Bayesian cluster geographically weighted regression for spatial heterogeneous data

Wala Draidi Areed; Aiden Price; Helen Thompson; Conor Hassan; Reid Malseed; Kerrie Mengersen

doi:10.1098/rsos.231780

. 2024 Jun 19;11(6):231780. doi: 10.1098/rsos.231780

Bayesian cluster geographically weighted regression for spatial heterogeneous data

Wala Draidi Areed ^1,^✉, Aiden Price ¹, Helen Thompson ¹, Conor Hassan ¹, Reid Malseed ², Kerrie Mengersen ¹

PMCID: PMC11293802 PMID: 39092145

Abstract

Spatial statistical models are commonly used in geographical scenarios to ensure spatial variation is captured effectively. However, spatial models and cluster algorithms can be complicated and expensive. One of these algorithms is geographically weighted regression (GWR) which was proposed in the geography literature to allow relationships in a regression model to vary over space. In contrast to traditional linear regression models, which have constant regression coefficients over space, regression coefficients are estimated locally at spatially referenced data points with GWR. The motivation for the adaption of GWR is the idea that a set of constant regression coefficients cannot adequately capture spatially varying relationships between covariates and an outcome variable. GWR has been applied widely in diverse fields, such as ecology, forestry, epidemiology, neurology and astronomy. While frequentist GWR gives us point estimates and confidence intervals, Bayesian GWR enriches our understanding by including prior knowledge and providing probability distributions for parameters and predictions of interest. This paper pursues three main objectives. First, it introduces covariate effect clustering by integrating a Bayesian geographically weighted regression (BGWR) with a post-processing step that includes Gaussian mixture model and the Dirichlet process mixture model. Second, this paper examines situations in which a particular covariate holds significant importance in one region but not in another in the Bayesian framework. Lastly, it addresses computational challenges in existing BGWR, leading to enhancements in Markov chain Monte Carlo estimation suitable for large spatial datasets. The efficacy of the proposed method is demonstrated using simulated data and is further validated in a case study examining children’s development domains in Queensland, Australia, using data provided by Children’s Health Queensland and Australia’s Early Development Census.

Keywords: children’s development, Bayesian geographically weighted regression, Dirichlet process mixture model, clustering

1. Introduction

Spatial regression models and algorithms are widely used to model the relationships between a response variable and a set of covariates over a region of interest, taking into account location-specific information and allowing for spatial relationships in the data. These methods are prominent in the health and epidemiological sciences where the study of the impact of the geographical distribution of health data and outcomes is a major research field. Spatial data can be represented and analysed in a variety of ways. For instance, objects at discrete locations or sampled from a continuous surface can be modelled as point patterns or point processes [1]. An alternative representation is areal data, which are spatial regions typically comprising summaries (e.g. counts, means, rates) of variables of interest . Popular examples of spatial regression models that explicitly allow for spatial correlation between neighbouring areas include simultaneous spatial autoregressive models, conditional spatial autoregressive models and spatial moving average models [2–4]. These can be likened to autoregressive methods in time-series analysis [5].

Generalized additive models have emerged as a potent class of models for capturing nonlinear effects of continuous covariates in spatial regression models. The modelling of nonlinear effects in continuous covariates can be rooted in methods such as smoothing splines [6] and local polynomials [7]. Cressie [2] proposed a spatial linear regression model in which only the intercept accounts for the spatial random effect. Diggle et al. [8] extended the spatial linear regression to the spatial generalized linear model. Brundson et al. [9] proposed geographically weighted regression (GWR) to capture smoothly varying patterns of the regression coefficients.

The GWR fits a local weighted regression model at the location of each observation and captures spatial information by accounting for nearby observations, using a weight matrix defined by a kernel function. Xue et al. [10] extended the GWR to the Cox survival model, by providing a novel approach to analysing spatially dependent survival data. This extension enhances the ability to explore how geographical factors impact time-to-event outcomes.

Despite the appeal of the GWR model, there are some limitations in these frequentist approaches. For example, a critical issue is the violation of the usual assumption of non-constant variation between observations, and the resultant normality assumption for the errors [11]. Additionally, it struggles to address issues of model complexity, overfitting, variable selection and multicollinearity; also, the stability and reliability of frequentist GWR might yield unstable results or high variance when dealing with small sample sizes [12]. Bayesian GWR (BGWR) is considered one of the best solutions to address these problems [13]. In the Bayesian framework, Gelfand & Schilep [14] built a model with spatially varying coefficients by applying a Gaussian process to the distribution of regression coefficients. LeSage [15] suggested an early version of BGWR, where the prior distribution of the parameters depends on expert knowledge. More recent approaches have been proposed by Ma et al. [5], who proposed BGWR based on the weighted log-likelihood, and Liu & Goudie [16] proposed BGWR based on a weighted least-squares approach.

In this paper, we propose a new extension that integrates BGWR with unsupervised probabilistic clustering. Cluster analysis is of great interest in many spatial contexts. The most common method for spatial clustering is the scan statistic [17], which is constructed via likelihood ratio statistics. Similar efforts have been made in Bayesian and non-parametric frameworks. For example, Li et al. [18] proposed a non-parametric Bayesian method to find cluster boundaries for areal data. Neill et al. [19] described an extension of the spatial scan statistics based on improving space–time cluster detection. Unsupervised clustering is a set of statistical and machine learning approaches that partition cohorts into subgroups based on the structure of the data. The most common unsupervised clustering algorithms are K-means [20] and a Gaussian mixture model (GMM) [21]. We extend the BGWR to identify groups of observations that exhibit similar behaviour, by clustering the posterior regression coefficients obtained from BGWR with a GMM and the Dirichlet process mixture model (DPMM). The proposed BGWR model thus clusters the coefficients into distinct homogeneous groups using these two probabilistic clustering algorithms. One notable advantage of our algorithm is its ability to automatically determine the optimal number of clusters without requiring prior knowledge, which sets it apart from the traditional K-means algorithm and provides an edge over frequentist GWR.

In this paper, we also introduce methods that significantly improve the Markov chain Monte Carlo (MCMC) estimation for the BGWR model proposed by Ma et al. [5]. Owing to the high computational cost of their algorithm, the geographical regions must be divided into subsets and compute BGWR for each subset separately. This compromised the accuracy of the model for areas on the boundaries of the subsets. In our proposed approach this is not necessary, making it more suitable for large spatial scales and numerous areas while preserving information from the boundaries. Our approach involves removing unnecessary computations and factorizing declarations. It also enables the inclusion and efficient handling of both continuous and categorical explanatory variables. We also integrate BGWR with dynamic variable selection using the reversible jump MCMC (RJMCMC) which identifies when a particular covariate is important in a specific local region.

The power of the proposed algorithm is demonstrated through comprehensive simulation studies. As a practical illustration, this methodology is then applied to a case study focusing on children’s development domains in Queensland, Australia.

Child development includes the biological, psychological and emotional changes that occur between birth and maturity [22]. Children’s development in the early years from birth to five years of age is crucial since it is at this time that the foundations for health development, emotional wellbeing and life success are built [23]. The early identification of groups of children who are developmentally vulnerable may lead to prompt early intervention. Physical, social, emotional, speech and language, and communication skills are the five critical domains of growth [24]. Development domains have been used in previous research to classify children into subgroups that describe their development using an unsupervised clustering algorithm [25–27]. While BGWR offers a more robust framework for spatial regression, there remains a challenge in identifying and interpreting localized patterns or homogeneity within the spatial data. This is where the integration with clustering becomes important. Clustering, especially with advanced methods like GMMs and the DPMM, allows us to group similar spatial behaviours based on the posterior regression coefficients derived from BGWR. We are aiming to identify regions within the spatial data where specific behaviours or patterns are consistent, enabling targeted interventions or insights.

The novelty of our approach lies in this unique integration. By combining BGWR with probabilistic clustering, we not only get a refined understanding of spatial relationships but also connect these relationships into distinct clusters or groups. This analysis offers depth via BGWR and comprehensive insights through clustering. Moreover, our methodology’s capability to autonomously determine the optimal number of clusters offers a more adaptive and intuitive clustering mechanism than traditional methods. In the context of children’s development domains, this can be valuable. Identifying clusters can help in pinpointing areas or groups of children who might have similar developmental challenges or needs, leading to more targeted interventions and policy implementations.

2. Methods

In this section, we introduce frequentist GWR, the BGWR model and detail the three methodological contributions of this paper. First, we describe the vectorization method to enhance BGWR efficiency. Second, we extend the BGWR analysis to include variable selection for each covariate in each specific region. Finally, we describe the extension of the BGWR model to incorporate the DPMM and GMM in order to identify clusters of coefficients of interest.

2.1. Frequentist geographically weighted regression

This section summarizes the methods used in the study, including GWR.

The GWR model suggested by Brunsdon et al. [9] is used to estimate the relationship between a dependent variable (y) and a set of covariates (beta) at a specific location (s). The model takes the form of:

y (s) = β_{1} (s) x_{1} (s) + \dots + β_{p} (s) x_{p} (s) + ϵ (s),

2.1

where x _i, i = 1, 2, …, p are the covariates and β _i (s) denotes the coefficients of the covariates at location s. The error term, $ϵ_{s}$ is assumed to have a mean of zero and a variance of σ ² I, where I is the identity matrix. The estimated coefficients are found using a method similar to weighted least squares for each location s:

\hat{β} (s) = {(X^{⊤} W (s) X)}^{- 1} X^{⊤} W (s) Y,

2.2

where X is the n × p covariates matrix, Y is the n × 1 responses vector and W(s) = diag(w ₁(s), …, w _n(s) is the diagonal matrix of the weights.

A common assumption for the GWR is given as before. The error terms are normally distributed with constant variance $ϵ_{s} \sim N (0, σ^{2} I)$ . There are many situations in different fields where the assumption of constant variance is invalid. According to Páaez et al. [28], the error term can be written as $ϵ \sim N (0, Ω),$ where Ω = σ ² W is a general covariance matrix as diagonal matrix n × n with the elements

σ^{2} W = {\begin{matrix} w_{i i} = σ^{2} \exp (λ, d_{i j}) \\ w_{i j} = 0 \end{matrix} .

2.2. Bayesian geographically weighted regression

A common assumption for the GWR is that the error terms are normally distributed with constant variance $ϵ (s) \sim N (0, σ^{2} I)$ [29] for a specific geographical location s. There are many situations in different fields where the assumption of constant variance is invalid. According to Paez et al. [28], the error term can be written as $ϵ (s) \sim N (0, Ω (s))$ , where Ω(s) = σ ²(s)W(s) with W(s) as a diagonal matrix of geographical weights function f(d _i(s)|b) that is a decreasing function of distance d _i(s) from the location s to the location i. GWR is seen as a locally weighted regression method that operates by assigning a weight to each and every observation i depending on its distance from a specific geographical location s [30]. This local perspective of the variance is often called location heterogeneity. A common approach is to define the observations that are within a certain distance b from a specific location s. Different weights can be used in the GWR model, including:

W (s) = {\begin{matrix} 1 & d_{i} (s) \leq b \\ 0 & otherwise \end{matrix},

where d _i(s) is the distance between the locations s and i, when a weight is set to zero for certain observations, it implies that those observations are not considered in the regression model for predicting the response variable at a given location. This can lead to rows or columns of the weight matrix W being entirely zero. In such cases, the weight matrix can become singular (non-invertible) because it would have linearly dependent rows or columns. Other weights include the exponential and Gaussian functions, which give the following expressions for weights, respectively:

W (s) = e^{- (d_{i} (s) / b)}

2.3

and

W (s) = e^{- {(d_{i} (s) / b)}^{2}},

2.4

where b represents the bandwidth that controls the decay over distance [31]. Equations (2.3) and (2.4) are decreasing functions of d _i(s), which shows that an observation far away from the location of interest contributes little to the estimate of parameters at that location. Different choices have been used to find d _i(s); the Euclidean distance is the most popular choice when latitude and longitude for each observation are available [32]. Other choices of distance matrices include the graph distance [33] and greater circle distance (GCD) [34]; these approaches are used in the BGWR in §3 and 4. A further explanation of these distance measures can be found in appendix A.

The likelihood of the BGWR model can be written as [5]:

Y | β (s), X, W (s), σ^{2} (s) \sim M V N (X β (s), σ^{2} (s) W^{- 1} (s)) .

2.5

Y is the n × 1 observation or response vector, X the n × p predictor matrix, β is a p × 1 vector of spatially varying coefficients and σ ²(s)W ⁻¹(s) is an n × n matrix. Common conjugate prior distributions are $β (s) | σ_{β}^{2} \sim N (0, σ_{β}^{2})$ and σ ²(s) ∼ IGamma(α ₁, α ₂). The posterior distribution is given as:

p (β (s), σ^{2} (s) Y, X, W (s)) \propto p (Y β (s), X, W (s), σ^{2} (s)) \times (β (s) σ_{β}^{2}) \times p (σ^{2} (s)) .

2.6

In the GWR model, it is crucial to choose a suitable bandwidth for the weighted function. In a BGWR context, a prior can also be applied to the bandwidth b to allow estimation of the best bandwidth given other parameters. The choice of the prior also depends on the measure of distance that is used. A common prior for the bandwidth is [35]

b \sim Uniform (0, D) D > 0 .

2.7

Without any prior knowledge, D can be selected to be large enough that we begin to approximate with a non-informative prior; i.e. we begin with an approximate global model in which all observations are weighted equally.

2.3. Vectorization methodology

In this section, we introduce our first contribution aimed at enhancing the computational efficiency of the BGWR to accommodate large-scale datasets, ensuring that there is no loss of information at the boundaries, and removing the need to split the data into smaller regions to run the BGWR model.

The use of a multivariate normal distribution in BGWR leads to high computational costs. In this paper, we introduce a new approach based on vectorization, achievable in the ‘nimble’ package in R. This improves efficiency by reducing the number of calculations and nodes in the model, thus enhancing MCMC performance. The likelihoods for each region in the BGWR model are vectorized. The precision matrix σ ²(s)W ⁻¹(s) is a diagonal matrix, allowing for an alternative representation using univariate Gaussian functions. This permits independent estimation of the mean and variance in each dimension and characterizes the multivariate density function as a product of univariate Gaussian functions for each location s:

Y | β (s), X, W (s), σ^{2} (s) \sim MVN (X β (s), σ^{2} (s) W^{- 1} (s)) .

2.8

When W(s) is diagonal, the inverse W ⁻¹(s) is also diagonal. In this case, the likelihood function can be expressed as:

Y | β (s), X, W (s) \sim MVN (X β (s), σ^{2} (s) \cdot diag (w_{1 i}^{- 1} (s), w_{2 i}^{- 1} (s), \dots, w_{n i}^{- 1} (s))) .

Expanding the multiplication term

\begin{aligned} {(y_{i} - x_{i}^{T} β (s))}^{T} \cdot diag (w_{1 i}^{- 1} (s), w_{2 i}^{- 1} (s), \dots, w_{n i}^{- 1} (s)) \cdot (y_{i} - x_{i}^{T} β (s)) \\ = \sum_{i = 1}^{n} w_{i i}^{- 1} (s) \cdot {(y_{i} - x_{i}^{T} β (s))}^{2} . \end{aligned}

2.9

The likelihood function can then be further simplified to the product of univariate Gaussian likelihoods:

\begin{aligned} f (y_{i} | β (s), x_{i}, W (s), σ^{2} (s)) & = π^{- 1 / 2} \cdot σ^{- 1} (s) \cdot | W (s) |^{- 1 / 2} \\ \cdot \exp (- \frac{1}{2} σ^{- 2} (s) \sum_{i = 1}^{n} w_{i i}^{- 1} (s) \cdot {(y_{i} - x_{i}^{T} β (s))}^{2}), \end{aligned}

2.10

where $(w_{1 i}^{- 1} (s), w_{2 i}^{- 1} (s), \dots, w_{n i}^{- 1} (s))$ is the diagonal that represents the weights from a specific region s for the n different locations of the weight matrix W(s) for location s. Thus, observations with higher weights (indicating higher importance or reliability) contribute more to the overall likelihood value compared to those with lower weights.

The exponent term inside the exponential represents the sum of squared differences between the observed response variable y _i and the predicted values $x_{i}^{T} β$ , weighted by $w_{i i}^{- 1}$ , which is the precision or weight associated with each observation. The determinant of a diagonal matrix is simply the product of its diagonal entries. Thus,

| W (s) | = w_{1 i} (s) \cdot w_{2 i} (s) \cdot \dots \cdot w_{n i} (s) = \prod_{i = 1}^{n} w_{i i} (s) .

The likelihood in the above form demonstrates that, when the weighted matrix is diagonal, the multivariate Gaussian likelihood reduces to a product of n univariate Gaussian likelihoods, one for each region s.

The likelihood for each y _i can be represented as:

y_{i} | β (s), x_{i}, w_{i i} (s), σ^{2} (s) \sim N (x_{i}^{T} β (s), σ^{2} (s) w_{i i}^{- 1} (s)) .

2.11

The full likelihood for the entire dataset is the product of the likelihood for each observation.

In summary, the BGWR approach developed in this paper can be represented in the following form:

y_{i} | β (s), x_{i}, w_{i i} (s), σ^{2} (s) \sim N (x_{i}^{T} β (s), σ^{2} (s) w_{i i}^{- 1} (s)),

2.12

β_{j} (s) | σ_{β}^{2} \sim N (0, σ_{β}^{2}),

2.13

σ_{β}^{2} \sim IGamma (α, β),

2.14

σ^{2} (s) \sim IGamma (α_{1}, α_{2}),

2.15

w_{i i} (s) = f (d_{i i} (s) | b)

2.16

and b \sim Uniform (0, D),

2.17

where f is the weighted function in equations (2.3) or (2.4), and b is the bandwidth. The posterior distribution is obtained using MCMC methods, and the parameters of interest are estimated using the posterior mean of the respective marginal posterior distributions. The steps of fitting the proposed model in Nimble [36] are provided in appendix B.

2.4. Bayesian geographically weighted regression with dynamic variable selection: reversible jump Markov chain Monte Carlo approach

This section introduces our second contribution, the identification of locally important covariates for each location.

In the Bayesian framework, various algorithms have been developed to determine the most relevant predictors that should be included in a model to provide the best explanation for a response variable. A popular approach was developed by Mitchell & Beauchamp [37], which employs a prior distribution that encourages sparse models in Bayesian linear regression. A spike and slab approach to variable selection have also been proposed by George et al. [38], and Kuo & Mallick [39] proposed a simpler approach that embeds indicator variables in the regression equation. Green [40] introduced RJMCMC, which enables models of different dimensionality to be explored. Bhattacharya & Dunson [41] proposed a Bayesian non-parametric approach to sparse regression that can handle infinite potential predictors. The approach uses a spike-and-slab prior to calculating posterior model probabilities for each possible subset of variables but does not require specifying the number of potential predictors a priori. Ma et al. [5] employed a traditional spike and slab approach in the context of BGWR. However, their method includes or excludes the coefficient entirely without considering its potential importance in specific regions.

In this paper, a novel approach is described for spatial local BGWR that uses the inclusion or exclusion of the coefficients based on their importance in specific regions. This advancement enhances the model’s ability to capture location-specific effects accurately and opens new possibilities for understanding the spatially varying impact of coefficients. The proposed algorithm replaces the likelihood in equation (2.12) as:

y_{i} | β (s), x_{i}, w_{i} (s), σ^{2} (s) \sim N (x_{i}^{T} (Γ_{s} * β (s), σ^{2} (s) w_{i}^{- 1} (s))),

2.18

γ_{j} (s) \sim Bernoulli (ψ_{j})

2.19

and ψ_{j} \sim Beta (1, 1),

2.20

where Γ _s is a 1 × p vector with elements γ _j(s), and * presents the broadcast operation (element wise multiplication), with other priors given by equations (2.13) to (2.17). In equation (2.18), for each region s and each corresponding covariate j, an indicator variable γ _j(s) is assigned, following a Bernoulli distribution with probability parameter ψ _j. Thus γ _j(s) allows probabilistic determination of whether the coefficient β _j should be included in the model for the region s or not, depending on the contribution of the jth covariate in the model for the sth region. This dynamic approach to variable selection allows the model to make data-driven decisions about the importance of covariates in explaining the response variable for each region. A RJMCMC algorithm is adopted to implement this approach [40]. The steps of the RJMCMC algorithm are detailed in appendix D. We have tested our algorithm using simulated and real datasets; the code for the analysis of the simulated data is available on the first author’s GitHub (https://github.com/waladraidi/BGWR-Clustering).https://github.com/waladraidi/BGWR-Clustering

2.5. Cluster Bayesian geographically weighted regression

In this section, we present two techniques for probabilistic clustering of the posterior distributions of the region-specific (local) coefficients obtained from the BGWR analysis. The first parametric unsupervised learning technique assumes that the coefficients can be represented by a known probability distribution, the GMM. The second method is a non-parametric unsupervised learning approach that extends the GMM using a DPMM. Unlike traditional GMMs where the number of mixtures is predefined, DPMM infers the number of clusters directly from the data. Let us consider a random sample of size n drawn from the posterior distribution of the coefficient of interest, obtained from the MCMC iterations. We denote this sample generically as y = {y ₁, y ₂, …, y _n}. This sample y represents a collection of n data points, each corresponding to a particular observation or measurement. We use this sample to illustrate the concepts of both traditional GMMs and DPMMs, elucidating their differences and advantages.

2.6. Gaussian mixture model

The GMM is represented as a weighted sum of Gaussian density functions, each with its own set of parameters [42]. The GMM is represented as:

p (y | θ) = \sum_{k}^{K} α_{k} p_{k} (y_{i} | θ_{k}),

2.21

where $\sum_{k}^{K} α_{k} = 1$ and p _k(y _i|θ _k) is a Gaussian density function parameterized by θ _k [43]. For computational simplicity and to enhance inferential capability, a latent cluster indicator, z = {z ₁, z ₂, … , z _n} is introduced, where each z _i is a K-dimensional vector and the kth element of z _i takes the value 1 if y _i comes from cluster k, (k = 1, …, K) and takes the value 0 otherwise.

The parameters of the GMM can be estimated in a Bayesian framework using MCMC or approximations such as variational inference. Another popular approach that we adopted here is expectation-maximization (EM) [43]. The EM algorithm iterates between two steps until convergence: the expectation step, which takes the conditional expectation of the complete data log-likelihood given the observed data and current parameter estimates, and the maximization step, which maximizes the log-likelihood with respect to the parameters to give updated estimates [44].

There are many ways to select the optimal number of clusters (K) for GMM. Common approaches include the silhouette score [45], the Bayesian information criterion (BIC) [46], the elbow method which is based on a plot of the explained variance of the model versus the number of clusters [47], and cross-validation, which involves splitting the data into multiple subsets and then training and evaluating the GMM model on different subsets of the data [48]. In this study, we used BIC to find the optimal number of clusters. The choice of BIC is motivated by the nature and assumptions of the GMM. The GMM assumes that data are generated from a mixture of several Gaussian distributions. BIC is particularly suited for model selection in probabilistic models like GMM. It balances the likelihood of the data under the model against the complexity of the model, effectively penalizing overfitting. Given that GMM involves the estimation of several parameters, especially as the number of mixtures (or clusters) increases, BIC becomes a particularly relevant criterion for determining the optimal model.

2.7. Dirichlet process mixture model

DPMM is a non-parametric method that replaces the fixed number of clusters with a random probability measure [49]. The DPMM is defined by a base distribution G ₀ and a concentration parameter α. The base distribution represents the prior distribution over the means of the clusters, while the concentration parameter controls the level of clustering.

The DPMM can be summarized as follows:

y_{i} | θ_{i} \sim p (y_{i} | θ_{i}),

2.22

θ_{i} | G \sim G

2.23

and G \sim D P (α, G_{0}),

2.24

where each θ _i is drawn from a mixing distribution G. This mixing distribution has a Dirichlet process prior, with concentration parameter α and base distribution G ₀ where E[G] = G ₀ [50]. The concentration parameter acts as an inverse variance where larger values of α result in smaller variances, which creates more concentrated draws around the mean of the base distribution [51]. The DPMM is typically estimated using a Bayesian approach, such as the Chinese restaurant process or the stick-breaking process [52]. The stick-breaking construction is used in this paper. In this construction, the G can be determined by an infinite sum of weighted point masses:

G = \sum_{k = 1}^{\infty} C_{k} δ_{θ_{k}},

2.25

where $δ_{θ_{k}}$ is a point mass of 1 located at θ _k, which is sampled directly from G ₀, i.e θ _k ∼ G ₀. The weights C _k are obtained through the stick-breaking process:

\begin{matrix} V_{1}, V_{2}, \dots \sim Beta (1, α) \\ C_{1} = V_{1} \\ C_{k} = V_{k} \prod_{j = 1}^{k - 1} (1 - V_{j}); k \geq 2. \end{matrix}}

2.26

The cluster assignments process involves two stages: within-sample cluster configuration and across-sample cluster configuration. In the first stage, MCMC analysis is performed on the DPMM for each sample to determine cluster assignments for regions, employing the mode method to derive cluster configurations. These configurations are aggregated across iterations to accommodate varying cluster numbers. In the second stage, cluster configurations are compiled into a matrix, considering potential NA values. The mode is then calculated for each region across all samples, disregarding NA entries, to construct the final cluster assignment vector. This vector represents the most frequently assigned clusters for each region, accommodating variations in cluster numbers across iterations, a further explanation can be found in appendix F.

2.8. Cluster configurations

Two approaches were considered to determine cluster membership of the observed data. The first method, known as Dahl’s approach [53], involves computing the membership matrices for each iteration, denoted as B ⁽¹⁾, …, B ^(M), with M being the number of post-burn-in MCMC iterations. The membership matrix for the cth iteration B ^(c) is defined as:

B^{(c)} = {(B^{(c)} (i, j))}_{i, j \in {1 : n}} = 1 {(z_{i}^{(c)} = z_{j}^{(c)})}_{n \times n},

2.27

where 1( · ) represents the indicator function. The entries B ^(c)(i, j) take values in {0, 1} for all i, j = 1, …, n and c = 1, …, M with B ^(c)(i, j) = 1, indicating that observations i and j belong to the same cluster in the cth iteration. An empirical estimate of the probability that locations i and j are in the same cluster is given by the average of B ⁽¹⁾, …, B ^(M):

\bar{B} = \frac{1}{M} \sum_{c = 1}^{M} B^{(c)},

2.28

where $\sum$ denotes the element-wise summation of matrices. The (i, j)th entry of $\bar{B}$ provides the required empirical estimate.

Subsequently, we determine the iteration that exhibits the least-squared distance to $\bar{B}$ as:

C_{L S} = \arg min_{c \in {1 : M}} [\sum_{i = 1}^{n} \sum_{j = 1}^{n} {(B^{(c)} (i, j) - \bar{B} (i, j))}^{2}] .

2.29

The second method used here is the mode method. This method uses the posterior samples of z _i, where z denotes the cluster assignments specific to each region. Each iteration generates a new set of cluster assignments z, which are dependent on the parameters. Consequently, following multiple iterations, each region will have a collection of cluster assignments z. The mode indicates the cluster with the highest probability of assignment for a given region. Moreover, obtaining probabilities for assignment to alternative clusters provides valuable insights, aiding in the inferential and decision-making process.

2.9. Clustered accuracy

To assess the accuracy of our proposed algorithm, we used the Rand index (RI) [54] in our simulated study (see §3), to perform a comparison between the cluster configurations obtained using either Dahl’s method or the mode method, and the true clusters. Specifically, we employed this metric for the simulated data. However, for real data analysis, true labelling is unavailable. The RI quantifies the level of agreement between two sets of cluster assignments, denoted as C and C′, with respect to a given dataset X = {x ₁, x ₂, …, x _n}. Each data point x(s) is assigned a cluster label c _i in C and c′_i in C′. The computation of RI is based on the following formula:

RI = \frac{a + b}{a + b + c + d},

where a is the number of pairs of data points that are in the same cluster in both C and C′ (true positives), b is the number of pairs of data points that are in different clusters in both C and C′ (true negatives), c is the number of pairs of data points that are in the same cluster in C but in different clusters in C′ (false positives), d is the number of pairs of data points that are in different clusters in C but in the same cluster in C′ (false negatives).

The RI value ranges from 0 to 1, where a value of 1 signifies a perfect agreement between the two clusterings (both C andC′ fully agree on all pairs of data points). On the other hand, a value close to 0 indicates a low level of agreement between the two clusterings.

3. Simulated data analysis

In this section, we present two simulated datasets: one with spatially varying coefficients and the other without, see figures 1 and 2. The first simulation is related to a dataset based on Louisiana and is generated in a manner similar to the work by Ma et al. [5]. This dataset comprises a total of 64 regions, with three observations in each region i.

Figure 1. — Flowchart for the simulated study without spatial varying coefficients. MAB, mean absolute bias; MSD, mean standard deviation; MMSE, mean of mean squared error.

Figure 2. — Flowchart for the simulated study with spatial varying coefficients.

Under the scenario in which there are no spatially varying coefficients, we generated independent continuous covariates denoted as (X ₁, X ₂, …, X ₅) from a standard normal distribution N(0, 1) for each region. These covariates are used to create the response vector Y, generated as $Y = X β + ϵ$ , where $ϵ \sim N (0, 1)$ . The parameter vector β is set to be β = (2, 0, 0, 4, 8).

We used a bandwidth parameter D set to 100. The maximum GCD in the spatial structure of the 64 regions is 10, so using a bandwidth of 100 induces a weighting scheme that ensures relative weights are assigned appropriately. If the distance between two regions is considerable, the relative weight is approximately exp(− 10/100) = 0.904. This approximation thus allows the model to behave similarly to a global model where every observation is equally weighted, ensuring a sufficiently non-informative prior bandwidth b.

In each region, three observations are generated, resulting in a total of 192 observations per replicate. The analysis is replicated 100 times. Each replicate involves running a MCMC chain of length 10 000 without thinning, and the initial 2000 samples are discarded as burn-in. The mean bandwidth selected in the 100 replicates was calculated as well as the posterior means of the parameters β. The results are reported in table 1. Additionally, we employed the dynamic variable selection (RJMCMC) technique described in §2.4. This analysis was replicated 100 times. Each replicate involves running a RMCMC chain of length 400 000 without thinning, and the initial 20 000 samples are discarded as burn-in to converge. The results are summarized in table 2. This table illustrates how our algorithm identifies important covariates for each location.

Table 1.

Average parameter estimates and their performance when there is no spatial variation in the underlying true parameters. (The performance metrics used include mean absolute bias (MAB), mean standard deviation (MSD) and mean of mean squared error (MMSE).)

	$\bar{\hat{β}}$	MAB	MSD	MMSE	bandwidth
β ₁	1.975	0.079	0.012	0.006	49.06
β ₂	−0.089	0.014	0.011	0.005
β ₃	0.074	0.012	0.076	0.007
β ₄	3.927	0.013	0.072	0.004
β ₅	8.383	0.009	0.06	0.003

Open in a new tab

Table 2.

Average parameter estimates and the performance of parameter estimates when there is no spatial variation in the underlying true parameters using dynamic variable selection. (The performance metrics used include MAB, MSD and MMSE.)

	true β	$\bar{\hat{β}}$	MAB	MSD	MMSE	bandwidth
β ₁	2	1.96	0.048	0.081	0.002	47.89
β ₂	0	−0.0003	0.027	0.058	0.0007
β ₃	0	0.0001	0.018	0.046	0.0003
β ₄	4	3.92	0.014	0.066	0.002
β ₅	8	7.95	0.060	0.072	0.003

Open in a new tab

Our proposed BGWR model with vectorization significantly improves computational efficiency, completing each replicate in under 250 s. By contrast, a model that emulated the previous approach suggested by Ma et al. [5], which relies on a multivariate normal distribution to calculate the likelihood [5], took over 15 min to run for each replicate. Finally, we evaluated various weighted functions in this analysis. Table 3 presents the Watanabe-Akaike information criterion (WAIC), deviance information criterion (DIC), and effective sample size (P _D) obtained with different kernels. The explanation of WAIC and DIC for the BGWR model can be found in Appendix E. The bi-square kernel demonstrated a better fit to the data compared to the exponential and Gaussian kernels.

Table 3.

Model assessment for the proposed algorithm BGWR using different kernels for the simulated data.

	exponential kernel	bi-square kernel	Gaussian kernel
WAIC	181878.4	181854	181855
DIC	35846.3	35831.9	35853.4
P _D	391.7	381.8	404.7

Open in a new tab

The second simulated study was created based on the structure of the state of Georgia, also considered by Ma et al. [55]. This dataset includes 159 regions. Six spatially correlated covariates (X ₁ to X ₆) were generated using multivariate normal distributions with spatial weight matrices derived from the distance matrix and parameter bandwidth. The response variable (Y) in the simulation was generated by the GWR model:

y (s) = β_{0} (u (s), v (s)) + \sum_{k = 1}^{K} β_{k} (u (s), v (s)) \cdot X_{k} (s) + ε (s) .

Notably, the parameters (β ₁–β ₆) of this GWR model were spatially varying, based on the spatial weight matrices. We then visually partitioned the areas into three large regions to define true clustering settings based on the spatial coordinates of centroids. This approach allowed us to create distinct spatial patterns in the data, which incorporated spatial autocorrelation, spatial variability, and true clustering settings. Figure 3 summarizes the partition of the Georgia areas into three large regions with sizes 51, 49 and 59 regions in the three clusters, respectively. The same parameter vectors from table 4 were used for all three clusters across three settings under different strengths of signals. Setting 1 shows relatively low signal strengths for all three clusters. The signals in this setting are primarily found in specific variables within each cluster, with no variable consistently having a strong signal across all clusters. Setting 2 exhibits stronger signals than setting 1, with more pronounced variability among the clusters. Cluster 3 in this setting shows the strongest signals among the clusters, and certain variables have high signal strengths across multiple clusters. The signals are well spread across different variables in each cluster, and the magnitude of the signals is higher than in the previous settings. The performance of the proposed algorithm is presented in table 5. The analysis includes the three settings (setting 1, setting 2 and setting 3) and the two clustering methods (GMM and DPMM). It is apparent that setting 3 consistently exhibits higher RI values, signifying superior clustering accuracy compared to the other settings. Furthermore, the number of clusters varies across settings and methods. As anticipated, the DPMM method generally generates a larger number of clusters than the GMM method across most settings [56]. Additionally, the cluster configurations differ across settings and methods, revealing distinctive patterns and structures in the data for each configuration. The results for the final clusters in each setting are visually presented in appendix G (figures 18–20 for GMM methods, and figures 21–23 for the DPMM method).

Figure 3. — Cluster assignment for Georgia regions used for simulation studies.

Table 4.

True parameter vectors used in data generation for three clusters.

setting	cluster 1	cluster 2	cluster 3
1	(1, 0, 1, 0, 0.5, 2)	(1, 0.7, 0.3, 2, 0, 3)	(2, 1, 0.8, 1, 0, 1)
2	(2, 0, 1, 0, 4, 2)	(1, 0, 3, 2, 0, 3)	(4, 1, 0, 3, 0, 1)
3	(9, 0, − 4, 0, 2, 5)	(1, 7, 3, 6, 0, − 1)	(2, 0, 6, 1, 7, 0)

Open in a new tab

Table 5.

The cluster accuracy and cluster configuration using three different settings when there is spatial variation in the underlying true parameters based on the Georgia dataset.

	setting 1		setting 2		setting 3
	GMM	DP	GMM	DP	GMM	DP
RI for Dhal method	0.78	0.68	0.80	0.79	0.76	0.84
RI for mode method	0.76	0.64	0.76	0.82	0.85	0.88
no. clusters (Dahl’s method)	3	10	3	7	8	5
no. clusters mode	3	7	3	4	5	6
cluster summary (Dahl’s method)	C1 = 54, C2 = 54	C1 = 18, C2 = 10,	C1 = 38, C2 = 42,	C1 = 44, C2 = 2,	C1 = 20, C2 = 19,	C1 = 41, C2 = 45,
	C3 = 51	C3 = 21, C4 = 7,	C3, 79	C3 = 21, C4 = 31,	C3 = 13, C4 = 29,	C3 = 63, C4 = 3,
		C5 = 34, C6 = 18,		C5 = 58, C6 = 1,	C5 = 19, C6 = 41,	C5 = 7
		C7 = 12, C8 = 17,		C7 = 2	C7 = 18
		C9 = 6, C10 = 16
cluster summary (mode method)	C1 = 57, C2 = 60,	C1 = 44, C2 = 3,	C1 = 36, C2 = 34,	C1 = 49, C2 = 17,	C1 = 25, C2 = 51,	C1 = 1, C2 = 42,
	C3 = 42	C3 = 22, C4 = 51,	C3 = 89	C3 = 32, C4 = 61	C3 = 11, C4 = 65,	C3 = 47, C4 = 63,
		C5 = 11, C6 = 23,			C5 = 7	C5 = 2, C6 = 4
		C7 = 5

Open in a new tab

4. Real data analysis

In this section, we provide a comprehensive overview of our findings derived from applying the BGWR model in a real-world scenario. We explain the sources of the data employed and describe the cluster configurations associated with the BGWR parameter coefficients. Moreover, we present a thorough investigation into preschool attendance as estimated by the BGWR model, alongside the corresponding probability values. We also discuss the incorporation of dynamic variable selection into our analysis of the actual data, and we present a visual representation of substantively important coefficients for each region (figure 4).

Figure 4. — Real data case study flowchart summary.

4.1. Sources of the data

The Children’s Health Queensland (CHQ) has created an impressive resource called the CHQ Population Health Dashboard, which provides data on key health outcomes and socio-demographic factors for a 1-year period from 2018 to 2019. The dashboard is based on information from 528 small areas (statistical area level SA2) across the state of Queensland, Australia. The dashboard includes over 40 variables, visualized in a user-friendly format. The case study considered in this paper focuses on the health outcomes section of the dashboard, specifically, vulnerability indicators, which measure developmental vulnerability for children across five Australian Early Development Census (AEDC) domains:

(i)
phsical health and wellbeing, which evaluates children’s physical readiness for school, their level of physical independence, and their gross and fine motor skills;
(ii)
social competence, which assesses children’s overall social competence, responsibility, respect, approaches to learning and readiness to explore new things;
(iii)
emotional maturity, which examines children’s pro-social and helping behaviour, anxious and fearful behaviour, aggressive behaviour, and hyperactivity and inattention;
(iv)
language and cognitive skills (school-based), which evaluates children’s basic literacy, interest in the literacy, numeracy and memory, as well as their advanced literacy and basic numeracy; and
(v)
communication skills and general knowledge, which measures children’s communication skills and general knowledge.

The AEDC data also includes two additional indicators: vulnerable on one or more domains (Vuln 1) and vulnerable on two or more domains (Vuln 2). Additionally, the CHQ dashboard encompasses data on socio-demographic factors that may be linked to health outcomes. Three factors considered in the analysis are the Socio-Economic Indexes for Areas (SEIFA) score, attendance at a preschool and remoteness. The SEIFA score is a socio-economic index that summarizes a variety of data on individual and family economic and social conditions in a given area. It ranges from 1 to 5, with a low score indicating greater disadvantage. The remoteness factor includes the categories of cities, regional, and remote. In 2018, Queensland had 294 SA2s in cities, 208 regional SA2s and 24 remote SA2s. The analysis uses data from the AEDC, which is conducted every 3 years and collects data on children in their first year of school. The data used in this study is from the 2018 census. Owing to the aggregated nature of the data, the analysis focuses on the proportion of vulnerable children within each SA2 [57], collected by first-year teachers across the Australian government and non-government schools, with parents’ agreement. The final dataset therefore compromises the proportion of children who attended preschool, the SEIFA, and remoteness for each SA2.

Between 3 and 6% of the data had missing variables, which were imputed using the average of the proportions from the neighbouring SA2s. For categorical data, such as remoteness, missing values were imputed using the highest frequency category of the neighbouring SA2s. However, missing values for two islands could not be imputed as these regions have no contiguous neighbours. As a result, the analysis was carried out on the remaining 526 SA2 areas. In our analysis, all the aforementioned covariates are considered, with specific additional investigation focusing on the factor of attendance at preschool.

4.2. Case study analysis

The main objective of the analysis was to examine the factors that influence children’s developmental vulnerability in one or more domains (Vuln1) with a particular focus on the importance of attendance at preschool. Using the proposed approach, we aimed to identify clusters of regions that are similar with respect to the influence of attendance at preschool on Vuln1. The summary of this section can be found in figure 4.

The previous work on the BGWR model was designed to deal with continuous covariates [5]. However, in this paper, we extended the BGWR model to deal with categorical variables as well, specifically for the ‘remoteness factor’. Since a region can only belong to one of the three levels (cities, regional or remote), we created a list of covariates for each region for inclusion in the BGWR model. This avoids the critical issue of drawing from the prior when there is no valid information available for a particular category.

In §4.2.1, we explore inferences associated with the proportion of attendance at preschool, as derived from the BGWR model. In §4.2.2, we explore the efficiency of our variable selection algorithm. Finally, in §4.2.3, we discuss the probabilistic clustering results from the two algorithms. Additionally, we investigate the probability of each region belonging to specific clusters.

4.2.1. Inferences from the Bayesian geographically weighted regression

We use the graph distance as well as the GCD in the BGWR model. So, we adopted a non-informative prior for the bandwidth and estimated the optimal bandwidth as part of the analysis by choosing a value of D = 100 and the maximum distance between any two points is 10.

The MCMC algorithm was run for 400 000 iterations with a burn-in period of 20 000 iterations. The results reported are based on the remaining 380 000 iterations. We focused on exemplar insights from the BGWR model. First, the posterior mean and 95% credible interval were obtained for the coefficients associated with attendance at preschool in each region. The results are summarized in figure 5. It appears that there is a substantive negative relationship between the proportion of attendance at preschool and Vuln 1. This negative association is stronger in the southeast of Queensland, particularly in greater Brisbane, and comparatively weaker in the northern regions of Queensland. Several factors could contribute to this relationship, such as parental background, Indigenous status and access to preschool services. The second insight is on the evaluation of the non-stationary variance. Further analysis and insights form BGWR can be found in appendix C.

Figure 5. — (a) posterior means and 95% credible interval for the coefficients associated with preschool attendance, (b) the geographical distribution of the posterior means (the regions have reordered to show the posterior mean range).

The table shows that the bi-square kernel outperforms the other two kernels, exponential and Gaussian, in all three criteria. Furthermore, we also observed a consistent pattern in the behaviour of the GCD and the graph distance when used with non-informative priors for the bandwidth in both our case study and simulated data. The results from both distance metrics provide similar posterior estimates.

4.2.2. Dynamic variable selection: real data analysis

The algorithm described in §2.4 was applied to the real case study. The summarized results presented in figure 7 show that the preschool factors have a substantive negative impact, particularly in southeast and central Queensland. In addition, the index of relative socio-economic disadvantage (IRSD) covariate is also important for some specific locations in the case study. Its effect becomes more obvious when focusing on southeast Queensland, and the far north of Queensland. A summary of the posterior estimates obtained from the model is presented in figure 6. These two figures 7 and 6 are interconnected. Figure 6 depicts the spatial distribution of these posterior means on a map, while figure 7 illustrates the posterior mean for each covariate across 526 SA2 data points (x-axis). The proposed algorithm highlights instances where certain factors, such as the IRSD factor, may not be significant for certain locations, as evidenced by confidence intervals that include zero.

Figure 6. — The posterior summary from the proposed selection algorithm in 526 SA2 regions.

4.2.3. Probabilistic cluster analysis and its insights

We extracted 500 random iterations from the posterior distributions obtained from the BGWR analysisusing all the coefficients and applied GMM and DPMM clustering algorithms for all the regression coefficients at each iteration. To determine the optimal number of clusters for the GMM, we employed the BIC across these samples. The analysis revealed that the optimal number of clusters is 4, and we present the optimal number of clusters in figure 24 from appendix H.

Using Dahl’s method, we obtained four clusters with sizes, 103, 101, 220 and 102, whereas, using the mode we also obtained four clusters with different sizes 303, 102, 20 and 101. In our case study, Dahl’s method yielded more evenly sized clusters. The mode method, on the other hand, resulted in one notably large cluster and one exceptionally small one, making Dahl’s method the more consistent of the two. The spatial distribution of these cluster assignments using Dahl’s method and the mode can be observed in figure 8.

Figure 8. — Spatial pattern of GMM clusters: case study results using (a) Dahl’s and (b) mode methods.

We present the probability of each region belonging to a specific cluster from the GMM compared with the mode method in the heatmap. Figure 9 shows these probabilities for 10 regions in Queensland. Each row in the heatmap represents one of the four possible values (1, 2, 3, 4) indicating the clusters, while each column corresponds to one of the first 10 regions in Queensland. The colour intensity in each cell represents the probability of the corresponding region belonging to the respective cluster: lighter shades signify higher probabilities, while darker shades indicate lower probabilities. A complete view of the probabilities for all regions in Queensland is provided in appendix H (figures 26–30).

The cluster configuration process of DPMM happens in two stages: the first stage is a within-sample cluster configuration, obtained for each of the 500 randomly chosen iterations in the MCMC analysis of the DPMM. The configurations are based on Dahl’s method and the mode method. The second step is across-sample cluster configuration. In this stage, we consider the entire set of 500 samples and their corresponding cluster configurations obtained from the first stage and perform clustering configurations again to get the final cluster configuration for each region (see appendix F). For our case study, the final cluster number using Dahl’s method is 12 with sizes as follows: 11, 196, 1, 1, 2, 1, 1, 107, 3, 3, 100 and 100. The mode method also found five clusters with sizes 1, 212, 109, 101 and 103. Dahl’s method resulted in more clusters. However, the maps from both methods show overlapping regions, with specific differences highlighted in figure 10. We also calculated the probabilities of each region belonging to one of the 12 clusters using Dahl’s method. Figure 11 presents the probabilities for the first 10 regions. The complete probabilities are available in appendix H (figures 31–36).

Figure 10. — The spatial distribution of the cluster configuration obtained from DPMM using (a) Dahl’s method as well as (b) the mode method.

Figure 11. — The probabilities of each region in Queensland belonging to specific clusters obtained from the DPMM.

5. Discussion

In this section, we discuss key aspects of our clustered BGWR approach for estimating and clustering regression coefficients in spatially heterogeneous data. Firstly, our model demonstrates versatility by accommodating both continuous and categorical variables. The introduction of a new vectorization technique has significantly improved MCMC sampling efficiency. This enhancement allows scalable analyses of large-scale datasets without the need for sub-regional data segmentation. Secondly, an intriguing feature of the BGWR model is its ability to identify covariates that are important in some locations but have minimal impact in others. This capability enhances our understanding of spatial variations in covariate effects. Thirdly, we extend our approach to include probabilistic clustering, substantially expanding the use of the analysis. This extension provides valuable insights into similar geographical regions, particularly relevant in subdomains such as health. Additionally, our exploration of various weighting schemes based on graph distance and GCD reveals robust parameter estimation capabilities, especially when confronting spatially heterogeneous data. Our proposed methods were successfully implemented in R using the nimble computational framework [36]. The interpretation of results from the clustered (BGWR) model represents a substantial advancement in our understanding of developmental vulnerability in children. The model’s capability to discern spatially variable covariate effects unveils nuanced patterns across geographical regions. For instance, the model highlights a significant negative relationship between preschool attendance and vulnerability indicators, particularly accentuated in southeast Queensland. This spatial variation prompts a deeper exploration of the underlying factors contributing to this regional discrepancy.

The dynamic variable selection algorithm provides valuable insights into the changing significance of covariates across iterations. The substantive negative impact of preschool factors in specific regions, coupled with the influence of the IRSD in southeast and far north Queensland, underscores the intricate interplay of factors influencing children’s developmental vulnerability.

The extension of the BGWR model to incorporate probabilistic clustering enriches the depth of our analysis. The use of clustering algorithms reveals distinct spatial patterns in regression coefficients, enabling the identification of regions with similar influences on vulnerability indicators. The comparison between GMM and DPMM methods provides a robust understanding of the uncertainty and variability associated with clustering.

In terms of real-world application, the findings from the case study using data from the CHQ Population Health Dashboard have direct implications for policy and intervention strategies. The spatial variations uncovered in factors influencing developmental vulnerability enable targeted and region-specific initiatives. Regions exhibiting a stronger negative association between preschool attendance and vulnerability may necessitate tailored interventions addressing underlying socio-economic factors.

Furthermore, the insights derived from the analysis, particularly the identification of clusters with similar influences on vulnerability indicators, offer valuable inputs for health and education planning. Policymakers can use this information to allocate resources more effectively, focusing on regions with specific needs related to children’s developmental outcomes.

Notwithstanding the above contributions, the study has certain limitations. The reliance on data from a specific dashboard may introduce biases, and future research could explore additional datasets for validation and extension of findings. The complexity of the BGWR model necessitates computational resources, and future research might explore advanced computational techniques for enhanced scalability. Additionally, the generalizability of findings to other regions or countries remains an open question, and future studies could replicate the analysis in diverse geographical contexts. Exploring temporal dynamics in greater detail could be a promising avenue for future research, providing a more comprehensive understanding of how covariate effects evolve over longer time periods and informing intervention planning.

5.1. Future research directions

While our current work presents promising results, several avenues for future research extend beyond the scope of this paper.

Our Bayesian approach could be extended to handle generalized linear models (GLMs), broadening its applicability to a diverse range of regression models and data types. This extension involves adapting the algorithm to accommodate various link functions and likelihood distributions [58].

Furthermore, the integration of penalized methods such as ridge regression, lasso or elastic net regularization into our approach could enhance its efficiency in handling high-dimensional data, potentially improving parameter estimation accuracy and stability.

Developing a robust framework to determine optimal bandwidths is essential for ensuring the efficiency and accuracy of the BGWR algorithm, particularly when dealing with spatially varying covariate effects within a Bayesian framework [59].

For improved model interpretability and identification of relevant covariates, exploring appropriate approaches for variable selection in the context of clustered regression is warranted.

Using the DPMM and GMM to obtain clustering information of regression coefficients shows promise. However, further investigation is needed to address inconsistencies in the posterior on the number of clusters [60].

To enhance the reliability and robustness of the proposed algorithm, especially when dealing with spatially dependent data and clustered structures, a prior for spatially clustered regression coefficients should be developed [58,60].

Developing a comprehensive spatial probabilistic framework would provide a solid foundation for handling complex spatial structures. This framework could incorporate various spatial components and enable integration with other spatial models and techniques.

Investigating the connection of the proposed BGWR with latent variable models might lead to more advanced and efficient clustering techniques.

Another promising direction is to extend the proposed algorithm for a multivariate BGWR model, considering multiple response variables simultaneously. This could enhance the model’s capability to capture complex spatial dependencies and better understand the relationships between variables in a spatial context.

6. Conclusion

In this paper, we have achieved three main contributions. The first is the combination of BGWR with clustering via the GMM and the DPMM to detect patterns in how different factors influence geographical data. The second is the optimization of the computational algorithm to work well with large geographical areas and many spatial regions, leading to faster estimations. The third is the inclusion of the dynamic variable selection for each location in the BGWR model.

In addition to validating the effectiveness of our method through substantive simulated studies, we considered a case study focusing on children’s development in Queensland. Real data from CHQ and Australian Early Development Census were used for this study. By examining vulnerability in at least one of five critical development domains’ physical health and wellbeing, social competence, emotional maturity, language and cognitive skills (school-based), and communication skills and general knowledge–our approach successfully identified clusters of regions with similar developmental vulnerabilities. This discovery holds significant implications for health services planning and early intervention efforts, aiming to enhance the overall wellbeing and success of children during their formative years. Furthermore, we conducted an in-depth exploration of the preschool attendance covariate using the BGWR model.

Overall, our research contributes to the field of spatial regression analysis by offering a powerful and efficient tool for understanding spatially varying patterns in the relationship between risk variables and responses and opens up new avenues for exploring spatial clusters and their implications. Our approach can be used in research in various domains beyond child development, such as health services planning and spatial analysis in general.

Appendix A

A.1. Spatial distances

When working with areal data, the graph distance is an alternative distance metric that can be used. It is based on the concept of a graph, where V = {v ₁, … , v _m} represents the set of nodes (vertices) and E(G) = {e ₁, …, e _n} represents the set of edges connecting these nodes. The graph distance is defined as the distance between any two nodes in the graph:

d_{v_{s} v_{i}} = {\begin{matrix} | V (e) | & if e is the shortest distance connecting a pair of nodes \\ \infty & if the two nodes are not connected \end{matrix},

where |V(e)| is the number of edges in e [33]. Choosing the right settings for distance methods can be tricky and depend on ‘how near is really near’. One way to decide this is by using graph distance. If regions have common boundaries, they have a graph distance of 1, which means they are close. However, if they have a distance of more than 1, they are not so close. Regions that are not very close should get less importance when we do calculations. The graph distance-based weighted function is given by:

W (s) = {\begin{matrix} 1 & if d_{i} (s) \leq b \\ f (d_{i} (s), b) & otherwise \end{matrix},

where d is the graph distance, f is a weighting function and b represents the bandwidth. In this study, we suppose that f() is a negative exponential function [5], so that:

W (s) = {\begin{matrix} 1 & if d_{i} (s) \leq 1 \\ e^{(- d_{i} (s) / b)} & otherwise \end{matrix} .

Another way to calculate the distance from the areal data is the GCD which calculates the shortest distance d _i(s) between two points on the surface of a sphere (e.g. Earth) using their latitude and longitude coordinates. This accounts for the curvature of the Earth and gives the distance along the surface of the sphere, following the path of a great circle. The great circle represents the largest circle that can be drawn on a sphere and passes through the two points. The GCD is given as [34]:

d_{i} (s) = r \cdot \cos^{- 1} (\cos (a_{s}) \cdot \cos (a_{i}) \cdot \cos (b_{s} - b_{i}) + \sin (a_{s}) \cdot \sin (a_{i})) .

A 1

To calculate the GCD, the formula involves trigonometric functions. Specifically, it calculates the arc-cosine of a term involving the cosines and sines of the latitudes and longitudes of the two points. a _s and b _s represent the latitude and longitude for location s, and a _i and b _i represent the latitude and longitude for location i. The term inside the arc-cosine represents the dot product of two vectors in a three-dimensional space, which is used to determine the angle between the two points. The inverse cosine function of this dot product yields the angle, and by multiplying it with the Earth’s radius r, we get the GCD between the two points [34].

The weighting scheme introduced in graph distance and GCD, where neighbours are assigned the same weight and all others receive positive weight, provides additional assurance of parameter estimation stability when compared with the other weighting schemes.

Appendix B. Metropolis-hastings sampling for Bayesian geographically weighted regression

This section explains how to draw samples from the posterior distribution of the BGWR model parameters, considering the spatial dependencies, likelihood function and prior distributions.

Initialization:

—
start with an initial set of values for all parameters: β(s), σ ²(s), $σ_{β}$ and b.

Iterative sampling process: for each iteration until the desired number of samples:

—
proposal generation: propose new values for each model parameter based on their respective proposal distributions and a random walk step:
- —
  coefficients β(s): propose $β^{'} (s) = β (s) + ϵ$ where $ϵ \sim N (0, σ_{β}^{2})$ ;
- —
  spatial variance σ ²(s): given the prior is σ ²(s) ∼ IGamma(α ₁, α ₂), propose σ′²(s) = σ ²(s) + δ where δ is drawn from some distribution (e.g. Gaussian);
- —
  coefficient variance $σ_{β}$ : given the prior is $σ_{β} \sim IGamma (α, β)$ , propose $σ_{β}^{'} = σ_{β} + ζ$ where ζ is drawn from some distribution;
- —
  bandwidth b: propose b′ = b + η, where η is drawn from some distribution (e.g. uniform).

Compute acceptance ratio R:

—

coefficients β(s):

R_{β} = \frac{π^{- n / 2} \cdot σ^{' - n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β^{'} (s))}^{2}) \times P (β^{'} (s))}{π^{- n / 2} \cdot σ^{- n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}) \times P (β (s))},

—

compute acceptance ratio

R_{σ^{2}}

R_{σ^{2}} = \frac{π^{- n / 2} \cdot σ^{' - n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}) \times P (σ^{' 2} (s))}{π^{- n / 2} \cdot σ^{- n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}) \times P (σ^{2} (s))},

—
compute acceptance ratio $R_{σ_{β}}$ :
$R_{σ_{β}} = \frac{π^{- n / 2} \cdot σ^{' - n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}) \times P (σ_{β}^{'})}{π^{- n / 2} \cdot σ^{- n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}) \times P (σ_{β})},$
—
compute acceptance ratio R _b:
$R_{b} = \frac{π^{- n / 2} \cdot σ^{- n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}) \times P (b^{'})}{π^{- n / 2} \cdot σ^{- n} (s) \cdot | W (s) |^{- 1 / 2} \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}) \times P (b)} .$

Evaluate:

—
R (be it $R_{β}$ , $R_{σ^{2}}$ , $R_{σ_{β}}$ or R _b).

Accept or reject:

—
draw a random value u from a uniform distribution U(0, 1). If $u < min (1, R)$ , accept the proposed parameters. Otherwise, retain the current parameter values.

Diagnostics:

—
repeat the above steps for the desired number of iterations until converge.

Appendix C. Probabilistic analysis

Figure 12 provides valuable insights into the variability and disparity among regions by depicting a scatter plot of the posterior means and variances derived from the BGWR model. One notable observation is the inverse relationship between posterior means and variances across regions. Regions with smaller posterior means tend to exhibit larger variances, indicating greater variability in the estimated coefficients within those areas. Additionally, the influence of the three levels of remoteness factor in the case study is apparent, as it affects the observed relationship between posterior means and variances. Furthermore, figure 12 facilitates comparisons between different regions by selecting two regions from urban areas, two from suburban or outer regions, and two from remote areas. This deliberate selection allows for the identification of both differences and similarities in the estimated coefficients among these diverse regions. For instance, comparing regions such as Thorlands and Clayfield, representing urban areas, with regions like Rockhampton region-West and Lockyer Valley-East, representing outer regions, may reveal distinct patterns in the spatial distribution of coefficients. Similarly, examining regions such as the Ingham Region and Kowanyama Pormpuraaw, representing remote areas, can provide insights into unique spatial dynamics and contextual factors influencing the coefficients within these regions. A visualization of the correlation between these six regions is shown in figure 13, which reveals a slight association among the posterior estimates for preschool attendance. Figure 14 presents a comparative analysis of probabilities among the six regions regarding the posterior estimates of the coefficient associated with attendance at preschool. The probabilities are computed by comparing the posterior draws of the coefficient for one region with those of other regions, for each MCMC simulation. The vulnerabilities in our study are expressed as proportions ranging between 0 and 1. Given this scale, large coefficient values would not only be unexpected but also potentially misleading. Instead, smaller coefficients are more appropriate and consistent with the data’s inherent structure. Additionally, while the range of these coefficients might appear limited, even small variations can be meaningful. Especially, in a context where the dependent variable operates on such a narrow scale, even slight changes in predictor values can lead to significant shifts in outcomes. Therefore, allowing the coefficients to vary, even within a small range, is essential. This flexibility ensures that our model can capture and represent subtle spatial patterns and gradients in vulnerabilities, which might be critical for understanding and addressing them effectively.

Figure 13. — The correlation between the posterior distributions of the attendance at preschool parameter across the six regions.

Assume β _p(i) represents the coefficient for preschool attendance at location s. In figure 14, probabilities are computed using P _r(β _p(k) > β _p(q)), indicating the probability of one coefficient at one location being greater than at another. For each posterior sample, an indicator function assigns 1 when β _p(k) > β _p(q) and 0 otherwise. The resulting probability is the sum of these indicators divided by the total samples. For instance, the value at row 5, column 3 (0.87) denotes the probability of preschool attendance in Kowanyama Pormpuraaw surpassing that in Ingham.

Extending the analysis to explore $P_{r} (β_{p} (i) > \bar{β_{p} (i)})$ , where $\bar{β_{p}}$ is the posterior mean for preschool attendance in region i, figure 15 visualizes resulting probabilities, highlighting variability among regions. Over 300 SA2 regions have a probability of less than 0.5 exceeding their posterior mean, suggesting potential overestimation of coefficient values. This uncertainty has implications for decision-making.

Furthermore, this uncertainty underscores the importance of conducting sensitivity analyses and exploring alternative model specifications to assess the robustness of the findings and account for the variability in the estimated coefficients across different regions. Additionally, further research and data collection efforts may be warranted to improve the precision and reliability of the estimates for these regions.

The distribution of these probabilities is depicted in figure 16. Regions with higher probabilities are highlighted, indicating a greater probability of exceeding the posterior mean for that particular region. In this plot, 101 SA2 regions exhibit a probability greater than 0.8 of having $(β_{p} (i) > \bar{β_{p}})$ across M iterations.

Additionally, our analysis involved calculating the probability of each region being among the top 10 regions for the coefficients associated with attendance at preschool β _p, Further analysis can be found in appendix C. To achieve this, we performed the following steps: for each iteration in our study (denoted by M), we obtained a posterior draw for β _p in each of the 526 SA2 regions. These draws were then sorted, and the top 10 were identified. An indicator variable was created for each iteration M, taking the value 1 if it was among the top 10 and 0 otherwise. This process was repeated for all iterations in the posterior. After obtaining the indicator values for all iterations, we calculated the probabilities by summing up the values corresponding to a specific region and dividing by the total number of iterations. This gave us the probabilities of each region being in the top 10. The probability of each region being in the top 10 is shown in figure 17. This visual representation highlights that the far north of Queensland has a stronger probability of being in the top 10 compared to southeast Queensland. Finally, an assessment was undertaken of the impact of different kernel functions in the BGWR analysis of the case study. The exponential, bi-square and Gaussian functions were considered. Results were presented in table 6. As in the simulated study in §3, table 3, the performance of the algorithm is evaluated based on three criteria: WAIC, DIC and P _D (effective sample size).

Figure 17. — The probability of each region on the map to be in the top 10.

Table 6.

WAIC, DIC and effective sample size values from the proposed BGWR algorithm using different kernels for the case study.

	exponential kernel	bi-square kernel	Gaussian kernel
WAIC	−801304.5	−802014.9	−801161.4
DIC	−742933.6	−743896.3	−744056.8
P _D	3253.3	3176.5	3176.5

Open in a new tab

Appendix D. Reversible jump Markov chain Monte Carlo algorithm

RJMCMC lets the chain switch between models with different parameter counts. In this paper, RJMCMC is used to assess the inclusion or exclusion of predictors at each spatial location. This section explains the RJMCMC steps mentioned in §2.4 and how they are used to find important local variables.

Initialization:

—
initialize all parameters of the model, i.e. choose initial values for γ _j(s), ψ, β(s), σ ²(s) and b;
—
example: set γ _j(s) = 1 for all s and j (all predictors are included initially);
—
set ψ = 0.5 assumption that each predictor is equally likely to be included or excluded;
—
initialize β(s) and σ ²(s) based on a simple linear regression with all predictors included;
—
initialize b based on a suitable initial value (e.g. 10).

Proposal step:

—
select a predictor j and a spatial location s randomly;
—
propose a change in the model. If γ _j(s) = 0 (predictor j is currently excluded), propose to include predictor j in the model (γ _j(s) = 1). If γ _j(s) = 1 (predictor j is currently included), propose to exclude predictor j from the model (γ _j(s) = 0);
—
propose a new value for b by adding a small random change to the current value.

Calculation of acceptance ratio R: compute the acceptance ratio as:

R = \frac{p (y | new model) \times p (new model) \times q (delete)}{p (y | current model) \times p (current model) \times q (add)} \times J,

where

—
likelihood of data given model p(y|model): the likelihood function is given by the normal distribution:
$p (y | β (s), x, W (s), σ^{2} (s)) = π^{- (n / 2)} \cdot σ^{- n} (s) \cdot | W (s) |^{- (1 / 2)} \cdot \exp (- \frac{1}{2} \sum_{i} w_{i}^{- 1} (s) \cdot {(y (s) - x^{T} (s) β (s))}^{2}),$
—
prior probability of model p(model): this term includes the prior distribution of the model parameters β(s), σ ²(s), and ψ.with a Bernoulli prior for γ _j(s) and a Beta prior for ψ, β(s) prior is $N (0, Σ_{β})$ , σ ²(s) prior is IGamma(α ₁, α ₂) and b prior is uniform(0, D);
—
proposal densities q(add), q(delete): these represent the probability of proposing to add or delete a predictor. In this case, let us assume q(add) = q(delete) = 1/p, where p is the total number of predictors. This assumes that each predictor is equally likely to be proposed for addition or deletion;
—
Jacobian J : the Jacobian accounts for the change in the dimensionality of the parameter space when adding or deleting predictors. In this case, adding a predictor increases the dimension by one, while deleting reduces it by one. We use the Jacobian to properly adjust our calculations in the RJMCMC algorithm. Specifically, we apply the reverse of the Jacobian that was used when we proposed changing the model. This helps to ensure our probabilities are correct and the Markov chain remains balanced.

Acceptance/rejection step:

—
draw u ∼ Uniform(0, 1). If $u < min (1, R)$ , accept the proposed change in the model; otherwise, keep the current model.

Repeat:

—
iterate the above steps, allowing the Markov chain to jump between different model spaces, determining the inclusion/exclusion of predictors for each spatial location i, and updating the parameters β(s), σ ²(s) and b.

Post-processing:

—
after running the RJMCMC for a large number of iterations, post-process the Markov chain to obtain posterior distributions of the parameters β(s), σ ²(s), γ _j(s), b and ψ. These posterior distributions provide estimates of the parameters along with measures of uncertainty.

Appendix E. Model assessment

The weighted functions used in the model influence the GWR model. Various spatial weighted functions were introduced to be incorporated into the GWR model. To determine the optimal fit for the data, we employed standard evaluation tools, including WAIC [61] and DIC [62].

WAIC is commonly computed from samples of the posterior distribution of interest using a pointwise log predictive density. WAIC is given as:

\begin{aligned} W A I C = - 2 \sum_{i = 1}^{n} \log \tilde{p} (y (s) | y) + 2 V \\ = - 2 \sum_{i = 1}^{n} \log \int p (y (s) | θ) p (θ | y) d θ + 2 V, \end{aligned}

E 1

where $V = \sum_{i = 1}^{i = n} V_{i}$ is the sum of the posterior variance of the pointwise log-likelihoods:

V_{i} = {Var}_{θ | y} \log (p (y (s) | θ)),

E 2

WAIC can be estimated using posterior sample p(θ|y) to evaluate $\tilde{p} (y (s) | y)$ and V.

To find WAIC for the BGWR model, for each data point y(s): calculate the log-likelihood for each posterior sample j = 1, 2, …, M using the posterior estimates $\bar{β} (s)$ , $\bar{W} (i)$ and ${\bar{σ}}^{2} (s)$

\log f (y (s) | \bar{β} (s), X, \bar{W} (i), {\bar{σ}}^{2} (s)) = \log [π^{- n / 2} \cdot {\bar{σ}}^{- n} \cdot | \bar{W} |^{- 1 / 2} \cdot \exp (- \frac{1}{2} {\bar{σ}}^{- 2} {(y (s) - X \bar{β})}^{T} \cdot {\bar{W}}^{- 1} \cdot (y (s) - X \bar{β}))],

E 3

then calculate the mean log-likelihood over all posterior samples for data point y(s):

\bar{\log} f (y (s)) = \frac{1}{M} \sum_{j = 1}^{M} \log f (y (s) | \bar{β} (s), X, \bar{W} (i), {\bar{σ}}^{2} (s)) .

E 4

Take the squared difference between the individual log-likelihood and the mean log-likelihood, and take the average over all the M posterior samples:

V_{i} = \frac{1}{M} \sum_{j = 1}^{M} (\log f (y (s) | \bar{β} {(s), X, \bar{W} (i), {\bar{σ}}^{2} (s)) - \bar{\log} f (y (s)))}^{2} .

E 5

Sum up all the individual V _i values to get the total V:

V = \sum_{i = 1}^{n} V_{i} .

E 6

Calculate the log pointwise predictive density for each data point y(s):

\log \tilde{p} (y (s) | y) = \log f (y (s) | \bar{β} (s), X, \bar{W} (i), {\bar{σ}}^{2} (s)),

E 7

finally, calculate the WAIC for the BGWR model using the formula:

WAIC = - 2 \sum_{i = 1}^{n} \log \tilde{p} (y (s) | y) + 2 V,

E 8

where V is the sum of the posterior variances V _i for all data points, and $\log \tilde{p} (y (s) | y)$ is the log pointwise predictive density for each data point. Smaller WAIC value indicates a better-fitting model.

The DIC is given as:

DIC = Dev (\tilde{θ}) + 2 p_{D} .

E 9

In this context, θ denotes the parameters of interest, and $\bar{θ}$ represents the posterior mean. The deviance function, denoted as Dev( · ), is used, and the effective number of parameters in the model, denoted as p _D, is calculated using the equation $p_{D} = \bar{Dev} (θ) - Dev (\bar{θ})$ . The specific expression for the deviance function is provided in [63] and is as follows:

\begin{aligned} Dev (β (s), W (s), σ^{2} (s)) = - 2 \log f (Y | β (s), X, W (s), σ^{2} (s)) \\ = nlog (2 π) + \log (σ^{2} (s)) - \log (| W (s) |) + {(Y - X β (s))}^{T} σ^{- 2} W (s) (Y - X β (s)) . \end{aligned}

E 10

The DIC for GWR can be summarized as

\begin{aligned} DIC & = Dev (\bar{β} (s), \bar{W} (i), {\bar{σ}}^{2} (s)) + 2 p_{D} \\ = 2 \bar{Dev} (β (s), W (s), σ^{2} (s)) - Dev (\bar{β} (s), \bar{W} (i), {\bar{σ}}^{2} (s)) . \end{aligned}

E 11

The quantities $\bar{β} (s)$ , $\bar{W} (i)$ and ${\bar{σ}}^{2} (s)$ correspond to the posterior estimates derived from the MCMC results similar to before. A smaller DIC value is a more favourable model [5].

Appendix F. Dirichlet process mixture model cluster configuration process

In this section, we provided a further explanation for the two-stage cluster configuration that we developed to find the final cluster labelling for the DPMM in the case of the mode method.

F.1. Stage 1: within-sample cluster configuration

Given:

—
N: total number of regions
—
M: number of MCMC samples obtained from BGWR (here, M = 500)
—
Q: number of MCMC iterations for DPMM for each sample i (here, Q = 2000)
—
C _i: cluster configuration for the ith sample

For each sample i from 1 to M:

(i)
perform MCMC analysis on the DPMM and determine the cluster assignment for each region;
(ii)
based on the given regions, we can find the cluster configuration, C _i, using the mode method; and
(iii)
arrange the cluster configuration derived from each samplei across Q iteration and put NA if the value does not exist. Each sample i may give a different number of clusters.

Mathematically, for the mode method, the cluster assignment for a region j during the ith iteration, c _i,j, is

c_{i, j} = mode (p (x_{j} | C_{i})),

where p(x _j|C _i) represents the probability of region N _i belonging to each cluster in configuration C _i.

F.2. Stage 2: across-sample cluster configuration

Given the set of cluster configurations from the first stage:

C = [\begin{matrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, j} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, j} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ c_{M, 1} & c_{M, 2} & \dots & c_{M, j} \end{matrix}],

where

—
c _i,j is the cluster assignment for region j for sample i;
—
if the number of clusters during sample i is less than M _max, some c _i,j values in that row will be NA.

(i)
for each region j, consider its column in matrix C;
(ii)
calculate the mode across all entries in that column, ignoring the NA values.

The final cluster assignment for each region j, c _final,j, is:

c_{final, j} = mode ({c_{1, j}, c_{2, j}, \dots, c_{i, j}}),

where

—
c _final,j is the most frequently assigned cluster for region j across all MCMC 500 samples;
—
the mode function will exclude NA values.

Using the matrix C, we construct the final cluster assignment vector C _final as:

C_{final} = [\begin{matrix} c_{final, 1} \\ c_{final, 2} \\ ⋮ \\ c_{final, j} \end{matrix}],

where each entry c _final,j corresponds to the final cluster assignment for each region j. To handle variations in cluster numbers across iterations, the use of NA simplifies the mode calculation by indicating a lack of assignment.

Appendix G. Simulated data further analysis

In this section, we present additional results from the analysis of simulated data for the three settings across 100 replicates and report the average performance within these replicates. Our main focus is on the spatial distributions of the clusters obtained using our proposed algorithm. The final cluster configuration is determined using both Dahl’s method and the mode method as the assignment criteria.

Additionally, in this section, we present an empirical density plot showcasing the optimal number of clusters obtained from each sample of the BGWR coefficients posteriors. To ensure accuracy, we considered approximately 500 samples from the posterior of the beta parameters obtained from BGWR without replacement during the evaluation of our algorithm. For the simulated data, we ran our algorithm for 5000 iterations, with 2000 burn-in iterations. Furthermore, figure 19 depicts the spatial distribution obtained from the proposed algorithm using the GMM cluster method. Notably, our algorithm consistently identifies the optimal number of clusters as 3 across the mode method and Dahl’s method. Noticeably, in this setting, the range of the RI is higher compared to setting 1. Compared to figures 18 and 19, 20 demonstrates higher accuracy (RI) with the true labels. The mode method and Dahls’ method generate more clusters. As anticipated, increasing the signal strength leads to improved clustering accuracy. However, the proposed approach exhibits a tendency for over-clustering. Nevertheless, this tendency diminishes as the signal strength increases.

Figure 18. — The optimal number of clusters obtained from the GMM using BIC from 500 samples from the posterior, for setting 1, along with the cluster configuration on the spatial map using the proposed algorithm.

Figure 19. — The optimal number of clusters obtained from the GMM using BIC from 500 samples from the posterior, for setting 2, along with the cluster configuration on the spatial map using the proposed algorithm.

Figure 20. — The optimal number of clusters obtained from the GMM using BIC from 500 samples from the posterior, for setting 3, along with the cluster configuration on the spatial map using the proposed algorithm.

Figure 21. — The spatial distribution of the clusters obtained from setting 1 using DPMM.

Figure 22. — The spatial distribution of the clusters obtained from setting 2 using DPMM.

Figure 23. — The spatial distribution of the clusters obtained from setting 3 using DPMM.

Figure 24. — Optimal cluster distribution per sample.

Figure 25. — The probabilities of the first 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 26. — The probabilities of the second 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 27. — The probabilities of the third 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 28. — The probabilities of the fourth 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 29. — The probabilities of the fifth 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 30. — The probabilities of the 26 regions to belong to each cluster from 1 to 4 for GMM.

Figure 31. — The probabilities of the first 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 32. — The probabilities of the second 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 33. — The probabilities of the third 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 34. — The probabilities of the fourth 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 35. — The probabilities of the fifth 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 36. — The probabilities of the 26 regions to belong to each cluster from 1 to 12 for DPMM.

We compared two models for grouping data, DPMM and GMM, in three different situations. Notably, the DPMM outperformed the GMM by creating more clusters, especially smaller ones. This pattern persisted consistently across all three scenarios, demonstrating the DPMM’s remarkable ability to uncover structures in the data. The DPMM’s adaptive nature allowed it to identify complex patterns, making it a powerful tool for revealing hidden insights in complex datasets.

Appendix H. Case study further analysis

The optimal number of cluster derived from 500 samples with the GMM method and assessed using BIC, is depicted in figure 24. The provided plots illustrate the probability of each region belonging to various clusters. Since we have 526 regions to visualize on the heatmap, we have presented this information across six figures, with the first five figures displaying the probabilities for 100 regions. Similarly, we displayed the probability of each region being assigned to one of the 12 clusters obtained from the GMM in a heatmap (figures 25–30) and from DPMM in figures 31–36.

Ethics

This work did not require ethical approval from a human subject or animal welfare committee.

Data accessibility

Data and relevant code for this research work are stored in GitHub: (https://github.com/waladraidi/BGWR-Clustering) and have been archived within the Zenodo repository: [https://zenodo.org/doi/10.5281/zenodo.11176324].

Declaration of AI use

We have not used AI-assisted technologies in creating this article.

Authors' contributions

W.D.A.: conceptualization, data curation, formal analysis, methodology, software, validation, visualization, writing–original draft, writing—review and editing; A.P.: project administration, supervision; H.T.: supervision, writing—review and editing; C.H.: methodology, software, writing—review and editing; R.M.: data curation, supervision, writing—review and editing; K.M.: supervision, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

No funding has been received for this article.

References

1. Bailey TC. 2001. Spatial statistical methods in health. Cadernos de Saúde Pública 17 , 1083-1098. ( 10.1590/S0102-311X2001000500011) [DOI] [PubMed] [Google Scholar]
2. Cressie N. 2015. Statistics for spatial data. New York, NY: John Wiley & Sons. [Google Scholar]
3. Haining RP. 1990. Spatial data analysis in the social and environmental sciences. Cambridge, UK: Cambridge University Press. [Google Scholar]
4. Ripley B. 1981. Spatial statistics. Chichester, UK: John Wiley & Sons. [Google Scholar]
5. Ma Z, Xue Y, Hu G. 2021. Geographically weighted regression analysis for spatial economics data: a Bayesian recourse. Int. Reg. Sci. Rev. 44 , 582-604. ( 10.1177/0160017620959823) [DOI] [Google Scholar]
6. Hastie T, Tibshirani R. 1990. Generalized additive models. London, UK: Chapman and Hall. [DOI] [PubMed] [Google Scholar]
7. Fan J, Gijbels I. 1996. Local polynomial modeling and its applications. London, UK: Chapman and Hall. [Google Scholar]
8. Diggle PJ, Tawn JA, Moyeed RA. 1998. Model-based geostatistics. J. R. Stat. Soc. C (Applied Statistics) 47 , 299-350. ( 10.1111/1467-9876.00113) [DOI] [Google Scholar]
9. Brunsdon C, Fotheringham AS, Charlton ME. 1996. Geographically weighted regression: a method for exploring spatial nonstationarity. Geograph. Anal. 28 , 281-298. ( 10.1111/j.1538-4632.1996.tb00936.x) [DOI] [Google Scholar]
10. Xue Y, Schifano ED, Hu G. 2020. Geographically weighted Cox regression for prostate cancer survival data in Louisiana. Geograph. Anal. 52 , 570-587. ( 10.1111/gean.12223) [DOI] [Google Scholar]
11. Chan HSR. 2008. Incorporating the concept of ‘community’ into a spatially-weighted local regression analysis. New Brunswick, Canada: University of New Brunswick. [Google Scholar]
12. Dormann CF, et al. 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30 , 609-628. ( 10.1111/j.2007.0906-7590.05171.x) [DOI] [Google Scholar]
13. Sodikin I, Pramoedyo H, Astutik S. 2017. Geographically weighted regression and Bayesian geographically weighted regression modelling with adaptive Gaussian kernel weight function on the poverty level in west Java province. Int. J. Humanit., Relig. Soc. Sci. 2 , 21-30. [Google Scholar]
14. Gelfand AE, Schliep EM. 2016. Spatial statistics and Gaussian processes: a beautiful marriage. Spat. Stat. 18 , 86-104. ( 10.1016/j.spasta.2016.03.006) [DOI] [Google Scholar]
15. LeSage JP. 2004. A family of geographically weighted regression models. In Advances in spatial econometrics: methodology, tools and applications (eds L Anselin, RJGM Florax), pp. 241–264. Berlin, Germany: Springer.
16. Liu Y, Goudie RJB. 2021. Generalized geographically weighted regression model within a modularized Bayesian framework. arXiv. (http://arxiv.org/abs/2106.00996).
17. Kulldorff M, Nagarwalla N. 1995. Spatial disease clusters: detection and inference. Stat. Med. 14 , 799-810. ( 10.1002/sim.4780140809) [DOI] [PubMed] [Google Scholar]
18. Li P, Banerjee S, Hanson TA, McBean AM. 2015. Bayesian models for detecting difference boundaries in areal data. Stat. Sinica 25 , 385. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Neill DB, Moore AW, Sabhnani M, Daniel K. 2005. Detection of emerging space-time clusters. In Proc. of the Eleventh ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, Chicago, pp. 218–227.
20. Wu J. 2012. Cluster analysis and k-means clustering: an introduction. In Advances in K-means clustering: a data mining thinking, pp. 1–16. Berlin, Germany: Springer.
21. McLachlan GJ, Basford KE. 1988. Mixture models: inference and applications to clustering, vol. 38 . New York, NY: M. Dekker. [Google Scholar]
22. UNICEF et al. 1993. Facts for life: a communication challenge. In Facts for life: a communication challenge (eds P Adamson, G Williams), pp. 78–78. London, UK: The Salvation Army Afghan Refugee Assistance Project.
23. Hertzman C, Wiens M. 1996. Child development and long-term outcomes: a population health perspective and summary of successful interventions. Soc. Sci. Med. 43 , 1083-1095. ( 10.1016/0277-9536(96)00028-7) [DOI] [PubMed] [Google Scholar]
24. Irwin LG, Siddiqi A, Hertzman G. 2007. Early child development: a powerful equalizer. Vancouver, Canada: Human Early Learning Partnership (HELP). [Google Scholar]
25. Areed W, Price A, Arnett K, Thompson H, Malseed R, Mengersen K. 2023. Assessing the spatial structure of the association between attendance at preschool and children’s developmental vulnerabilities in Queensland, Australia. PLoS ONE 18 , e0285409. ( 10.1371/journal.pone.0285409) [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Geluk CAML, van Domburgh L, Doreleijers TAH, Jansen L, Bouwmeester S, Garre FG, Vermeiren R. 2014. Identifying children at risk of problematic development: latent clusters among childhood arrestees. J. Abnorm. Child Psychol. 42 , 669-680. ( 10.1007/s10802-013-9811-3) [DOI] [PubMed] [Google Scholar]
27. Ukoumunne OC, et al. 2012. Profiles of language development in pre-school children: a longitudinal latent class analysis of data from the early language in Victoria study. Child Care Health Dev. 38 , 341-349. ( 10.1111/j.1365-2214.2011.01234.x) [DOI] [PubMed] [Google Scholar]
28. Páez A, Uchida T, Miyamoto K. 2002. A general framework for estimation and inference of geographically weighted regression models: 1. location-specific kernel bandwidths and a test for locational heterogeneity. Environ. Plann. A 34 , 733-754. ( 10.1068/a34110) [DOI] [Google Scholar]
29. Leung Y, Mei CL, Zhang WX. 2000. Statistical tests for spatial nonstationarity based on the geographically weighted regression model. Environ. Plann. A 32 , 9-32. ( 10.1068/a3162) [DOI] [Google Scholar]
30. Wheeler D, Tiefelsdorf M. 2005. Multicollinearity and correlation among local regression coefficients in geographically weighted regression. J. Geograph. Syst. 7 , 161-187. ( 10.1007/s10109-005-0155-6) [DOI] [Google Scholar]
31. Cho S-H, Lambert DM, Chen Z. 2010. Geographically weighted regression bandwidth selection and spatial autocorrelation: an empirical example using Chinese agriculture data. Appl. Econ. Lett. 17 , 767-772. ( 10.1080/13504850802314452) [DOI] [Google Scholar]
32. Yu H, Fotheringham AS, Li Z, Oshan T, Kang W, Wolf LJ. 2020. Inference in multiscale geographically weighted regression. Geograph. Analy. 52 , 87-106. ( 10.1111/gean.12189) [DOI] [Google Scholar]
33. Gao X, Xiao B, Tao D, Li X. 2010. A survey of graph edit distance. Pattern Anal. Appl. 13 , 113-129. ( 10.1007/s10044-008-0141-y) [DOI] [Google Scholar]
34. Carter C. 2002. Great circle distances. SiRF White Paper. See https://ieeexplore.ieee.org/abstract/document/10114987.
35. Pérez-Elizalde S, Cuevas J, Pérez-Rodríguez P, Crossa J. 2015. Selection of the bandwidth parameter in a Bayesian kernel regression model for genomic-enabled prediction. J. Agricul. Biol. Environ. Stat. 20 , 512-532. ( 10.1007/s13253-015-0229-y) [DOI] [Google Scholar]
36. de Valpine P, Turek D, Paciorek CJ, Anderson-Bergman C, Lang DT, Bodik R. 2017. Programming with models: writing statistical algorithms for general model structures with NIMBLE. J. Comput. Graph. Stat. 26 , 403-413. ( 10.1080/10618600.2016.1172487) [DOI] [Google Scholar]
37. Mitchell TJ, Beauchamp JJ. 1988. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 83 , 1023-1032. ( 10.1080/01621459.1988.10478694) [DOI] [Google Scholar]
38. Chipman H. 1996. Bayesian variable selection with related predictors. Can. J. Stat. 24 , 17-36. ( 10.2307/3315687) [DOI] [Google Scholar]
39. Kuo L, Mallick B. 1998. Variable selection for regression models. Sankhyā: The Indian J. Stat. B 60 , 65-81. [Google Scholar]
40. Green PJ. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 , 711-732. ( 10.1093/biomet/82.4.711) [DOI] [Google Scholar]
41. Bhattacharya A, Dunson DB. 2011. Sparse Bayesian infinite factor models. Biometrika 98 , 291-306. ( 10.1093/biomet/asr013) [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Dempster A, Laird N, Rubin D. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B (Methodological) 39 , 1-22. ( 10.1111/j.2517-6161.1977.tb01600.x) [DOI] [Google Scholar]
43. He X, Cai D, Shao Y, Bao H, Han J. 2010. Laplacian regularized Gaussian mixture model for data clustering. IEEE Trans. Knowl. Data Eng. 23 , 1406-1418. ( 10.1109/TKDE.2010.259) [DOI] [Google Scholar]
44. Reynolds DA. 2009. Gaussian mixture models. Encyclopedia Biometr. 741 , 659-663. ( 10.1007/978-0-387-73003-5_196) [DOI] [Google Scholar]
45. Rousseeuw P. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 , 53-65. ( 10.1016/0377-0427(87)90125-7) [DOI] [Google Scholar]
46. Wan H, Wang H, Scotney B, Liu J. 2019. A novel Gaussian mixture model for classification. In 2019 IEEE Int. Conf. on Systems, Man and Cybernetics (SMC), pp. 3298–3303. New York, NY: IEEE.
47. Liu W, Yang J, Zhao J, Yang L. 2017. A novel method of unsupervised change detection using multi-temporal PoISAR images. Remote Sens. 9 , 1135. ( 10.3390/rs9111135) [DOI] [Google Scholar]
48. Fu W, Perry PO. 2020. Estimating the number of clusters using cross-validation. J. Comput. Graph. Stat. 29 , 162-173. ( 10.1080/10618600.2019.1647846) [DOI] [Google Scholar]
49. Ferguson TS. 1973. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1 , 209-230. [Google Scholar]
50. Sammut C, Webb GI. 2011. Encyclopedia of machine learning. Boston, MA: Springer Science & Business Media. [Google Scholar]
51. Teh YW. 2010. Dirichlet process. Encyclopedia Mach. Learn. 1063 , 280-287. ( 10.1007/978-0-387-30164-8_219) [DOI] [Google Scholar]
52. Neal RM. 2000. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9 , 249-265. ( 10.1080/10618600.2000.10474879) [DOI] [Google Scholar]
53. Dahl DB. 2006. Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian Inference Gene Expression Proteomics 4 , 201-218. ( 10.1017/CBO9780511584589.011) [DOI] [Google Scholar]
54. Rand WM. 1971. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66 , 846-850. ( 10.1080/01621459.1971.10482356) [DOI] [Google Scholar]
55. Ma Z, Xue Y, Hu G. 2020. Heterogeneous regression models for clusters of spatial dependent data. Spatial Econ. Anal. 15 , 459-475. ( 10.1080/17421772.2020.1784989) [DOI] [Google Scholar]
56. Liu K, He J, Chen Y. A topic-enhanced Dirichlet model for short text stream clustering. Available at SSRN 4346394 .
57. The Australian Early Development Census. 2018. Queensland data snapshot. See https://earlychildhood.qld.gov.au/aboutUs/Documents/queensland-aedc-report-final.pdf .
58. Kowal DR. 2022. Bayesian subset selection and variable importance for interpretable prediction and classification. J. Mach. Learn. Res. 23 , 4661-4698. [PMC free article] [PubMed] [Google Scholar]
59. da Silva AR, Mendes FF. 2018. On comparing some algorithms for finding the optimal bandwidth in geographically weighted regression. Appl. Soft Comput. 73 , 943-957. ( 10.1016/j.asoc.2018.09.033) [DOI] [Google Scholar]
60. Geng J, Bhattacharya A, Pati D. 2019. Probabilistic community detection with unknown number of communities. J. Am. Stat. Assoc. 114 , 893-905. ( 10.1080/01621459.2018.1458618) [DOI] [Google Scholar]
61. Watanabe S, Opper M. 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11 , 3571-3594. [Google Scholar]
62. Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A. 2002. Bayesian measures of model complexity and fit. J. R. Stat. Soc. B 64 , 583-639. ( 10.1111/1467-9868.00353) [DOI] [Google Scholar]
63. Ma Z, Hu G, Chen M-H. 2021. Bayesian hierarchical spatial regression models for spatial data in the presence of missing covariates with applications. Appl. Stoch. Models Bus. Ind. 37 , 342-359. ( 10.1002/asmb.2568) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This work did not require ethical approval from a human subject or animal welfare committee.

[RSOS231780C1] 1. Bailey TC. 2001. Spatial statistical methods in health. Cadernos de Saúde Pública 17 , 1083-1098. ( 10.1590/S0102-311X2001000500011) [DOI] [PubMed] [Google Scholar]

[RSOS231780C2] 2. Cressie N. 2015. Statistics for spatial data. New York, NY: John Wiley & Sons. [Google Scholar]

[RSOS231780C3] 3. Haining RP. 1990. Spatial data analysis in the social and environmental sciences. Cambridge, UK: Cambridge University Press. [Google Scholar]

[RSOS231780C4] 4. Ripley B. 1981. Spatial statistics. Chichester, UK: John Wiley & Sons. [Google Scholar]

[RSOS231780C5] 5. Ma Z, Xue Y, Hu G. 2021. Geographically weighted regression analysis for spatial economics data: a Bayesian recourse. Int. Reg. Sci. Rev. 44 , 582-604. ( 10.1177/0160017620959823) [DOI] [Google Scholar]

[RSOS231780C6] 6. Hastie T, Tibshirani R. 1990. Generalized additive models. London, UK: Chapman and Hall. [DOI] [PubMed] [Google Scholar]

[RSOS231780C7] 7. Fan J, Gijbels I. 1996. Local polynomial modeling and its applications. London, UK: Chapman and Hall. [Google Scholar]

[RSOS231780C8] 8. Diggle PJ, Tawn JA, Moyeed RA. 1998. Model-based geostatistics. J. R. Stat. Soc. C (Applied Statistics) 47 , 299-350. ( 10.1111/1467-9876.00113) [DOI] [Google Scholar]

[RSOS231780C9] 9. Brunsdon C, Fotheringham AS, Charlton ME. 1996. Geographically weighted regression: a method for exploring spatial nonstationarity. Geograph. Anal. 28 , 281-298. ( 10.1111/j.1538-4632.1996.tb00936.x) [DOI] [Google Scholar]

[RSOS231780C10] 10. Xue Y, Schifano ED, Hu G. 2020. Geographically weighted Cox regression for prostate cancer survival data in Louisiana. Geograph. Anal. 52 , 570-587. ( 10.1111/gean.12223) [DOI] [Google Scholar]

[RSOS231780C11] 11. Chan HSR. 2008. Incorporating the concept of ‘community’ into a spatially-weighted local regression analysis. New Brunswick, Canada: University of New Brunswick. [Google Scholar]

[RSOS231780C12] 12. Dormann CF, et al. 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30 , 609-628. ( 10.1111/j.2007.0906-7590.05171.x) [DOI] [Google Scholar]

[RSOS231780C13] 13. Sodikin I, Pramoedyo H, Astutik S. 2017. Geographically weighted regression and Bayesian geographically weighted regression modelling with adaptive Gaussian kernel weight function on the poverty level in west Java province. Int. J. Humanit., Relig. Soc. Sci. 2 , 21-30. [Google Scholar]

[RSOS231780C14] 14. Gelfand AE, Schliep EM. 2016. Spatial statistics and Gaussian processes: a beautiful marriage. Spat. Stat. 18 , 86-104. ( 10.1016/j.spasta.2016.03.006) [DOI] [Google Scholar]

[RSOS231780C15] 15. LeSage JP. 2004. A family of geographically weighted regression models. In Advances in spatial econometrics: methodology, tools and applications (eds L Anselin, RJGM Florax), pp. 241–264. Berlin, Germany: Springer.

[RSOS231780C16] 16. Liu Y, Goudie RJB. 2021. Generalized geographically weighted regression model within a modularized Bayesian framework. arXiv. (http://arxiv.org/abs/2106.00996).

[RSOS231780C17] 17. Kulldorff M, Nagarwalla N. 1995. Spatial disease clusters: detection and inference. Stat. Med. 14 , 799-810. ( 10.1002/sim.4780140809) [DOI] [PubMed] [Google Scholar]

[RSOS231780C18] 18. Li P, Banerjee S, Hanson TA, McBean AM. 2015. Bayesian models for detecting difference boundaries in areal data. Stat. Sinica 25 , 385. [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOS231780C19] 19. Neill DB, Moore AW, Sabhnani M, Daniel K. 2005. Detection of emerging space-time clusters. In Proc. of the Eleventh ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, Chicago, pp. 218–227.

[RSOS231780C20] 20. Wu J. 2012. Cluster analysis and k-means clustering: an introduction. In Advances in K-means clustering: a data mining thinking, pp. 1–16. Berlin, Germany: Springer.

[RSOS231780C21] 21. McLachlan GJ, Basford KE. 1988. Mixture models: inference and applications to clustering, vol. 38 . New York, NY: M. Dekker. [Google Scholar]

[RSOS231780C22] 22. UNICEF et al. 1993. Facts for life: a communication challenge. In Facts for life: a communication challenge (eds P Adamson, G Williams), pp. 78–78. London, UK: The Salvation Army Afghan Refugee Assistance Project.

[RSOS231780C23] 23. Hertzman C, Wiens M. 1996. Child development and long-term outcomes: a population health perspective and summary of successful interventions. Soc. Sci. Med. 43 , 1083-1095. ( 10.1016/0277-9536(96)00028-7) [DOI] [PubMed] [Google Scholar]

[RSOS231780C24] 24. Irwin LG, Siddiqi A, Hertzman G. 2007. Early child development: a powerful equalizer. Vancouver, Canada: Human Early Learning Partnership (HELP). [Google Scholar]

[RSOS231780C25] 25. Areed W, Price A, Arnett K, Thompson H, Malseed R, Mengersen K. 2023. Assessing the spatial structure of the association between attendance at preschool and children’s developmental vulnerabilities in Queensland, Australia. PLoS ONE 18 , e0285409. ( 10.1371/journal.pone.0285409) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOS231780C26] 26. Geluk CAML, van Domburgh L, Doreleijers TAH, Jansen L, Bouwmeester S, Garre FG, Vermeiren R. 2014. Identifying children at risk of problematic development: latent clusters among childhood arrestees. J. Abnorm. Child Psychol. 42 , 669-680. ( 10.1007/s10802-013-9811-3) [DOI] [PubMed] [Google Scholar]

[RSOS231780C27] 27. Ukoumunne OC, et al. 2012. Profiles of language development in pre-school children: a longitudinal latent class analysis of data from the early language in Victoria study. Child Care Health Dev. 38 , 341-349. ( 10.1111/j.1365-2214.2011.01234.x) [DOI] [PubMed] [Google Scholar]

[RSOS231780C28] 28. Páez A, Uchida T, Miyamoto K. 2002. A general framework for estimation and inference of geographically weighted regression models: 1. location-specific kernel bandwidths and a test for locational heterogeneity. Environ. Plann. A 34 , 733-754. ( 10.1068/a34110) [DOI] [Google Scholar]

[RSOS231780C29] 29. Leung Y, Mei CL, Zhang WX. 2000. Statistical tests for spatial nonstationarity based on the geographically weighted regression model. Environ. Plann. A 32 , 9-32. ( 10.1068/a3162) [DOI] [Google Scholar]

[RSOS231780C30] 30. Wheeler D, Tiefelsdorf M. 2005. Multicollinearity and correlation among local regression coefficients in geographically weighted regression. J. Geograph. Syst. 7 , 161-187. ( 10.1007/s10109-005-0155-6) [DOI] [Google Scholar]

[RSOS231780C31] 31. Cho S-H, Lambert DM, Chen Z. 2010. Geographically weighted regression bandwidth selection and spatial autocorrelation: an empirical example using Chinese agriculture data. Appl. Econ. Lett. 17 , 767-772. ( 10.1080/13504850802314452) [DOI] [Google Scholar]

[RSOS231780C32] 32. Yu H, Fotheringham AS, Li Z, Oshan T, Kang W, Wolf LJ. 2020. Inference in multiscale geographically weighted regression. Geograph. Analy. 52 , 87-106. ( 10.1111/gean.12189) [DOI] [Google Scholar]

[RSOS231780C33] 33. Gao X, Xiao B, Tao D, Li X. 2010. A survey of graph edit distance. Pattern Anal. Appl. 13 , 113-129. ( 10.1007/s10044-008-0141-y) [DOI] [Google Scholar]

[RSOS231780C34] 34. Carter C. 2002. Great circle distances. SiRF White Paper. See https://ieeexplore.ieee.org/abstract/document/10114987.

[RSOS231780C35] 35. Pérez-Elizalde S, Cuevas J, Pérez-Rodríguez P, Crossa J. 2015. Selection of the bandwidth parameter in a Bayesian kernel regression model for genomic-enabled prediction. J. Agricul. Biol. Environ. Stat. 20 , 512-532. ( 10.1007/s13253-015-0229-y) [DOI] [Google Scholar]

[RSOS231780C36] 36. de Valpine P, Turek D, Paciorek CJ, Anderson-Bergman C, Lang DT, Bodik R. 2017. Programming with models: writing statistical algorithms for general model structures with NIMBLE. J. Comput. Graph. Stat. 26 , 403-413. ( 10.1080/10618600.2016.1172487) [DOI] [Google Scholar]

[RSOS231780C37] 37. Mitchell TJ, Beauchamp JJ. 1988. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 83 , 1023-1032. ( 10.1080/01621459.1988.10478694) [DOI] [Google Scholar]

[RSOS231780C38] 38. Chipman H. 1996. Bayesian variable selection with related predictors. Can. J. Stat. 24 , 17-36. ( 10.2307/3315687) [DOI] [Google Scholar]

[RSOS231780C39] 39. Kuo L, Mallick B. 1998. Variable selection for regression models. Sankhyā: The Indian J. Stat. B 60 , 65-81. [Google Scholar]

[RSOS231780C40] 40. Green PJ. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 , 711-732. ( 10.1093/biomet/82.4.711) [DOI] [Google Scholar]

[RSOS231780C41] 41. Bhattacharya A, Dunson DB. 2011. Sparse Bayesian infinite factor models. Biometrika 98 , 291-306. ( 10.1093/biomet/asr013) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOS231780C42] 42. Dempster A, Laird N, Rubin D. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B (Methodological) 39 , 1-22. ( 10.1111/j.2517-6161.1977.tb01600.x) [DOI] [Google Scholar]

[RSOS231780C43] 43. He X, Cai D, Shao Y, Bao H, Han J. 2010. Laplacian regularized Gaussian mixture model for data clustering. IEEE Trans. Knowl. Data Eng. 23 , 1406-1418. ( 10.1109/TKDE.2010.259) [DOI] [Google Scholar]

[RSOS231780C44] 44. Reynolds DA. 2009. Gaussian mixture models. Encyclopedia Biometr. 741 , 659-663. ( 10.1007/978-0-387-73003-5_196) [DOI] [Google Scholar]

[RSOS231780C45] 45. Rousseeuw P. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 , 53-65. ( 10.1016/0377-0427(87)90125-7) [DOI] [Google Scholar]

[RSOS231780C46] 46. Wan H, Wang H, Scotney B, Liu J. 2019. A novel Gaussian mixture model for classification. In 2019 IEEE Int. Conf. on Systems, Man and Cybernetics (SMC), pp. 3298–3303. New York, NY: IEEE.

[RSOS231780C47] 47. Liu W, Yang J, Zhao J, Yang L. 2017. A novel method of unsupervised change detection using multi-temporal PoISAR images. Remote Sens. 9 , 1135. ( 10.3390/rs9111135) [DOI] [Google Scholar]

[RSOS231780C48] 48. Fu W, Perry PO. 2020. Estimating the number of clusters using cross-validation. J. Comput. Graph. Stat. 29 , 162-173. ( 10.1080/10618600.2019.1647846) [DOI] [Google Scholar]

[RSOS231780C49] 49. Ferguson TS. 1973. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1 , 209-230. [Google Scholar]

[RSOS231780C50] 50. Sammut C, Webb GI. 2011. Encyclopedia of machine learning. Boston, MA: Springer Science & Business Media. [Google Scholar]

[RSOS231780C51] 51. Teh YW. 2010. Dirichlet process. Encyclopedia Mach. Learn. 1063 , 280-287. ( 10.1007/978-0-387-30164-8_219) [DOI] [Google Scholar]

[RSOS231780C52] 52. Neal RM. 2000. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9 , 249-265. ( 10.1080/10618600.2000.10474879) [DOI] [Google Scholar]

[RSOS231780C53] 53. Dahl DB. 2006. Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian Inference Gene Expression Proteomics 4 , 201-218. ( 10.1017/CBO9780511584589.011) [DOI] [Google Scholar]

[RSOS231780C54] 54. Rand WM. 1971. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66 , 846-850. ( 10.1080/01621459.1971.10482356) [DOI] [Google Scholar]

[RSOS231780C55] 55. Ma Z, Xue Y, Hu G. 2020. Heterogeneous regression models for clusters of spatial dependent data. Spatial Econ. Anal. 15 , 459-475. ( 10.1080/17421772.2020.1784989) [DOI] [Google Scholar]

[RSOS231780C56] 56. Liu K, He J, Chen Y. A topic-enhanced Dirichlet model for short text stream clustering. Available at SSRN 4346394 .

[RSOS231780C57] 57. The Australian Early Development Census. 2018. Queensland data snapshot. See https://earlychildhood.qld.gov.au/aboutUs/Documents/queensland-aedc-report-final.pdf .

[RSOS231780C58] 58. Kowal DR. 2022. Bayesian subset selection and variable importance for interpretable prediction and classification. J. Mach. Learn. Res. 23 , 4661-4698. [PMC free article] [PubMed] [Google Scholar]

[RSOS231780C59] 59. da Silva AR, Mendes FF. 2018. On comparing some algorithms for finding the optimal bandwidth in geographically weighted regression. Appl. Soft Comput. 73 , 943-957. ( 10.1016/j.asoc.2018.09.033) [DOI] [Google Scholar]

[RSOS231780C60] 60. Geng J, Bhattacharya A, Pati D. 2019. Probabilistic community detection with unknown number of communities. J. Am. Stat. Assoc. 114 , 893-905. ( 10.1080/01621459.2018.1458618) [DOI] [Google Scholar]

[RSOS231780C61] 61. Watanabe S, Opper M. 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 11 , 3571-3594. [Google Scholar]

[RSOS231780C62] 62. Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A. 2002. Bayesian measures of model complexity and fit. J. R. Stat. Soc. B 64 , 583-639. ( 10.1111/1467-9868.00353) [DOI] [Google Scholar]

[RSOS231780C63] 63. Ma Z, Hu G, Chen M-H. 2021. Bayesian hierarchical spatial regression models for spatial data in the presence of missing covariates with applications. Appl. Stoch. Models Bus. Ind. 37 , 342-359. ( 10.1002/asmb.2568) [DOI] [Google Scholar]

PERMALINK

Bayesian cluster geographically weighted regression for spatial heterogeneous data

Wala Draidi Areed

Aiden Price

Helen Thompson

Conor Hassan

Reid Malseed

Kerrie Mengersen

Roles

Abstract

1. Introduction

2. Methods

2.1. Frequentist geographically weighted regression

2.2. Bayesian geographically weighted regression

2.3. Vectorization methodology

2.4. Bayesian geographically weighted regression with dynamic variable selection: reversible jump Markov chain Monte Carlo approach

2.5. Cluster Bayesian geographically weighted regression

2.6. Gaussian mixture model

2.7. Dirichlet process mixture model

2.8. Cluster configurations

2.9. Clustered accuracy

3. Simulated data analysis

Figure 1.

Figure 2.

Table 1.

Table 2.

Table 3.

Figure 3.

Table 4.

Table 5.

4. Real data analysis

Figure 4.

4.1. Sources of the data

4.2. Case study analysis

4.2.1. Inferences from the Bayesian geographically weighted regression

Figure 5.

4.2.2. Dynamic variable selection: real data analysis

Figure 7.

Figure 6.

4.2.3. Probabilistic cluster analysis and its insights

Figure 8.

Figure 9.

Figure 10.

Figure 11.

5. Discussion

5.1. Future research directions

6. Conclusion

Appendix A

Appendix B. Metropolis-hastings sampling for Bayesian geographically weighted regression

Appendix C. Probabilistic analysis

Figure 12.

Figure 13.

Figure 14.

Figure 15.

Figure 16.

Figure 17.

Table 6.

Appendix D. Reversible jump Markov chain Monte Carlo algorithm

Appendix E. Model assessment

Appendix F. Dirichlet process mixture model cluster configuration process

Appendix G. Simulated data further analysis

Figure 18.

Figure 19.

Figure 20.

Figure 21.

Figure 22.

Figure 23.

Figure 24.

Figure 25.

Figure 26.

Figure 27.

Figure 28.

Figure 29.

Figure 30.

Figure 31.

Figure 32.

Figure 33.

Figure 34.

Figure 35.

Figure 36.