Cluster detection of spatial regression coefficients

Junho Lee; Ronald E Gangnon; Jun Zhu

doi:10.1002/sim.7172

. Author manuscript; available in PMC: 2018 Jul 31.

Published in final edited form as: Stat Med. 2016 Nov 22;36(7):1118–1133. doi: 10.1002/sim.7172

Cluster detection of spatial regression coefficients

Junho Lee ^a, Ronald E Gangnon ^b,^*,^†, Jun Zhu ^c

PMCID: PMC6067680 NIHMSID: NIHMS982821 PMID: 27878838

Abstract

Popular approaches to spatial cluster detection, such as the spatial scan statistic, are defined in terms of the responses. Here, we consider a varying-coefficient regression and spatial clusters in the regression coefficients. For varying-coefficient regression, such as the geographically weighted regression, different regression coefficients are obtained for different spatial units. It is often of interest to the practitioners to identify clusters of spatial units with distinct patterns in a regression coefficient, but there is no formal statistical methodology for that. Rather, cluster identification is often ad-hoc such as by eyeballing the map of fitted regression coefficients and discerning patterns. In this paper, we develop new methodology for spatial cluster detection in the regression setting based on hypotheses testing. We evaluate our methods in terms of power and coverages for true clusters via simulation studies. For illustration, our methodology is applied to a cancer mortality dataset.

Keywords: geographically weighted regression, hypothesis testing, spatial cluster detection, spatial scan statistic, varying coefficient regression

1. Introduction

Cluster detection, the identification of spatial units adjacent in space that are associated with distinctive patterns of data of interest relative to background variation, is an important problem in disciplines such as spatial epidemiology and disease surveillance. For count data, clusters have distinctive risks of an event of interest: typically elevated, but possibly reduced, relative to background variation. For continuous data, clusters show higher or lower mean values than the background.

Spatial scan statistics [1,2] and their variants [3–11] are popular approaches to cluster detection within a frequentist hypothesis testing framework. The scan statistic is the maximum likelihood ratio test statistic based on a large collection of potential clusters of a particular regular geometric form (e.g., circles). Significance is evaluated via Monte Carlo simulation under an assumed null hypothesis, such as a constant risk over the entire spatial domain.

An alternative approach to spatial cluster detection uses Bayesian models for the underlying event rates that incorporate explicit spatial clusters associated with distinctive, either elevated or lowered, risks [12–18]. These models allow for formal inference regarding the number, locations, and risks associated with clusters relative to a model-specified and possibly non-uniform background risk. The aforementioned spatial cluster detection approaches, however, are all defined in terms of the responses. Here, we consider a new problem, namely, cluster detection of spatial regression coefficients.

In a spatial regression framework, it is plausible that a subdomain has a different relationship between the response and a covariate than the background. Such a subdomain can be considered a spatial cluster with different regression coefficients inside/outside the cluster. Alternatively, one can consider varying-coefficient regression such as the geographically weighted regression (GWR) [19, 20]. For example, GWR allows the relationship between a response and covariates to vary geographically by considering locally weighted regression coefficients. Then, cluster identification can be carried out by eyeballing the smooth map of fitted regression coefficients. This method does not directly model clustering of regression coefficients. In addition, Lawson et al. [21] proposed discrete grouping of regression coefficients by considering a prior distribution for spatial grouping in a Bayesian framework. While this method directly provides grouping of regression coefficients, the number of groups needs to be specified in advance. Here, we propose new approach that enables the detection of an unknown number of spatial clusters in terms of the relationships between the response and the covariate.

In particular, we focus on spatially varying coefficient regression models and develop new methodology for spatial cluster detection with a covariate. For a single cluster, we consider testing potential circular clusters of regression coefficients against the null hypothesis that the regression coefficient is the same over the entire spatial domain by an F statistic. The p-value of our test is obtained via a Monte Carlo simulation. For multiple clusters, we adopt the sequential detection approach as Zhang et al. [22] proposed. Further, we propose two methods to detect multiple clusters sequentially in the regression setting. The first method detects significant clusters in the slopes and the intercepts simultaneously. In the second method, significant clusters in the slopes are detected first, and then in the intercepts. We believe that our method is the first of its kind to cluster the relationship between the response and the covariate in space. With a unified modeling framework for spatial clusters of covariates in relation to the response, it is more rigorous to discern heterogeneity of the relationship in terms of spatial clusters and more intuitive to interpret the spatial patterns than GWR. The main challenge in developing our method is computing time. A large number of matrix manipulations are involved due to the large number of potential clusters. However, we resolve the computational challenge by devising an efficient algorithm that reduces the computational complexity.

The remainder of the paper is organized as follows. In Section 2, we develop a test for spatial cluster effects in a simplified set, and propose a simultaneous detection method in intercepts and slopes. For multiple clusters, we also propose a two-stage method in Section 3. In Section 4, we evaluate these methods in terms of power and coverages for true clusters via simulation studies. For illustration, our proposed methodology is applied to a cancer mortality dataset in the Southeast of U.S.A in Section 5. Details about the computation are given as Appendix.

2. Simultaneous Spatial Cluster Detection in Intercepts and Slopes

2.1. Test for Spatial Cluster Effects in a Simplified Setting

Let D denote a spatial domain of interest in ℝ². Let N denote the number of cells that partition the spatial domain D and form a spatial lattice. For cell i = 1, …, N, let y_i denote the ith response variable. We model the response variable as y_i = μ_i + ε_i, where ε_i is a random error and the ε_i’s are independently and identically distributed (iid) as N(0, σ²) for a variance component σ² > 0. Let J denote the number of clusters on the spatial lattice and the clusters are denoted C₁, …, C_J such that

C_{j} = {i ∣ d (s_{i}, c_{j}) \leq r_{j}},

where j = 1, …, J, s_i = (s₁_i, s₂_i)^T denotes the coordinates of the geographical centroid of cell i, c_j and r_j are the center and radius of the spatial extent of cluster C_j, and d(·, ·) is the distance between two locations. Then, the mean response μ_i follows a varying-coefficient model

μ_{i} = {\begin{cases} β_{0} + β_{1} x_{i} & if i \notin \cup_{j = 1}^{J} C_{j} \\ (β_{0} + θ_{1, 0}) + (β_{1} + θ_{1, 1}) x_{i} & if i \in C_{1} \\ ⋮ \\ (β_{0} + θ_{J, 0}) + (β_{1} + θ_{J, 1}) x_{i} & if i \in C_{J} \end{cases},

(1)

where x_i is the ith covariate, β₀ and β₁ are the intercept and the slope for the background (i.e., non-cluster), θ_j_,0 and θ_j_,1 are the cluster C_j effect in the intercepts and in the slopes. We begin with a single cluster C ≡ C₁ (i.e., J = 1) and assume that the cluster C is known a priori. Then, model (1) can be rewritten as

μ_{i} = {\begin{cases} β_{0} + β_{1} x_{i} & if i \notin C \\ (β_{0} + θ_{0}) + (β_{1} + θ_{1}) x_{i} & if i \in C \end{cases},

(2)

Next, we develop hypothesis testing for the cluster effect, which will be extended to test for an unknown cluster in the subsequent sections. For model (2) and a fixed cluster C, we may consider four possible hypotheses: H₀ : θ₀ = θ₁ = 0, H₁ : θ₀ ≠ 0, θ₁ = 0, H₂ : θ₀ ≠ 0, θ₁ ≠ 0, and H₃ : θ₀ = 0, θ₁ ≠ 0. The model under H₀ is the standard constant-coefficient (no cluster) regression; the model under H₁ has different intercepts but a common slope; the model under H₂ has different intercepts and different slopes; and the model under H₃ has a common intercept but different slopes. Among these four possible hypotheses, we will only consider H₀, H₁, and H₂ because, in a regression setting, the inference about slopes is generally of more interest than the intercept when evaluating the patterns of relationships between the response and the covariate relative to the background.

We consider a simultaneous test for the cluster effect in both the slopes and the intercepts:

H_{0} : θ_{0} = θ_{1} = 0 versus H_{2} : θ_{0} \neq 0, θ_{1} \neq 0.

(3)

Define a test statistic as F = {(SSE₀ − SSE₂)/2}/{SSE₂/(N − 4)}, where SSE₀ is the sum of squared errors (SSE) under H₀ equal to $\sum_{i = 1}^{N} y_{i}^{2} - {(\sum_{i = 1}^{N} x_{i} y_{i})}^{T} {(\sum_{i = 1}^{N} x_{i} x_{i}^{T})}^{- 1} (\sum_{i = 1}^{N} x_{i} y_{i})$ , and x_i is the ith covariate vector (1, x_i)^T. Further, SSE₂ is the SSE under H₂ equal to $\sum_{i = 1}^{N} y_{i}^{2} - {(\sum_{i \in C} x_{i} y_{i})}^{T} {(\sum_{i \in C} x_{i} x_{i}^{T})}^{- 1} (\sum_{i \in C} x_{i} y_{i}) - {(\sum_{i \notin C} x_{i} y_{i})}^{T} {(\sum_{i \notin C} x_{i} x_{i}^{T})}^{- 1} (\sum_{i \notin C} x_{i} y_{i})$ . Under H₀, the F statistic follows an F distribution with degrees of freedom df₁ = 2 and df₂ = N − 4.

Hypothesis testing involving the three hypotheses H₀, H₁, and H₂ will be further discussed in Section 3.

2.2. Single Cluster

In Section 2.1, a fixed cluster is assumed to be known a priori. Now, we relax this assumption and consider spatial cluster detection in the regression coefficients without assuming a fixed cluster. Let 𝒞 = {C₁, C₂, …} denote the set of all possible clusters. For an unknown single cluster C ∈ 𝒞, let

μ_{i} = {\begin{array}{l} β_{0} + β_{1} x_{i} & if i \notin C \\ (β_{0} + θ_{C, 0}) + (β_{1} + θ_{C, 1}) x_{i} & if i \in C \end{array},

(4)

where θ_C_,0 and θ_C_,1 are the cluster effect in the intercepts and in the slopes, respectively, of the cluster C.

For C_k ∈ 𝒞, k = 1, 2, …, we first consider the null hypothesis H₀ versus a cluster specific local alternative hypothesis H_{C_k}:

H_{0} : θ_{C_{k}, 0} = θ_{C_{k}, 1} = 0 versus H_{C_{k}} : θ_{C_{k}, 0} \neq 0, θ_{C_{k}, 1} \neq 0,

(5)

where θ_{C_k,0} and θ_{C_k,1} are the cluster effect in the intercepts and in the slopes, respectively, of the cluster C_k. For a given cluster C_k, this setting is the same as (3). Thus, an F test statistic can be defined as

F (C_{k}) = {({SSE}_{0} - {SSE}_{C_{k}}) / 2} / {{SSE}_{C_{k}} / (N - 4)}

and follows an F distribution with degrees of freedom df₁ = 2 and df₂ = N − 4 under H₀, where SSE_{C_k} is the SSE under H_{C_k}.

Next, we consider a global alternative hypothesis for an unknown generic cluster

H_{A} : θ_{C, 0} \neq 0, θ_{C, 1} \neq 0 for a cluster C \in C .

From the F test statistics for all the possible local hypotheses given in (5), we define the test statistic H₀ versus H_A to be

T = max_{C \in C} F (C) .

(6)

To compute a p-value, a Monte Carlo method in the spirit of a parametric bootstrap is adopted. First, we compute the unbiased estimates of the parameters under H₀ and obtain β̂₀, β̂₁, and σ̂². Second, we generate Monte Carlo samples $y_{i}^{new} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i} + ε_{i}^{new}$ , where $ε_{i}^{new} ~ iid N (0, {\hat{σ}}^{2})$ for i = 1, …, N. Third, we compute the test statistic (6) for each Monte Carlo sample. Suppose there are S random Monte Carlo samples. The p-value is R/(S + 1), where R is the rank of the test statistic (6) for the original dataset in comparison with all the Monte Carlo samples, and the largest number acquires a rank of 1.

The test statistic (6) is for all the possible clusters in 𝒞 = {C₁, C₂, …}. Among those clusters, the cluster that corresponds to the test statistic T in (6) is considered to be the cluster estimate Ĉ. That is,

\hat{C} = arg max_{C \in C} F (C) .

Here, the set of potential clusters, 𝒞 = {C₁, C₂, …, C_K}, is pre-defined by circular clusters centered at the N sites in the data with various radii. We restrict the radius to be between 0 and a maximum radius, say R_max. For a particular centroid of, say cell i, the potential clusters centered are chosen to have radii 0 = r_i_,1 < r_i_,2 < … < r_{i,m_i} ≤ R_max. Essentially, there are m_i distinct potential clusters with radii r_i_,1, r_i_,2, …, r_{i,m_i}. With $K = \sum_{i = 1}^{N} m_{i} < \infty$ , there are a total of K potential clusters for the N cells.

The computational complexity and algorithm are described in Appendix A.

2.3. Multiple Clusters

To detect potential additional clusters, we propose a sequential algorithm. That is, we estimate the first cluster ${\hat{C}}_{1} = arg max_{C \in C} F (C)$ , where 𝒞 is pre-defined with N cells on the spatial lattice and the maximum radius is R_max. To test H₀ : θ_C = 0 for any cluster C ∈ 𝒞 versus H_A : θ_C ≠ 0 for a cluster C ∈ 𝒞 where θ_C = (θ_C_,0, θ_C_,1)^T, the single cluster method in Section 2.2 is applied. Next, after removing the effect of Ĉ₁ from the data, we estimate the second cluster ${\hat{C}}_{2} = arg max_{C \in C} F (C)$ . To test H₀ : θ_C = 0 for any cluster C ∈ 𝒞 versus H_A : θ_C ≠ 0 for a cluster C ∈ 𝒞, the single cluster method in Section 2.2 is again applied. Then, after removing the effect of Ĉ₂ from the data again, we find the third cluster estimate ${\hat{C}}_{3} = arg max_{C \in C} F (C)$ and perform the single cluster test for H₀ : θ_C = 0 for any cluster C ∈ 𝒞 versus H_A : θ_C ≠ 0 for a cluster C ∈ 𝒞, etc. In the end, a set of cluster estimates, {Ĉ₁, Ĉ₂, Ĉ₃, …}, is obtained. Because these cluster estimates are obtained sequentially, the corresponding p-values are also computed in a sequential fashion. The detailed algorithm has the following steps.

Estimate the background coefficients β̂ = (β̂₀, β̂₁)^T under H₀ (no cluster) and compute the residuals $e_{0 i} = y_{i} - x_{i}^{T} \hat{β}$ .
Pre-define 𝒞 with N cells on the spatial lattice and the maximum radius R_max.
Obtain the cluster $\hat{C} = arg max_{C \in C} F (C)$ with the residuals as the responses, its p-value, and corresponding coefficients θ̂_Ĉ = (θ̂_Ĉ_,0, θ̂_Ĉ_,1)^T.
Update the residuals by removing the cluster effect such as $e_{j i} = e_{(j - 1) i} - x_{i}^{T} {\hat{θ}}_{\hat{C}} \cdot I {i \in \hat{C}}$ , where e_ji’s are the residuals from the model with the jth cluster and I(·) is the indicator function.
Repeat steps 3–4 until p-value > α. That is, stop only if the p-value in step 3 is greater than the significance level α.

The detected clusters using the sequential method above can overlap with each other. To obtain multiple non-overlapping clusters, we update the set of potential clusters for the jth cluster to be $C_{j} = C \ \cup_{k = 1}^{j - 1} K_{k}$ , where 𝒦_k is a set of clusters that overlap with the kth cluster estimate Ĉ_k.

The previously proposed methodology for multiple clusters, overlapping or not, is based on F tests for the cluster effect in both the slopes (θ_C_,1) and the intercepts (θ_C_,0) of each potential cluster C ∈ 𝒞. The detected clusters could have significant cluster effects in the intercepts only, or in both the slopes and the intercepts. Thus, we will refer to this cluster detection as the simultaneous detection to distinguish from an alternative sequential approach to be developed in the next section.

3. Two–Stage Spatial Cluster Detection in Intercepts and Slopes

In a regression setting, inference about a slope is generally of more interest than the intercept. The test statistic (6) allows the detection of spatial clusters in both the slopes and the intercepts, but it is not straightforward to determine whether the cluster effects are in the slopes or in the intercepts. To study the potential spatial pattern in the slopes, we now develop an alternative, two-stage approach to detecting multiple clusters. In particular, spatial clusters in the slopes will be detected in the first stage regardless of intercept effect. Then, in the second stage, spatial clusters in the intercepts will be detected. Henceforth, this alternative approach will be referred to as the two-stage detection.

3.1. Test for Spatial Cluster Effects in a Simplified Setting

Assume model (2) with a fixed cluster C which is known a priori. We perform hypotheses testing in two steps: first the cluster effect in the slopes and then the cluster effect in the intercepts. That is,

H_{1} : θ_{0} \neq 0, θ_{1} = 0 versus H_{2} : θ_{0} \neq 0, θ_{1} \neq 0,

(7)

H_{0} : θ_{0} = θ_{1} = 0 versus H_{1} : θ_{0} \neq 0, θ_{1} = 0.

(8)

The test statistics for (7) and (8) are, respectively,

F^{slope} = ({SSE}_{1} - {SSE}_{2}) / {{SSE}_{2} / (N - 4)}, F^{int} = ({SSE}_{0} - {SSE}_{1}) / {{SSE}_{1} / (N - 3)},

where SSE₁ is the SSE under H₁ and equivalent to $\sum_{i = 1}^{N} y_{i}^{2} - {(\sum_{i = 1}^{N} w_{i} y_{i})}^{T} {(\sum_{i = 1}^{N} w_{i} w_{i}^{T})}^{- 1} (\sum_{i = 1}^{N} w_{i} y_{i})$ , and w_i is defined as the column vector (1, x_i, 1)^T for i ∈ C and (1, x_i, 0)^T for i ∉ C. Under H₁, the test statistic F^slope follows an F distribution with degrees of freedom df₁ = 1 and df₂ = N − 4, whereas the test statistic F^int follows an F distribution with degrees of freedom df₁ = 1 and df₂ = N − 3 under H₀.

3.2. First Stage: Spatial Cluster in the Slopes

From now, we assume model (4). Of interest is the cluster effect in the slopes (θ_C_,1) for an unknown single cluster C ∈ 𝒞. For C_k ∈ 𝒞, k = 1, 2, …, we first consider the null hypothesis $H_{0}^{slope}$ versus a cluster specific local alternative hypothesis $H_{C_{k}}^{slope}$ for the slopes:

H_{0}^{slope} : θ_{C_{k}, 1} = 0 versus H_{C_{k}}^{slope} : θ_{C_{k}, 1} \neq 0.

(9)

For a given cluster C_k, this setting is the same as (7). Thus, we define

F^{slope} (C_{k}) = ({SSE}_{0, slope} - {SSE}_{C_{k}, slope}) / {{SSE}_{C_{k}, slope} / (N - 4)} .

(10)

The test statistic F^slope(C_k) in (10) follows an F distribution with degrees of freedom df₁ = 1 and df₂ = N − 4 under $H_{0}^{slope}$ , where SSE_0,slope and SSE_{C_k,slope} are the SSEs under $H_{0}^{slope}$ and $H_{C_{k}}^{slope}$ , respectively.

As in the simultaneous method, we consider a global alternative hypothesis

H_{A}^{slope} : θ_{C, 1} \neq 0 for a cluster C \in C

for an unknown generic cluster. From the F test statistics for all the possible local hypotheses given in (9), we define the test statistic for $H_{0}^{slope}$ versus $H_{A}^{slope}$ and the corresponding cluster estimate to be

T^{slope} = max_{C \in C} F^{slope} (C),

(11)

\hat{C} = arg max_{C \in C} F^{slope} (C) .

(12)

To compute a p-value, a Monte Carlo method is applied in a manner similar to Section 2.2.

To detect potential additional clusters in the slopes, we propose a sequential algorithm with the cluster estimate (12). That is, we estimate the first cluster ${\hat{C}}_{1} = arg max_{C \in C} F^{slope} (C)$ . Then, we iteratively estimate the next cluster ${\hat{C}}_{j + 1} = arg max_{C \in C} F^{slope} (C)$ after removing the effect of Ĉ_j from the data, where j = 1, 2, …. The iteration continues until there is not any more significant cluster in the slopes. Then, we move to the second stage to find clusters in the intercepts.

3.3. Second Stage: Spatial Cluster in the Intercepts

In the second stage, of interest is the cluster effect in the intercepts (θ_C_,0), for an unknown single cluster C ∈ 𝒞. Thus, a varying-intercept but constant-slope model is considered.

For C_k ∈ 𝒞, k = 1, 2, …, we first consider the null hypothesis $H_{0}^{int}$ versus a cluster specific local alternative hypothesis $H_{C_{k}}^{int}$ for the intercepts:

H_{0}^{int} : θ_{C_{k}, 0} = θ_{C_{k}, 1} = 0 versus H_{C_{k}}^{int} : θ_{C_{k}, 0} \neq 0, θ_{C_{k}, 1} = 0.

(13)

For a given cluster C_k, this setting is the same as (8). Thus, an F test statistic can be defined as F^int(C_k) = (SSE₀ − SSE_{C_k,int})/{SSE_{C_k,int}/(N − 3)} and follows an F distribution with degrees of freedom df₁ = 1 and df₂ = N − 3 under $H_{0}^{int}$ , where SSE_{C_k,int} is the SSE under $H_{C_{k}}^{int}$ .

Next, we consider a global alternative hypothesis for an unknown generic cluster

H_{A}^{int} : θ_{C, 0} \neq 0 for a cluster C \in C .

From the F test statistics for all the possible local hypotheses given in (13), we define the test statistic for $H_{0}^{int}$ versus $H_{A}^{int}$ and corresponding cluster estimate to be

T^{int} = max_{C \in C} F^{int} (C),

(14)

\hat{C} = arg max_{C \in C} F^{int} (C) .

(15)

The p-value of the test statistic (14) is again computed via a Monte Carlo method.

Suppose a total of J₁ significant clusters in the slopes are detected in the first stage. Then, in the second stage, we could consider a sequential algorithm with the cluster estimate (15) to detect potential additional clusters in the intercepts. That is, after removing the effects of {Ĉ₁, …, Ĉ_J_₁} from the data, we estimate the (J₁ + 1)th cluster ${\hat{C}}_{J_{1} + 1} = arg max_{C \in C} F^{int} (C)$ . We again estimate the next cluster ${\hat{C}}_{J_{1} + 2} = arg max_{C \in C} F^{int} (C)$ after removing the effect of Ĉ_J_₁+1, and so on and so forth. In the end, a set of cluster estimates, {Ĉ₁, Ĉ₂, Ĉ₃, …}, is identified, where the first set of cluster estimates {Ĉ₁, …, Ĉ_J_₁} is the effect in the slopes while the second set {Ĉ_J_₁+1, Ĉ_J_₁+2, …} is the effect in the intercepts.

For multiple non-overlapping clusters, we update the set of potential clusters for the jth cluster to be $C_{j} = C \ \cup_{k = 1}^{j - 1} K_{k}$ , where 𝒦_k is a set of clusters that overlap with the kth cluster estimate Ĉ_k.

4. Simulation Study

We conducted a simulation study to evaluate our previous methodology for a single cluster or two clusters that have either overlapping or non-overlapping cells. We consider a 25×25 square grid in the unit square [0, 1] × [0, 1], which is partitioned into 625 cells with 25 rows and 25 columns. The width of each cell is 1/25 = 0.04. The centroids of the cells are {0.02, 0.06, …, 0.98} × {0.02, 0.06, …, 0.98}. The set of potential clusters consists of 41,493 circular clusters centered at the 625 cell centroids with radii ranging from 0 to 0.2. The single covariate, x, follows the standard normal distribution. The regression coefficients in the background are set to be β = (β₀, β₁)^T = (0, 0)^T, and the variance of the random error ε_i is set to be σ² = 1. We will evaluate the power of the cluster detection tests in a single cluster setting and will evaluate the coverage of the true clusters in a two-cluster setting.

4.1. Evaluation of Power of Tests

For a single true cluster detection, we define power to be the proportion of simulations in which the global null hypothesis, H₀ : θ_C = (θ_C_,0, θ_C_,1)^T = (0, 0)^T for any cluster C ∈ 𝒞, is rejected at the significance level α. There are different ways to define power for cluster detection tests in the literature, incorporating different views on how to define a correct cluster identification. However, the different definitions of power do not have much impact on the results [4, 9, 23, 24].

Here, we consider a total of nine different circular clusters which are defined by nine centroids and the same radius of 3/25 unit. One centroid is at the center (0.50, 0.50) of the unit square, four centroids are away from the center to the bottom, and the other fours are away from the center to the lower left corner. A complete circular cluster consists of 29 cells. The half circular cluster with a centroid at the bottom, (0.05, 0.02), has 18 cells, whereas the quarter circular cluster with a centroid at the lower left corner, (0.02, 0.02), has only 11 cells. These cluster settings are illustrated in Figure 1. The cluster effect in the slope is set to be the same as in the intercept. That is θ = (θ, θ)^T where θ is set to be 2, 1, or 1/2 for strong, medium, or weak cluster effect, respectively, relative to the error standard deviation σ = 1. We simulated 1000 datasets for the different combinations of centroids and cluster effects θ.

The nine-cluster settings with different centroids and the same radius of 3/25 unit for evaluation of power of detecting true clusters.

We identified the critical value of the test statistic (6), by the null distribution, which was generated from 10,000 null simulations, at α = 0.05 with the max radius 1/5 unit. We used this critical value to test the detected cluster in each simulated dataset. The simultaneous detection, developed in Section 2, was used to find a significant cluster.

Table I provides the results of the power calculation for each simulation setting. Our cluster detection method has a 100% power when the signal-to-noise ratio (SNR: θ/σ) is 2 even for a half or a quarter circular cluster. With SNR 1, the power is around 99% for complete circular clusters, 78% for half circles with 18 cells, and 49% for quarter circles.

Table I.

Power in percentage for cluster detection on the 25 × 25 square grid with the max cluster radius R_max = 1/5. The error standard deviation is σ = 1.

Centroid	Cells	Signal-to-noise ratio (SNR: θ/σ)
Centroid	Cells	2	1	1/2
(0.50, 0.50)	29	100.0	99.0	23.0

(0.50, 0.38)	29	100.0	99.0	22.5
(0.50, 0.26)	29	100.0	99.0	22.8
(0.50, 0.14)	29	100.0	99.0	24.1
(0.50, 0.02)	18	100.0	77.5	11.1

(0.38, 0.38)	29	100.0	99.0	22.6
(0.26, 0.26)	29	100.0	99.0	23.0
(0.14, 0.14)	29	100.0	99.1	23.4
(0.02, 0.02)	11	100.0	48.9	8.3

Open in a new tab

4.2. Evaluation of Coverage of the True Clusters

For two true clusters, we evaluated the coverage of detected clusters. We considered a total of three different two cluster settings. The two circular clusters have the same radius 3/25 unit. The two clusters are adjacent each other in the first setting and are apart in the second setting. The third setting has two overlapping clusters. These three cluster settings are illustrated in Figure 2. Further, we set two different scenarios for the cluster effects, one such that the cluster effects are in the slopes and the intercepts for each cluster and the other such that the cluster effects are in the slopes and the intercepts for one cluster, while there is the cluster effect in the intercepts only for the second cluster. The cluster effect is set to be θ = 2. That is, θ_C_₁ = θ_C_₂ = (2, 2)^T in the first scenario, and θ_C_₁ = (2, 2)^T and θ_C_₂ = (2, 0)^T in the second scenario. We simulated 1000 datasets for a total of six different combinations of cluster settings and cluster effect scenarios. For each simulated dataset, we estimated the regression coefficients for the detected clusters, and we mapped the mean coefficient estimates in comparison with the true values.

Two clusters are adjacent to, apart from and overlapping with each other, respectively, with the same radius of 3/25 unit for evaluation of coverage of true clusters.

To detect clusters, we applied four methods: simultaneous detection or two-stage detection with non-overlapping or overlapping clusters. We used the critical values for the test statistics (6), (11), and (14) for testing in each simulated dataset. The null distribution of each test statistic was generated from 10,000 null simulations, at α = 0.05 of the max radius 1/5 unit.

Figure 3 provides the maps of the mean coefficient estimates based on each of the four cluster detection methods for the simulated data with two true overlapping clusters. Columns 1 and 3 are for the mean slope estimates, whereas columns 2 and 4 are for the mean intercept estimates. In the first two columns, θ_C_₁ = θ_C_₂ = (2, 2)^T. In the last two columns, θ_C_₁ = (2, 2)^T and θ_C_₂ = (2, 0)^T. Row 1 is the oracle, namely, the true coefficients. Rows 2 and 3 are from the simultaneous detection method with non-overlapping or overlapping clusters. Rows 4 and 5 are from the two-stage detection method with non-overlapping or overlapping clusters. The results for the other two cluster settings, adjacent or apart, are omitted because the findings are similar in the sense that all the cluster detection methods perform well and the corresponding mean coefficient estimates are close to true clusters and the true regression coefficients.

Maps of the mean coefficient estimates for each cell from the 1000 simulated datasets with two overlapping clusters and in the first two columns, θ_C_₁ = θ_C_₂ = (2, 2)^T and in the last two columns, θ_C_₁ = (2, 2)^T and θ_C_₂ = (2, 0)^T. Row 1 is the oracle. Rows 2 and 3 are simultaneous detection with non-overlapping and overlapping clusters. Rows 4 and 5 are two-stage detection with non-overlapping and overlapping clusters.

Figure 3 shows that, when true clusters overlap with each other, it is hard to identify all of the true clusters under the non-overlapping clusters assumption while the results under the overlapping assumption indicate clusters that are close to the truth. Thus, detecting clusters under the overlapping assumption seems to be the safer choice for identifying true clusters, whether overlapping or not. However, the overlapping assumption requires more computation to detect multiple clusters than the non-overlapping assumption. The set of potential clusters for the jth cluster could be 𝒞 \ {Ĉ₁, …, Ĉ_j₋₁} when we assume overlapping clusters, while that is $C \ \cup_{k = 1}^{j - 1} K_{k}$ under the non-overlapping assumption, where 𝒦_k is a set of clusters that overlap with the kth cluster estimate Ĉ_k. We have more potential clusters to examine under the overlapping assumption, $∣ C \ {{\hat{C}}_{1}, \dots, {\hat{C}}_{j - 1}} ∣ - ∣ C \ \cup_{k = 1}^{j - 1} K_{k} ∣ = ∣ \cup_{k = 1}^{j - 1} K_{k} ∣ - (j - 1)$ , where |·| denotes the cardinality of a set. Further, this difference in the number of potential clusters, $∣ \cup_{k = 1}^{j - 1} K_{k} ∣ - (j - 1)$ , increases as j increases. That is, overlapping assumption requires more computation as the number of clusters increases. In our simulation study, identifying clusters under the overlapping assumption is about 10% slower than the non-overlapping assumption in both of the simultaneous detection and the two-stage detection.

Under the overlapping assumption, both of the simultaneous detection and the two-stage detection show similar performances in terms of identifying true clusters in Figure 3. Because the two-stage detection is more computationally intensive, the simultaneous detection is appealing.

5. Data Example

5.1. Southeast U.S.A Cancer Mortality Data

The dataset comprises 616 counties in seven U.S. states: Alabama, Florida, Georgia, Mississippi, North Carolina, South Carolina, and Tennessee. For each county, the cancer mortality rate is defined as the number of deaths of cancer patients per 100,000 population per year over 2008–2012 and age-adjusted to the 2000 U.S standard population (http://www.statecancerprofiles.cancer.gov/). In addition, the dataset contains information about the extent of urban versus rural areas in terms of the proportion of the population in urban areas in census year 2000 (http://www.census.gov/). We considered regression models with the log cancer mortality rate ( logMortality) as the response variable and the proportion of the population in urban areas ( purban) as the covariate. For y_i = log r_i, where r_i is the rate for the ith county, it can be shown that Var(y_i) ≈ (n_iρ_i)⁻¹ + σ², where n_i is the county population and ρ_i = E(r_i). For county populations in the thousands, the first term (n_iρ_i)⁻¹ is negligible, and thus, we assume a constant variance. In addition, the residuals did not provide evidence of clusters based on spatial scan statistics or nonnormality. Thus, the assumption of independent errors seems reasonable. The slope estimate of the ordinary regression with no cluster is −0.096 with its standard error of 0.018. Thus, the overall trend shows that there is a negative relationship between cancer mortality and proportion of urban area.

The map of the log cancer mortality rate in Figure 4 shows that Union county in northern Florida has the highest log cancer mortality rate of nearly 6.00. In addition, some highly urbanized counties such as Fulton county in northern Georgia and Hillsborough and Miami-Dade counties in southern Florida have relatively low cancer mortality rates. There is no other obvious geographical clusters of the cancer mortality rate in relation to proportion of urban area.

The proportion of the population in urban areas and the log cancer mortality rate for each county in the states of Alabama, Florida, Georgia, Mississippi, North Carolina, South Carolina, and Tennessee.

The result of GWR are mapped in Figure 5, where the log cancer mortality rate and the proportion of the population in urban areas are the response and the covariate, respectively. It appears that there are several potential geographical clusters in the relationship between cancer mortality and proportion of urban area: negative relationship in central Florida, coastal South Carolina and central Tennessee; positive relationship in northern Mississippi; no relationship in southern Florida and North Carolina. But, still, it is not clear how to delineate clusters and interpret the corresponding regression coefficients estimates. However, we could identify multiple clusters using our proposed methodology. The covariate is centered to have a zero mean in both of GWR and our methods. The set of potential clusters consists of 93,450 circular clusters centered at the 616 county centroids with radii ranging from 0 to 300 km. We detected multiple clusters by the simultaneous detection in Section 2 and the two-stage detection in Section 3 in terms of relations between the log cancer mortality rate and the proportion of the population in urban areas. We assumed overlapping clusters because the simulation results in Section 4.2 showed that the coverage of the true clusters under the overlapping assumption is better than those under the non-overlapping assumption even though its computation is somewhat slower. The p-values were obtained from 1000 Monte Carlo samples. The maximum radius for a potential cluster is set to be R_max = 300 km because the largest circular cluster with R_max is large enough to cover all or the majority of each of the seven states.

Coefficients Estimates from the geographically weighted regression (GWR).

5.2. Simultaneous Detection

Table II’s left panel and Table III’s top panel provide the significant clusters and the corresponding coefficient estimates that were detected via the simultaneous detection method at α = 0.05. There are a total of six clusters with no overlapping region. The maps of the coefficients estimates for the detected clusters are given in Figure 6, and corresponding scatter plots are illustrated in Figure 7. Different clusters have different coefficient estimates. Each cluster has a different slope and intercept from the background except the third cluster Ĉ₃ that covers Georgia, North Carolina, and South Carolina. This third cluster differs from the background in the intercept but not quite in the slope. The slopes are negative in the background and in the third cluster Ĉ₃ but are positive in the first two clusters, Ĉ₁ and Ĉ₂. Further, the slopes are close to zero in the last three clusters, Ĉ₄, Ĉ₅, and Ĉ₆. The negative slopes, in the background and in Ĉ₃, suggest a negative association between cancer mortality and proportion of urban areas. Among the clusters with almost zero slopes, southern Florida (Ĉ₄), central Georgia (Ĉ₅), and most of North Carolina and several counties of South Carolina (Ĉ₆), have lower intercepts than the background. The cluster in northwestern Mississippi (Ĉ₁) has the distinct pattern of a positive slope and a higher intercept than the background. In this cluster, there are 0% urban area ( purban = 0) in the least urbanized county, while 83% urban area ( purban = 0.829) in the most urbanized county. In addition, the difference in the fitted log cancer mortality rates between these two counties, ŷ(x_max) − ŷ(x_min), is 0.123 while the difference is −0.080 when the ordinary regression with no cluster is considered. A small cluster, which consists of three counties in northern Florida (Ĉ₂), has a positive, but steep slope, possibly due to Union county that has the highest cancer mortality rate. In Ĉ₂, there are 35% urban area ( purban = 0.349) and 47% urban area ( purban = 0.474) in the least urbanized county and the most urbanized county, respectively. These two counties show the difference of 0.639 in the fitted log cancer mortality rates while that is −0.012 from the ordinary regression with no cluster.

Table II.

Detected clusters (1) via the simultaneous cluster detection method at α = 0.05, and (2) via the two-stage cluster detection method at α = 0.05. The response is the log cancer mortality rate and the covariate is the proportion of the population in urban areas in a county. In the two-stage cluster detection’s result, clusters Ĉ₃ and Ĉ₅ share one common county.

C	(1) Simultaneous detection				(2) Two-stage detection

	Centroid	Radius	Counties	p-value	Centroid	Radius	Counties	p-value	Stage
Ĉ₁	Sunflower, MS	122	22	0.001	Calhoun, MS	176	58	0.008	1st
Ĉ₂	Union, FL	31	3	0.002	Columbia, FL	58	8	0.013	1st
Ĉ₃	Habersham, GA	95	33	0.001	Habersham, GA	95	33	0.001	2nd
Ĉ₄	Glades, FL	214	28	0.002	Glades, FL	214	28	0.001	2nd
Ĉ₅	Peach, GA	128	59	0.001	Monroe, GA	101	39	0.006	2nd
Ĉ₆	Person, NC	251	79	0.003	–	–	–	–	–

Open in a new tab

Table III.

Coefficients estimates for sequentially detected clusters (1) via the simultaneous cluster detection method at α = 0.05, and (2) via the two-stage cluster detection method at α = 0.05. The response is the log cancer mortality rate and the covariate is the proportion of the population in urban areas in a county.

Number of Ĉ_j		0	1	2	3	4	5	6
(1) Simultaneous detection	β̂₀	5.242	5.234	5.233	5.240	5.246	5.253	5.260
	β̂₁	−0.096	−0.105	−0.106	−0.115	−0.083	−0.096	−0.113

	θ̂_Ĉ_₁,0	–	0.213	0.213	0.213	0.213	0.213	0.213
	θ̂_Ĉ_₁,1	–	0.261	0.261	0.261	0.261	0.261	0.261

	θ̂_Ĉ_₂,0	–	–	0.143	0.143	0.143	0.143	0.143
	θ̂_Ĉ_₂,1	–	–	5.223	5.223	5.223	5.223	5.223

	θ̂_Ĉ_₃,0	–	–	–	−0.123	−0.123	−0.123	−0.123
	θ̂_Ĉ_₃,1	–	–	–	0.002	0.002	0.002	0.002

	θ̂_Ĉ_₄,0	–	–	–	–	−0.185	−0.185	−0.185
	θ̂_Ĉ_₄,1	–	–	–	–	0.104	0.104	0.104

	θ̂_Ĉ_₅,0	–	–	–	–	–	−0.074	−0.074
	θ̂_Ĉ_₅,1	–	–	–	–	–	0.146	0.146

	θ̂_Ĉ_₆,0	–	–	–	–	–	–	−0.059
	θ̂_Ĉ_₆,1	–	–	–	–	–	–	0.167

(2) Two-stage detection	β̂₀	5.242	5.235	5.234	5.240	5.246	5.253	–
	β̂₁	−0.096	−0.115	−0.118	−0.127	−0.094	−0.092	–

	θ̂_Ĉ_₁,0	–	0.096	0.096	0.096	0.096	0.096	–
	θ̂_Ĉ_₁,1	–	0.361	0.361	0.361	0.361	0.361	–

	θ̂_Ĉ_₂,0	–	–	0.274	0.274	0.274	0.274	–
	θ̂_Ĉ_₂,1	–	–	1.286	1.286	1.286	1.286	–

	θ̂_Ĉ_₃,0	–	–	–	−0.125	−0.125	−0.125	–

	θ̂_Ĉ_₄,0	–	–	–	–	−0.135	−0.135	–

	θ̂_Ĉ_₅,0	–	–	–	–	–	−0.096	–

Open in a new tab

Coefficients estimates with overlapping clusters that were significant at α = 0.05 via the simultaneous cluster detection.

Scatter plots with fitted regression lines with overlapping clusters which were significant at α = 0.05 via the simultaneous cluster detection.

5.3. Two-Stage Detection

Table II’s right panel and Table III’s bottom panel provide the significant clusters and the corresponding coefficient estimates that were detected via the two-stage detection method at α = 0.05. There are a total of five detected clusters with one overlapping region. The maps of the coefficients estimates for the detected clusters are given in Figure 8, and corresponding scatter plots are illustrated in Figure 9. The first two detected clusters, Ĉ₁ and Ĉ₂, are significant in the slopes, and the next three clusters, Ĉ₃–Ĉ₅, are significant in the intercepts only. A big cluster in North Carolina, which was significant in the simultaneous detection, is not identified via the two-stage detection. Other than that, however, the detected clusters are quite similar to those from the simultaneous detection. The first cluster (Ĉ₁) is centered at a county in Mississippi, and the second cluster (Ĉ₂) is in northern Florida including the Union county. The third cluster (Ĉ₃) covers Georgia, North Carolina, and South Carolina and shares one county (Oconee county, Georgia) with another cluster in central Georgia (Ĉ₅). There is also a cluster in southern Florida (Ĉ₄). In Figure 8, the first map shows two clusters that have different slopes from the background, while the second map indicates that all the clusters have different intercept estimates. The two clusters in northern Mississippi with several counties of Alabama and Tennessee (Ĉ₁), and in northern Florida with a county of Georgia (Ĉ₂), have positive slopes and higher intercepts than the background. In Ĉ₁, there are 0% urban area ( purban = 0) in the least urbanized county while 97% urban area ( purban = 0.967) in the most urbanized county. In addition, the difference in the fitted log cancer mortality rates between these two counties, ŷ(x_max) − ŷ(x_min), is 0.260 while the difference is −0.093 when the ordinary regression with no cluster is considered. In Ĉ₂, there are 0% urban area ( purban = 0) and 47% urban area ( purban = 0.474) in the least urbanized county and the most urbanized county, respectively. These two counties show the difference of 0.566 in the fitted log cancer mortality rates, while the difference is −0.045 from the ordinary regression with no cluster. The other three clusters, Ĉ₄, Ĉ₅, and Ĉ₆, have lower intercepts than the background, while they have the same negative slopes as the background.

Coefficients estimates with overlapping clusters that were significant at α = 0.05 via the two-stage cluster detection.

Scatter plots with fitted regression lines with overlapping clusters that were significant at α = 0.05 via the two-stage cluster detection. The third and the fifth clusters share one common county, Oconee county in Georgia (OVLP CLS 3,5).

6. Conclusions and Discussion

We have developed in this paper a new methodology to detect spatial clusters in the regression coefficients. Both the simultaneous detection and the two-stage detection methods can be used to find geographic regions that have different relationship between a response variable and a covariate in a varying-coefficient regression setting. Although it is a common practice to use circular clusters as we have performed here, our methods can be modified to consider other shapes, such as ellipses and squares (e.g., [5–7]).

Our simulation study, which evaluated the power and the coverage of true clusters, suggests satisfactory performance of both methods. The simultaneous detection method is faster to compute than the two-stage detection. In the simultaneous cluster detection, the regression coefficient estimates are obtained for both the intercepts and the slopes in any detected cluster. However, some of the slope estimates may not differ significantly from the background. In contrast, the two-stage detection produces slope estimates for only those clusters that have the slope estimates significantly different from the background. For those clusters, in which only the intercept is significantly different from the background but not the slope, only the intercept estimates are reported. Because this latter method consists of two separate stages, it is slower to compute than the simultaneous detection.

The simultaneous cluster detection and the two-stage cluster detection methods provide different results, but qualitatively the interpretation in both the locations and the coefficient estimates of the clusters is similar. Thus, between the two methods, we may choose one that is more suitable for the application.

For further research, we will consider more than one covariate. While the simultaneous detection can be readily extended to a multiple regression model, it is not easy to derive a multiple stage detection method from the two-stage detection, as the computational time increases greatly with more covariates.

Supplementary Material

Supporting Info 1

NIHMS982821-supplement-Supporting_Info_1.R^{(5.1KB, R)}

Supporting Info 2

NIHMS982821-supplement-Supporting_Info_2.R^{(10KB, R)}

Supporting Info 3

NIHMS982821-supplement-Supporting_Info_3.R^{(1.4KB, R)}

Supporting Info 4

NIHMS982821-supplement-Supporting_Info_4.RData^{(163.5KB, RData)}

Supporting Info 5

NIHMS982821-supplement-Supporting_Info_5.R^{(12.9KB, R)}

Supporting Info 6

NIHMS982821-supplement-Supporting_Info_6.RData^{(26.9KB, RData)}

Acknowledgments

The authors thank the editor, an associate editor, and two referees for their insightful and constructive comments. We also thank Maria Kamenetsky for her assistance with the cancer mortality dataset. Funding has been provided by a USDA Cooperative State Research, Education and Extension Service (CSREES) McIntire-Stennis project and a pilot project from the Center for Demography of Health and Aging at the University of Wisconsin-Madison.

Appendix A: Computational Aspects

A.1 Computational Complexity

The test statistic $T = max_{C \in C} F (C)$ in (6) is based on F statistics for the local hypotheses for all the possible clusters in 𝒞 = {C₁,C₂,…}. We consider a multiple regression model with (p − 1) covariates such that x_i = (1, x₁_i,…, x₍_p₋₁₎_i)^T. Then, for a given cluster C_k, F(C_k) is defined as F(C_k) = {(SSE₀ − SSE_{C_k})/p}/{SSE_{C_k}/(N − 2p)}. Thus, while a single calculation of SSE₀ is enough because SSE₀ is identical for all C_k, SSE_{C_k} needs to be calculated for every given cluster C_k, which can be time consuming. Thus, we rewrite SSE_{C_k} as

{SSE}_{C_{k}} = \sum_{i = 1}^{N} y_{i}^{2} - {(\sum_{i \in C_{k}} x_{i} y_{i})}^{T} {(\sum_{i \in C_{k}} x_{i} x_{i}^{T})}^{- 1} (\sum_{i \in C_{k}} x_{i} y_{i}) - {(\sum_{i = 1}^{N} x_{i} y_{i} - \sum_{i \in C_{k}} x_{i} y_{i})}^{T} {(\sum_{i = 1}^{N} x_{i} x_{i}^{T} - \sum_{i \in C_{k}} x_{i} x_{i}^{T})}^{- 1} (\sum_{i = 1}^{N} x_{i} y_{i} - \sum_{i \in C_{k}} x_{i} y_{i}) .

(A.1)

The components of SSE_{C_k} in (A.1) for a given cluster C_k are $\sum_{i = 1}^{N} y_{i}^{2}, \sum_{i = 1}^{N} x_{i} x_{i}^{T}, \sum_{i = 1}^{N} x_{i} y_{i}, \sum_{i \in C_{k}} x_{i} x_{i}^{T}$ , and Σ_i_{∈C_k} x_iy_i. Among these components, the first three need to be computed just once, but $\sum_{i \in C_{k}} x_{i} x_{i}^{T}$ and Σ_i_{∈C_k} x_iy_i need to be calculated for every C_k. Thus, the last two components, $\sum_{i \in C_{k}} x_{i} x_{i}^{T}$ and Σ_i_{∈C_k} x_iy_i, are bottlenecks in the computation of the test statistic T.

The computational complexities for these component are O(N), O(Np²), O(Np), O(|C_k|p²), and O(|C_k|p), respectively, where |·|denotes the cardinality of a set. Thus, the total computational complexity for all the clusters C_k ∈ 𝒞 = {C₁,C₂,…,C_K} is

O {N (1 + p^{2} + p)} + O {\sum_{k = 1}^{K} ∣ C_{k} ∣ (p^{2} + p)} = O {\sum_{k = 1}^{K} ∣ C_{k} ∣ (p^{2} + p)}

(A.2)

because $\sum_{k = 1}^{K} ∣ C_{k} ∣ ≫ N$ .

A.2 Computational Algorithm

As in Section 2.2, we can consider a total of m_i potential clusters centered at site i with radii r_i_,1, r_i_,2,…, r_i_{,m_i}. Let C(i, r_i_,_q) be the cluster centered at site i with the radius r_i_,_q for i = 1,…,N and q = 1,…,m_i. Then, |C(i, r_i_,_q)| = q and $\sum_{k = 1}^{K} ∣ C_{k} ∣ = \sum_{i = 1}^{N} \sum_{q = 1}^{m_{i}} ∣ C (i, r_{i, q}) ∣$ . Thus, (A.2) can be expressed as

O {\sum_{i = 1}^{N} \sum_{q = 1}^{m_{i}} q (p^{2} + p)} = O {\sum_{i = 1}^{N} m_{i} (m_{i} + 1) (p^{2} + p) / 2} .

(A.3)

Based on the fact that C(i, r_i_,1) ⊂ C(i, r_i_,2) ⊂ … ⊂ C(i, r_{i,m_i}) for clusters with the same centroid i, the cumulative sums for $x_{i} x_{i}^{T}$ and x_iy_i can be considered to ease the bottleneck. That is

\sum_{i^{'} \in C (i, r_{i, q})} x_{i^{'}} x_{i^{'}}^{T} = \sum_{i^{'} \in C (i, r_{i, q - 1})} x_{i^{'}} x_{i^{'}}^{T} + \sum_{i^{'} \in C (i, r_{i, q}) \ C (i, r_{i, q - 1})} x_{i^{'}} x_{i^{'}}^{T} .

For example, suppose C(1, r_1,1) = {1}, C(1, r_1,2) = {1, 3}, C(1, r_1,3) = {1, 3, 7}, …, with the centroid i = 1. Then, $\sum_{i^{'} \in C (1, r_{1, 1})} x_{i^{'}} x_{i^{'}}^{T} = x_{1} x_{1}^{T}$ has the computational complexity O(p²). For the next cluster at the centroid i = 1, $\sum_{i^{'} \in C (1, r_{1, 2})} x_{i^{'}} x_{i^{'}}^{T} = x_{1} x_{1}^{T} + x_{3} x_{3}^{T}$ . However, because $x_{1} x_{1}^{T}$ is already calculated in the previous cluster, only $x_{3} x_{3}^{T}$ needs to be computed with the complexity O(p²). For the next cluster C(1, r_1,3) = {1, 3, 7}, we only need to calculate $x_{7} x_{7}^{T}$ and its complexity is still O(p²). Thus, by considering these cumulative sums, the number of mathematical operations for the C(i, r_i_,_q)’s with the same centroid i can be reduced from $\sum_{q = 1}^{m_{i}} ∣ C (i, r_{i, q}) ∣ (p^{2} + p) = \sum_{q = 1}^{m_{i}} q (p^{2} + p)$ to $\sum_{q = 1}^{m_{i}} (p^{2} + p) = m_{i} (p^{2} + p)$ . Thus, the total computational complexity for all C_k ∈ 𝒞 = {C₁,C₂,…,C_K} becomes

O {\sum_{i = 1}^{N} \sum_{q = 1}^{m_{i}} (p^{2} + p)} = O {\sum_{i = 1}^{N} m_{i} (p^{2} + p)} = O {K (p^{2} + p)},

(A.4)

where $K = \sum_{i = 1}^{N} m_{i}$ .

The ratio of (A.3) and (A.4) is

(1 / 2) (K^{- 1} \sum_{i = 1}^{N} m_{i}^{2} + 1) \geq (1 / 2) (K^{- 1} \sum_{i = 1}^{N} {(K / N)}^{2} + 1) = (1 / 2) (K / N + 1) .

(A.5)

The inequality in (A.5) suggests that we can ease the bottleneck computation by reducing computation by at least (K/N + 1)/2 times. For a 25 × 25 square grid in the unit square with a total of N = 625 cells, if we consider circular clusters with the maximum radius R_max = 1/5 unit, there are a total of K = 41493 potential clusters. Thus, we could reduce the computation by about 30 times.

Appendix B: Source Code

The algorithm of our methodology is implemented in R. The source code and an illustrative example are available in the Supporting Information.

Footnotes

Supporting information

Additional supporting information may be found in the online version of this article at the publisher’s web site.

References

1.Kulldorff M, Nagarwalla N. Spatial disease clusters: detection and inference. Statistics in Medicine. 1995;14:799–810. doi: 10.1002/sim.4780140809. [DOI] [PubMed] [Google Scholar]
2.Kulldorff M. A spatial scan statistic. Communications in Statistics, Part A. 1997;26:1481–1496. [Google Scholar]
3.Duczmal L, Assuncao R. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics and Data Analysis. 2004;45:269–284. [Google Scholar]
4.Gangnon RE, Clayton MK. Likelihood-based tests for localized detecting spatial clustering of disease. Environmetrics. 2004;15:797–810. [Google Scholar]
5.Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005;4:11. doi: 10.1186/1476-072X-4-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Assuncao R, Costa M, Tavares A, Ferreira S. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine. 2006;25:723–742. doi: 10.1002/sim.2411. [DOI] [PubMed] [Google Scholar]
7.Kulldorff M, Huang L, Pickle L, Duczmal L. An elliptic spatial scan statistic. Statistics in Medicine. 2006;25:3929–3943. doi: 10.1002/sim.2490. [DOI] [PubMed] [Google Scholar]
8.Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. International Journal of Health Geographics. 2009;8:58. doi: 10.1186/1476-072X-8-58. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gangnon RE. Local multiplicity adjustments for spatial cluster detection. Environmental and Ecological Statistics. 2010;17(1):55–71. doi: 10.1007/s10651-008-0101-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Neill DB. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society, Series B. 2012;74(2):337–360. [Google Scholar]
11.Shu L, Jiang W, Tsui KL. A standardized scan statistic for detecting spatial clusters with estimated parameters. Naval Research Logistics. 2012;59:397–410. [Google Scholar]
12.Gangnon RE, Clayton MK. Bayesian detection and modeling of spatial disease clustering. Biometrics. 2000;56:922–935. doi: 10.1111/j.0006-341x.2000.00922.x. [DOI] [PubMed] [Google Scholar]
13.Gangnon RE, Clayton MK. A hierarchical model for spatially clustered disease rates. Statistics in Medicine. 2003;22:3213–3228. doi: 10.1002/sim.1570. [DOI] [PubMed] [Google Scholar]
14.Gangnon RE, Clayton MK. Cluster detection using bayes factors from overparameterized cluster models. Environmental and Ecological Statistics. 2007;14:69–82. [Google Scholar]
15.Lawson AB. Cluster modelling of disease incidence via rjmcmc methods: a comparative evaluation. Statistics in Medicine. 2000;19:2361–2375. doi: 10.1002/1097-0258(20000915/30)19:17/18<2361::aid-sim575>3.0.co;2-n. [DOI] [PubMed] [Google Scholar]
16.Clark AB, Lawson AB. Spatial Cluster Modelling. Chapman and Hall/CRC: Boca Raton, FL; 2002. Spatio-temporal cluster modelling of small area health data; pp. 235–258. [Google Scholar]
17.Yan P, Clayton MK. A cluster model for space-time disease counts. Statistics in Medicine. 2006;25:867–881. doi: 10.1002/sim.2424. [DOI] [PubMed] [Google Scholar]
18.Wakefield J, Kim A. A bayesian model for cluster detection. Biostatistics. 2013;14(4):752–765. doi: 10.1093/biostatistics/kxt001. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Brunsdon C, Fotheringham AS, Charlton ME. Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical Analysis. 1996;28:281–298. [Google Scholar]
20.Fotheringham AS, Brunsdon C, Charlton ME. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley; New York: 2002. [Google Scholar]
21.Lawson AB, Choi J, Zhang J. Prior choice in discrete latent modeling of spatially referenced cancer survival. Statistical Methods in Medical Research. 2014;23(2):183–200. doi: 10.1177/0962280212447148. [DOI] [PubMed] [Google Scholar]
22.Zhang Z, Assuncao R, Kulldorff M. Spatial scan statistic adjusted for multiple clusters. Journal of Probability and Statistics. 2010 Article ID 642379:11. [Google Scholar]
23.Waller LA, Hill EG, Rudd RA. The geography of power: statistical performance of tests of clusters and clustering in heterogeneous populations. Statistics in Medicine. 2006;25:853–865. doi: 10.1002/sim.2418. [DOI] [PubMed] [Google Scholar]
24.Gangnon RE. Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution. Biometrics. 2012;68:174–182. doi: 10.1111/j.1541-0420.2011.01643.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Info 1

NIHMS982821-supplement-Supporting_Info_1.R^{(5.1KB, R)}

Supporting Info 2

NIHMS982821-supplement-Supporting_Info_2.R^{(10KB, R)}

Supporting Info 3

NIHMS982821-supplement-Supporting_Info_3.R^{(1.4KB, R)}

Supporting Info 4

NIHMS982821-supplement-Supporting_Info_4.RData^{(163.5KB, RData)}

Supporting Info 5

NIHMS982821-supplement-Supporting_Info_5.R^{(12.9KB, R)}

Supporting Info 6

NIHMS982821-supplement-Supporting_Info_6.RData^{(26.9KB, RData)}

[R1] 1.Kulldorff M, Nagarwalla N. Spatial disease clusters: detection and inference. Statistics in Medicine. 1995;14:799–810. doi: 10.1002/sim.4780140809. [DOI] [PubMed] [Google Scholar]

[R2] 2.Kulldorff M. A spatial scan statistic. Communications in Statistics, Part A. 1997;26:1481–1496. [Google Scholar]

[R3] 3.Duczmal L, Assuncao R. A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics and Data Analysis. 2004;45:269–284. [Google Scholar]

[R4] 4.Gangnon RE, Clayton MK. Likelihood-based tests for localized detecting spatial clustering of disease. Environmetrics. 2004;15:797–810. [Google Scholar]

[R5] 5.Tango T, Takahashi K. A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 2005;4:11. doi: 10.1186/1476-072X-4-11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Assuncao R, Costa M, Tavares A, Ferreira S. Fast detection of arbitrarily shaped disease clusters. Statistics in Medicine. 2006;25:723–742. doi: 10.1002/sim.2411. [DOI] [PubMed] [Google Scholar]

[R7] 7.Kulldorff M, Huang L, Pickle L, Duczmal L. An elliptic spatial scan statistic. Statistics in Medicine. 2006;25:3929–3943. doi: 10.1002/sim.2490. [DOI] [PubMed] [Google Scholar]

[R8] 8.Kulldorff M, Huang L, Konty K. A scan statistic for continuous data based on the normal probability model. International Journal of Health Geographics. 2009;8:58. doi: 10.1186/1476-072X-8-58. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Gangnon RE. Local multiplicity adjustments for spatial cluster detection. Environmental and Ecological Statistics. 2010;17(1):55–71. doi: 10.1007/s10651-008-0101-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Neill DB. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society, Series B. 2012;74(2):337–360. [Google Scholar]

[R11] 11.Shu L, Jiang W, Tsui KL. A standardized scan statistic for detecting spatial clusters with estimated parameters. Naval Research Logistics. 2012;59:397–410. [Google Scholar]

[R12] 12.Gangnon RE, Clayton MK. Bayesian detection and modeling of spatial disease clustering. Biometrics. 2000;56:922–935. doi: 10.1111/j.0006-341x.2000.00922.x. [DOI] [PubMed] [Google Scholar]

[R13] 13.Gangnon RE, Clayton MK. A hierarchical model for spatially clustered disease rates. Statistics in Medicine. 2003;22:3213–3228. doi: 10.1002/sim.1570. [DOI] [PubMed] [Google Scholar]

[R14] 14.Gangnon RE, Clayton MK. Cluster detection using bayes factors from overparameterized cluster models. Environmental and Ecological Statistics. 2007;14:69–82. [Google Scholar]

[R15] 15.Lawson AB. Cluster modelling of disease incidence via rjmcmc methods: a comparative evaluation. Statistics in Medicine. 2000;19:2361–2375. doi: 10.1002/1097-0258(20000915/30)19:17/18<2361::aid-sim575>3.0.co;2-n. [DOI] [PubMed] [Google Scholar]

[R16] 16.Clark AB, Lawson AB. Spatial Cluster Modelling. Chapman and Hall/CRC: Boca Raton, FL; 2002. Spatio-temporal cluster modelling of small area health data; pp. 235–258. [Google Scholar]

[R17] 17.Yan P, Clayton MK. A cluster model for space-time disease counts. Statistics in Medicine. 2006;25:867–881. doi: 10.1002/sim.2424. [DOI] [PubMed] [Google Scholar]

[R18] 18.Wakefield J, Kim A. A bayesian model for cluster detection. Biostatistics. 2013;14(4):752–765. doi: 10.1093/biostatistics/kxt001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Brunsdon C, Fotheringham AS, Charlton ME. Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical Analysis. 1996;28:281–298. [Google Scholar]

[R20] 20.Fotheringham AS, Brunsdon C, Charlton ME. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Wiley; New York: 2002. [Google Scholar]

[R21] 21.Lawson AB, Choi J, Zhang J. Prior choice in discrete latent modeling of spatially referenced cancer survival. Statistical Methods in Medical Research. 2014;23(2):183–200. doi: 10.1177/0962280212447148. [DOI] [PubMed] [Google Scholar]

[R22] 22.Zhang Z, Assuncao R, Kulldorff M. Spatial scan statistic adjusted for multiple clusters. Journal of Probability and Statistics. 2010 Article ID 642379:11. [Google Scholar]

[R23] 23.Waller LA, Hill EG, Rudd RA. The geography of power: statistical performance of tests of clusters and clustering in heterogeneous populations. Statistics in Medicine. 2006;25:853–865. doi: 10.1002/sim.2418. [DOI] [PubMed] [Google Scholar]

[R24] 24.Gangnon RE. Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution. Biometrics. 2012;68:174–182. doi: 10.1111/j.1541-0420.2011.01643.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Cluster detection of spatial regression coefficients

Junho Lee

Ronald E Gangnon

Jun Zhu

Abstract

1. Introduction

2. Simultaneous Spatial Cluster Detection in Intercepts and Slopes

2.1. Test for Spatial Cluster Effects in a Simplified Setting

2.2. Single Cluster

2.3. Multiple Clusters

3. Two–Stage Spatial Cluster Detection in Intercepts and Slopes

3.1. Test for Spatial Cluster Effects in a Simplified Setting

3.2. First Stage: Spatial Cluster in the Slopes

3.3. Second Stage: Spatial Cluster in the Intercepts

4. Simulation Study

4.1. Evaluation of Power of Tests

Figure 1.

Table I.

4.2. Evaluation of Coverage of the True Clusters

Figure 2.

Figure 3.

5. Data Example

5.1. Southeast U.S.A Cancer Mortality Data

Figure 4.

Figure 5.

5.2. Simultaneous Detection

Table II.

Table III.

Figure 6.

Figure 7.

5.3. Two-Stage Detection

Figure 8.

Figure 9.

6. Conclusions and Discussion

Supplementary Material

Acknowledgments

Appendix A: Computational Aspects

A.1 Computational Complexity

A.2 Computational Algorithm

Appendix B: Source Code

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases