Comparing Bayesian spatial models: Goodness-of-smoothing criteria for assessing under- and over-smoothing

Earl W Duncan; Kerrie L Mengersen

doi:10.1371/journal.pone.0233019

. 2020 May 20;15(5):e0233019. doi: 10.1371/journal.pone.0233019

Comparing Bayesian spatial models: Goodness-of-smoothing criteria for assessing under- and over-smoothing

Earl W Duncan ^1,^*, Kerrie L Mengersen ¹

Editor: Qiang Zeng²

PMCID: PMC7239453 PMID: 32433653

Abstract

Background

Many methods of spatial smoothing have been developed, for both point data as well as areal data. In Bayesian spatial models, this is achieved by purposefully designed prior(s) or smoothing functions which smooth estimates towards a local or global mean. Smoothing is important for several reasons, not least of all because it increases predictive robustness and reduces uncertainty of the estimates. Despite the benefits of smoothing, this attribute is all but ignored when it comes to model selection. Traditional goodness-of-fit measures focus on model fit and model parsimony, but neglect “goodness-of-smoothing”, and are therefore not necessarily good indicators of model performance. Comparing spatial models while taking into account the degree of spatial smoothing is not straightforward because smoothing and model fit can be viewed as opposing goals. Over- and under-smoothing of spatial data are genuine concerns, but have received very little attention in the literature.

Methods

This paper demonstrates the problem with spatial model selection based solely on goodness-of-fit by proposing several methods for quantifying the degree of smoothing. Several commonly used spatial models are fit to real data, and subsequently compared using the goodness-of-fit and goodness-of-smoothing statistics.

Results

The proposed goodness-of-smoothing statistics show substantial agreement in the task of model selection, and tend to avoid models that over- or under-smooth. Conversely, the traditional goodness-of-fit criteria often don’t agree, and can lead to poor model choice. In particular, the well-known deviance information criterion tended to select under-smoothed models.

Conclusions

Some of the goodness-of-smoothing methods may be improved with modifications and better guidelines for their interpretation. However, these proposed goodness-of-smoothing methods offer researchers a solution to spatial model selection which is easy to implement. Moreover, they highlight the danger in relying on goodness-of-fit measures when comparing spatial models.

Introduction

Spatial smoothing is a technique used when modelling the underlying data-generating process of spatial data to account for spatial autocorrelation, as expressed by Tobler’s first law of geography: “… near things are more related than distant things” [1]. Neglecting spatial autocorrelation is akin to ignoring the order of time-series data, leading to greater uncertainty about the model parameters, poorer predictions, and misguided inference. Conversely, when spatial smoothing is applied, it has the benefit of more appropriately representing the statistical uncertainty of model parameters, better predictions, and providing more insight into the layers of the underlying data-generating process, similar to how the trend and seasonality help explain layers of decomposed time series data [2, 3].

Many methods of spatial smoothing have been developed, for both point data as well as areal data, including linear or non-linear functions based on distances, loess smoothing [4], spline functions [5], kriging [6] and Gaussian process priors [7, 8], and empirical Bayes approaches to spatial smoothing [9]. For an overview of smoothing techniques, see Kafadar [2] and Tiwari and Rushton [10]. Empirical Bayes methods, which smooth estimates of points on a spatial surface towards the global mean based on a distribution whose parameters are fixed a priori, gained popularity as computing power increased and parameter estimation techniques became more widely accessible. Fully Bayes methods have also been proposed [10, 11], where the surface is typically estimated by one or more spatially varying parameters with purposefully chosen prior distributions to account for the spatial autocorrelation.

Spatial smoothing plays an important role in a broad range of applications, including the assessment of feature significance [12] and seafloor classification in geostatistics [13], monitoring of groundwater contaminant plumes [14], image processing [11, 15], the calorific value distributions in coal facies [16], and analysis of traffic accidents [17, 18] to name a few.

Notwithstanding the benefits associated with spatial smoothing, there seems to be a small but growing awareness of the dangers associated with under- and over-smoothing. While over-smoothing causes genuine deviations from the local or global mean to be obscured [13, 19], under-smoothing is equally undesirable as it exaggerates features in the surface, making them indistinguishable from background noise, which defeats the point of spatial smoothing. The negative effects of under-smoothing are a lot less vocalised in spatial modelling than in time series modelling, where the link between under-smoothing and residual autocorrelation, large prediction errors, and biased hypothesis tests have been articulated [20, 21].

Despite the growing awareness, there is very little guidance in the literature on how to assess the appropriateness of the level of spatial smoothing, and according to our knowledge, any efforts to account for such smoothing in model selection are non-existent. The latter is evident by the widespread use of model selection criteria like the Bayesian information criterion (BIC) [22], the deviance and related deviance information criterion (DIC) [23], and widely applicable information criterion (WAIC) [24] to compare spatial models, ironically even in studies which aimed to assess the presence of under- or over-smoothing (see for example, Rodrigues and Assunҫão [25] and Law [26]). The problem is that these criteria are designed to quantify goodness-of-fit (GoF), that is, the discrepancy between the observed data and the predicted values from the model, while penalising for over-fitting (model complexity), but they fail to account for the spatial dependencies [27] and the effect that spatial smoothing has on model fit. Put another way, the problem of model selection can be viewed as an optimisation problem with several competing objective functions: in addition to GoF, model parsimony and predictive capability, spatial models necessitate an additional objective function–“goodness-of-smoothing” (GoS). Hence not only should a model which under- or over-smooths be given less preference, but a model with an appropriate amount of smoothing should be preferred over a model without any smoothing, even though it is likely to have a poorer GoF to the observed data.

In the context of Bayesian spatial modelling, spatial smoothing is typically implemented through a prior distribution using spatial weights to define the spatial dependencies; see Cramb et al. [28] for a critical review of popular Bayesian spatial models. One of the most common prior distributions for spatial random effects (SREs) in a Bayesian spatial model is the intrinsic conditional autoregressive (ICAR) prior [19, 29]. The BYM model [11] makes use of the ICAR prior, but also includes unstructured (independent) SREs so that the estimated risks are smoothed towards a local mean as well as a global mean [26]. The two random effects are henceforth referred to as the structured (SSRE) and unstructured spatial random effects (USRE). In response to the complexity of having two sets of SREs, Leroux et al. [30] proposed a model in which the SREs were a weighted mixture of the USRE and SSRE, the latter modelled by the ICAR prior. Although the BYM and Leroux models remain popular, especially in epidemiology [25], some concern about the potential for over-smoothing has been expressed (for example, see Smith et al. [19]; Law [26]; Kandhasamy and Ghosh [31], Lawson and Clark [32], Best et al. [33] and Cramb et al. [28])

This paper has three aims: 1) to demonstrate that reliance on common GoF criteria for spatial model selection is inadequate; 2) to propose several methods for quantifying the degree of smoothing; and 3) to compare these methods against GoF statistics on real data. These methods were developed within the context of disease-mapping using areal data in a Bayesian framework. However, some of these methods were inspired from methodology outside this field and will equally be applicable to problems in other contexts, such as geostatistics; other methods are more specific to the disease-mapping context, but could potentially be extended to a broader class of models and problems with little modification.

Without loss of generality, we impose three constraints on our study. The first is the range of models considered. We limit our analysis to the BYM and Leroux models for several reasons: they are well known and widely used; the ICAR model, which underpins both the BYM and Leroux models, has been criticised for being susceptible to over-smoothing; and as the ensuing analysis reveals, a wide range of models with varying degrees of smoothing can be achieved simply via changes to the hyperprior specification. For the purpose of quantifying and comparing different degrees of smoothing, this is adequate. Moreover, given the large influence of the hyperpriors on smoothing, the choice of model seems secondary. More broadly, other approaches such as models based on Gaussian process priors will suffer similar issues with respect to under- and over-smoothing.

The second constraint is investigating the effect of spatial smoothing parameters or spatial weights on the degree of smoothing. Typically, in models such as the BYM and Leroux, spatial weights are based on first-order adjacency. That is, each pair of spatial units (areas) are assigned a weight of 1 if they are considered (typically geographically) adjacent and zero otherwise. This simplifies the spatial covariance function substantially and improves computation without substantial loss of information. However, many other formulations have been explored (see for example Earnest et al. [27], Law [26], and Duncan et al. [34]). Not only has this issue already received much attention, but the conclusions suggest that binary first-order adjacency weights are often a good choice anyway.

Third, the task of trying to determine the optimal amount of smoothing for a given model is not considered. Again, this has already been addressed in the literature (e.g. Evers et al. [35]), but more importantly, this task is impeded by the lack of guidance on how the degree of smoothing can be quantified.

The structure of this paper is as follows. The Methods section describes the Bayesian spatial models and introduces an important quantity derived from the model parameters which is subsequently used in the analysis. Also described in this section are five approaches to quantifying smoothing and three commonly used GoF measures, as well as the two spatial datasets. The Results section reports the parameter estimates, the GoF and GoS criteria are evaluated which are subsequently used to compare the models. These results and limitations of this study are examined in the Discussion.

Methods

Bayesian spatial models

For specificity, we consider two spatial models for area-level count data that are commonly used in epidemiological modelling. For each model, the data are assumed to follow a Poisson distribution

y_{i} \sim Pois (E_{i} e^{μ_{i}})

where y_i and E_i are the observed and expected counts respectively, and μ_i is the log relative risk for the i^th area. Assuming k covariates and some weakly informative priors, the Leroux model [30] is specified as

μ_{i} = β^{T} x_{i} + s_{i}

β_{k} \sim N (0, σ^{2})

s_{i} | s_{\ i} ~ N (\frac{ρ \sum_{j} w_{i j} s_{j}}{ρ \sum_{j} w_{i j} + 1 - ρ}, \frac{σ_{s}^{2}}{ρ \sum_{j} w_{i j} + 1 - ρ})

ρ \sim Unif (0, 1)

σ_{s}^{2} \sim I G (α, η)

and the BYM model [11] is specified as

μ_{i} = β^{T} x_{i} + s_{i} + u_{i}

β_{k} \sim N (0, σ^{2})

s_{i} | s_{\ i} ~ N (\frac{\sum_{j} w_{i j} s_{j}}{\sum_{j} w_{i j}}, \frac{σ_{s}^{2}}{\sum_{j} w_{i j}})

u_{i} \sim N (0, σ_{u}^{2})

σ_{s}^{2} \sim I G (α, η)

σ_{u}^{2} \sim N {(0, 10)}^{+}

where β^T = (β₀,…,β_k)^T are the k + 1 regression coefficients, $I G$ denotes the inverse-gamma (IG) distribution, parameterised in terms of shape and rate, $N {(\cdot)}^{+}$ denotes a Normal distribution left-truncated at zero, and all Normal distributions including the truncated distribution are parameterised in terms of mean and variance. The spatial weights w_ij were fixed a priori as the binary, first-order adjacency weights, σ² was held fixed at 100, while different combinations of values of α and η were used to fit different models with varying degrees of smoothing.

Given the sensitivity to the hyperprior for $σ_{s}^{2}$ , left-truncated Normal (LTN) distributions, $N {(π, ν)}^{+}$ , were also trialled. Other hyperpriors are possible (see Gelman [36] for example), but are not considered here for the sake of brevity. The specific values of α, η, v, and π are included in S1 Table. It should be stressed that these values are not necessarily sensible from a practical standpoint–they were chosen deliberately to induce a set of maps with varying degrees of smoothing to test the methods for quantifying smoothing described below. This yields a total of 4 models each with 12 model variants labelled A through L. While the relationship between the informativeness of a prior distribution and the impact it will have on smoothing is not straightforward, these model variants are approximately ordered in descending order of smoothing intensity.

Extensions of the standardised incidence ratio

In the disease mapping context, the ratio y_i/E_i is called the observed or ‘raw’ standardised incidence ratio (SIR). This is usually unstable due to low incidence and/or small populations at risk [10, 30, 37], and thus the goal is to provide a better estimate, given by the relative risk exp(μ_i), or posterior SIR.

We introduce a new quantity, the covariate-adjusted SIR (CASIR), which is a key component of the methods below,

C A S I R_{i} = exp (μ_{i} - β^{T} x_{i})

which is equivalent to exponentiating the SRE, exp(s_i). In the case of the BYM model, the unstructured spatial random effects (USRE) are also subtracted from μ_i before exponentiating. Similarly, we define the covariate-adjusted raw SIR (CARSIR) as

C A R S I R_{i} = \frac{y_{i}}{E_{i}} \exp (- β^{T} x_{i})

where y_i/E_i is the raw SIR. As will become apparent, the smoothed SIR surface, given by exp(μ_i), may not necessarily appear smooth, and paradoxically may appear less smooth when more smoothing is applied, and vice versa. This is because the smoothness exhibited by the SIR depends on the effect of the covariate(s), and their relative contribution to the SIR compared to the SRE. Conversely, the CASIR directly reflects the degree of smoothing.

We justify use of the CASIR over the SRE for two reasons. First, the CASIR is comparable to the SIR, the main parameter of interest in these epidemiological models, by converting the SRE to a ratio scale parameter. Second, it allows a theoretical bound on the potential values of CASIR to be computed, which is a central feature of one of the approaches to quantifying smoothing described below. Taking logarithms of the raw SIR to compute a range for s_i is not reliable since y_i may be zero.

Computation

The Leroux model with the IG prior distribution was fit using the R package CARBayes [38], for computational efficiency while the other three models were fit using WinBUGS [39] via the R package R2WinBUGS [40, 41]. Although CARBayes can fit the BYM model with an IG prior, only the sum of the estimated SREs are provided whereas separate estimates of the SSRE and USRE are highly valuable for this analysis. These software use Markov chain Monte Carlo (MCMC) techniques to estimate the posterior distribution. Although other software is available which should produce very similar results, these software were chosen for their reliability and convenience in fitting these particular spatial models.

Approaches to quantifying smoothing

There are potentially several ways to quantify the degree of smoothing attained by a given model. To address the second aim of this paper, five ideas are explored. The origins of these ideas and their technical details are described below.

Ratio of variograms

The classical variogram for area i at lag h is given by

γ_{i} (h) = \frac{1}{2 N_{i} (h)} \sum_{j \sim i} {(z_{i} - z_{j})}^{2}

where N_i(h) is the number of areas which are no more distant than the lag h from area i, and j ∼ i denotes all areas i and j which satisfy d_ij < h where d_ij is the distance between areas i and j, and z_i is a measured variable for area i [3, 6]. Instead of using the Great Circle distance between the centroids of each area, we define d_ij as the minimum number of boundaries that must be crossed to move from area i to area j, as proposed by Knorr-Held and Raßer [42]. This appears to be more appropriate for areal data as it tends to provide smoother and more robust estimates of the variogram, especially for small lag values. Additionally, under this construction, adjacency of areas defines the autocorrelation in the variogram as well as the weights matrix in the modelling.

The variogram, averaged over the areas,

γ (h) = \frac{1}{N} \sum_{i = 1}^{N} γ_{i} (h),

provides a succinct visual representation of the spatial continuity of the variable z = (z₁,…,z_N). Plotting the variogram of CASIR against the variogram of the CARSIR may be helpful in assessing the degree of smoothing: a variogram that is too flat indicates over-smoothing, while a variogram that is similar to that for the raw SIR indicates under-smoothing. As a quantitative metric for assessing GoS, we propose the ratio of the variograms for CASIR to CARSIR, averaged over the areas and lag parameter. This can be compared against a user-specified target to determine whether the smoothing is appropriate.

Kurtosis preservation

Drawing on inspiration from developments in time series analysis, we propose a method based on the work of Rong and Bailis [43]. The authors address the issue of over-smoothing in time series analysis by using a simple moving average smoothing function such that the moving average window size minimises the “roughness” (defined as the standard deviation of the first-order difference series) with the constraint that the kurtosis of the smoothed time series must be greater than or equal to the kurtosis of the original, unsmoothed time series. That is, they aim to smooth a time series as much as possible while preserving kurtosis. The result is that the smoothed time series retains rare large-scale deviations while smoothing out more frequent modestly sized deviations.

This methodology presented in Rong and Bailis [43] not only provides a technique for smoothing, but also a statistic for quantifying smoothness. It is the latter development that is of interest here, since the spatial smoothing is performed as part of the Bayesian modelling. However, spatial dependencies differ from longitudinal dependencies in terms of how individual units (areas or time points) are assumed to interact. As an analogy to first-order differences in time, we consider a first-order neighbourhood approach in space, that is, differences between a measure at a given area and the mean of its first-order neighbours. The roughness is the standard deviation of these differences over all areas.

For a generic spatial variable z_i associated with the i^th area, the excess kurtosis is defined as

Kurt (z_{i}) = \frac{E [{(z_{i} - \bar{z} (w_{i}))}^{4}]}{E {[{(z_{i} - \bar{z} (w_{i}))}^{2}]}^{2}} - 3

where $\bar{z} (w_{i})$ is the weighted mean of {z_i,; i = 1,…N}, and w_i is the vector of spatial weights pertaining to the i^th area. The overall measure of kurtosis is given by averaging over all areas, i = 1,…,N. A larger kurtosis implies that the variation is dominated by infrequent and extreme deviations [43].

Note that whether z is defined as the CASIR or SIR, the kurtosis is very similar when compared with their raw counterpart (i.e. CARSIR and raw SIR). However, the roughness can vary substantially, making inference difficult. In our analyses, the SIR was found to be a more reliable measure, which is what is presented here. For consistency, SIR was also used to compute the kurtosis, i.e. z_i = SIR_i.

Kappa

Cohen’s kappa statistic [44] has been used previously in the spatial context to compare spatial agreement of patterns and to quantify the magnitude of spatial smoothing (e.g. Sterlacchini et al. [45] and Earnest et al. [27]). The statistic is defined as

κ = \frac{\Pr (A_{o}) - Pr (A_{e})}{1 - Pr (A_{e})}

where Pr(A_o) and Pr(A_e) are the observed and expected proportion of agreement between a spatial variable respectively,

\Pr (A_{o}) = \frac{1}{N} \sum_{i = 1} c_{i i}

\Pr (A_{e}) = \frac{1}{N} \sum \frac{\sum_{j = 1} c_{i j} \times \sum_{i = 1} c_{i j}}{N}

and {c_ij} are the elements of a confusion matrix formed from the cross-tabulation of the categories of nominal variables [44]. To cross-tabulate values of continuous variables like the observed and smoothed SIRs, they must first be categorised by specifying “epidemiologically meaningful” thresholds [27, 46]. Following the suggestions of Earnest et al. [27] and Sterlacchini et al. [45], kappa was computed on the quantiles of CASIR and CARSIR using 3 categories (2 cut-offs: 0.25 and 0.75) as well as 5 categories (4 cut-offs: 0.1, 0.3, 0.7, 0.9).

In addition to being designed for categorical data, Cohen’s kappa has several criticisms. Interpretation of kappa is not straightforward since its magnitude can be influenced by multiple factors, and it may not be clear which factor(s) is responsible [45, 46]. While there is no consensus to interpreting kappa, some guidelines have been suggested in the literature (e.g. Landis and Koch [47]). Broadly, kappa values less than or close to zero indicate a lack of agreement, while kappa values close to 1 indicate substantial agreement [45–47]. However, the difficulty of interpreting kappa is exacerbated in the spatial context. The statistic does not take into account the spatial structure of the two variates being compared, and being symmetric, there is no clear “baseline” for assessing agreement. Consequently, there is no unambiguous connection between kappa and the degree of smoothness exhibited by the spatial variables. This problem is illustrated in Fig 1.

In Fig 1A and 1B, there is perfect agreement between variables A and B. However, the surfaces in b) are not smooth, so a kappa value close to 1 does not necessarily indicate a high degree of smoothness. In Fig 1C, there is perfect disagreement, yet both surfaces are smooth, so the low kappa value should not be interpreted as a low degree of smoothness. In Fig 1D, kappa is approximately 0.04. Regardless of how this is interpreted, it is not clear how it would apply to surfaces A and B simultaneously.

If one of the two variables being compared is designated as the baseline, then this may help in the interpretation. For example, consider the two variables raw and smoothed SIR. The null hypothesis is that the raw SIR is not smooth. As smoothing increases, the disagreement between these variables will increase, thereby reducing the kappa value. Thus it has been suggested that smaller kappa values indicate greater smoothing [27].

In the absence of more definitive guidelines, the following metric to assess the GoS was devised using the results from the other methods as calibration: $\hat{κ} < 0.05$ indicates over-smoothing, 0.05 < $\hat{κ}$ <0.95 indicates a reasonable degree of smoothing, and $\hat{κ} > 0.95$ indicates under-smoothing.

Note that Earnest et al. [27] compute kappa for the raw and smoothed SIR. However, the only covariates included in their models are temporal, not spatial, making these variables more comparable to the CARSIR and CASIR respectively. As explained above when introducing CASIR, it is necessary to remove the effect of spatial covariates when assessing spatial smoothing. Consequently, in this paper, kappa is computed for the estimates of CASIR and CARSIR, treating the latter as the baseline for agreement.

Fraction of spatial variation

Earnest et al. [27] and Law [26] also consider comparing models based on the fraction of spatial variation explained by the model. This is defined as the ratio of the empirical variance captured by the SSRE to the total spatial variation,

ψ = \frac{Var (s)}{Var (s) + Var (u)}

where s and u are the SSRE and USRE in the BYM model respectively–the only model considered by Earnest et al. [27]. As illustrated in Duncan et al. [34], this ratio, albeit using standard deviations rather than variances, is helpful in solving the identifiability issue between s and u, by modifying these random effects according to $ψ$ , which has been applied to the results from all the BYM model variants in this paper. It is not meaningful to compute this ratio again after modification, nor is this ratio applicable to other models which have only one set of SREs, like the Leroux model.

To generalise this concept to all spatial models with a SRE, s, we propose redefining the total spatial variation to be $Var (s) + Var (ε)$ where $ε = {(ε_{1}, \dots, ε_{N})}^{T}$ are the model residuals, which for the BYM model includes the unstructured spatial random effect. That is, the residuals for the BYM model are defined as

ε_{i} = E_{i} e^{μ_{i} - u_{i}} - y_{i}

since the USREs do not contribute to an understanding of the spatial variation but rather represent spatial noise. To compute the ratio, the posterior median for a posterior sample of size M is computed before computing the variance over the areas, i.e.

Var (s) = \frac{1}{N - 1} \sum_{i = 1}^{N} {(s_{i}^{*} - {\bar{s}}^{*})}^{2}

Var (ε) = \frac{1}{N - 1} \sum_{i = 1}^{N} {(ε_{i}^{*} - {\bar{ε}}^{*})}^{2}

where

s_{i}^{*} = \underset{m = 1, \dots, M}{median} {s_{i}^{(m)}}

{\bar{s}}^{*} = \frac{1}{N} \sum_{i = 1}^{N} s_{i}^{*}

and similarly for $ε_{i}^{*}$ and ${\bar{ε}}^{*}$ . Whether a small or large fraction of spatial variation is preferred depends on the reason for modelling the SIR [27]. Moreover, it is not obvious what values would be considered small or large in general or in a particular application. Given the lack of guidelines for interpreting this statistic for the purpose of assessing the degree of spatial smoothing, this criterion was not given further consideration when comparing the models. However, the results are reported below for completeness.

Relative position of CASIR

The fifth approach to quantifying smoothing begins with the observation that if no smoothing (i.e. no shrinkage) occurs, then the smoothed SIR and CASIR become more similar to their raw counterparts. As the degree of smoothing increases, each estimate of the SIR is smoothed towards the mean of its neighbours, subject to the model constraints and a priori knowledge imposed by the prior distributions. When the maximum amount of smoothing is applied to area i,

C A S I R_{i} \to E (C A S I R_{j ~ i} | y)

which approaches 1 as the SRE tends to zero. This does not imply that all areas will be smoothed towards the global mean, since areas may experience different degrees of smoothing. In fact, some areas will undoubtedly be smoothed away from the global mean. Notwithstanding some small deviations due to the use of posterior point estimates and properties of the posterior sample such as convergence and effective sample size, the CASIR_i estimate will lie somewhere between CARSIR_i and the posterior mean of its neighbours, E(CASIR_{j ∼ i} $| y$ ). If the relative position of CASIR at these two extremes is denoted 0 and 1 respectively, then this quantifies the degree of smoothing exhibited by a given area in relative terms. To quantify the overall degree of smoothing for a given model, the distribution of these relative positions is compared against a specified cut-off (see Table 1 for some examples).

Table 1. Cut-offs used to construct the GoS criteria.

Statistic	Cut-off type	Criteria
Variogram ratio	(u)	The ratio, averaged over the lag, is between 0.2 and 0.8.
	(c)	The ratio, averaged over the lag, is between 0.25 and 0.75.
	(pu)	The ratio, averaged over the lag, is between 0.1 and 0.4.
Kurtosis Preservation	(u)	The kurtosis of CASIR ≥ kurtosis of CARSIR and the roughness of CASIR is less than the minimum roughness + 30%
Kurtosis Preservation	(c)	The kurtosis of CASIR ≥ kurtosis of CARSIR and the roughness of CASIR is less than the minimum roughness + 10%
Kappa	(u)	Kappa lies between 0.05 and 0.95.
	(c)	Kappa lies between 0.1 and 0.9.
	(pu)	Kappa lies between 0.05 and 0.7.
Relative position of CASIR	(u)	At least 75% of the N CASIR point estimates lie within the range 0.01 to 0.99 (inclusive).
	(c)	At least 85% of the N CASIR point estimates lie within the range 0.02 to 0.98 (inclusive).
	(pu)	At least 75% of the N CASIR point estimates lie within the range 0.2 to 0.98 (inclusive).

Open in a new tab

(u) = unbiased; (c) = conservative (less likely to choose under- or over-smoothed mode ls); (pu) = penalise under-smoothing more heavily than over-smoothing.

Assessing GoS criteria

Several criteria were used to classify the models based on example cut-offs, listed in Table 1. These cut-offs can be adjusted in the same way that different cut-offs for DIC and WAIC can be specified to broaden or narrow the set of models considered “good”. A “PASS” indicates that the model variant is neither under- nor over-smoothing under the given criterion. Note that unlike the other GoS approaches, the kurtosis preservation method only has 2 cut-offs as it is not obvious how this criteria can be adjusted to penalise under-smoothing in favour of models with more smoothing. For the reasons outlined above, the fraction of spatial variation is excluded.

Goodness-of-fit and predictive performance

To address the first and third aims of this paper, we consider the following criteria commonly used to measure GoF and check predictive performance. Many studies involving model selection amongst competing spatial models use DIC which evaluates the model GoF while penalising for model complexity (e.g. Law [26] amd Earnest et al. [27]). The DIC was proposed by Spiegelhalter et al. [23] as a generalisation of Akaike’s information criterion (AIC) [48] using information theoretic justification. The DIC can be defined as

D I C = 2 p_{D} - 2 \log p (y | \bar{θ})

where p_D is the effective dimension of the model and $p (y | \bar{θ})$ is the likelihood evaluated at the posterior mean of the unknown parameters, θ. The WAIC [24, 49] is a similar criterion, defined as

W A I C = 2 p_{W} - 2 \log \prod_{i = 1}^{N} E_{θ} [p (y_{i} | θ_{i}) | y_{i}] .

The advantages of WAIC over DIC include that it uses the entire posterior distribution, is invariant to parameterisation, and closely approximates Bayesian cross-validation [49, 50]. Both GOF criteria are considered here for comprehensiveness.

Gelman et al. [49] propose two variants of p_W. Here we use the second variant,

p_{W} = \frac{1}{2} \sum_{i = 1}^{N} var [\log p (y_{i} | θ_{i}) | y_{i}]

which, after simplification, leads to the specific WAIC criterion

W A I C = 2 \sum_{i = 1}^{N} {\underset{m = 1, \dots, M}{var} [\log p (y_{i} | θ_{i}^{(m)})] - \log \frac{1}{M} \sum_{m = 1}^{M} p (y_{i} | θ_{i}^{(m)})}

where $θ_{i}^{(m)}$ is the estimate of the unknown parameter(s) for the i^th area and m^th MCMC iteration. Predictions, or theoretical future observations, denoted $\tilde{y}$ , can be drawn from the posterior predictive distribution

p (\tilde{y} | y) = \int p (\tilde{y} | θ) p (θ | y) d θ

which can be used to assess predictive performance. The idea is that if the model is adequate in describing the data generating process, then the predicted data $\tilde{y}$ will be close to the observed data y. Thus these posterior predictive checks (PPCs) can be viewed as a variation on GoF diagnostics [51].

One specific PPC is the conditional predictive ordinate (CPO) [52] which seeks to re-observe a datum y_i given all other observed data, denoted y_/i,

\begin{array}{l} C P O_{i} = p (y_{i} | y_{\ i}) \\ = \int p (y_{i} | θ) p (θ | y_{\ i}) d θ . \end{array}

This metric is equivalent to the posterior predictive ordinate (PPO), $p (y_{i} | y)$ , in the sense that the set of leave-one-out marginal distributions ${p (y_{i} | y_{\ i}); i = 1, \dots, N}$ contain the same information about the predictive performance as the marginal distribution p(y) [51, 53]. However, the CPO avoids double use of the data since it is a leave-one-out cross-validation predictive density. Additionally, unlike the PPO, the literature contains several useful guidelines for interpreting the CPO [51]. For detecting outlying observations, Congdon [54] suggests scaling the CPO values by dividing them by the maximum CPO value. Scaled CPOs less than 0.01 suggest areas for which the model does not fit well. To compare models, several overall measures of fit have been proposed (e.g. Ntzoufras [51] and Congdon [54]). However, the most numerically stable option seems to be the sum of the log CPO values, as suggested by Held et al. [55], which we adopt here. The best model is taken to be the model which minimises

- \sum_{i = 1}^{N} \log (C P O_{i}) .

In addition to these GoF criteria, we use Moran’s I statistic [56] to measure the degree of autocorrelation remaining in the model residuals, checking the model assumption that the residuals are independent and identically distributed.

To compare models with respect to predictive performance, the minimum DIC and WAIC were determined for each model, indicating the best model fit, and model variants with a DIC or WAIC within 2 or 7 units were identified as having reasonable model fit, as per the common rule of thumb [23]. Smaller sums of log CPOs indicated better predictive performance, and the model with the minimum was flagged. Moran’s I was compared across model variants using p-values from the test assuming normality of the statistic under the null hypothesis of no autocorrelation.

Data

Two spatial datasets are analysed. The first is the North Carolina sudden infant death syndrome (SIDS) dataset first presented by Atkinson [57], and subsequently augmented and analysed by Cressie and Read [37] and Cressie and Chan [58] amongst others. The observed data represent counts of SIDS aggregated from 1979 to 1983 for each of the 100 counties in North Carolina. The non-white birth rate over the same period is included here as a covariate. The second dataset is the Scottish lip cancer dataset compiled by Kemp et al. [59] and first analysed in Clayton and Kaldor [9]. This data has been previously analysed by Spiegelhalter et al. [23], Leroux et al. [30], and Duncan et al. [34] amongst others. The observed data represent counts of lip cancer across 56 counties of Scotland, and a spatial covariate representing the percentage of the workforce acting as a proxy for sun exposure is included. A graphical summary of the data is shown in Fig 2. To improve visual interpretation, the northeast island counties of Scotland, Shetland and Orkney, are excluded from all maps. This modification is limited to the maps–data from these counties are still used and estimates for these counties are still generated by the models.

Fig 2 — The observed and expected counts are shown in greyscale; the gradient is capped at the maximum observed value (57 and 39 respectively); larger expected values are shown in black. The colour gradient for the raw SIR reflects a ratio scale; darker shades of red indicate a higher raw SIR, while darker shades of blue indicate a lower raw SIR.

These datasets were chosen for the following reasons: they each contain one useful spatial covariate, which is essential in demonstrating the importance of CASIR; each study region contains a sufficient number of areas to enable adequate evaluation of spatial effects; they have been extensively analysed previously, corroborating the plausibility of the model specifications and parameter estimates presented here; and they are publicly available data, facilitating reproducibility. Additionally, these data represent real cases. This has the advantage over simulated data which may not resemble realistic data, thus casting doubt on the authenticity of the model results and accuracy of the approaches to quantifying smoothing.

Results

For the sake of brevity, the ensuing figures relate mostly to the lip cancer data, with the remaining results presented in S1 Appendix. Key parameter estimates for the BYM model with IG hyperpriors fit to the lip cancer dataset are summarised in Fig 3. The values represent the posterior means. The first four columns correspond to the linear scale parameters: the SSRE (s_i), USRE (u_i), covariate effect (βx_i), and the logarithm of the smoothed SIR (μ_i). The last two columns correspond to the ratio scale parameters, namely the SIR ( $e^{μ_{i}}$ ) and CASIR ( $e^{s_{i}}$ ). The colour gradient is consistent within each of these two classes of variables (i.e. same hues indicate the same values), but the legend reflects the range of values for the specific variable. Note that the degree of smoothing generally decreases as the model variant increases from A to L.

Maps of the key parameter estimates for all the alternative models and variants, for both the lip cancer and SIDS datasets are provided as supplementary material (see S1 Fig through S8 Fig).

The spatial pattern of the SIR appears similar across model variants, while the CASIR varies considerably. The contrast between the SIR and CASIR is greater when more smoothing is applied, highlighting the value of CASIR when trying to investigate the occurrence over-smoothing. This is particularly true for this model applied to this dataset, as the maps of the SIR look similar to the map of the raw SIR in Fig 2. A visual inspection of the SIR maps in Fig 3 (and maps of the remaining 7 model variants in S3 Fig) might lead one to conclude that all model variants have under-smoothed, when in fact the majority of the model variants are likely to be over-smoothed, as the subsequent analysis reveals. Moreover, aside from the SIR, these model variants vary considerably in the estimated SSRE, USRE, and covariate effect, each providing different statistical inference.

The smoothing paradox effect on the SIR surface is not readily observed in Fig 3, but is quite noticeable in the results for the Leroux model variants on the lip cancer data (see S1 and S2 Figs), and the BYM model variants on the SIDS data (see S7 and S8 Figs). The extent of this effect depends largely on the contribution of the covariate effect to the log-risk surface and how spatially autocorrelated the covariate is.

Goodness-of-fit criteria

The results for the GoF criteria and Moran’s I p-values are summarised in Figs 4 and 5.

The interpretation of Figs 4 and 5 is the same. For the DIC and WAIC, the model that minimises the respective criterion is highlighted blue. This is the best model under this criterion. Models with a DIC or WAIC value within 2 or 7 units are highlighted in lighter shades of blue, indicating a reasonable model fit. For the CPO, the model which minimises the criterion is highlighted. For Moran’s I, models with small p-values are highlighted red.

There are two important observations to be made here. First, the GoF criteria DIC, WAIC, and CPO are rarely in agreement, and sometimes identify very different models. For example, in Fig 4, for the Leroux LTN model, the model variants considered “best” under each of the three GOF criteria are L, F, and C. Second, sometimes the best model under DIC coincided with a low Moran’s I p-value. However, Moran’s I p-values should be interpreted cautiously–a high degree of autocorrelation amongst near-zero residuals should not warrant the same concern as highly autocorrelated residuals that are large in magnitude. This is especially true for those models closer to variant L which have less smoothing and therefore generally have smaller residuals (see S9 and S10 Figs).