Summary.
Statisticians analyzing spatial data often need to detect and model associations based upon distances on the Earth’s surface. Accurate computation of distances are sought for exploratory and interpretation purposes, as well as for developing numerically stable estimation algorithms. When the data come from locations on the spherical Earth, application of Euclidean or planar metrics for computing distances is not straightforward. Yet, planar metrics are desirable because of their easier interpretability, easy availability in software packages, and well-established theoretical properties. While distance computations are indispensable in spatial modeling, their importance and impact upon statistical estimation and prediction have gone largely unaddressed. This article explores the different options in using planar metrics and investigates their impact upon spatial modeling.
Keywords: Correlation functions, Geodetic distances, Geographical Information Systems, Isotropic models, Map projections, Spatial range, Spherical coordinates
1. Introduction
The analysis and modeling of geographically referenced data play an indispensable role in diverse disciplines such as environmental sciences, ecology, and public health. Such data are often obtained from a set of locations referenced by geographical coordinates (longitude and latitude) that form the “spatial domain.” Spatial modeling attempts to detect and model associations between the observed variables as a function of distances (and perhaps angles) between locations.
Distance computations are indispensable in spatial analysis. Precise intersite distance computations are used in variogram analysis to assess the strength of spatial association. They help in specifying priors on the range parameter in Bayesian modeling (Ecker and Gelfand, 1997), and in setting starting values for the nonlinear least-squares algorithms in classical analysis (Cressie, 1993), making them crucial for correct interpretation of spatial range and the convergence of statistical algorithms. Yet, this is not an issue that has received much attention in the existing statistical literature and ambiguity prevails among practicing statisticians about distance metrics. For example, the analysis of the scallops data appearing in Kaluzny et al. (1998, p. 76–79), and in Ecker and Gelfand (1997), uses naive Euclidean distances treating the geographical coordinates as planar. Except when the spatial domain is “small enough” as to have negligible curvature (where how “small” is “small enough” depends upon the specific application), the usual planar metrics for calculating distances are inappropriate. Treating geodetic coordinates as planar can induce deceptive anisotropy in the models because of the difference in differentials in longitude and latitude (a unit increment in degree longitude is not the same length as a unit increment in degree latitude except at the equator). Spurious nonstationarity may be induced as well due to the systematic properties of these differentials.
Nevertheless, Euclidean metrics are popular due to their simplicity and availability in standard software. More importantly, statistical modeling of spatial correlations proceed from correlation functions that are often valid only with Euclidean metrics. As we demonstrate later, applying these metrics on geographical coordinates requires care, and can otherwise have unattractive consequences on statistical estimation and subsequent interpretation. Note that in geostatistics interest often resides in points that are “closer” together, so the sensitivity of planar metrics may seem irrelevant. However, an important feature of formal spatial modeling (particularly isotropic models) is inference on the effective spatial range, a critical distance beyond which spatial correlation is deemed negligible. The range is relative to the spatial domain and is likely more sensitive to the definition of distance, especially for larger domains.
This article explores options for computing distances and investigates their impact upon statistical modeling, keeping in mind the practicing modeler. While mathematical cartography presents a rich literature (see, e.g., Snyder, 1987) in the study of geodetic distortions and planar projections, such topics focus upon the thematic properties based upon mapping objectives, but are less useful for practicing statisticians seeking appropriate intersite distances for statistical modeling.
Spatial statisticians, however, often need to use cartographic concepts and do so using Geographical Information Systems (GIS). These databases offer versatile interfaces for manipulating and visualizing spatial data and play an indispensable role in spatial statistics that is too huge to be addressed comprehensively here (see, e.g., Jones, 1997). Focusing upon distance computations, GIS offers a wide array of planar map projections using appropriate coordinate transformations, and more flexible distance computations using polygonal methods. Map projections and polygonal methods both require caution for computing distances. The former always distorts distances and can influence statistical estimation as we discuss later. The polygonal methods treat “distances” informally (e.g., actual roadway distance or rail track distance), rather than purely geometric concepts. Such intersite distance matrices can be imported from GIS, but they need not be valid arguments for statistical correlation functions leading to unstable or even infeasible numerical algorithms. Here, we focus upon direct distance computations and do not discuss the polygonal methods further.
We also restrict attention to point-referenced or geostatistical data where the sites are fixed, as opposed to point processes where the sites (and hence intersite distances) are random. The remainder of this article evolves by reviewing a basic framework for spatial modeling, concentrating upon isotropic models, where distances are particularly helpful for interpreting the spatial range. In Section 3, we discuss distance computations using the spherical coordinate system and map projections. Section 4 illustrates the impact of the different metrics on statistical modeling and Section 5 concludes the article with a summary.
2. Review of Spatial Regression Models
There is a growing literature on statistical modeling for point-referenced or geostatistical data. The most common setting assumes a response or a dependent variable observed at a generic location , referenced by its latitude and longitude, along with a vector of covariates . One seeks to model the dependent variable in a spatial regression setting such as
(1) |
The residual is partitioned into a spatial process, , capturing residual spatial association, and an independent process, , also known as the nugget effect, modeling pure error. Inferential goals include estimation of regression coefficients, spatial and nugget variances, and the strength of spatial association through distances.
When we have observations, , from locations, we treat the data as a partial realization of a spatial process, modeled through . Hence, is a zero-centered Gaussian process with variance and a valid correlation function , which depends upon intersite distances and parameters quantifying correlation decay and smoothness of process realizations. Also, we assume are i.i.d. . Likelihood-based inference proceeds from the distribution of the data, , with , where is the matrix of covariates (or model matrix) and is the spatial correlation matrix (corresponding to with . See Cressie (1993) for details, including maximum likelihood and restricted maximum likelihood methods.
Statistical prediction (kriging) at a new location proceeds from the conditional distribution of given the data (for details see, e.g., Banerjee, Carlin, and Gelfand, 2004, p. 48–52). Collecting all the model parameters into , we note that
where and when for all , while the nugget effect is added to the th entry if for some . Classical prediction computes the best linear unbiased predictor (BLUP) by substituting maximum likelihood estimates for the above parameters. A Bayesian solution first computes a posterior distribution , where is the normal data likelihood and is the prior distribution for the parameters, and then computes the posterior predictive distribution by marginalizing over the posterior distribution, .
The function depends upon the metric used to compute and must ensure that is positive definite. Valid classes of correlation functions for Euclidean spaces are generated by Bochner’s theorem (see, e.g., Stein, 1999), highlighting the theoretical importance of Euclidean metrics. Apparently, the parameters most sensitive to the choice of the metric are those associated with the correlation function. Also, note that the correlation function features prominently in both the likelihood as well as the predictive distribution suggesting concern regarding inferential sensitivity to the metric.
Focusing upon the correlation function parameters, we consider the flexible Matérn family, where involves a smoothness parameter in addition to the correlation decay parameter , and is given by
where is the usual Gamma function and is the modified Bessel function of the second kind of order (see, e.g., Abramowitz and Stegun, 1965). In particular, with , we obtain the exponential correlation function . Recent interest in the study of smoothness of a spatial process and spatial gradients (Stein, 1999; Banerjee, Gelfand, and Sirmans, 2003) warrants estimation of . In our context it is unclear how the distance metrics will affect smoothness, so we investigate the exponential (with fixed ) and the more general Matérn family with unknown .
A Bayesian framework is convenient here, allowing inference by assigning proper and moderately informative priors on the weakly identified correlation function parameters. For example, for the smoothness parameter in the Matérn, we can follow Stein (1999) that the data cannot distinguish between and , which suggests placing a prior on . Usually a Markov chain Monte Carlo (MCMC) algorithm is required to obtain the joint posterior distribution of the parameters, but again there are different strategies to opt for. For example, we may work with the marginalized likelihood as above, , or we may add a hierarchy with spatial random effects such that
In either framework, a Gibbs sampler may be designed, with embedded Metropolis or slice-sampling steps, to obtain the marginal posterior distribution (see, e.g., Banerjee et al., 2004).
Much more complex hierarchical models have been discussed extensively in the spatial literature but, irrespective of their complexity, they typically incorporate a spatial correlation function whose computation involves intersite distance computations. Therefore, although we work with simpler isotropic spatial models, our results will be relevant in a broader context.
3. Computing Distances
We consider a few different approaches for computing distances on the Earth, classifying them as those arising from the classical spherical coordinates, and those arising from planar projections. Our treatment is comparative, eliciting some nontrivial aspects that impact spatial modeling. While spherical geometry may suggest natural metrics (such as the geodetic metric to be discussed shortly), we do not recommend a true distance metric because none may be appropriate for scientific data analysis. Henceforth, will denote the Euclidean metric in or as the case may be.
Recall that a spherical model of the Earth is divided by parallels of latitude, referencing east–west, and the meridians of longitude that are great-circle arcs (circle passing through the two points with center as the center of the Earth) joining the poles, intersecting the parallels orthogonally. The Earth is not exactly a sphere, but an ellipsoid (surface obtained by revolving an ellipse). For geodetic computations requiring very high degrees of accuracy, the ellipsoidal model of the Earth is used, but for spatial modeling a spherical model suffices. In fact, apart from locations in the polar regions the accuracy of the spherical model is excellent.
3.1. Spherical Coordinates, the Geodetic Formula, and Euclidean Approximations
Figure 1 shows the spherical coordinate system, where and are two points on the surface of the Earth (sphere not shown) with center , given by longitudes and , and latitudes and . The geodetic distance is the length of the arc of a great circle joining and and is obtained as , where is the radius of the Earth and is the angle between the vectors and . A three-dimensional orthogonal coordinate system is set up with the origin at the center , the -axis directed toward the North Pole, the -axis, on the equatorial plane, along the Greenwich meridian (the 0° meridian, passing through Greenwich England), and the -axis perpendicular to the -axis on the equatorial plane.
Using projections and on the plane, we obtain
Letting and be the unit vectors and , our desired angle is given by , where denotes the inner product between these vectors. Simple trigonometric identities reveal the geodetic distance as
(2) |
This is given in Cressie (1993, p. 265) as the great-arc distance. The correct scale is obtained with the angle in radian measure with the distance expressible in kilometers or miles depending upon the unit of . Using results in a sufficiently good approximation. For example, to obtain the geodetic distance between Chicago (87.63W, 41.88N) and Minneapolis (93.22W, 44.89N), we plug in the appropriate values in (2) to obtain the required distance as .
The transcendental nature of equation (2) dispels any misconception that the relationship between the Euclidean distances and the geodetic distances is just a matter of scaling and merits further investigation. A simple scaling of the geographical coordinates results in a “naive Euclidean” metric (as is done by Kaluzny et al., 1998 and Ecker and Gelfand, 1997) obtained directly in degree units, and converted to kilometer units as . This metric performs well on small domains but always overestimates the geodetic distance, flattening out the meridians and parallels, and stretching the curved domain onto a plane, thereby stretching distances as well. As the domain increases, the estimation deteriorates.
A more natural metric to consider is along the “chord” joining the two points. This is simply the Euclidean metric , yielding a “burrowed through the Earth” distance—the chordal length between and . The slight underestimation of the geodetic distance is expected, because the chord “penetrates” the domain, producing a straight line approximation to the geodetic arc.
The first three rows of Table 1 compare the geodetic distance with the “naive Euclidean” and chordal metrics. The first column corresponds to the distance between the farthest points in a spatially referenced data set comprising 50 locations in Colorado (more of this in Section 4), while the next two present results for two differently spaced pairs of cities. The overestimation and underestimation of the “naive Euclidean” and “chordal” metrics, respectively, is clear, although the chordal metric excels even for distances over 2000 km (New York and New Orleans).
Table 1.
Methods | Colorado data (farthest) | Chicago-Minneapolis | New York-New Orleans |
---|---|---|---|
Geodetic | 741.7 km | 562.0 km | 1897.2 km |
Naive Euclidean | 933.8 km | 706.0 km | 2172.4 km |
Chord | 741.3 km | 561.8 km | 1890.2 km |
Mercator | 951.8 km | 773.7 km | 2336.5 km |
Sinusoidal | 742.7 km | 562.1 km | 1897.7 km |
Centroid-based | 738.7 km | 562.2 km | 1901.5 km |
This approximation of the chordal metric has an important theoretical implication for the spatial modeler. A troublesome aspect of geodetic distances is that they are not necessarily valid arguments for correlation functions defined on Euclidean spaces (e.g., the exponential, spherical, Matérn, etc.). However, the excellent approximation of the chordal metric (which is Euclidean) ensures that in most practical settings, as in our illustration in Section 4, valid correlation functions in such as the Matérn and exponential yield positive definite correlation matrices with geodetic distances and enable proper convergence of the statistical estimation algorithms.
Schoenberg (1942) develops a necessary, sufficient representation for valid positive-definite functions on spheres in terms of normalized Legendre polynomials of the form:
where ’s are positive constants such that converges. An example is given by
which can be easily shown to have the Legendre polynomial expansion .
The chordal metric also provides a simpler way to construct valid correlation functions over the sphere using a sinusoidal composition of any valid correlation function on Euclidean space. To see this, consider a unit sphere and note that
Therefore, a correlation function (suppressing the range and smoothness parameters) on the Euclidean space transforms to on the sphere, thereby inducing a valid correlation function on the sphere. This has several advantages over the Legendre polynomial approach of Schoenberg: (1) We retain the interpretation of the smoothness and decay parameters, (2) is simpler to construct and compute, and (3) builds upon a rich legacy of investigations (both theoretical and practical) of correlation functions on Euclidean spaces. We do not explore spherical correlation functions here, restricting ourselves to the Matérn, and its special case, the exponential, correlation functions that are popular in practice.
3.2. Map Projections
An alternative approach to using Euclidean metrics is that of a planar projection of the spatial domain. This is particularly popular among GIS users, where several map projections are available, and has the added advantage of working with two-dimensional coordinates, unlike the three-dimensional chordal metric. In fact, currently most existing spatial statistics software (e.g., WinBUGS [Thomas et al., 2002], geoR [Ribeiro and Diggle, 2003]) allow specification of only two-dimensional Euclidean coordinates.
We will restrict ourselves to the purely mathematical map projections that derive a relationship between geographical coordinates and cartesian coordinates through
where and are functions that are determined by mapping infinitesimal quadrilaterals with desirable map properties. Ideally, we would seek to preserve all intersite distances but the existence of such a projection is precluded by Gauss’ Theorema Eggregium in differential geometry (see, e.g., Guggenheimer, 1977, p. 240–242). Projections such as the gnomonic projection (Snyder, 1987, p. 164–168) give the correct distance from a single reference point, but is less useful for the practicing spatial analyst who needs to obtain complete intersite distance matrices, which would require, not one, but several such maps.
Areas and angles can, however, be preserved and most mathematical projections offered by GIS can be classified as either conformal (preserving angles) or equal-area (preserving areas). These are developed using geometric constructions or differential geometric analysis to provide underdetermined systems of partial differential equations with further map properties leading to the final equations. See Banerjee et al. (2004, p. 12–14) for some heuristic derivations. Distances are always distorted, with its extent varying by type, but typically conformal projections distort distances much more than equal-area.
We illustrate with two popular projections, one of each type: the Mercator (conformal) and the sinusoidal (equalarea). The Mercator projection is a classical conformal projection where loxodromes (curves that intersect the meridians at a constant angle) are straight lines on the map, a property particularly useful for navigation purposes, derived by letting . After suitable integration, this leads to the analytical equations (with the 0° meridian as the central meridian),
(3) |
The sinusoidal projection yields equally spaced rectilinear parallels (with the 0° meridian as the central meridian), by specifying
(4) |
These and several other projections are routinely available in the GIS software, and in interfaces such as the R package mapproj (McIlroy, 2004), but are simple enough to be computed without accessing such packages.
Yet another class of projections are site-adaptive in that they use information on the specific configuration of the sites (the data). One such projection, which we call centroid-based, sets up rectangular axes along the centroid of the observed locations and scales the points according to these axes. Thus, with locations , we first compute the centroid (the mean longitude and latitude). Next, two geodetic distances are computed that scale the axes: is the geodetic distance (using (2)) between and , where and are the minimum and maximum of the observed latitudes; analogously, is that between and . This centroid-based projection then defines a two-dimensional, planar coordinate system as scaled displacements with respect to the axes along the centroid:
(5) |
Returning to the bottom half of Table 1, we compare the three projections in equations (3)–(5). We find that the sinusoidal and centroid-based projections seem to be distorting distances much less than the Mercator, which performs even worse than the naive Euclidean. Their impact upon statistical estimation and prediction will be discussed in Section 4.
Note that Table 1 is more pertinent from a geographical or geodetic viewpoint than for the spatial statistician, as they do not reflect how statistical estimation is affected, where points that are “closer” together have greater influence on analysis. Nevertheless, the distortion brought about by a poor metric alters the definition of “closeness” and can lead to erroneous statistical estimates (see Section 4).
The centroid-based projection has the potentially unattractive property of being data dependent in that its computation changes with new sites being added. Addition of new sites is particularly common in spatiotemporal settings such as environmental monitoring, and (5) needs to be recomputed every time. Because the sinusoidal does not suffer from this, has comparable accuracy, and is easy to compute, it might be preferred. Nevertheless, being site-adaptive the centroid-based projection is more flexible than the sinusoidal and may perform better for certain configurations. Also, it is inexpensive to compute and, unless the number of sites is huge, presents itself as a viable alternative.
We conclude this section with a brief discussion of the Universal Transverse Mercator (UTM) projection system. Rather than a purely mathematical projection, the UTM is more of a coordinate or grid system (also known as state plane coordinates) using a transverse aspect of the Mercator projection (see, e.g., Snyder, 1987). The projection equations are further transformed into “Easting–Northing” coordinates by overlaying a grid that divides the domain into zones each 6° wide, referencing each point from a zone-specific central meridian. While these UTM grids can be used to adjust for local scale to provide accurate measurements, they are in the same scale as the chordal or the sinusoidal. However, these accrue additional computational complexity (for the grid) and should always be imported from the GIS software or interfaces; yet many GIS interfaces do not provide them (e.g., mapproj). For these reasons, we do not explicitly use them in this article, although their use typically produces accurate results comparable to the geodetic metric.
4. Illustration
We illustrate spatial modeling under different geodetic computations with a weather data set obtained from the National Center for Atmospheric Research (NCAR), Boulder, Colorado with the mean temperature measurements (in 10° C units) obtained at 50 sites, in the month of January in 1997 as our dependent variable . Also supplied is the elevation (in 100 m units) at each site, so the covariate comprises an intercept and elevation. A univariate spatial model as in (1) explains temperature given elevation, accounting for the spatial correlation in the data. Figure 2 shows an elevation map of the spatial domain with the solid circles indicating our sampling locations. The contours represent a particularly interesting topography where temperature is expected to show rich spatial variation. A detailed spatiotemporal analysis of this data set using coregionalized models is performed in Gelfand, Banerjee, and Gamerman (2005).
We apply each of the six metrics in Table 1 to the exponential and Matérn functions. We also analyzed the data using UTM projections for which our results were almost identical to that of the geodetic metric and, hence, are not presented. We performed classical likelihood-based as well as Bayesian analysis for the exponential models, but only a Bayesian analysis for the Matérn (see Section 2). Because the classical and Bayesian methods provided extremely consistent answers for the exponential, we present results only from the latter.
We adopted a flat prior for (the intercept coefficient), and relatively vague (Inverted-Gamma) priors for and . We also chose a Gamma prior for the correlation decay parameter , specified so that the prior spatial range has a mean of about half of the maximum intersite distance in our data, obtained from the first column of Table 1 for the respective metric. Practical analysis calculates the spatial range by solving for . In addition, for the Matérn correlation function we use a prior for the smoothness parameter in our data.
Three parallel MCMC chains were run for 10,000 iterations. The CODA package in R was used to diagnose convergence by monitoring mixing, Gelman–Rubin diagnostics, autocorrelations, and cross-correlations. In each case, 5000 iterations were enough for sufficient mixing of the chains, so the remaining 15,000 samples (5000 × 3) were retained for posterior analysis. We used C/C++ code to fit these models with posterior summarization in R. We remark that implementations with the naive Euclidean and projection methods with an exponential correlation function could be performed in WinBUGS and geoR, because they need a two-dimensional coordinate input. The Matérn is accessible only in the latter, but with a fixed smoothness parameter. Classical analysis for the same could also be performed in geoR (see, e.g., Banerjee et al., 2004, p. 64–65).
Tables 2 and 3 show the parameter estimates (medians) with 95% credible intervals for the exponential and Matérn correlation functions, respectively, under different choices of the distance metric. We see that the regression estimates are virtually unaffected by the metric; in each case there is a significantly positive intercept and, quite expectedly, a significantly negative effect of elevation on temperature. The spatial variance seems to explain a substantial portion of the residual variation, dominating the nugget effect . For example, in the geodetic setting with exponential correlation functions the spatial variance explains about of spatial variation, while with the Matérn function this is about 96%. This seems to be quite stable across the different metrics. For the Matérn correlation function, the smoothness parameter also seems to be robustly estimated across metrics, with 0.5 included in each of the intervals, but the median seems to shift to slightly higher values for the Mercator and naive Euclidean indicating slight oversmoothing compared to the other four.
Table 2.
Parameter | Geodetic | Naive Euclidean | Chordal |
---|---|---|---|
Intercept | 1.031 (0.501, 1.518) | 1.020 (0.321, 1.607) | 1.141 (0.750, 1.488) |
Elevation | −0.417 (−0.530, −0.300) | −0.428 (−0.528, −0.331) | −0.419 (−0.512, −0.315) |
0.098 (0.031, 0.231) | 0.110 (0.033, 0.341) | 0.128 (0.035, 0.483) | |
1.09 × 10−2 (0.69 × 10−2, 7.63 × 10−2) | 0.71 × 10−2 (0.27 × 10−2, 5.42 × 10−2) | 1.12 × 10−2 (0.74 × 10−2, 8.50 × 10−2) | |
Range | 275.2 (39.3, 434.8) | 422.5 (55.4, 1109.2) | 267.8 (35.3, 405.4) |
0.011 (0.003, 0.017) | 0.011 (0.004, 0.024) | 0.008 (0.003, 0.024) | |
Parameter | Mercator | Sinusoidal | Centroid-based |
Intercept | 1.015 (0.307, 1.551) | 1.035 (0.399, 1.569) | 1.103 (0.601, 1.643) |
Elevation | −0.430 (−0.532, −0.327) | −0.432 (−0.533, −0.329) | −0.426 (−0.521, −0.321) |
0.109 (0.031, 0.351) | 0.105 (0.030, 0.348) | 0.093 (0.033, 0.622) | |
0.66 × 10−2 (0.19 × 10−2, 5.24 × 10−2) | 1.12 × 10−2 (0.71 × 10−2, 8.49 × 10−2) | 1.15 × 10−2 (0.71 × 10−2, 7.80 × 10−2) | |
Range | 454.5 (57.25, 1578.9) | 267.9 (35.3, 422.5) | 260.9 (38.5, 422.5) |
0.011 (0.003, 0.025) | 0.011 (0.004, 0.023) | 0.010 (0.003, 0.018) |
Table 3.
Parameter | Geodetic | Naive Euclidean | Chordal |
---|---|---|---|
Intercept | 1.087 (0.789, 1.410) | 1.031 (0.664, 1.447) | 1.015 (0.707, 1.308) |
Elevation | −0.430 (−0.533, −0.336) | −0.421 (−0.523, −0.322) | −0.422 (−0.515, −0.329) |
0.171 (0.043, 1.539) | 0.097 (0.033, 0.505) | 0.093 (0.034, 0.435) | |
7.47 × 10−3 (4.79 × 10−3, 51.18 × 10−3) | 4.26 × 10−3 (2.27 × 10−3, 41.42 × 10−3) | 7.63 × 10−3 (4.85 × 10−3, 54.51 × 10−3) | |
0.770 (0.213, 1.413) | 0.819 (0.227, 1.426) | 0.742 (0.199, 1.402) | |
Range | 273.7 (39.1, 426.7) | 477.3 (48.2, 895.6) | 268.7 (36.8, 422.8) |
0.008 (0.003, 0.019) | 0.007 (0.003, 0.018) | 0.008 (0.003, 0.017) | |
Parameter | Mercator | Sinusoidal | Centroid-based |
Intercept | 1.015 (0.332, 1.597) | 1.014 (0.377, 1.603) | 1.088 (0.735, 1.511) |
Elevation | −0.426 (−0.527, −0.333) | −0.431 (−0.530, −0.330) | −0.427 (−0.527, −0.324) |
0.111 (0.033, 0.363) | 0.106 (0.031, 0.321) | 0.102 (0.035, 0.684) | |
4.01 × 10−3 (1.98 × 10−3, 40.01 × 10−3) | 7.61 × 10−3 (4.89 × 10−3, 55.08 × 10−3) | 7.57 × 10−3 (4.74 × 10−3, 53.22 × 10−3) | |
0.843 (0.331, 1.533) | 0.767 (0.225, 1.441) | 0.791 (0.212, 1.421) | |
Range | 506.9 (51.7, 1025.5) | 270.3 (38.9, 418.0) | 269.5 (37.4, 430.2) |
0.011 (0.004, 0.024) | 0.009 (0.004, 0.024) | 0.007 (0.003, 0.017) |
The estimate that seems to be most sensitive to the choice of the metric is , the correlation parameter, and hence the implied spatial range. While on the one hand this is to be expected, being in some sense “closest” to the distance metric, this effect is interesting because larger distances (where these metrics really differ) are downweighted by the correlation functions. In fact, we see that the estimated spatial range using the chordal, sinusoidal, and centroid-based metrics are similar to the range estimated from the geodetic metric. In comparison, the naive Euclidean metric and Mercator’s projection estimate the spatial range by a factor exceeding 1.5 times the geodetic range for the exponential correlation function; this is even more drastic for the Matérn. Apparently this discrepancy seems to be consistent with the purely geographical effects in Table 1, resulting from a spurious expansion of the spatial domain. In the same vein, a benign difference (lower range than the geodetic) is seen with the chordal approximation, not surprising given the minor bias inherent in its definition. The estimates from the sinusoidal and centroid-based projections are also quite close to the geodetic, corroborating their claim as viable alternatives.
Turning next to predictive performance of these models, we assess the models for 10 holdout locations (with known elevation and temperature) and predict using our fitted model. Figure 3 plots the predicted values against the observed values with the dots representing the predictive mean and the bars representing 95% prediction intervals. Because there is no inherent ordering of the sites, we sort them by the (true) observed values arranged along the line to better display discrepancies. While the geodetic, chordal, sinusoidal, and centroid-based projections all seem to predict well for all these sites, three sites (the second, seventh, and eighth in ascending order in each of the panels in Figure 3) seem to be quite sensitive to the choice of the metric with their prediction intervals not including the observed value (along the line). These three sites are indicated by solid triangles in Figure 2 and are relatively isolated (in the eastern part) from the sampling sites, while the remaining seven holdout locations indicated by solid squares in Figure 2 are located amidst the sampling locations and show more robust predictive performance. The results for the exponential correlation model are almost identical and not shown.
Finally, we computed the Deviance Information Criterion (DIC) (Spiegelhalter et al., 2002) for our models to investigate how the model choice criterion captures metric discrepancies. Briefly, we summarize the fit of the model using the posterior expectation , where is the deviance statistic . The model is penalized by the effective number of parameters estimated as . The DIC is then computed as the sum of and .
Table 4 shows these computations for the exponential and Matérn models under the six different metrics. Generally, the Matérn seems to perform better for this data set for each of the metrics, even for the naive Euclidean metric where its overestimation (relative to the geodetic) of the spatial range is more drastic than the exponential. Apparently, in spite of this overestimation, the Matérn’s flexibility in capturing process smoothness (see, e.g., Stein, 1999) leads to a better fit than the exponential. In fact, the criterion also seems to capture the difference between the metrics with the Mercator and the naive Euclidean metric having higher scores and the other four performing much better.
Table 4.
Exponential | Matérn | |||||
---|---|---|---|---|---|---|
Methods | DIC | DIC | ||||
Geodetic | 7.25 | 11.83 | 18.08 | 7.23 | 9.80 | 17.03 |
Naive Euclidean | 9.52 | 13.15 | 22.67 | 9.11 | 12.28 | 21.39 |
Chordal | 7.18 | 11.92 | 19.10 | 7.29 | 9.86 | 17.15 |
Mercator | 10.21 | 14.65 | 24.86 | 10.14 | 13.71 | 23.85 |
Sinusoidal | 7.23 | 11.86 | 19.09 | 7.31 | 10.01 | 17.32 |
Centroid-based | 7.57 | 11.98 | 19.55 | 7.41 | 10.28 | 17.69 |
5. Discussion
This article explored different options for approximating geodetic distances providing theoretical clarifications and investigating impact upon statistical modeling and focusing upon ease of implementation. While geometrically natural metrics are not necessarily the most appropriate, often it is difficult to ascertain what is. Keeping this in mind, the article has tried to convey that (1) the practicing spatial analyst is often confronted with a choice of metrics which is not easily resolvable with scientific information, and (2) the choice of the metric may influence geostatistical analysis including estimation and prediction of certain parameters. We have demonstrated that uninformed formulation of metrics can affect estimation of the spatial range leading to differences in predictive performance. Viable solutions have been proposed and some have been shown to work well. Effects of spurious nonstationarity and anisotropy can be undertaken as further investigations. Future work can also focus upon the point-process settings, where intersite distances arise as random processes, with sensitivity of spatial tests of randomness on choice of metric.
Acknowledgements
The author thanks Montserrat Fuentes and Doug Nychka for useful discussions. The author also thanks the editors and referees for several insightful comments that helped improve the article.
References
- Abramowitz M and Stegun IA (1965). Handbook of Mathematical Functions. New York: Dover. [Google Scholar]
- Banerjee S, Gelfand AE, and Sirmans CF (2003). Directional rates of change under spatial process models. Journal of the American Statistical Association 98, 946–954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Banerjee S, Carlin BP, and Gelfand AE (2004). Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, Florida: Chapman and Hall/CRC Press. [Google Scholar]
- Cressie NAC (1993). Statistics for Spatial Data, 2nd edition. New York: Wiley. [Google Scholar]
- Ecker MD and Gelfand AE (1997). Bayesian variogram modeling for an isotropic spatial process. Journal of Agricultural, Biological, and Environmental Statistics 2, 347–369. [Google Scholar]
- Gelfand AE, Banerjee S, and Gamerman D (2005). Spatial process modelling for univariate and multivariate dynamic spatial data. Environmetrics, in press. [Google Scholar]
- Guggenheimer HW (1977). Differential Geometry. New York: Dover. [Google Scholar]
- Jones CB (1997). Geographical Information Systems and Computer Cartography. Harlow, Essex, U.K.: Addison Wesley Longman. [Google Scholar]
- Kaluzny SP, Vega SC, Cardoso TP, and Shelly AA (1998). S+ Spatial Stats. New York: Springer. [Google Scholar]
- McIllroy D (2004). The Mapproj Package. Available at http://cran.r-project.org.
- Ribeiro PJ and Diggle PJ (2003). The GeoR Package. Available at http://www.est.ufpr.br/geoR.
- Schoenberg IJ (1942). Positive definite functions on spheres. Duke Mathematics Journal 9, 96–108. [Google Scholar]
- Snyder JP (1987). Map projections—A working manual. United States Geological Survey Professional Paper; 1395. [Google Scholar]
- Spiegelhalter DJ, Best NG, Carlin BP, and van der Linde A (2002). Bayesian measures of model complexity and fit (with discussion and rejoinder). Journal of the Royal Statistical Society, Series B 64, 583–639. [Google Scholar]
- Stein ML (1999). Interpolation of Spatial Data: Some Theory of Kriging. New York: Springer. [Google Scholar]
- Thomas A, Best N, Arnold R, and Spiegelhalter D (2002). The GeoBUGS User Manual. Available at http://www.mrc-bsu.cam.ac.uk/bugs.