Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Apr 1.
Published in final edited form as: Ecography. 2017 May 30;41(4):661–672. doi: 10.1111/ecog.02575

Two-scale dispersal estimation for biological invasions via synthetic likelihood

Corentin M Barbu 1,2, Karthik Sethuraman 1, Erica M W Billig 1, Michael Z Levy 1
PMCID: PMC6086346  NIHMSID: NIHMS983086  PMID: 30104817

Abstract

Biological invasions reshape environments and affect the ecological and economic welfare of states and communities. Such invasions advance on multiple spatial scales, complicating their control. When modeling stochastic dispersal processes, intractable likelihoods and autocorrelated data complicate parameter estimation. As with other approaches, the recent synthetic likelihood framework for stochastic models uses summary statistics to reduce this complexity; however, it additionally provides usable likelihoods, facilitating the use of existing likelihood-based machinery. Here, we extend this framework to parameterize multi-scale spatio-temporal dispersal models and compare existing and newly developed spatial summary statistics to characterize dispersal patterns. We provide general methods to evaluate potential summary statistics and present a fitting procedure that accurately estimates dispersal parameters on simulated data. Finally, we apply our methods to quantify the short and long range dispersal of Chagas disease vectors in urban Arequipa, Peru, and assess the feasibility of a purely reactive strategy to contain the invasion.

1. Introduction

The environmental and economic costs of biological invasions are increasing (Pimentel et al. 2005). Invasions by the red imported fire ant (Solenopsis invicta) (Ascunce et al. 2011), the brown marmorated stink bug (Halyomorpha halys) (Zhu et al. 2012), the bed bug (Cimex lectularius) (Wu et al. 2014) and others can cause large economic loss, damage native ecosystems, foment the spread of pathogens, and disrupt agriculture (Jeschke & Strayer 2005, Pimentel et al. 2005). Managing such invasions is problematic in part due to difficulty in inferring the dispersal and migration patterns of invading organisms. Numerous models of invading species (Giometto et al. 2014, Hastings et al. 2005) and epidemics (Keeling et al. 2001, Smith et al. 2002) have been developed and characterized. The reliability of these models depends on an adequate representation of the underlying biological process and accurate estimation of the models’ parameters.

The generalized method of moments, originating from Pearson’s use of expected and observed moments to fit models (Pearson 1894), provides a conceptual framework for parameter estimation (Hansen 1982) based on minimizing a distance metric between observations and model expectations. Within this framework, the definition of an analytical likelihood is classically used to parameterize a variety of models.

However, likelihood functions can be complex or even intractable in the case of intricately autocorrelated data, a concern for time series and spatial data (Hartig et al. 2011, Soubeyrand et al. 2009). For example, sequential surveys generate spatial snapshots of the invading species which are inherently correlated in space and time. Invasions of many organisms, and insects especially, commonly involve passive and active migration inducing both local movements, related to the dispersal abilities of the organism, and long-distance jumps, often human mediated (Hengeveld 1989, Shigesada & Kawasaki 1997, Shigesada et al. 1995, Suarez et al. 2001). Long range dispersals can introduce new fronts of invasion, making the course of insect unpredictable and the likelihood function intractable (Supplementary Materials 1).

Despite the inherent stochasticity of the dispersal and observations, these processes generate reproducible spatial patterns that can be captured using summary statistics (Giometto et al. 2014, Lewis & Pacala 2000). Based on the use of such summary statistics, approximate Bayesian computation (ABC), provides likelihood-free inference for a wide range of phylogenetic, genealogical, and ecological applications (Aeschbacher et al. 2012, Beaumont 2010, Csilléry et al. 2010). ABCs can handle complex datasets and models but have implementation barriers. Summary statistics, distance metrics, and thresholds must all be selected, tuned, and cross-validated. Additionally, they suffer from the “curse of dimensionality” – as the number of summary statistics increases, the number of simulations necessary to parameterize and subsequently cross-validate models explodes (Csilléry et al. 2010). Another approach involves analytical derivations of approximate likelihoods, simplifying the spatial (Varin et al. 2011) or temporal (Ionides et al. 2006) relationships between observations or even defining analytical approximate likelihood on summary statistics (Soubeyrand et al. 2009). We provide an alternative to this last approach using the synthetic likelihood (Fasiolo et al. 2016, Wood 2010), a numerically defined likelihood for summary statistics. The synthetic likelihood avoids complex analytical derivations as well as problems of dimensionality and manual tuning, and can be readily explored using established likelihood based inference methods such as Markov chain Monte Carlo (MCMC).

Here we show how the synthetic likelihood framework can be applied to parameterize spatio-temporal dispersal processes, in particular invasions on multiple scales. Based on prior models of stratified spatial invasions (Shigesada & Kawasaki 1997, Shigesada et al. 1995), we divide dispersal into shorter range movements and longer range migrations. To characterize patterns emerging from such stratified dispersal, we compare several well-known spatial statistics including the Moran’s I (Moran 1950), Geary’s C (Geary 1954), and spatial semivariance (Cliff & Ord 1981), and introduce novel statistics based on distances between invaded locations, landscape partitions of varying granularity, and annular projections around dispersal foci. We show how different types of statistics can be complementary in summarizing biological invasions and how appropriate statistics might be chosen for parameter estimation based on spatial observations at two time points. Finally, we apply our procedure to the dispersal of the Chagas disease vector Triatoma infestans in the city of Arequipa, Peru and discuss how our estimates compare with and extend previous work on this major public health concern in South America (Tarleton et al. 2007).

2. Methods

2.1. Data and model

We study invasive spread on a two dimensional landscape composed of distinct, georeferenced, spatial units, i.e. households. For example, Figure 1 displays a particular landscape in Arequipa, Peru where our survey data was collected. Our experimental data is the state of each unit, occupied (here infested by T. infestans) or empty, at two time points (here in 2009 and 2011, Figure 1). We model the dispersal from the initial set of occupied units and assume no reversion from occupied to empty. Dispersal occurs at two scales: between neighboring units (“hops”) and across long distances (“jumps”). For simplicity, we assume dispersal to all units is equally likely within the hop limits and within the jump limits. We do not attempt to estimate the hop and jump limits but include them as fixed values of our model. We carefully select these limits through model comparison and relevant prior information from the literature (Supplementary Materials 4). Having fixed the hop and jump limits, the dispersal process is uniquely described by the frequency of dispersal events from a single occupied unit (α, units: dispersal events/(occupied units × weeks))) and the probability that a dispersal event is a jump rather than a hop (φ). We simulate the stochastic model in continuous time using the Gillespie algorithm (Gillespie 1975); details are provided in Supplementary Materials 2.

Figure 1:

Figure 1:

T. infestans infestation in urban Arequipa, Peru. The map on the left gives the infestation status of households when examined during the first or starting cross-sectional survey in January 2009; the map on the right gives the status when examined during the second or ending cross-sectional survey in March 2011. Black diamonds are households that were free of infestation; yellow triangles are households that were occupied. Data in the map have been altered randomly to prevent identification of households.

2.2. Synthetic Likelihood based parameter estimation

To estimate the dispersal frequency (α) and proportion of long distance events (φ) that produced the data collected at two time points, we use a Markov chain Monte Carlo procedure detailed hereafter and at each step the likelihood of the data given the model is determined numerically using the synthetic likelihood.

2.2.1. Synthetic Likelihood calculation

At each iteration of the MCMC, for specific values α0 and φ0 that are in the support of prior distributions, we compute the synthetic likelihood in five steps:

  1. Simulate dispersal from the occupied units from the start to the end time point r= 100 times, generating simulated datasets.

  2. Apply summary statistics to each simulated dataset, generating simulated summary statistics.

  3. Approximate the joint distribution of the simulated summary statistics with the multivariate normal distribution (Wood 2010).

  4. Calculate the density of the summary statistics on observed data in the approximate joint distribution fit at the previous step. This density is the synthetic likelihood.

  5. Combine the synthetic likelihood with prior probabilities placed on parameters to obtain the posterior probability of each parameter.

When the proposed value of α, the frequency of dispersal, is very small, the simulations, through which we fit parameters to the data, simply never take off – there is no simulated positive houses with which to fit a model. As all sufficiently small values of α generate simulated maps equally empty, the MCMC chain can wander through equally unlikely values of α, preventing full consideration of the parameter space. A symmetric issue arises with very large values of α – in this case the simulated dispersals fill up the whole area, and all houses are infested. This saturation similarly allows the MCMC chain to wander in equally unlikely large values of α. The situation is easily remedied by setting a log-normal prior on α (mean 0.02, standard deviation 1 on the log scale). The prior penalizes the exploration by the MCMC chain of the problematic areas, which are, of course, poor estimates that do not capture the true pattern (in reality we do have some dispersion but a dispersion weak enough not to fill up completely the map). For the proportion of jumps, φ, we simply used a bounded uniform prior on its definition domain [0,1].

2.2.2. Markov chain Monte Carlo procedure

We use a Metropolis-Hastings Markov chain Monte Carlo (MCMC) to sample over the synthetic likelihood space, and compute posteriors.

Proposals are generated independently for the two parameters. α (the dispersal frequency) proposals are generated using a log-normal distribution and φ (the proportion of jumps) values are proposed on a bounded normal distribution between 0 and 1. As recommended by (Gelman et al. 1996), we use an adaptive sampling procedure to select proposal standard deviations for sampled parameters. Each proposal standard deviation is increased or decreased automatically until the proportion of accepted proposals for all parameters falls consistently between 15% and 40%. In our experience, this step lasts between 100 and 1000 iterations which are later discarded as a burn-in.

After both proposal standard deviations are selected, we iterate through both parameters with routine checks of chain convergence and terminate the chain upon passing of convergence checks and sufficient sampling to estimate 95% credible intervals, as implemented in the R package yamh (Barbu 2016, Geweke 1992, Raftery & Lewis 1996). Convergence was further confirmed by applying Gelman and Rubin’s test of chain convergence (Gelman 2004) over three chains run from distinct starting values. This test, using a diagnostic threshold of 1.05, always confirmed the convergence detected by the procedure described above for individual chains. In our experience, convergence occurred within a few hundred iterations and 20000 iterations were enough to estimate 95% credible intervals thanks to weak autocorrelation of the chains.

2.3. Summary statistics

2.3.1. Pairwise summary statistics

Developments in spatial statistics over the last century provide a variety of measures to characterize spatial patterns.

We use here three common spatial statistics : Moran’s I, Geary’s C and the spatial semivariance. The Moran’s I (Moran 1950) and Geary’s C (Geary 1954) characterize global and local autocorrelation respectively (Cliff & Ord 1981). The spatial semivariance (Cliff & Ord 1981) characterizes the aggregation of units (Table 1). These statistics are pairwise, meaning that their calculation assesses the similarity between pairs of units depending on the distance between them. To summarize the spatial structure of the invasion we compute each of these statistics for non-overlapping distance intervals between units, e.g. 0–10m, 10–20m, 20–30m, etc. In Table 1, d denotes the non-overlapping distance interval. The indicator variable wd (i, j) is 1 when the Euclidean distance between units i and j falls within that interval.

Table 1:

Definition of pairwise summary statistics used in synthetic likelihood parameter estimation.

Statistic Formula
Moran’s I Id=Nij=i+1wd(i,j)ij=i+1wd(i,j)(XiX¯)(XjX¯)i(XiX¯)2
Geary’s C Cd=(N1)ij=i+1wd(i,j)(XiXj)22ij=i+1wd(i,j)i(XiX¯)2
Semivariance Sd=12ij=i+1wd(i,j)ij=i+1wd(i,j)(XiXj)2
 FPP FPPd=1ij=i+1wd(i,j)ij=i+1wd(i,j)(Xi×Xj)

We propose a fourth pairwise statistic focusing on the frequency of co-occurrent positive events at different distances, the frequency of positive pairs (FPP). For each distance interval we calculate the frequency of pairs of units that are both occupied (Table 1). The FPP resembles Ripley’s variance stabilized L estimators which also measure the co-occurrence of events in space (Ripley 1977). However, the FPP limits the possibly occupied space to discrete units in space, naturally accounting for spatial distribution of these units and boundary issues. As with previous statistics, we compute the FPP in non-overlapping distance intervals.

N is the total number of observations, Xi the binary value of the observation i(21 for occupied, 0 for not occupied), X¯ the mean value of all observations, wd (i,j) the indicator of the Euclidean distance between observations i and j being within the distance interval d (1 if yes, 0 if no).

In both the simulation and applied studies, the choice of distances followed the following empirical rules: 1) the first radius was set to include the first line of neighbors (mean of the nearest neighbor distance); 2) the largest radius was approximately set to half the width of the data landscape; 3) in between we defined 6 nested annuli, each being larger than the former by a pre-set value.

2.3.2. Annular prevalences around initially occupied units

We also considered a measure based on annuli of increasing radii around each of the initially occupied units. For each annulus, we calculate the occupancy prevalence, defined as the number of occupied units divided by the total number of units within the annulus. We then take the mean and standard deviation of the prevalence across annuli of the same size for all the initially occupied units, thus generating a set of means and standard deviations per distance interval. We use the same distance intervals for the annular prevalence metrics as the ones used for the pairwise statistics.

2.3.3. Partition based L-moments

We summarize the data by partitioning the landscape into discrete cells (Figure 2) using K-Means algorithm AS 58 (Sparks 1973), details are provided in Supplementary Materials 6. For each MCMC iteration we define seven partitions with an increasing number of cells, the coarser one encompassing ~ 100 units per cell and the finer one encompassing ~ 5 units per cell. Unlike a simple grid, K-Means yields partitions that overlap irregularly so that finer partitions are not nested within coarser ones, thus reducing correlation among measurements at different scales.

Figure 2:

Figure 2:

Multiscale partition based summary statistics. A spatial pattern, with occupied units in red diamonds and empty units in smaller black dots, is summarized using K-Means partitions. The landscape is partitioned into two scales, ten and twenty cells, with each cell in a different shade of grey. The occupancy prevalence per cell is computed, and its distribution is shown as histogram and empirical cumulative distribution function. The L-variance, L-skewness, and L-kurtosis, our summary statistics of the respective occupancy distributions, are given.

We first calculate the proportion of occupied units in the different cells of a partition. We then integrate the information from each cell into summary statistics. We considered polynomial regression coefficients, which have been suggested previously (Wood 2010), but found that these tend to be highly correlated and sensitive to small changes in underlying distributions (Stimson et al. 1978). Therefore, we use L-moments which are scaled, linear combinations of order statistics resembling standard sample moments but more robust to outliers. L-moments are extensively used as goodness of fit criteria in geospatial modeling (Hosking 1990). For each partition, we calculate moments of the occupancy prevalence in the different cells across repeated simulations of a given MCMC iteration. These moments are: 1) the L-variance, 2) the L-skewness and 3) the L-kurtosis (L-V, L-S, and L-K in Figure 2). Additionally, because the L-mean is the normalized average proportion occupied it does not change across different partitions or scales. For this reason, we use the number of occupied units (noccupied) in its place.

2.4. Evaluation and validation of alternative summary statistics

In this section we describe two procedures to study the behavior of the synthetic likelihood and proposed summary statistics. The first is an analysis of each summary statistic’s relevancy, designed to be conducted before the estimation procedure. Using a computationally efficient approximation we compare the potential of different summary statistics and their combinations by sampling the synthetic likelihood surface around a given parameter set. The second is a check of credible interval accuracy, designed to be conducted after the estimation by MCMC. This is a computationally intensive procedure to verify that the estimation using the synthetic likelihood generates credible intervals of proper coverage probability, here that the interval contains the true value 95% of the time.

2.4.1. Simulation landscape and specification of the statistics

To evaluate our model and fitting procedure on simulated data, we consider an example landscape consisting of 33 × 33 units on a square grid, spaced 10m apart. In the model, we limit hops to 30m, meaning that the invader can disperse locally to any household within 30m of an occupied household. We limit jumps to between 30 and 120m.

We consider the semivariance, Geary’s C, Moran’s I, and FPP at seven distance intervals (0–15m, 15–23m, 23–39m, 39–63m, 63–95m, 95–135m, and 135–183m). We use the same distance intervals for the annular prevalence metrics.

For the partition based L-moments, we divide our simulation landscape following seven partitions of 10, 30, 50, 75, 100, 135, and 170 cells, going from coarse to fine grained partitions.

2.4.2. Preliminary assessment of the relevancy of summary statistics

To explore the inferential value of the summary statistics for an expected parameter set θ0, we investigate the shape of the normalized synthetic likelihood distribution Lsθ0 around θ0 and the accompanying synthetic likelihood density dsθ0. By examining the synthetic likelihood surfaces of different summary statistics through simulation, we compare their strengths and select informative statistics for subsequent use in parameter estimation.

In our simulations, we use a fixed set of initially occupied units, S. We generate this set by seeding an invasion at a single unit at the center of the map and then dispersing according to θ0 for two years. Alternatively, this initial household could be chosen at random, likely providing a more stringent basis for the choice of adequate statistics. Next, we simulate dispersion r= 100 times from S still with θ0 for another two years, obtaining r sets of ending occupied units. We call this collection of endpoints from the simulations Eθ0. We fit a multivariate normal distribution to the summary statistics on Eθ0 to calculate the likelihood distribution L^sθ0 which is defined as the estimated likelihood distribution for any set of summary statistics given θ0.

We evaluate this distribution L^sθ0 over parameters along an n × n (we use 50 by 50) matrix of θ values surrounding θ0. As above, for each θ, we simulate dispersion r times from S, giving a set of ending occupied units Eθ,l … r. The average density of each Eθ,i in L^sθ0 corresponds to the average synthetic likelihood of the endpoints simulated with θ given the model parameterized with θ0. We assume this sampling approximates the likelihood distribution over the entire parameter space and normalize the likelihoods to obtain a synthetic density surface, comparable between statistics. The synthetic density surface or dsθ0(θ) is estimated as:

dsθ0(θ)L^sθ0(θ)θΘL^sθ0(θ)

where Θ is the ensemble of parameter sets tested, including θ0.

These densities provide information on the power of different summary statistics to discriminate between parameter sets surrounding θ0. To facilitate the comparison, we identify approximate 95% highest posterior density regions (Held 2004) - which are the narrowest, possibly disjoint, set of points which contain 95% of the density (detailed in Supplementary Materials 3). Our approach eases testing of summary statistic combinations because once a set of model simulations and statistics has been computed, combinations of these statistics can be rapidly analyzed for their ability to discriminate around θ0.

2.4.3. Assessment of posterior credible intervals coverage

Concerns have been expressed that approximating the joint distribution of the summary statistics with the multivariate normal distribution may induce suboptimal estimates (Fearnhead & Prangle 2012). We verify numerically that our procedure provides unbiased estimates and proper credible intervals through a series of parameter estimations via MCMC, attempting to recover preset parameters in a manner similar to (Cook et al. 2006).

From a given set of parameters θ0, we simulate n data sets with the same procedure as the preliminary analysis, compute their summary statistics, and use the statistics to parameterize the model on each dataset by MCMC. For each MCMC iteration, we use r simulations to compute the synthetic likelihood. We assess the quality of our estimation through the 95% credible interval width and coverage. The coverage is the proportion of MCMC chains with 95% credible intervals containing the θ0, the true parameter set:

coverage0.95i=1nI(Qi,97.5>θ0>Qi,2.5)n

where I is the indicator function, 0 if false, 1 if true, and Qi,x gives the xth quantile from the MCMC using the ith simulated dataset.

2.5. Chagas disease vector dataset

As an example of invasive species, we studied the Chagas disease vector T. infestans in Arequipa, the second largest city in Peru (Figure 1). Household level data were collected on the presence or absence of T. infestans from 535 households in January of 2009 and again in March of 2011. In 2009, data were collected by Peruvian Ministry of Health personnel as they surveyed to plan insecticide application. No control activities were conducted between 2009 and 2011 due to an insecticide shortage. In 2011, our team repeated the surveys. Each survey consisted of a search for vectors in the house and adjacent backyards. Detailed survey methods are provided in (Hong et al. 2015).

Participation in the surveys was imperfect. In the first survey, 96 households (18%) could not be entered and in the second survey, 131 households (24%). We have shown previously that households that do not participate in vector surveys are less likely to be infested than those that participate (Hong et al. 2015). Based on this finding, we assume that T. infestans are absent from households we did not enter. The proposed method can nevertheless be modified to include the fact that there are households with missing status by randomly sampling participating households before calculating the summary statistics.

Based on preliminary analysis (Supplementary Materials 4), we set the hop limit to 30m, corresponding to three times the average distance to the closest neighbor, and the jump limit to between 30 and 500m, allowing jumps between all households. We fit the dispersion parameters (α, φ) using the statistics described above. We adjust the partition scales to 5, 10, 25, 40, 60, 90, and 130 cells and the pairwise and annular distance intervals to 0–10m, 10–19m, 19–37m, 37–64m, 64–100m, 100–145m, and 145–199m.

To verify that the posteriors for our parameter estimates (Figure 5) have proper coverage and credible intervals, we perform a check of credible intervals validity on data from Arequipa. We use households occupied in 2009 as our starting point and the respective parameter estimates as the θ0.

Figure 5:

Figure 5:

Dispersal parameter estimates for Triatoma infestans in urban Arequipa, Peru. The mean and 95% credible intervals for α, the dispersal frequency (units: dispersal events/(infested households × weeks), and for φ, the proportion of jumps, obtained from synthetic likelihood analysis using different sets of statistics are given above. Each analysis used one Markov chain Monte Carlo and 150 model simulations per chain iteration.

Statistical analyses were performed using R (R Core Team 2014), the annotated code is freely available as a Bitbucket repository at: https://bitbucket.org/cbarbu/synlikspatial/branch/ecography2017publication.

3. Results

3.1. Preliminary assessment of the relevancy of summary statistics on generated data

We assessed the comparative relevancy of the summary statistics for a set of distinct sets of parameter values: (α, φ): (0.02, 0.80), (0.04, 0.40), or (0.06, 0.10). In general, summary statistics varied in their ability to recapture known parameter values.

Over all parameter sets, the FPP performs the best of all the pairwise measures (Figure 3, column 4). The high performance of the FPP is consistent with its focus on presence (i.e occupied households) and not absence, (i.e empty households) suggesting that it might work readily as an “out of the box” statistic for parameter estimation of invasive processes in the early stages when the invader is relatively rare. The partition-based L-moment statistics perform comparably to the FPP over all parameter sets considered (Figure 3, column 5). Minor differences in identification persist, with the FPP providing homogeneity for the (0.02, 0.80) set but elongated tails for the (0.06, 0.10) set.

Figure 3:

Figure 3:

Synthetic likelihood density profiles for Moran’s I, noccupied, Moran’s I and n, FPP, and partition L-moments statistics. Three sets of dispersal parameters, (α, φ), are taken as the true θ0: (0.02, 0.80), (0.04, 0.40), and (0.06, 0.10) in this preliminary analysis. The θ0 are shown as black lines. Synthetic likelihood density profiles are plotted for each selected summary statistic. Areas of high likelihood are shown in green, intermediate likelihood in yellow, and low likelihood in red. The region encircled in black is the 95% credible region, the smallest collection of points containing 95% of the likelihood density. Tighter credible regions centered around the true value indicate better performing summary statistics.

The annular statistics are sensitive to the initial set of households occupied, as they only survey around these foci. As expected, they perform better for the (0.06, 0.10) set which corresponds to many initial occupations (Supplementary Materials 3). For the (0.02, 0.80) parameter set, the marginal 95% credible region of the annular statistic nearly covers the entire space of φ, demonstrating an inability to discriminate between low and high proportions of long range dispersal.

The Moran’s I performs worst in terms of identifying the dispersal frequency (α) (Figure 3, column 1). The Moran’s I distribution appears broad and bimodal on the (0.02, 0.80) set with modes at (0.02, 0.80) and (0.03, 0.95). Bimodality may arise from symmetric consideration of occurrence and absence, where a pattern and its negative generate similar values. The first mode, with lower α, is caused by the patterning of occupied units while the second mode, with higher α, is caused by the patterning of empty units – a nuisance interaction. Through the bimodality, we see that Moran’s I lacks information on the number of occupied households (noccupied), indicating that the first mode, considering occupied households, is the true mode. Conversely, the noccupied statistic precisely captures the dispersal frequency but entirely fails to discriminate between different levels of jumping (Figure 3, column 2). Once combined, these two statistics perform excellently and demonstrate that combining complementary and differently informative statistics can improve parameter identification (Figure 3, column 3). Geary’s C and the semivariance both contain information on the number of occupied households and perhaps for this reason perform better than Moran’s I alone (Supplementary Materials 3).

All statistics fail to differentiate between large proportions of long range dispersal (φ > 40%), potentially because these parameter sets generate similar results. In addition, for small φ (e.g. at (0.06, 0.10)) (Figure 3, third row), the diagonal orientation of the synthetic likelihood surface suggests that high dispersal frequency (α) and low long range dispersal (φ) can partially be confused with low dispersal frequency and high long range dispersal which is an expected feature as long range dispersal colonizes new and completely susceptible areas.

3.2. Assessment of quality of posterior credible intervals

We study the validity of the credible intervals estimated with our procedure and compare the efficiency of the different summary statistics for the intermediate (0.04, 0.40) parameter set. Concerning the relative performances of summary statistics, we find strong consistency with the preliminary analysis (Figure 4), confirming the validity of the approximation made in this computationally efficient procedure. In the dispersal frequency (α) domain, Moran’s I performs worst with the largest credible interval size but adding the noccupied statistic greatly shrinks its credible intervals. As before, we can see that combining summary statistics can make up for innate deficits in individual ones. We consider three different sets of partition based L-moments: L-Variance and noccupied (L−V + N), L-Variance, L-Skewness, and noccupied (L−VS + N), or L-Variance, L-Skewness, L-Kurtosis, and noccupied (L−VSK + N). The FPP and these three sets of L-moments perform well in capturing α with small credible intervals. Furthermore, the coverage checks demonstrate that most summary statistics produce 95% credible interval that accurately recapture θ0, here α = 0.04, 95% (or greater) of the time.

Figure 4:

Figure 4:

Credible interval size and coverage depending on statistics. The true dispersal parameters, θ0 (α, φ) are taken as (0.04, 0.40) in this procedure. Bars give the 95% empirical credible intervals for the interval size and 95% binomial credible intervals for the coverage.

All summary statistics provide fairly similar credible interval sizes when estimating φ (the proportion of jumping) though the quality of the credible intervals varies. The Moran’s I and n and FPP statistics recover φ relatively well with small credible interval sizes and proper coverage. The L-moments statistics only achieve sub-optimal coverage. Their incorrect coverage potentially results from deviation of their summary statistic distribution from the multivariate normal. Removing statistics from L−VSK+N to L−VS+N to L−V+N results in improved coverage, an expected result: as fewer statistics are included, the distribution’s dimensionality decreases, and the summary statistics are more likely to conform to the multivariate normal (Scott 2009). Finally, as expected from the preliminary analysis, the annular statistics have the largest credible interval for φ and report greater than 95% coverage.

3.3. Applications to Chagas disease vector dispersal

The dispersal of T. infestans in Arequipa produces distinctly identifiable spatial patterns. In our data, most occupied households were found within small, distinct clusters (Figure 1). Between the first and second time points, all clusters expanded and a new cluster was established at the bottom right, suggesting that a jump occurred. When the hop radius is fixed at 30m (as identified in Supplementary Materials 4), the dispersal characteristics of these patterns can be readily detected through synthetic likelihood analysis.

Using the FPP summary statistic, we find that hopping is the dominant mode of dispersal, φ: 0.188 (95% CI: 0.0261–0.448). However, there is a nontrivial amount of longer range (>30m) jumping. The mean dispersal frequency (α) is 0.0102 (invasions per occupied household per week) with a tight 95% credible interval (0.00620–0.0158). The dispersal frequency paints a picture of a slowly advancing invasion, with a single occupied household producing a secondary infestation on average (computed as 1/ α) in 1.89 years (95% CI: 1.22–3.20).

The characteristics of the posteriors (medians and 95% credible intervals) estimated with other summary statistics are similar to those obtained with the FPP (Figure 5). The relative performance of the different statistics is consistent with their performance from simulations on the artificial landscape. The weakness of the annuli in identifying the proportion of jumps (φ) is particularly marked here with the credible interval spanning nearly the entire space from 0 to 1.

In addition to narrow credible intervals for both α and φ, the FPP demonstrates good coverage for the 95% CI (Supplementary Materials 5). The partition-based statistics also perform well, closely resembling the FPP for parameter estimates and demonstrating adequate coverage whether including the L-Kurtosis (L−VSK + N) or not (L−VS + N).

These partition-based statistic coverage estimates for φ noticeably differ from those resulting from simulations on the artificial landscape (Figure 4). Two major differences potentially affect the coverage: first, coverage is estimated around measured parameters, ~ (0.01,0.19), not around (0.04,0.40); second, partitioning is done on a heterogeneous landscape, allowing for marked difference between partition cells and across partition statistics. Further details results obtained on the Arequipa data are given in Supplementary Materials 5.

4. Discussion

Understanding the parameters of the dispersion of invading organisms is critical to controlling them. We provide a generic framework based on the synthetic likelihood and summary statistics to efficiently parameterize complex spatio-temporal dispersion models. Our methods integrate a rich literature on spatial patterns and the underlying processes from which they emerge (Giometto et al. 2014, Hastings et al. 2005, Keeling et al. 2004, Legendre & Fortin 1989, Lewis & Pacala 2000, Ripley 1977), a growing literature on parameter estimation using summary statistics (Aeschbacher et al. 2012, Beaumont 2010, Beaumont et al. 2002, Csilléry et al. 2010), and a nascent literature on combining the two approaches to make inference on dispersal processes using spatial summary statistics (Jandarov et al. 2014). Our approach allows for efficient calculation of likelihoods even for complex dispersal models and avoids the common simplifications required to derive an analytical likelihood. In our example case on the dispersal of the Chagas disease vector Triatoma infestans in the city of Arequipa, Peru, we successfully estimate the rate of spread of infestations from infested households and show that long range (>30m) dispersal events represents a significant share of overall dispersal events.

The synthetic likelihood provides a generic, convenient, and robust solution for the definition of a likelihood for such statistics. While normal and informative summary statistics must be provided, the synthetic likelihood does not depend on model-dependent analytical calculations (as in the case of the generalized method of moments) or manual “tweaking” (as in the case of approximate Bayesian computation or approximate likelihood methods). The parameter estimation process is reduced to the choice of statistics compatible with the data and relevant to underlying biological processes. We demonstrate that standard, well-known spatial statistics perform adequately for spatial data and propose an efficient and broadly applicable family of statistics, the partition-based L-moments.

Spatial statistics vary in their ability to recapture underlying spatio-temporal process. Both autocorrelation measures and the related semivariance perform worse than the Frequency of Positive Pairs statistic (FPP) in our analysis. The FPP, which only considers interactions between occupied units, may be suited to particularities of biological invasions: while the occupation disperses and forms patterns, the emptiness does not, and patterns of emptiness gives specious information. Our alternative partition-based L-moment statistics perform comparably to the FPP and generally outperform other pairwise methods. Interestingly, while we noted slight under coverage in simulations, this disappeared when estimating dispersal parameters in Arequipa. Partition-based statistics probably perform better over real data because multiple sources of spread on a heterogeneous landscape result in greater variation across partition cells and a distribution closer to a multivariate normal. Additionally, these statistics are computationally efficient; only one calculation per unit is required (O(n)) contrary to pairwise measures (O(n2)). The partition-based statistics provide a good default summary statistic for spatial datasets because they are not restricted to binary data (unlike the FPP)

Computation time can also be a legitimate concern when considering model fitting using summary statistics (Csilléry et al. 2010, Jandarov et al. 2014). Our preliminary analysis allows quick comparison and selection of potential summary statistics. Achieving proper coverage is also a challenge (Fearnhead & Prangle 2012). We check the coverage for a given set of parameters in a manner similar to cross validations used with ABCs (Csilléry et al. 2010) to help determine whether credible intervals are acceptable. While coverage is often appropriate, we find occasional under coverage and note that removing statistics can improve coverage in these cases (Figure 4).

Consequently, we suggest a three part algorithm for the use of the synthetic likelihood for spatio-temporal inference:

  1. Preliminary analysis to identify a relevant, informative set of statistics.

  2. Parameters estimation by MCMC using the identified set of statistics.

  3. Validation of the credible interval coverage for estimated parameter values.

Our findings on the predominance of short range dispersal and slow pace of T. infestans invasion are strongly compatible with the existing literature (Cecere et al. 2006, Schofield & Matthews 1985, Vazquez-Prokopec et al. 2004). These previous studies aimed to describe T. infestans dispersal under different conditions and using different experimental and statistical methods. There is no clear consensus among them. However, it is evident that the insects move at different rates across different scales and that they are affected by barriers in the urban landscape, namely streets separating city blocks (Barbu et al. 2013, Khatchikian et al. 2015). These findings are based on smoothing kernels or molecular methods that compare discrete subpopulations; none attempt to quantify the rate of local versus long distance dispersal in the manner that we have done here.

Currently, Chagas disease vector control consists of broad insecticide application within a community and subsequent control of returning and residual vector populations. To conduct surveillance and control returning and residual populations, the Ministry of Health relies on community-based vector reporting, confirmation of reports by trained personnel, and treatment of confirmed infestation in reporting households as well as neighboring households. This strategy of “stamping out fires” is purely reactive; it relies on community reports post-infestation to find and exterminate vector foci.

To succeed using a purely reactive strategy, long range dispersal must be infrequent. Generally, short range dispersal grows existing foci which, once discovered, can be treated quickly. Long range dispersal, however, generates secondary foci that need to be discovered and treated separately. To eliminate an invasion successfully, the average number of long range dispersals before detection and control of a focus must remain under one, in a comparable sense to the R0 (Heesterbeek 2002). Interestingly, while long range dispersal complicates the elimination of invasive species, short range dispersal may facilitate their elimination by increasing the probability of detection as the focus affects more households. Here, we show that the rate of dispersal at short distances is much higher than the rate of dispersal at long distances, suggesting that a purely reactive strategy may succeed if foci are detected and reported quickly and are treated in their entirety.

There are several caveats to our model and procedures. Summarizing data with statistics results in some loss of information contained within our already fairly limited raw observations. Nevertheless, the information retained is rich enough to provide valuable insight on the spatio-temporal spread process. Additionally, different types of statistics provide similar estimates, hinting at a consistency of parameter estimation. Our procedure currently handles binary data; invasive spread is often more finely described with data on species abundances (Hastings et al. 2005, Lewis & Pacala 2000, Shigesada & Kawasaki 1997, Shigesada et al. 1995). Our procedures can be extended to continuous data and are likely to improve when applied to datasets with additional abundance information. Our hop-jump dispersal model builds on a rich tradition of flexible stochastic models used to describe invasions and epidemics (Barbu et al. 2010, Keeling et al. 2001, Levy et al. 2011, Smith et al. 2002) and overcomes the limitations of some more traditional diffusion models (Hastings et al. 2005). Nevertheless, the hop-jump model only crudely represents the relationship between dispersal probability and distance. Future work could incorporate additional movement classes or dispersal kernels (Keeling et al. 2001, Levy et al. 2008). In particular for T. infestans in Arequipa, we might consider a third movement class to account for the barrier effect of streets (Barbu et al. 2013). Our model also ignores habitat variation among households. Such variability may affect susceptibility of households to invasion, subsequent duration of occupation, and future rate of emigration. The T. infestans data we parameterize is imperfect; approximately 20% of the households in the study site did not participate. We made a simplifying assumption that these households were unoccupied potentially leading to an underestimation of the dispersal rate. Nevertheless, participation in household level vector control campaigns has been studied (Buttenheim et al. 2014), and non-participating households appear to be significantly less infested (Hong et al. 2015).

While the specific stochastic model presented here describes spatio-temporal dispersion, our approach can readily be extended to other stochastic spatio-temporal processes occurring on multiple scales and in complex environments. As complex stochastic models are increasingly prevalent in fields as diverse as population ecology (Barbu et al. 2010, Beaumont 2010), evolution and genetics (Beaumont et al. 2002), and epidemiology (Keeling et al. 2001, Levy et al. 2011), our synthetic likelihood framework augments existing approaches by providing a new and powerful tool to parameterize such models with real world data.

Supplementary Material

Supplemental Materials

6. Acknowledgements

We thank Dr. Jason Roy and Dr. Michelle Ross for their insightful comments and Katty Borrini-Mayori, Maria Luz Hancco, Manuel Burgos, Renzo Salazar, and the laboratory staff at UPCH-LID. We also thank the following organizations for their part in organizing and conducting the Chagas Disease control campaign in Arequipa: Ministerio de Salud del Perú (MINSA), the Dirección General de Salud de las Personas (DGSP), the Estrategia Sanitaria Nacional de Prevención y Control de Enfermedades Metaxénicas y Otras Transmitidas por Vectores (ESNPCEMOTVS), the Dirección General de Salud Ambiental (DIGESA), the Gobierno Regional de Arequipa, the Gerencia Regional de Salud de Arequipa (GRSA), the Pan American Health Organization (PAHO/OPS) and the Canadian International Development Agency (CIDA). This work was supported by National Institutes of Health grant NIH-NIAID R01AI101229.

Footnotes

Publisher's Disclaimer: This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: [10.1111/ecog.02575].

References

  1. Aeschbacher S et al. 2012, A novel approach for choosing summary statistics in approximate bayesian computation – Genetics 192(3), 1027–1047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ascunce MS et al. 2011, Global invasion history of the fire ant Solenopsis invicta – Science 331(6020), 1066–1068. [DOI] [PubMed] [Google Scholar]
  3. Barbu C 2016, ‘Yamh: Yet another metropolis-hastings r package’. https://bitbucket.org/cmbce/r-package-yamh
  4. Barbu CM et al. 2010, Characterization of the dispersal of non-domiciliated triatoma dimidiata through the selection of spatially explicit models – PLoS Negl. Trop. Dis 4(8), e777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Barbu C et al. 2013, The effects of city streets on an urban disease vector – PLoS Comput. Biol 9(1), e1002801. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Beaumont MA 2010, Approximate bayesian computation in evolution and ecology – Annu. Rev. Ecol. Evol. Syst 41, 379–406. [Google Scholar]
  7. Beaumont MA et al. 2002, Approximate bayesian computation in population genetics. – Genetics 162(4), 2025–2035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Buttenheim AM et al. 2014, Is participation contagious? evidence from a household vector control campaign in urban peru – J. Epidemiol. Community Health 68, 103–109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cecere MC et al. 2006, Reinfestation sources for Chagas disease vector, Triatoma infestans, Argentina. – Emerg. Infect. Dis 12(7), 1096–1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cliff AD & Ord JK 1981, Spatial processes: models & applications, Vol. 44, Pion; London. [Google Scholar]
  11. Cook SR et al. 2006, Validation of software for bayesian models using posterior quantiles – J. Comput. Graph. Statist 15(3). [Google Scholar]
  12. Csilléry K et al. 2010, Approximate bayesian computation (abc) in practice – Trends Ecol. Evol 25(7), 410–418. [DOI] [PubMed] [Google Scholar]
  13. Fasiolo M et al. 2016, A comparison of inferential methods for highly nonlinear state space models in ecology and epidemiology – Statist. Sci 31(1), 96–118. [Google Scholar]
  14. Fearnhead P & Prangle D 2012, Constructing summary statistics for approximate bayesian computation: semi-automatic approximate bayesian computation – J. R. Stat. Soc. Series B Stat. Methodol 74, 419–474. [Google Scholar]
  15. Geary RC 1954, The contiguity ratio and statistical mapping – The Incorporated Statistician 5, 115–127+ 129–146. [Google Scholar]
  16. Gelman A 2004, Bayesian data analysis, CRC press. [Google Scholar]
  17. Gelman A et al. 1996, Efficient metropolis jumping rules – Bayesian statistics 5, 599–608. [Google Scholar]
  18. Geweke J 1992, Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments, in Bernardo J et al. , eds, ‘Bayesian Statistics 4’, Clarendon Press, Oxford, UK. [Google Scholar]
  19. Gillespie DT 1975, An exact method for numerically simulating the stochastic coalescence process in a cloud – J. Atmos. Sci 32(10), 1977–1989. [Google Scholar]
  20. Giometto A et al. 2014, Emerging predictable features of replicated biological invasion fronts – Proc. Natl. Acad. Sci. U.S.A 111(1), 297–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hansen LP 1982, Large sample properties of generalized method of moments estimators – Econometrica 50(4), 1029–1054. [Google Scholar]
  22. Hartig F et al. 2011, Statistical inference for stochastic simulation models–theory and application – Ecol. Lett 14(8), 816–827. [DOI] [PubMed] [Google Scholar]
  23. Hastings A et al. 2005, The spatial spread of invasions: new developments in theory and evidence – Ecol. Lett 8(1), 91–101. [Google Scholar]
  24. Heesterbeek J 2002, A brief history of R0 and a recipe for its calculation – Acta Biotheor 50(3), 189–204. [DOI] [PubMed] [Google Scholar]
  25. Held L 2004, Simultaneous posterior probability statements from Monte Carlo output – J. Comput. Graph. Statist 13(1), 20–35. [Google Scholar]
  26. Hengeveld R 1989, Dynamics of biological invasions., Springer Science & Business Media. [Google Scholar]
  27. Hong AE et al. 2015, Mapping the spatial distribution of a disease-transmitting insect in the presence of surveillance error and missing data – J. R. Stat. Soc. Ser A Stat. Soc 178(3), 641–658. [Google Scholar]
  28. Hosking JR 1990, L-moments: analysis and estimation of distributions using linear combinations of order statistics – J. R. Stat. Soc. Series B Stat. Methodol 52(1), 105–124. [Google Scholar]
  29. Ionides E et al. 2006, Inference for nonlinear dynamical systems – Proc. Natl. Acad. Sci. U.S.A 103(49), 18438–18443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Jandarov R et al. 2014, Emulating a gravity model to infer the spatiotemporal dynamics of an infectious disease – J. R. Stat. Soc. Ser. C Appl. Stat 63(3), 423–444. [Google Scholar]
  31. Jeschke JM & Strayer DL 2005, Invasion success of vertebrates in europe and north america – Proc. Natl. Acad. Sci. U. S. A 102(20), 7198–7202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Keeling MJ et al. 2001, Dynamics of the 2001 UK foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape – Science 294(5543), 813–817. [DOI] [PubMed] [Google Scholar]
  33. Keeling MJ et al. 2004, Using conservation of pattern to estimate spatial parameters from a single snapshot. – Proc. Natl. Acad. Sci. U. S. A 101(24), 9155–9160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Khatchikian C et al. 2015, Population structure of the chagas disease vector Triatoma infestans in an urban environment – PLoS Negl.Trop. Dis 9(2), e0003425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Legendre P & Fortin MJ 1989, Spatial pattern and ecological analysis – Plant Ecol 80(2), 107–138. [Google Scholar]
  36. Levy MZ et al. 2008, Impregnated netting slows infestation by Triatoma infestans. – Am. J. Trop. Med. Hyg 79(4), 528–534. [PMC free article] [PubMed] [Google Scholar]
  37. Levy MZ et al. 2011, Retracing micro-epidemics of Chagas disease using epicenter regression – PLoS Comput. Biol 7(9), e1002146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Lewis M & Pacala S 2000, Modeling and analysis of stochastic invasion processes – J. Math. Biol 41(5), 387–429. [DOI] [PubMed] [Google Scholar]
  39. Moran PAP 1950, Notes on continuous stochastic phenomena – Biometrika 37, 17–23. [PubMed] [Google Scholar]
  40. Pearson K 1894, Contributions to the mathematical theory of evolution – Philos. Trans. Roy. Soc. London 185, 71–110. [Google Scholar]
  41. Pimentel D et al. 2005, Update on the environmental and economic costs associated with alien-invasive species in the united states – Ecol. Econ 52(3), 273–288. [Google Scholar]
  42. R Core Team 2014, ‘R: A language and environment for statistical computing’. ISBN 3-900051-07-0. http://www.R-project.org
  43. Raftery AE & Lewis SM 1996, Implementing MCMC, in Gilks W et al. , eds, ‘Markov chain Monte Carlo in practice’, London: Chapman and Hall, pp. 115–130. [Google Scholar]
  44. Ripley BD 1977, Modelling spatial patterns – J. R. Stat. Soc. Series B Stat. Methodol 39(2), 172–212. [Google Scholar]
  45. Schofield CJ & Matthews JN 1985, Theoretical approach to active dispersal and colonization of houses by Triatoma infestans. – J. Trop. Med. Hyg 88(3), 211–222. [PubMed] [Google Scholar]
  46. Scott DW 2009, Multivariate density estimation: theory, practice, and visualization, Wiley.com.
  47. Shigesada N & Kawasaki K 1997, Biological Invasions: Theory and Practice, Oxford University Press. [Google Scholar]
  48. Shigesada N et al. 1995, Modeling stratified diffusion in biological invasions – Am. Nat 146, 229–251. [Google Scholar]
  49. Smith DL et al. 2002, Predicting the spatial dynamics of rabies epidemics on heterogeneous landscapes – Proc. Natl. Acad. Sci. U.S.A 99(6), 3668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Soubeyrand S et al. 2009, Inference with a contrast-based posterior distribution and application in spatial statistics – Stat. Methodol 6(5), 466–477. [Google Scholar]
  51. Sparks D 1973, Algorithm AS 58: Euclidean cluster analysis – J. R. Stat. Soc. Ser. C Appl. Stat 22(1), 126–130. [Google Scholar]
  52. Stimson JA et al. 1978, Interpreting polynomial regression – Sociol. Methods Res 6(4), 515–524. [Google Scholar]
  53. Suarez AV et al. 2001, Patterns of spread in biological invasions dominated by long-distance jump dispersal: insights from Argentine ants – Proc. Natl. Acad. Sci. U.S.A 98(3), 1095–1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Tarleton RL et al. 2007, The challenges of Chagas Disease – grim outlook or glimmer of hope. – PLoS Med 4(12), e332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Varin C et al. 2011, An overview of composite likelihood methods – Stat. Sin 21(1), 5–42. [Google Scholar]
  56. Vazquez-Prokopec GM et al. 2004, Active dispersal of natural populations of Triatoma infestans (hemiptera: Reduviidae) in rural northwestern Argentina – J. Med. Entomol 41(4), 614–621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Wood SN 2010, Statistical inference for noisy nonlinear ecological dynamic systems – Nature 466(7310), 1102–1104. [DOI] [PubMed] [Google Scholar]
  58. Wu Y et al. 2014, A door-to-door survey of bed bug (Cimex lectularius) infestations in row homes in Philadelphia, Pennsylvania – Am. J. Trop. Med. Hyg 91, 206–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Zhu G et al. 2012, Potential geographic distribution of brown marmorated stink bug invasion (Halyomorpha halys) – PLoS One 7(2), e31246. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Materials

RESOURCES