Abstract
Estimating dispersal distances from population genetic data provides an important alternative to logistically-taxing methods for directly observing dispersal. While methods for estimating dispersal rates between a modest number of discrete demes are well developed, methods of inference applicable to “isolation-by-distance” models are much less established. Here we present a method for estimating ρσ2, the product of population density (ρ) and the variance of the dispersal displacement distribution (σ2). The method is based on the assumption that low-frequency alleles are identical by descent. Hence, the extent of geographic clustering of such alleles, relatively to their frequency in the population, provides information about ρσ2. We show that a novel likelihood-based method can infer this composite parameter with a modest bias in a lattice model of isolation-by-distance. For calculating the likelihood, we use an importance sampling approach to average over the unobserved intra-allelic genealogies, where the intra-allelic genealogies are modeled as a pure birth process. The approach also leads to a likelihood ratio test of isotropy of dispersal, i.e. whether dispersal distances on two axes are different. We test the performance of our methods using simulations of new mutations in a lattice model and illustrate its use with a data set from Arabidopsis thaliana.
Introduction
Patterns of dispersal have long been recognized as important for evolutionary and ecological dynamics. Nevertheless, accurately quantifying patterns of dispersal via direct observation is difficult in most systems. Instead, patterns of genetic variation can be used to obtain indirect estimates of dispersal tendencies (Slatkin, 1987). For models in which individuals are in distinct demes, a variety of indirect methods are available, for example, cladistic methods (Slatkin and Maddison, 1989; Excoffier et al., 1992) and likelihood-based methods (Rannala and Hartigan, 1995; Tufto et al., 1996; Beerli and Felsenstein, 2001; Nielsen and Wakeley, 2001; Iorio et al., 2005; Hey and Nielsen, 2007).
In comparison, methods for estimating dispersal in continuous isolation-by-distance (CIBD) models are less well developed. In CIBD models, individuals are distributed across a continuous habitat and mate preferentially with nearby individuals. If an individual is born at a location (x0, y0) then the probability distribution of the location where it leaves its offspring is assumed to be a continuous distribution that is time invariant, identical for every individual, and of the form k(x−x0, y−y0) where k(·, ·) is a bivariate distribution, such as a Gaussian distribution with zero mean, variances and , and zero covariance. The standard deviations σx and σy are measures of the single-generation dispersal distance (but see Rousset 2004 for discussion of how this value should be interpreted). If this parameterization implies dispersal is isotropic; in these cases, σ2 is often defined such that . Finally, to avoid the formation of unrealistic spatial clumps of individuals (Felsenstein, 1975), population density is typically constrained to be uniform across the habitat. One approach to imposing density regulation is to assume each individual occupies a single node on a large lattice. The resulting models are parameterized by σ2 and ρ, the density of the population, although in most cases only the joint parameter ρσ2 is identifiable.
Various methods for estimating dispersal in CIBD models have been previously proposed, many of which are moment-based. Rousset (1997, 2000) estimate the product ρσ2 by regressing pairwise estimates of F-statistics on geographic distance. While this method has been shown to be robust to various violations of the model assumptions (Leblois et al., 2003, 2004), one limitation is that linearity of the regression holds only over a limited, intermediate range of geographic distances that is not known prior to the study. Another moment-based estimator is that of (Neigel et al., 1991), which estimates σ2 by assessing the geographic dispersion of mtDNA haplotypes relative to their estimated time of most recent common ancestor (TMRCA). The method has the advantage of being independent of population density of, but it has several drawbacks: it relies on data from only a single non-recombining locus; it fails to account for uncertainty in the estimated TMRCA; and it assumes that a molecular clock holds. Wilkins and Wakeley (2002) provide a coalescent-based, method-of-moments approach applicable to sequence data that uses only average pairwise sequence distances among samples.
More recently, efforts have been made to derive maximum-likelihood estimators of dispersal parameters in CIBD models (Rousset and Leblois, 2007; Meligkotsidou and Fearnhead, 2007). Maximum-likelihood estimators have the advantage of making full use of the data but the potential disadvantage of being difficult to implement. Significant computational challenges exist to implementing likelihood-based methods in population genetics. As a result these methods rely on computational approximations, such as importance sampling, to estimate likelihoods (Stephens and Donnelly, 2000).
Here we propose a novel importance-sampling method for maximum likelihood estimation of dispersal. The proposed method differs from existing likelihood methods in focusing on the geographic distribution of low-frequency alleles only. By restricting the analysis to low-frequency alleles, the computational problem of estimating the likelihood is more tractable than for the full data. The computational gains arise because we need consider only the ancestry of the carriers of the low-frequency allele, rather than the ancestry of the whole sample. Further, the geographic distribution of low-frequency alleles contains most of the information about recent dispersal. Low-frequency alleles are typically descendants of recent mutations and are geographically clustered in the geographic area where the mutation occurred initially. For a given allele frequency, the tightness of this spatial clustering indicates the levels of dispersal. From a coalescent perspective, copies of a low-frequency allele have recent pairwise coalescent times, and previous studies show restricted dispersal has a large effect on the distribution of recent pairwise coalescent times (Wilkins, 2004; Fearnhead, 2007). In addition, by focusing on low-frequency alleles, our method only requires that a population be at demographic equilibrium since the low-frequency alleles in question have arisen. This timescale is much shorter than the time to coalescence of the whole sample (Wiuf, 2000; Slatkin, 2003).
Our method estimates the likelihood of and for each locus separately and then combines information from independent loci. It can be applied to datasets with large numbers of independent loci such as single nucleotide polymorphism (SNP) datasets. The likelihood framework provides approximate confidence intervals and allows a likelihood ratio test for equal dispersal in both directions.
Methods
Model
We begin by considering a population genetic sample of L independent bi-allelic loci from a diploid population of size N that is distributed over a finite area of size A with constant density ρ = N/A. Let n = (n1, . . . , nL) be the the number of chromosomes sampled at each of the L loci and let j = (j1, . . . , nL) be the counts of the derived allele at each of the L loci. Let Xl be a 2×jl matrix containing elements Xldk that represent the dth-dimensional geographical coordinate of the kth copy of the derived allele at locus l, and let X = (X1, . . . ,XL). Further, we assume that all copies of the derived allele are descendant from a single, unique mutation event (i.e. all copies of the mutation are identical-by-descent). While we focus on bi-allelic loci in our presentation, our method is also to applicable multi-allelic loci (e.g. loci that fit an infinite-alleles model of mutation), as long as each allele considered is at low frequency and identical-by-descent.
For alleles that are identical-by-descent, the history of the ancestral lineages of the allele can be summarized by a single intra-allelic genealogy for each locus which can be decomposed into two portions. First we specify a set of times describing events on the genealogy with time measured in units of N generations from the present day. Let Tl = (Tl1, . . . Tljl) be a vector of times such that Tl1 is the time at which the first copy of the derived allele arose at locus l and let Tli for i ≥ 2 be the time points in the past at which the number of ancestral lineages in the intra-allelic genealogy decreased from i to i−1 as one looks backwards in time. Let T = (T1,. . . , TL). The second portion of the genealogy is the tree topology describing the lineages involved in each of the jl − 1 coalescent events on the intra-allelic genealogy. Let G = (G1, . . . ,GL) be the graphs describing the tree topology of the intra-allelic genealogy at each of the L loci.
To model dispersal, we consider a model with independent dispersal along two perpendicular, geographic axes. We consider the per-generation dispersal distribution along each axis to have a mean of 0 with a variance of along one axis and a variance of along the other. We assume the higher moments of the distribution are well-behaved such that after some small duration of time s (measured in units of N generations) the location of a lineage starting at (x, y) is well approximated by a two-dimensional normal distribution with mean (x, y) and a variance-covariance matrix []. The precise time-point at which the approximation becomes valid depends on the timescale of coalescent events in the intra-allelic genealogy. In the extreme case where the allele frequency is so rare that intra-allelic coalescent events occur nearly instantaneously, our assumption implies the dispersal distribution must be exactly normal. By assuming that the geographic location of a lineage will be a two-dimensional normal distribution we are implicitly assuming that the positions of each lineage are following a Brownian motion and that boundary effects are negligible. The lack of boundary effects may be reasonable for low-frequency alleles found centrally within large habitats, because such alleles will likely not have dispersed widely enough to have encountered the boundaries of the habitat.
To denote the unobserved geographic position of each of the single mutation events that gave rise to the first copy of each derived allele, let Zl = (Zl0, Zl1) be the coordinates of the location at which the mutation event occurred for locus l and let Z = (Z1, . . . ,ZL). Due to the constant density of the population, the location at which a mutation occurs is equally likely across the whole habitat, so that the marginal distribution P(Zl = z) equals for all z in the habitat.
Furthermore, we make the approximation that the intra-allelic genealogy is independent of the geographical configuration of the lineages. This approximation is also used by Neigel et al. (1991) and Meligkotsidou and Fearnhead (2007) and implicitly assumes weak population density regulation. For our purposes, we note that, even in density-regulated populations, the approximation may be more accurate for low-frequency alleles. In panmictic populations, the number of copies of a low-frequency allele evolves approximately as a linear birth-death process (Slatkin and Rannala, 1997), so that each copy of the allele leaves an independent number of descendant copies in the next generation. The extension that we assume here is that because copies of the low-frequency allele reproduce independently of each other, they will also reproduce independently of each other’s geographic configuration. Given this approximation, the distribution of topologies and coalescent times for the intra-allelic genealogy are described by the birth-death results obtained by Slatkin and Rannala (1997) for randomly mating populations. Specifically, the probability of the vector Tl can be found by considering Tl as jl ordered samples from the density h(t), where
| (1) |
This distribution arises from the equations in Slatkin and Rannala (1997) by setting and measuring time in units of N generations (see Appendix for more detail). The density implies that:
| (2) |
for all possible values {t1, . . . , tjl}. For the topologies Gl, Slatkin and Rannala’s results provide the following simple distribution which reflects equiprobability of all labelled tree topologies:
| (3) |
for all possible g. To refer to the model, we use the acronym BBM, as our model is a type of branching Brownian motion.
Likelihood-based inference
For performing inference on the model described above, we focus on X, the geographic locations of derived alleles for a set of loci. We are interested in inference on and although we can only infer the value of these parameters jointly with ρ; thus the identifiable parameters of the model are and . We chose to assume the area A was known, and so we instead are only interested in inference on and , knowing that we can convert them to and using the known value of A. We define and use θ to refer to these parameters succinctly. We are also interested in the special case where . In this case there is only one identifiable parameter of the model, which we denote as Nσ2 or in some cases θ* to be more compact. Finally, there are the unobserved quantities that are crucial components of the probability model. To summarize these “missing data” at each locus we let Ml = (Tl, Gl, Zl).
Using the notation Pθ(·) for the probability of an event given θ, the likelihood can then be written as:
| (4) |
The integration over Ml is intractable analytically for realistic sample sizes because the space of Ml is the set of all possible topologies, all possible vectors of intra-allelic coalescent times, and all possible geographic origins of the derived allele for locus l.
To approximate the integral over Ml, we use a set of approximation techniques. We use a straightforward Monte Carlo approach to integrate over Tl, an importance sampling approach to integrate over Gl, and an approximate analytical integration for Zl. The details of each approach are described in appendix A.
From the perspective of how well the whole approximation algorithm performs, the critical part of the algorithm is the importance sampling (IS) over Gl. Here we propose an IS distribution, P*(Gl) that proceeds by randomly constructing a tree sequentially backwards in time such that topologies that join geographically proximal lineages are favored. Our distribution P*(Gl) takes two parameters, θ0 and H. The parameter θ0 defines a “driving value” of θ, such that the IS distribution will perform best when θ has a value close to θ0. In practice we use a single set of m simulated values from P*(Gl) to evaluate the approximation to equation P*(X) across a range of values of θ. This allows for significant computational speed-ups because we only need to simulate from P*(Gl) once in order to calculate a series of points on the likelihood surface around θ0. The parameter H defines the extent to which geographical proximity influences the sampled topologies. H can be thought of as a “heat” parameter in that as its value increases, the entropy of the importance sampling distribution increases. More specifically, a value of H = 1 favors topologies in close proportion to the contribution the topology will make to the calculation of Pθ(Xl|Gl = gi) while larger values sample topologies more uniformly. The use of H is designed so that as H approaches ∞, the importance sampler P*(Gl) will converge on the straightforward Monte Carlo sampler P(Gl). The roles of the H and θ0 parameters are described in more detail in the appendix.
Finally, using the approximations to the likelihood, we employ standard optimization routines from the GNU scientific library to maximize the likelihood with respect to θ. We denote the maximum likelihood estimate (MLE) of θ as and the associated likelihood as L(θ̂). We also maximize the likelihood for the constrained model where dispersal is isotropic so that . The MLE for the constrained case is denoted as with likelihood . Given L(θ̂) and , we can compute the likelihood ratio test statistic λ for the null hypothesis that as .
Performance evaluation
To formally evaluate the performance of the likelihood-based inference we take a two-part approach. In both cases we focus mainly on characterizing the sampling distributions of , and λ because it is the sampling behavior of these statistics that is most relevant to biological applications.
First, we evaluate the performance of our estimation method on data simulated under the same BBM model that is used to define the likelihood function. This step allows us to assess the performance of the algorithm for numerically approximating the likelihood function and producing estimates of θ. Given that the model underlying the likelihood approach is identical to the model simulating the data, we expect that if the algorithm is performing well we will have a well behaved sampling distribution for the statistics of interest, , and λ.
The second step is to evaluate how the method performs on data from forward simulations from a model of individuals distributed on a lattice. The performance of the method on these simulations will be a result of how well the numerical approximation to the likelihood function performs as well as how accurately the BBM model that the likelihood function is based on summarizes the behavior of the lattice-based model.
For the performance evaluations, we fix the value of H to 2, unless stated otherwise. We also fix the value of θ0 to twice the value of θ used to simulate data, unless stated otherwise. In practice, both the appropriate values of H and θ0 will depend on the dataset in question (see discussion).
Simulation of the Branching Brownian Motion model
To simulate data under this model, we first fix the number of loci L, the sample sizes per locus n, and the number of derived alleles observed per locus j. We then perform the following steps for each locus l:
Draw a topology from the distribution defined by P(Gl) (Equation 3).
Draw a vector of times from the distribution defined by P(Tl) (Equation 2). See the “Monte Carlo integration over Tl” section of the appendix for more detail on how to simulate from P(Tl).
Assuming the mutation occurs at a geographic location (0, 0), simulate two independent Brownian motions along the intra-allelic genealogy defined by Gl and Tl. The resulting set of geographic locations for the lineages at the present day is stored as a simulated value of Xl.
Simulation of alleles in a finite lattice model
In order to test our method using an alternative model of dispersal in a continuously distributed population, we simulated from a lattice model in which one individual is at each point in a large lattice. Although this model is still highly idealized, it includes density regulation of the individuals, yet it is simple enough that we could generate sufficient sample data with which to test our method. An alternative approach for simulating from a model with density regulation is the coalsecent-based algorithm of Wilkins and Wakeley (Wilkins and Wakeley, 2002; Wilkins, 2004).
We assumed a square lattice of (2l + 1) ×(2l + 1) diploid individuals (where l is an abitrary non-negative integer), and that each generation consisted of two steps, dispersal of an infinitely large migrant pool followed by random sampling of alleles from that migrant pool at each lattice point. In each replicate, the population was initially fixed for allele a. Then at t = 0, one of the copies of a in the individual at the center of the lattice mutated to A to create a heterozygote. Then, each copy of A contributed to the migrant pool at the lattice point dx and dy steps away in proportion to a discretized and truncated bivariate normal distribution with mean (0, 0), variances and , and 0 covariance. We truncated the dispersal distribution for dx and dy larger than 3σx and 3σy in order to speed the computations. We also simulated dispersal according to a modified double-exponential distribution that has been motivated by seed dispersal data (Clark, 1998): where , α is a scale parameter, and c is a shape parameter. For values of c < 1, the tails of this distribution are not exponentially bounded (i.e. the distribution is “fat-tailed”). For these simulations we truncate the dispersal distribution for x and y larger than 10σx and 10σy.
The frequency of A at location (x, y), px,y, is the sum of the contributions to the migrant pool at that location from all extant copies of A. To create the next generation of adults, we assumed individuals are composed of two alleles independently sampled with the frequency of A allele being px,y.
Each replicate continued until A was either lost or fixed. For each set of replicates, we specified a target number of copies, j. Whenever the number of copies of A was exactly j, the locations of those j copies were recorded. At the end of each replicate in which j copies were found at least once, one of the sets of locations was chosen randomly to be the result for that replicate. Replicates were continued until L replicates were obtained in which j copies of A were found at least once. Then the results for that set of replicates were formatted for analysis by our IS program.
Evaluating a single run of the algorithm
A general property of IS algorithms is that their performance can be evaluated by inspecting the distribution of importance sampling weights (Liu, 2002). In particular the variance of the importance sampling weights is useful because in pathological cases, the weights will vary wildly so that the final approximation will be determined by a few large importance sampling weights. A useful summary statistic based on the variance of the importance sample weights is the effective sample size (ESS). Letting g = (g1, . . . , gm), the ESS can be defined as:
| (5) |
where w(·) is defined in the appendix. The ESS statistic can be interpreted as the effective number of independent samples from the target distribution Pθ(Xl|Gl)P(Gl).
Example application to Arabidopsis thaliana
To provide an example application of the method, we analyzed a dataset representing genetic variation from populations of Arabidopsis thaliana. We use a subset of the data presented in Nordborg et al. (2005). Of the 96 accessions presented byNordborg et al. (2005) we focus on a subset of 49 accessions from Europe (Supplemental Figure 2). The 49 accessions were chosen by first taking the subset of 76 accessions in Europe and then excluding accessions at random that represented multiple samples from a single geographic locale. As a result, the set of 49 accessions represent 49 unique geographic locations across Europe. This last fact is important for application of our method because of the assumption in our model that individuals are sampled randomly from across the habitat and obtaining multiple individuals from the same location is unlikely under random spatial sampling.
We next filter the sequence data to obtain sites that are bi-allelic. We assume minor alleles are derived and limit ourselves to a a fixed low-frequency range [i.e. each has 6 copies of the minor allele segregating (which corresponds to an allele frequency of 6/(2 · 49) ≈ 6%]. The geographic locations of the low-frequency allele at each locus are used to define X. Here we present the results for a simple data set of 8 loci from chromosome 3, chosen to be well-spaced along the chromosome.
Results
Performance of importance sampling approximation
Across a range of exploratory trial values, we found the IS algorithm decreases the Monte Carlo variance relative to using a straightforward Monte Carlo approach. In most cases the IS algorithm outperforms the Monte Carlo sampler by providing estimates of the likelihood surface that are accurate and suffer from little Monte Carlo sampling error (Figure 1); however in some rare cases the IS algorithm produces a likelihood surface that is a clear outlier from the majority of other IS replicates. Typically these aberrant replicates are recognizable by having low values for the ESS statistic and/or importance sampling weights with means that are not approximately 1. These rare replicates likely reflect cases where the importance sampler samples a rare topology with a very large importance sampling weight and so the resulting approximation to the likelihood is dominated by a single replicate. In most cases, increasing the value of H was found to decrease the occurrence of these aberrant runs, although the reduction comes at the cost of increased Monte Carlo variance among the remaining replicates.
Figure 1. Example of the performance of the importance sampling algorithm relative to the straightforward Monte Carlo sampler.
The left panel shows ten replicate estimates of the θ* likelihood surface using 1000 iterations of the importance sampling algorithm. The center panel shows ten replicate estimates using 3000 iterations of the random sampling algorithm. The right panel shows a close approximation to the true likelihood surface (obtained by 10 million replicates of the Monte Carlo sampler). The test case is a simulated sample from 1 locus with 12 minor alleles observed and with θ = (104, 104). For the importance sampling algorithm, H = 1 and θ0 = (2 × 104, 2 × 104).
Performance on data from the birth-process model
To assess the performance of the importance-sampling-based likelihood method we simulated data under the BBM model that underlies the method. Rather than examine the likelihood surface itself we focus on the properties of the estimators that would be used in an application to data. In particular, we are interested in the performance of the point-estimates , their associated confidence intervals, and the likelihood ratio test based on λ. As mentioned above, this step allows us to investigate whether there are any obvious deficiencies in the importance sampling algorithm and to gain insight on the performance of likelihood-based inference for this problem.
We found the sampling variance of the point estimates , decreases as either the number of low-frequency alleles observed per locus, j, or the total number of loci, L, increases (Figure 2). For smaller sample sizes the sampling distributions are skewed towards higher values.
Figure 2. Box-plot summaries of the sampling distribution of .
Summaries are plotted across a range for the number of loci and the number of copies of the minor allele observed, and the true value of N σ2 is indicated by a horizontal dashed line in each panel. (A) Brownian Birth Process results: Each summary is based on the results of applying the importance sampling algorithm with M = 2000, θ0 = (200, 200) and H = 2 to 500 datasets obtained by independent simulations from the birth process model with θ = (100, 100). (B) Lattice-model results: Each summary is based on the results of applying the importance sampling algorithm with M = 20000, θ0 = (102×104, 102×104) and H = 2 to 500 datasets obtained by independent simulations from a 101 × 101 lattice with σ1 = σ2 = 5, such that theta = (51 × 104, 51 × 104).
Despite the positive skew in the sampling distribution, the estimators appear to be unbiased. The mean of the estimators is consistently close to the true values used in the simulation, even for values of L and j that represent small sample sizes (j = 3, L = 3). The lack of bias is especially remarkable given the driving value of θ0 was set to twice the true value of θ used in the simulation. Because , and Nσ2 are scale parameters, proportionally similar results are found when simulations are performed with different values of , and Nσ2 (e.g. the coefficients of variation for each estimator are constant, results not shown).
The sampling distribution of has a lower sampling variance than that of either or , particularly for small sample sizes (Figure 2 vs. Supplemental Figure 3). This result is not unexpected because for the geographic positions of alleles in both dimensions are informative, whereas for and , only the positions of the alleles in a single dimension are informative.
Point estimates of the coverage probabilities for the 2 log-likelihood confidence intervals for and suggest the confidence intervals are slightly too narrow (e.g. Table 1). Across the conditions we investigated, the average coverage probability is 93.5% for both and . No clear patterns with regards to how the coverage probability changes with the number of loci or number of copies of the low-frequency allele were observed, although asymptotic likelihood theory suggests the coverage probability will approach 95% as the number of loci increases.
Table 1.
Point estimates for the coverage probabilities of the 2 log-likelihood confidence intervals for .
| No. of copies of allele | ||||||
|---|---|---|---|---|---|---|
| Simulation Model | L | 3 | 6 | 9 | 12 | 15 |
| Brownian Birth Process | 5 | 0.920 | 0.944 | 0.948 | 0.942 | 0.942 |
| 10 | 0.938 | 0.946 | 0.924 | 0.928 | 0.934 | |
| 15 | 0.934 | 0.926 | 0.938 | 0.942 | 0.940 | |
| 20 | 0.932 | 0.932 | 0.916 | 0.930 | 0.930 | |
| Lattice Model | 5 | 0.800 | 0.836 | 0.876 | 0.888 | 0.924 |
| 10 | 0.510 | 0.646 | 0.756 | 0.806 | 0.852 | |
| 15 | 0.356 | 0.524 | 0.658 | 0.732 | 0.802 | |
| 20 | 0.218 | 0.460 | 0.560 | 0.664 | 0.756 | |
The coverage probabilities for Nσ2 likewise show no clear relationship to L or the number of copies of the low-frequency allele observed at each locus; however one clear difference is that the confidence intervals are closer to the expected value of 95% (Table 2). The average coverage probability across the conditions we investigated was 95.2%.
Table 2.
Point estimates for the coverage probabilities of the 2 log-likelihood confidence intervals for Nσ2.
| No. of copies of allele | ||||||
|---|---|---|---|---|---|---|
| Simulation Model | L | 3 | 6 | 9 | 12 | 15 |
| Brownian Birth Process | 5 | 0.962 | 0.954 | 0.962 | 0.950 | 0.938 |
| 10 | 0.970 | 0.976 | 0.960 | 0.938 | 0.954 | |
| 15 | 0.956 | 0.962 | 0.958 | 0.962 | 0.954 | |
| 20 | 0.958 | 0.952 | 0.942 | 0.938 | 0.914 | |
| Lattice Model | 5 | 0.766 | 0.804 | 0.864 | 0.856 | 0.910 |
| 10 | 0.432 | 0.530 | 0.680 | 0.706 | 0.778 | |
| 15 | 0.232 | 0.370 | 0.492 | 0.576 | 0.698 | |
| 20 | 0.142 | 0.272 | 0.424 | 0.518 | 0.634 | |
These results show how when we simulate data from the same model underlying our method (the BBM model), we observe reasonable performance. The MLEs have low bias, confidence intervals show approximately the correct coverage, and a likelihood ratio test that is well calibrated with respect to nominal p-values. These results indicate that the importance sampling algorithm for computing the likelihood and subsequent methods for maximizing the likelihood are performing well. However, the BBM model is an approximation to the dynamics of a low-frequency allele that ignores population density regulation. To get a sense of how the method will perform on a model with density regulation, we turn to lattice-based simulations.
Performance on data from the lattice-based model
To assess performance of the method on lattice-based simulations, we again focus on the performance of the point estimates , and , their associated confidence intervals, and the likelihood ratio test based on λ.
As with data from the BBM model, the sampling variance of the point estimates , and decreases as either the number of loci or the number of low frequency allelels observed per locus, increases. When the number of low-frequency alleles is low we again see a skew towards high values. One important distinction is that for the lattice simulations, we generally observe a modest bias. For example, when we simulate alleles arising on a 101 × 101 lattice with , we find that the estimated values of , , and are each biased upwards (Figure 2, Supplemental Figure 2). The bias decreases as the number of copies of the low-frequency allele (j) increases, but it appears to be unaffected by the number of loci (L). One consequence of this bias is that the coverage probabilities of the 95% confidence intervals are poorly behaved - for our simulations, often the lower confidence interval is above the true value of the parameter, resulting in very low coverage of the true value (Tables 1, 2). Coverage increases as j increases, but as L increases the coverage becomes worse. Presumably as L increases one is getting tighter confidence intervals around a central value that is biased, resulting in poorer coverage; whereas when j increases the bias is reduced, hence increasing the coverage probabilities.
To further investigate bias, we conducted experiments to see how habitat size might affect the estimates. Because the BBM does not allow for edge effects, one might expect that as dispersal increases substantially relative to the scale of the habitat, the method might be biased towards underestimating the levels of dispersal. The underestimation would arise because the limited habitat size forces alleles to be more geographically clustered than they would be an infinite habitat, and the excess geographic clustering would appear to the method like reduced dispersal. Our simulations confirm this behavior (Figure 4A). In contrast, when habitat sizes are large relative to the scale of dispersal, we find the estimates are directly proportional to the true underlying values (with a modest bias upwards, Figure 4B).
Figure 4. Lattice model results: Effect of habitat size on bias.
Each panel shows the distribution of MLEs as a function of Nσ2. (A) Lattice of size 101 × 101. 100 replicates per value of Nσ2. (B) Lattice of size 401 × 401. 25 replicates per value of Nσ2. All simulations had L = 10, j = 9.
We considered whether the driving value θ0 could play a role in the upward bias. In nearly all the previous lattice model simulations the driving value was arbitrarily chosen to be twice the value of used to simulate the data (as we did in our simulated data from the BBM model). To assess whether the method is sensitive to this driving value and might bias θ̂ towards θ0, we studied how the distribution of changes as a function of θ0 (Supplemental Figure 4). The results show that unless θ0 is much less than θ (i.e. log10(θ0 < −1) the results are unaffected by the choice of θ0. Importantly, even when θ0 = (and recalling the ideal choice of θ0 is θ) we find an upward bias. Combined with the observation that the inference for the BBM model was not biased, this suggests the bias is not a result of properties of the importance-sampling approximation for calculating the likelihood.
We next investigated the performance of the likelihood ratio test on the lattice model data (Table 3 and Supplemental Figure 5). We find no clear patterns in how the false positive rate depends on L and j but that overall the false-positive rates are close to the nominal p-value of 0.05 (the average false positive rate across our simulations conditions was 0.044). As in the simulations of the BBM model, the power of the method is proportional to and the power increases with j.
Table 3.
Point estimates for the false positive rate of the asymptotic likelihood-ratio test that
| No. of copies of allele | ||||||
|---|---|---|---|---|---|---|
| Simulation Model | L | 3 | 6 | 9 | 12 | 15 |
| Brownian Birth Process | 5 | 0.044 | 0.054 | 0.062 | 0.042 | 0.056 |
| 10 | 0.068 | 0.070 | 0.060 | 0.052 | 0.050 | |
| 15 | 0.072 | 0.054 | 0.044 | 0.060 | 0.054 | |
| 20 | 0.060 | 0.046 | 0.058 | 0.048 | 0.040 | |
| Lattice Model | 5 | 0.054 | 0.054 | 0.056 | 0.052 | 0.028 |
| 10 | 0.058 | 0.054 | 0.018 | 0.036 | 0.028 | |
| 15 | 0.072 | 0.046 | 0.040 | 0.048 | 0.038 | |
| 20 | 0.071 | 0.070 | 0.038 | 0.056 | 0.046 | |
We also considered how the method performs for data generated from an alternative dispersal distribution, with more or less kurtosis than the discretized normal distribution. We used the modified double-exponential distribution, in which the parameter c determines the kurtosis of the distribution. For reference a normal distribution has a kurtosis value of 3, more fat-tailed (leptokurtic) distributions have higher values, and more narrow-tailed (playtkurtic) distributions have lower values. Our results show that as kurtosis increases, the estimates decrease, maintaining an upward bias, until the distribution becomes very leptokurtic (c < 0.75, kurtosis > 12), at which point the estimates have a downward bias (Figure 5).
Figure 5. attice model results: Effect of kurtosis on bias.
The distribution of MLEs is plotted as a function of the kurtosis observed in the modified double-exponential dispersal distribution described in the text. Simulations used c = (0.5, 0.75, 1, 1.5, 2, 4) to produce the 6 different levels of kurtosis, and in each case α was chosen to result in constant standard deviation of 5. Kurtosis was calculated based on the distribution produced after discretization and truncation (see main text). The lattice size was 101 × 101 so that the true underlying value of Nσ2 = 51 × 104. 25 replicates per value of c. All simulations had L = 10, j = 9.
Finally, we note that for the lattice model one can convert estimates of Nσ2 to estimates of the “neighborhood size”, 4ρπσ2 Wright (1943) by dividing by the total area of the population (A) and multiplying by 4π. For example, suppose we obtain the estimate of 6 × 106 for Nσ2 on a 200 × 200 lattice. In this case the neighborhood size would be estimated as 1885 individuals.
Application to Arabidopsis thaliana
In the sample Arabidopsis dataset, the copies of the minor allele are distributed across a spatial area of hundreds of kilometers in each dimension (Supplemental Figure 6), although the exact extent varies largely from locus to locus. Locus 147 and Locus 7014 have the most compact distributions, while Locus 2123 has the most widespread. The variability from locus to locus in the distribution of the minor allele translates to high variability in the likelihood surfaces for Nσ2 at each locus (Figure 6). The curvature of each individual likelihood surface and the variability of the likelihood surface among loci make obvious that precise estimation of Nσ2 is difficult using single loci. The joint likelihood curve (Figure 7) has a much more narrow confidence interval ([3.1 × 106, 11.27 × 106]) than any individual locus and an MLE of 5.9×106. Assuming N = 50, 000 the corresponding value of σ would be 10 km. The likelihood surface for and Figure 8) shows the MLE is close to the line defined by and in turn there is no significant support to reject the null hypothesis that .
Figure 6. The approximate log-likelihood curves for Nσ2 for each of the eight loci in the example dataset.
The log-likelihood curves are translated so that the maximum value across the range of 2.5 × 106 km2- 2.5 × 107 km2 is positioned at 0 on the y-axis. In some cases the MLE lies outside of this range (e.g., Locus 147 and Locus 7014).
Figure 7. The joint likelihood curve across all loci for Nσ2.
The horizontal line intersects the curve at the curve’s maximum value and the two vertical lines demarcate the two log-likelihood support/confidence interval.
Figure 8. The joint likelihood surface across all loci for and .
Discussion
The importance-sampling approach taken here works well for approximating the likelihood of , , and Nσ2 in the BBM model. Estimates of each parameter are unbiased, confidence intervals for each parameter have roughly the correct coverage probabilities, and the likelihood ratio test of has the expected false positive rate. These patterns are true even for small datasets, which is particularly noteworthy since the statistical properties of likelihood estimators are assured only as sample sizes become large. In addition, these patterns were observed across a large number of simulated datasets that were analyzed in batch without any special adjustments to parameters that govern the implementation of the method (H, θ0, and M). In addition, the computations proceed quickly (for a single locus with 9 copies of the low-frequency allele, 20,000 iterations of the importance sampling algorithm complete in a few seconds on a 3GhZ processor with 16GB RAM). The favorable performance of the inference method under these settings is evidence that the technical challenges of approximating the likelihood function of the BBM and finding its maxima are overcome by the importance-sampling algorithm used here.
Despite the robust performance of the importance-sampling algorithm observed in the BBM simulations, Monte Carlo techniques such as importance sampling should always be used with care. The appropriate settings for H, θ0, and M will necessarily depend on the dataset in question. Here we found values of H = 2 work well and we found similar results for values of H = 2 − 10 (unpublished results, note: values > 10 not tested). For θ0, we showed that the choice is not crucial as long as its value is not drastically smaller than the underlying θ for the data-set (Supplemental Figure 5). Because θ is not known a priori a reasonable approach suggested by Supplemental Figure 5 is to run the algorithm iteratively using as θ0 the θ̂ value from the previous run until the estimate θ̂ does not change. Finally, the larger the value of Mt he better the approximation will be, and it is useful to re-run the algorithm several times. The ESS statistic will indicate whether the estimates are suffering from large Monte Carlo sampling error if the value of M is too small.
To assess the effects of model mis-specification we also simulated data from a lattice model of dispersal. We found our method still performs well, in that the estimated parameters are proportional to their true values, especially for large habitat sizes where boundary effects are minimized (Figure 4). However, we do find a general upward bias - inferred average dispersal distances tend to be larger than the true values. Given the lack of bias for data simulated under the BBM model, we conclude that the bias is due to discrepancies between the dynamics in the lattice and BBM models, and not to the importance-sampling algorithm itself. The bias appears to be caused by the fact that low frequency mutations are dispersing slightly farther than expected given the branch lengths of the intra-alleleic genealogy. Or to restate the same conclusion, given the geographic locations, the branch lengths of the intra-allelic genealogy tend to be somewhat shorter than expected under the linear birth-death model.
The observation of upwardly biased estimates of dispersal in a CIBD model is not unique to our study. Meligkotsidou and Fearnhead (2007) observe a similar bias, and both our methods use the same approximation; namely, the probability distribution for events in the genealogical process occurs independently of the locations of the sampled lineages. This approximation implicitly ignores population density regulation, which is a key feature of the lattice model. Coalescent-based models that explicitly include population density regulation (Wilkins and Wakeley, 2002; Wilkins, 2004; Barton and Depaulis, 2002), are much more challenging computationally, and further work is necessary before they can be adapted for likelihood-based inference. A further consideration is that, in practice, the extent of density regulation may vary across species. For example, Meligkotsidou and Fearnhead (2007) found the performance of their estimator improved in data derived from populations that have been expanding, and we would expect a similar pattern to hold for our estimator.
We also assessed the effects of habitat size and the kurtosis of the dispersal distribution. If the habitat size is small relative to Nθσ2, dispersal parameters will be under-estimated. This problem arises because of boundary effects, whereby alleles are being constrained in how far they can disperse, an aspect which is not captured by our model. In practice this suggests that loci with low-frequency alleles that are found near the edges of a habitat should be excluded from datasets prior to applying our approach, or at least analyzed separately with the intention of quantifying boundary effects. We also found that if the dispersal distribution is strongly leptokurtic, dispersal parameters may be under-estimated. This perhaps occurs because the long-range migration events that make a dispersal distribution fat-tailed are unlikely to have been sampled in the time since a low-frequency allele arose by mutation. As a result, low-frequency alleles are more clumped than they would be if dispersal distances were normally distributed. Although the inference algorithm could in principle be based on a different dispersal distribution, that would increase the computational cost because our approach exploits properties of the normal distribution for the peeling algorithm (see Appendix).
Given these considerations, we can tentatively conclude that barring boundary effects and fat-tailed dispersal distributions, our method will often over-estimate the dispersal parameter, and thus it provides an upper bound for the average dispersal distances. There are many complicating factors that we have not addressed, including mis-specification of the derived allele, density-dependent dispersal, directional bias in dispersal, and non-random spatial sampling, that will affect the results from using our method, and its true accuracy cannot be established without detailed modeling of dispersal in a study species. Nevertheless, our method does provide a computationally feasible approach for estimation based on the dispersal of low frequency, relatively young mutations, and those estimates are at least of the right order of magnitude for the models of dispersal considered here.
Supplementary Material
Figure 3. Power of the asymptotic likelihood-ratio test to detect departures from the null hypothesis that .
The results are based on inference performed on simulated data from the Brownian birth process model where was fixed at 100 and was varied across the values 10, 50, 75, 100, 125, 200, 1000. All simulations were fixed at L = 20. Power is estimated from 500 replicate simulations. Similar power curves are found for the lattice model simluations (Supplemental Figure 3).
References
- Barton NH, Depaulis F. Neutral evolution in spatially continuous populations. Theor opul Biol. 2002;61:31–48. doi: 10.1006/tpbi.2001.1557. [DOI] [PubMed] [Google Scholar]
- Beerli P, Felsenstein J. Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. PNAS. 2001;98:4563–4568. doi: 10.1073/pnas.081068098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark JS. Why trees migrate so fast: Confronting theory with dispersal biology and the paleorecord. American Naturalist. 1998;152:204–224. doi: 10.1086/286162. [DOI] [PubMed] [Google Scholar]
- Excoffier L, Smouse PE, Quattro JM. Analysis of molecular variance inferred from metric distances among dna haplotypes - application to human mitochondrial-dna restriction data. Genetics. 1992;131:479–491. doi: 10.1093/genetics/131.2.479. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fearnhead P. On the choice of genetic distance in spatial-genetic studies. Genetics. 2007;177:427–434. doi: 10.1534/genetics.107.072538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet. 1973;25:471–492. [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. A pain in the torus: some difficulties with models of isolation by distance. American Naturalist. 1975;109:359–368. [Google Scholar]
- Felsenstein J. Evolutionary trees from gene frequencies and quantitative characters: Finding maximum likelihood estimates. Evolution. 1981;35:1229–1242. doi: 10.1111/j.1558-5646.1981.tb04991.x. [DOI] [PubMed] [Google Scholar]
- Hey J, Nielsen R. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc Natl Acad Sci U S A. 2007;104:2785–2790. doi: 10.1073/pnas.0611164104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Iorio MD, Griffiths RC, Leblois R, Rousset F. Stepwise mutation likelihood computation by sequential importance sampling in subdivided population models. Theoretical Population Biology. 2005;68:41–53. doi: 10.1016/j.tpb.2005.02.001. [DOI] [PubMed] [Google Scholar]
- Leblois R, Estoup A, Rousset F. Influence of mutational and sampling factors on the estimation of demographic parameters in a “continuous” population under isolation by distance. Molecular Biology and Evolution. 2003;20:491–502. doi: 10.1093/molbev/msg034. [DOI] [PubMed] [Google Scholar]
- Leblois R, Rousset F, Estoup A. Influence of spatial and temporal heterogeneities on the estimation of demographic parameters in a continuous population using individual microsatellite data. Genetics. 2004;166:1081–1092. doi: 10.1534/genetics.166.2.1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu JS. Monte Carlo Strategies in Scientific Computing. Springer; 2002. [Google Scholar]
- Meligkotsidou L, Fearnhead P. Postprocessing of genealogical trees. Genetics. 2007;177:347–358. doi: 10.1534/genetics.107.071910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neigel JE, Ball RM, Avise JC. Estimation of single generation migration distances from geographic-variation in animal mitochondrial-dna. Evolution. 1991;45:423–432. doi: 10.1111/j.1558-5646.1991.tb04415.x. [DOI] [PubMed] [Google Scholar]
- Nielsen R, Wakeley J. Distinguishing migration from isolation: A Markov chain Monte Carlo approach. Genetics. 2001;158:885–896. doi: 10.1093/genetics/158.2.885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nordborg M, Hu TT, Ishino Y, Jhaveri J, Toomajian C, et al. The pattern of polymorphism in arabidopsis thaliana. PLoS Biol. 2005;3:e196. doi: 10.1371/journal.pbio.0030196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rannala B, Hartigan JA. Identity by descent in island-mainland populations. Genetics. 1995;139:429–437. doi: 10.1093/genetics/139.1.429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rousset F. Genetic differentiation and estimation of gene flow from f-statistics under isolation by distance. Genetics. 1997;145:1219–1228. doi: 10.1093/genetics/145.4.1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rousset F. Genetic differentiation between individuals. Journal of Evolutionary Biology. 2000;13:58–62. [Google Scholar]
- Rousset F. Genetic Structure and Selection in Subdivided Populations (MPB-40) (Monographs in Population Biology) Princeton University Press; 2004. [Google Scholar]
- Rousset F, Leblois R. Likelihood and approximate likelihood analyses of genetic structure in a linear habitat: performance and robustness to model mis-specification. Mol Biol Evol. 2007;24:2730–2745. doi: 10.1093/molbev/msm206. [DOI] [PubMed] [Google Scholar]
- Slatkin M. Gene flow and the geographic structure of natural populations. Science. 1987;236:787–792. doi: 10.1126/science.3576198. [DOI] [PubMed] [Google Scholar]
- Slatkin M. A vectorized method of importance sampling with applications to models of mutation and migration. Theor Popul Biol. 2002;62:339–348. doi: 10.1016/s0040-5809(02)00007-2. [DOI] [PubMed] [Google Scholar]
- Slatkin M. The Age of Alleles. In: Slatkin M, Veuille M, editors. Modern Developments in Theoretical Population Genetics: The Legacy of Gustave Malecot’. Oxford University Press; 2003. [Google Scholar]
- Slatkin M, Maddison WP. A cladistic measure of gene flow inferred from the phylogenies of alleles. Genetics. 1989;123:603–613. doi: 10.1093/genetics/123.3.603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin M, Rannala B. Estimating the age of alleles by use of intraallelic variability. American Journal of Human Genetics. 1997;60:447–458. [PMC free article] [PubMed] [Google Scholar]
- Stephens M, Donnelly P. Inference in molecular population genetics. Journal Of The Royal Statistical Society Series B Statistical Methodology. 2000;62:605–635. [Google Scholar]
- Tufto J, Engen S, Hindar K. Inferring patterns of migration from gene frequencies under equilibrium conditions. Genetics. 1996;144:1911–1921. doi: 10.1093/genetics/144.4.1911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilkins JF. A separation-of-timescales approach to the coalescent in a continuous population. Genetics. 2004;168:2227–2244. doi: 10.1534/genetics.103.022830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilkins JF, Wakeley J. The coalescent in a continuous, finite, linear population. Genetics. 2002;161:873–888. doi: 10.1093/genetics/161.2.873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wiuf C. On the genealogy of a sample of neutral rare alleles. Theoretical Population Biology. 2000;58:61–75. doi: 10.1006/tpbi.2000.1469. [DOI] [PubMed] [Google Scholar]
- Wright S. Isolation by distance. Genetics. 1943;28:114–138. doi: 10.1093/genetics/28.2.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.








