Abstract
Phylogeographic inference allows reconstruction of past geographical spread of pathogens or living organisms by integrating genetic and geographic data. A popular model in continuous phylogeography—with location data provided in the form of latitude and longitude coordinates—describes spread as a Brownian motion (Brownian Motion Phylogeography, BMP) in continuous space and time, akin to similar models of continuous trait evolution. Here, we show that reconstructions using this model can be strongly affected by sampling biases, such as the lack of sampling from certain areas. As an attempt to reduce the effects of sampling bias on BMP, we consider the addition of sequence-free samples from under-sampled areas. While this approach alleviates the effects of sampling bias, in most scenarios this will not be a viable option due to the need for prior knowledge of an outbreak’s spatial distribution. We therefore consider an alternative model, the spatial Λ-Fleming-Viot process (ΛFV), which has recently gained popularity in population genetics. Despite the ΛFV’s robustness to sampling biases, we find that the different assumptions of the ΛFV and BMP models result in different applicabilities, with the ΛFV being more appropriate for scenarios of endemic spread, and BMP being more appropriate for recent outbreaks or colonizations.
Author summary
Phylogeography studies past location and migration using information from current geographic locations of genetic sequences. For example, phylogeography can be used to reconstruct the history of geographical spread of an outbreak using the genetic sequences of the pathogen collected at different times and locations. Here, we investigate the effects of different model assumptions on phylogeographic inference. In particular, we examine the effects of the strategy used to collect samples. We show that sample collection biases can have a strong impact on the quality of phylogeographic reconstruction: geographically biased sampling scheme can be very detrimental for popular continuous phylogeography models. We consider different ways to counter these effects, from utilising alternative phylogeographic models, to the inclusion of partially informative samples (known cases without genetic sequences). While these strategies do alleviate the effects of sampling biases, they also lead to considerable additional computational burden. We also investigate the intrinsic differences of different phylogeographic models, and their effects on reconstructed patterns in different scenarios.
1 Introduction
Genetic data can be very informative of migration histories and spatial patterns of living organisms, and of geographic spread of outbreaks, in particular when combined with information regarding present and past geographic ranges. Phylogeography combines genetic and geographic data to study geographical spread; in the context of geographic spread of outbreaks, which we will focus on in this manuscript, phylogeography often interprets observed genetic sequences as the result of sequence evolution along an evolutionary phylogenetic tree (see [1]), while modeling spatial spread as a separate evolutionary process along the same phylogeny (see e.g. [2–8]).
In recent years, Bayesian phylogeographic inference has gained remarkable popularity, in large part due to convenient implementations such as in the Bayesian phylogenetic inference software package BEAST [9, 10]. Bayesian phylogeography in BEAST allows users to investigate past geographical spread using genetic sequences possibly collected at different times. Genetic data is integrated with geographical and temporal sampling information, and all data is interpreted jointly in terms of evolution along a phylogenetic tree with heterochronous leaves [6, 7, 11–15]. BEAST uses Markov chain Monte Carlo (MCMC) to efficiently sample from the joint parameter space—which can also include parameters related to demographic reconstruction and phenotypic trait evolution—and in doing so, accurately accounts for uncertainty in phylogeny and model parameters, and possibly uncertainty in sampling time and location.
Bayesian phylogeographic approaches in BEAST can be divided into two categories depending on the type of geographical data: discrete space phylogeography and continuous space phylogeography. Discrete space phylogeography is typically used when samples are clustered based on their geographic location; this is appropriate when spread within a geographical unit is more or less free, while spread between units is hindered by geographical or political barriers (such as bodies of water, mountain chains, national borders, etc). In this case, the geographical data for a collected sample consists of a discrete geographical unit (e.g. a country). Oftentimes, the use of discrete phylogeography is one of necessity, e.g. when only the country of origin of the collected samples is known. Evolution of this location over time (e.g. spread between countries) along the phylogeny is usually modeled using a continuous-time Markov chain (see [6, 11]), similarly to popular phylogenetic models of sequence evolution (see [1]).
On the other hand, when the longitudinal and latitudinal coordinates of the samples are known, and when spread is assumed to happen more or less in a geographically homogeneous way over some area (such as on one island, or within one continent), continuous space phylogeography is often employed as an alternative to binning samples into discrete locations, which can be biasing [16]. In continuous phylogeography one typically models geographical spread along the branches of the tree as a Brownian motion process, which can be thought of as consisting of many small movements in random directions over many short time intervals (see [7, 14]). The results of continuous phylogeography can subsequently be used to determine factors causing non-homogeneous spatial spread through space [17, 18].
A problem of discrete space phylogeography is that sampling bias (samples not being collected across locations proportionally to their prevalence) can strongly affect statistical inference [13]. Unbiased sampling can be very hard to achieve, as it requires knowledge of the full geographic range of an outbreak, access to the whole of this range, and extensive sampling and sequencing efforts. An alternative is to use models that are not affected, or less affected, by sampling biases, such as the structured coalescent and its approximations (see [12, 13, 15], although note that these can be adversely affected by unsampled or unknown demes [19–21]). The structured coalescent model, however, is far more computationally demanding than classical discrete space phylogeography and can differ from it on several aspects other than sampling assumptions. For example, the structured coalescent assumes that the migratory process and the distribution of cases across locations are at equilibrium, but these assumptions are rarely met in practice and do not match outbreaks that recently expanded into new areas.
Here, we investigate the effect of sampling biases in continuous space phylogeography. We show that sampling only certain areas of an outbreak can result in strongly inaccurate inference of dispersal history and the related model parameters. A possible alternative to the Brownian motion phylogeography (“BMP”) model used in continuous space phylogeography is the spatial Λ-Fleming-Viot process (“ΛFV”) recently introduced in population genetics (see [16, 22–26]). The ΛFV addresses, among other things, the undesirable equilibrium properties of classical models of geographic spread [27]. The ΛFV represents an alternative to the BMP, which would be expected to be mostly robust to sampling bias. We here show that the BMP and the ΛFV are non-interchangeable models, which are suitable for very different evolutionary scenarios. We also investigate the use of “sequence-free” samples (samples without genetic information) as a means to correct or help diagnose the effects of sampling biases on BMP.
2 Materials and methods
We assume that N samples s1, …, sN have been collected, and each sample si is associated with a genetic sequence Si, a collection time ti, and a location of collection . Location li is made up of longitude and latitude , and represents the location of the sample at the time ti of collection. Sequence Si represents the genome (or part of the genome) of the sample, and usually provides most of the phylogenetic information. We assume that the phylogenetic tree τ is a time-stamped phylogeny, where the dates of the tips are known (corresponding to the collection times ti) and can differ from each other; branch lengths are represented in units of time.
Our main focus is to infer the history of geographical spread, represented in particular here by the reconstruction of the location of the root node of τ, and to infer the parameters of the migration process itself. We use two models to simulate and infer the migration process in continuous space: Brownian motion phylogeography (BMP) and the spatial Λ-Fleming-Viot process (ΛFV). Below we describe both models in detail.
2.1 Brownian motion phylogeography (BMP)
BMP assumes that changes in location happen along branches of τ according to a time-homogeneous Brownian (Wiener) diffusion process [28, 29]. Given any branch b of length t in τ, and assuming that we know the location l = (l(1), l(2)) of the parent node of this branch, then, under the BMP, the distribution of potential locations of the child node of b is centered on l and is multivariate normally distributed with variance tP−1, where P is the precision matrix of the BMP. In other words, conditional on the parent node of b being in position l, the location of the descendant node of the branch has distribution . We assume that the precision matrix P is the same for all branches, and has three free parameters: two marginal precisions, and the correlation coefficient between dimensions. These parameters describe respectively how fast spatial movement happens in each dimension and how correlated the movements are in the two dimensions. For simplicity, we assume no changes in diffusion rates across branches, although we recognize that variation in diffusion rates is important in many real-life scenarios [7].
Under the BMP, the posterior probability of a set of parameters τ (the phylogeny), Θ (the parameters describing sequence evolution along τ), and P (the precision matrix of the BMP) conditional on the data t1, …, tN, S1, …, SN, l1, …, lN is:
(1) |
This means that, given τ and P, the migratory history (and therefore the observed locations) is independent of genetic data and evolution. Similarly, given τ and Θ, sequence evolution (and therefore observed sequences) is independent of geographic data and migratory process. It is usually not feasible to calculate the probability of the data (known as the marginal likelihood, or the normalizing constant), P(t1, …, tN, S1, …, SN, l1, …, lN), which appears in the denominator above. Instead, BEAST employs MCMC to obtain samples from the posterior density of model parameters without the need to calculate this probability. The terms in the numerator are:
the prior P(P) on the precision matrix P (usually a Wishart distribution [7]), and the prior P(Θ) on the substitution model and parameters Θ. For P(Θ), many choices are possible, depending on prior information available regarding the mutational process, and the models considered [30].
the tree prior P(τ, t1, …, tN) which represents the prior probability of observing a given tree and sampling times. Possible priors can be based on birth-death models [31] or coalescent models [32] (note however that for coalescent priors a different notation from Eq 1 is required, conditioning all probabilities on the sampling times t1, …, tN).
the classical phylogenetic likelihood P(S1, …, SN|τ, t1, …, tN, Θ) that depends on a specific substitution model and parameters Θ and that can be calculated using Felsenstein’s pruning algorithm [33].
the geographic likelihood P(l1, …, lN|τ, t1, …, tN, P) is the probability of the geographic locations given the precision matrix P and tree. This can be efficiently calculated by integrating out the location of internal tree nodes, similarly to Felsenstein’s pruning algorithm but for a continuous trait [14, 34]. Some approaches opt for Gibbs sampling the ancestral node locations, for example in the work of [7]; in such cases, the notation of Eq 1 needs to be slightly modified.
There are a number of features that distinguish the BMP from the ΛFV presented in the next section, which are important to keep in mind. In Fig 1A we give a graphical representation of the BMP, and we here provide a short summary of the features of the model:
BMP assumes that the prior probability of the tree τ is not affected by the migration process P. Note however that the posterior probability of the tree might instead be very much affected by the geographic migration model and parameters.
BMP normally does not assume boundaries on possible geographic locations, so sample and ancestral node coordinates can be anywhere in the considered space (including in bodies of water, for example). Prior ancestral root locations can also be specified, see [35], and a normal prior distribution over root location is typically assumed, see [7]).
BMP does not assume that the density of the overall population of cases over space and over time is uniform or at equilibrium, and does not aim to describe, at least explicitly, the migratory and reproductive dynamics of the whole population, but only of the ancestral lineages of the considered samples. It assumes instead that there is no interaction among cases (for example, limited resources or susceptible individuals within one area), so that different lineages evolve and spread independently of each other no matter how close they are in geographic space.
in BMP, sampling locations are considered a result of pathogen spread, and not an arbitrary choice of the investigator. As such, sampling locations, even in the absence of genetic sequences, can be very informative about the process of geographic spread, as it is assumed that sampling locations are representative of the geographic range of the pathogen. This also means that absence of samples from certain areas will be interpreted by the model as evidence of absence of cases from such areas. In practice, if the sampling process is dependent on geography, for example when cases from some areas are more likely to be sampled than cases from other areas, then the inference under BMP can be affected, as we show below. This should not necessarily be considered a negative aspect of the model: if there is no sampling bias, then considering sampling locations as informative of the process of geographic spread can increase the inference power of the model.
Currently, no backward-in-time descriptions of the BMP exist; such a description of a dual process of the BMP could be useful for performing BMP inference while avoiding assumptions about (and therefore biases from) the sampling process.
2.2 Spatial Λ-Fleming-Viot process (ΛFV)
The ΛFV can be used to model migration and evolution of individuals within a population distributed across an area. The geographical area A under consideration is usually a torus (as in the simulator discsim [26]), or a rectangle (as in the phylogeographic inference software PhyREX [16]). Migration is only allowed from and into A, potentially representing, for example, the case of an island or a continental mass. Individuals of the population are assumed to be spread over A with uniform density ρ. Migration and reproduction of individuals are modeled through reproduction-extinction events (from now on, just “events”) which happen at rate λ over time. Each event ei happening at time ti is centered at a location ci taken at random uniformly from A. Individuals in the population are affected by the event according to their distance from ci. For example, in discsim all individuals within a radius r around ci are affected, while in PhyREX individuals are affected with a probability that decreases with their distance from ci (specifically, according to a normal kernel with variance θ2). Individuals affected by ei then die with probability μ, and new individuals are born around ci. In the case of disc events (as in discsim), new individuals are born uniformly within the event disc with density ρμ. In the case of normal kernel events, as in PhyREX, new individuals are similarly placed so to leave the population distribution uniform. Lastly, one (or more in case of recombination [26, 36]) parents for all the individuals born at ei are chosen, again with a probability that decreases as a function of the distance from ci (again, either uniformly on a disc as in discsim or with a normal kernel as in PhyREX, for example).
While the ΛFV is very different from the BMP, some aspects of the two models can be compared. For example, for narrow event kernels (i.e. small θ), the position of one lineage along one dimension after time t is approximately normally distributed with variance tσ2, where σ2 = 4πθ4λμ/|A| and |A| is the area of A [16], and at the limit of very small θ and very large λ, the movements of individuals approach a Brownian motion with diffusion rate σ2. Similarly, in the case of disc events of radius r, the mean per-dimension diffusion rate approaches (see S1 Text).
Despite the fact that for small and frequent events individuals might move almost in a Brownian motion, there are still significant differences between the ΛFV and the BMP. The posterior probability of ΛFV model parameters (which we collectively represent as Λ), of a certain history E of events E = {e1, …, e|E|}, of tree τ, and of substitution model parameters Θ is:
(2) |
Similarly to the BMP, samples from the joint posterior density of model parameters can be obtained using MCMC, as is done by PhyREX. The terms in the numerator are:
the prior P(Λ) on the ΛFV model parameters, and the prior P(Θ) on the substitution model parameters Θ.
the classical phylogenetic likelihood P(S1, …, SN|τ, t1, …, tN, Θ), as in the BMP.
the likelihood of the history of events, and the ancestry and ancestral locations of the samples P(τ, E|Λ, t1, …, tN, l1, …, lN), which can be computed following [16].
In Fig 1B we give a graphical representation of the ΛFV. From Eq 2 and the description of the model, a number of differences with the BMP can be noted, of which we again provide a summary here:
in the ΛFV, the probability of a tree τ can be affected by the spatial dynamics of the model.
the ΛFV is defined over a finite space, and is hence more appropriate at describing migration within a limited area (such as an island or continent).
the ΛFV assumes that the spatial density of the population is homogeneous and at equilibrium. This means that the model describes the case where resources are homogeneously spread across the environment, and the pathogen or species is endemic within an area (this excludes recent colonizations or recent outbreaks where the pathogen has not yet spread across the whole area).
calculating the likelihood of the ΛFV, at least in implementations proposed so far [16, 37], requires the explicit parameterization of individual events. This means that inference under this model is typically going to be more computationally demanding than inference under the BMP, except for scenarios with very few events.
the ΛFV always conditions on sampling times and sampling locations (see Eq 2). This is because, while the population is assumed homogeneously distributed through time and space, the sampling process is assumed to be arbitrary and not reflective or related to the density of the population or the migratory history. As such, the ΛFV should not be affected by any sampling bias.
the ΛFV model has a backward-in-time dual process [16, 26]. This process describes the distribution of past events given data collected later on (see term P(τ, E|l1, …, lN, t1, …, tN) above), thereby naturally accommodating possible spatial sampling biases.
3 Results
3.1 Sampling biases in BMP
To investigate the effect of sampling bias on BMP, we simulated evolution and migration under the same BMP model used for inference, and tested different sampling scenarios. We simulated a Yule phylogenetic tree with birth rate 1.0, and stopped the simulations when 1000 tips were generated. Genetic sequences were assumed 10kb long, and we simulated their evolution using an HKY model (κ = 3 and uniform nucleotide frequencies) and a substitution rate of 0.01 per unit time, ensuring reasonable levels of genetic diversity to allow reliable phylogenetic inference. Trees and sequences were simulated using DendroPy [38]. Using a custom python script, we simulated migration along the tree under the BMP model with two independent dimensions each with diffusion rate equal to 1 unit of square distance per time unit, and we always placed the root in (0.0, 0.0).
Of the 1000 tips in the total tree (representing all the cases in the considered outbreak), we sampled 50 tips (representing the samples collected and sequenced) under four different strategies to simulate different types of sampling bias:
in the first scenario (“Random Sampling”), samples were collected independently of their location, and as such no bias is expected and there is no model mis-specification;
in the second scenario (“Central Sampling”), the closest samples to the source of the outbreak (0.0, 0.0) were collected;
in the third scenario (“Diagonal Sampling”), the samples closest to the x = y diagonal were collected;
in the fourth and last scenario (“One-Sided Sampling”), the samples with the highest X coordinate (the most eastern samples) were collected.
The sampling biases we simulate here can be considered extreme, and may not represent real sampling scenarios, but they showcase the effects that different types of sampling biases can have on phylogeographic inference.
We used BEAST v1.10.4 to perform inference under the classical BMP model [7], assuming the default priors in BEAUti. During inference we did not restrict the two diffusion processes in the two dimensions to be independent or of equal rate, and inferred the correlation in the two diffusion processes and their rates. During both inference and simulations we assumed a constant rate migration process (see [7]). We ran the MCMC for 107 steps and sampled the posterior every 1000 steps, which was sufficient to reach convergence (ESS much higher than 200, checked using Tracer [39]). We ran 100 simulated replicates, and we analysed each replicate four times according to the four sampling scenarios above. Under these four sampling scenarios, we find at least moderate correlation between samples’ geographic distances and genetic distances: averages over 100 simulations of 0.192 for random sampling, 0.042 for central sampling, 0.176 for diagonal sampling, and 0.260 for one-sided sampling. This suggests that, in all scenarios, at least a moderate amount of signal to estimate geographic spread is present in the generated data.
We found that the sampling strategy affects root location inference using BMP (Fig 2A and 2D and S1 Fig). In the absence of sampling bias, inference appears accurate (unbiased and calibrated, Fig 2A). With central or diagonal sampling bias, the uncertainty and error of root location is further reduced (S1(C)–S1(F) Fig), but this probably reflects the fact that samples were collected close to the true origin of the simulated outbreak. When collecting samples at one extreme end of the outbreak, instead, we found that root location inference is strongly biased, with posterior distributions usually not containing the true simulated origin locations (Fig 2D).
The effects of sampling bias on the inference of BMP migration parameters are even more noticeable (Fig 2B, 2C, 2E and 2F and S2 Fig). While inference of diffusion rate with no sampling bias is correct and calibrated (Fig 2B), in every biased sampling scenario it is underestimated. In particular, with central and diagonal sampling the posterior distributions usually do not contain the true value (Fig 2E and S2 Fig). The reason for this is probably that the small sampled range (compared to the actual range of the outbreak) is interpreted as evidence of a small outbreak range, and therefore as low diffusion rate (absence of samples in an area interpreted as absence of cases). In the case of diagonal sampling, BMP also infers a strong correlation between the migration processes in the two dimensions (the true value of 0 covariance is never covered by the posterior distributions, Fig 2F).
To test the effects of tree uncertainty and sequence data, we also ran inference under the scenario that the simulated tree is perfectly known, representing the case in which sufficient genetic information is available so that there is negligible uncertainty in tree inference. We provided no input alignment and specified no phylogenetic likelihood or substitution model, but instead fixed the tree to the simulated one and removed all transition kernels in BEAST that affect the tree. In this case, our analyses required substantially fewer MCMC steps (104), with parameters sampled every 10 steps. We found virtually identical results as those presented in Fig 2 (S3, S4 and S5 Figs).
We also performed an additional set of simulations under the One-Sided Sampling scenario, but this time with different levels of proportions of biased samples, so to investigate the effects of the sampling bias intensity. We simulated a phylogeny of 10,000 cases under BMP and we sampled 100 of them under 5 different intensities of one-sided sampling bias. At 100% intensity, only samples with the highest X coordinate are collected. At 75% intensity, we collected 75 samples with the highest X coordinate and 25 at random. Similarly for 50% and 25% intensities. At 0% intensities, 100 samples are collected uniformly at random. We find that at 75% bias intensity already a large part of the inference bias alreay disappears (S6 Fig). Further lowering the intensity of the bias reduces inference bias, but with diminishing returns. It is surprising to see that, at intermediate sampling bias intensities, diffusion rate on the X dimension is not underestimated (as at 100% bias intensity) but is rather slightly overestimated.
3.2 Compensating the effects of sampling biases using sequence-free samples
The biases shown above originate from the fact that the BMP assumes that samples are collected independently of location, and so the absence of samples from an area is evidence—for the BMP—of absence of cases in that area. Here, we explore the possibility of compensating for the effects of sampling bias in BMP by adding “sequence-free” samples to the analyses. This is representative of the case, for example, that we know that an outbreak has spread into a location, and we know the time and place of some of the cases in that location, but we cannot collect or sequence samples from those cases; so, some of the samples will be “proper”, that is, will encompass genetic sequences, while the other “sequence-free” samples will have sampling location and time, but no genetic sequence (see also [40, 41]).
To recreate this scenario, we used the 100 Yule trees simulated before. As before, from each simulation, we considered 50 tips sampled according to the four sampling scenarios, representing “proper” samples with genetic sequence. Then, we selected another 50 sequence-free tips randomly (and independently of location) from the remaining 950 tips. These other 50 sequences were added to the BEAST analyses (for a total of 100 samples per replicate) without sequence data (or, more precisely, with uninformative sequences made only of gap characters “-”) but with correct sampling location and date.
Adding these extra sequence-free samples greatly reduces the effects of sampling biases, but does not eliminate them (Fig 2G–2I, S7 and S8 Figs). To further investigate how many extra sequence-free samples would be needed to compensate for sampling biases, we simulated a One-Sided Sampling scenario with a reduced number of cases (100, so to make it computationally feasible to add most of them as sequence-free samples) and with 20 samples. We then perform inference either with no added sequence-free samples, or with a number of extra sequence-free samples between 20 and 80. Adding 80 extra samples means considering all cases simulated during inference, but only considering the sequences of the 20 biased samples. We find that, in this scenario, adding 20 extra sequence-free samples removes most of the biases, and adding more than 40 extra sequence-free samples in this scenario brings no clear advantage (S9 Fig) while it considerably increases computational demands. This suggests that, while a considerable number of sequence-free samples is needed to compensate for sampling bias, it generally seems not necessary to add as many samples as to make the geographical sampling unbiased.
3.3 Can the ΛFV correct the sampling bias in BMP?
As mentioned before, the ΛFV has a number of differences from the BMP. One of these differences is that the model does not assume that the sampled locations are representative of the range of the outbreak, but instead the model assumes uniform density of cases over a considered, limited space. For this reason, the ΛFV should not be affected by sampling bias (see “Models” Section). We performed inference using the software PhyREX within the package PhyML v3.3.20190909 [16] downloaded on 4th of January 2020 from https://github.com/stephaneguindon/phyml.git. PhyREX implements the ΛFV model on a rectangular space. We fixed the tree to the simulated true one to greatly reduce the parameter space to be explored and to consider the case in which tree uncertainty is negligible (for example due to abundant genetic data). We used PhyREX to infer the diffusion rate (σ2) of the migration process (see S1 Text) and the migration histories, together with the other parameters of the ΛFV model. We ran each PhyREX replicate analysis for 1 week or a maximum of 2 × 108 MCMC steps, sampling every 2000 steps. This seemed generally sufficient to reach convergence in all scenarios and most replicates; however, we note that achieving convergence under the ΛFV was considerably harder than under the BMP, probably due to the larger number of free parameters and the considerable uncertainty in their values. In 57 out of 600 replicate runs of PhyREX, at least one of the considered parameters had an effective sample size (ESS) below 100. We show below results both including and excluding these non-converged replicates.
First, we considered the same exact 1000-tips simulated Yule trees as described above, with BMP migration, four different sampling bias scenarios and 50 collected tips. The ΛFV model used for inference might now be very different from the BMP model used for simulations, and so model mis-specification could have a considerable impact. One important difference is that the BMP has no spatial boundaries by default, while the ΛFV is defined over a finite space. In PhyREX, we define the geographical space (where the migration process takes place) to be a square with dimension double the maximum coordinate of any simulated outbreak case, and centered in (0.0, 0.0), so that all simulated samples are contained within the considered square. We find results from PhyREX to be very different from those in BEAST. Credible intervals of the root location are now much broader, and always contain the truth (Fig 3A, S10 and S11 Figs). On the other hand, the diffusion rate is highly overestimated, up to hundreds of times, and the corresponding posterior distributions usually do not contain the truth (Fig 3B, S12 and S13 Figs). The large uncertainty in the root location is probably caused by the fact that the ΛFV model uses less information than the BMP (by not assuming that sampling locations are representative of prevalence) and is less affected by sampling bias; however, the high inferred diffusion rates suggest that model mis-specification also plays a strong role in these analyses. We found that setting a prior on the radius parameter so as to mimic BMP (i.e., migration events preferentially taking place over short distances) can reduce this bias.
To further investigate the differences between the BMP and ΛFV models, we simulated trees and migration under the ΛFV model implemented in discsim [26]. The ΛFV models in discsim and PhyREX differ in some aspects. One difference is that discsim assumes that death and recolonization events happen uniformly over discs, while PhyREX uses normal distribution kernels. Another difference is that discsim assumes that migration happens on a torus, while PhyREX uses a rectangle (no migration outside the rectangle allowed, representing, for example, the edges of a continent or island). In discsim, we always assume a torus of length and width L = 100, and in PhyREX we run inference assuming a square space with the same dimensions. We simulated discs of radius r = 0.1, impact u = 0.1, and event rate ; these parameters were chosen so that migration histories are composed of many small migration events, therefore approximating a Brownian motion, with diffusion rate per dimension approximately σ2 = 1.0 (see S1 Text).
We consider two sampling strategies:
“wide sampling”, where 100 samples are collected uniformly at random from the central square of dimensions 50 × 50.
“narrow sampling”, where 100 samples are collected uniformly at random from a central square of dimensions 10 × 10.
In both sampling strategies, differently from BMP simulations, the diffusion rate was not overestimated by PhyREX (Fig 3F, S14 and S15 Figs). Root location inference in PhyREX is accurate, but posterior intervals usually span most of the simulated geographical range (Fig 3E, S16 and S17 Figs).
When we run BEAST inference on the discsim simulations, BMP seems to consistently underestimate the diffusion rate σ2 (Fig 3D and S18 Fig). While usually containing the true values, posterior distributions of root locations are even broader than those inferred by PhyREX, and, in particular, broader than the allowed geographical range (Fig 3C and S19 Fig).
These results suggest that the large discrepancies between the simulations under BMP and inference in PhyREX are due to model mis-specification and the inherent differences between the BMP and ΛFV models. In BMP simulations, the very high diffusion rate inferred by PhyREX is likely because the ΛFV model would usually assume that ancestral lineages traverse the considered geographical space several times, backward in time, before finally coalescing, at least in the limit of small and frequent events. The BMP, instead, not assuming endemicity but a rapid spread from an original location, expects shorter distances traveled before lineages find a common ancestor backward in time.
This seems, conversely, also the most plausible reason why the BMP infers low diffusion rate in ΛFV simulations. It seems harder instead to explain why root location posterior distributions inferred by the BMP are broader than those inferred with the ΛFV in ΛFV simulations, while the opposite is true for BMP simulations. A possible reason is that, because the ΛFV assumes a finite space, inferred root locations have to be contained within this space, even if, as typical, lineages are inferred to travel, backward in time, long distances before reaching the root. Under the BMP, in contrast, geographical space is unlimited, and in ΛFV simulations the simulated tree is very long, suggesting long traveled distances from the root to the tips, and therefore high uncertainty in root locations, which more than offsets the effect of sample locations being concentrated inside the ΛFV finite space of interest.
3.4 Analysis of a West Nile Virus outbreak
To showcase the importance of these observations with respect to practical epidemiological and phylodynamic investigations, we consider a dataset from a recent West Nile Virus outbreak in North America [14]. We choose this particular dataset due to availability of the data and of clear instructions on how to repeat the published analyses in BEAST https://beast.community/workshop_continuous_diffusion_wnv (accessed on August 2019), reducing the chances of errors on our part. As described in the tutorial, we include sampling time, sampling location, and genetic sequence data for each sample. We use a separate HKY model for each of the three codon positions, but assume no variation in substitution rates across codons, and we assumed an uncorrelated relaxed molecular clock model [42] with an underlying lognormal distribution. As the tree prior, we employ an exponential growth coalescent model. We assume homogeneous Brownian motion along tree branches.
To investigate the possible effects of sampling bias, we consider two datasets: the first including all samples, and the second including only the western-most half of the samples. This second scenario artificially recreates sampling bias, such as the case where only cases from one half of the country are accessible or considered. This sampling scenario might seem extreme, but it’s not uncommon for phylogeographic studies to focus, for example, on a single country or area within a continent, see e.g. [43, 44]. We consider the inference of the location of the root (MRCA) of the western half of the samples. The posterior densities of this same ancestor in the two analyses is very different: when using only western samples, this phylogenetic node is confidently placed in western USA, but when using the whole dataset this same node is confidently placed in the eastern USA instead (Fig 4). Another difference between the two analyses is that when restricting to just the western samples diffusion was inferred to be slower (95% HPD interval [166, 284] km/yr versus [339, 498] in the full analysis).
If we adopt a less extreme degree of sampling bias, for example including 5 or 10 eastern samples in addition to the western ones, we see that, similarly to our simulation results, most of the inference bias is removed (S20 and S21 Figs).
Next, we wanted to see whether, in this scenario, including some sequence-free samples from the eastern side of the country could help in the scenario of biased sampling. To do so, we ran an analysis of the 52 western samples with additionally the 52 eastern samples added as sequence-free samples. These sequence-free eastern samples were included with correct location and sampling time data but without sequence data. In this analysis, the inferred location of the considered node (the MRCA of the western samples) is now shifted eastward, but it is still very different from the inferred location of the same node from the full analysis (Fig 4). It is remarkable that in this dataset, unlike in our simulations, the addition of sequence-free samples does not seem to alleviate the effects of sampling bias very much. One possible explanation for this observation is that, unlike in our BMP simulations, in this case the outbreak seems to migrate westward as time progresses [14], a feature that sequence-free samples are insufficient, in this case, to capture, and that a more specific model might be able to address [45]. This is also hinted at by the fact that performing the same analyses as above but removing the western samples from the full dataset instead of the eastern ones shows almost no effects of the artificially introduced sampling bias (S22 Fig).
Analysing the same datasets with PhyREX also shows different estimates after removing the eastern samples, although this time there is considerable overlap between the different ancestral location estimates and different diffusion rate estimates (Fig 5, S23, S24 and S25 Figs). In principle we would not expect to see considerable differences for different subsampling schemes since the ΛFV model should be robust to sampling biases, as shown in our simulations. This further supports the hypothesis that the progressive westward shift of the outbreak plays a major role in the apparent strong effects of sampling bias in this case. A noticeable difference between BEAST and PhyREX results, also observed in simulations, is that the inferred uncertainty in ancestral location is much larger in PhyREX than in BEAST.
3.5 Analysis of a Yellow Fever Virus outbreak
As a second example of real world epidemic analysis, we considered a recent dataset of Yellow Fever Virus (YFV) from Brazil [46]. 65 YFV genomes were collected between January and April 2017, mostly from the Brazilian state of Minas Gerais. Again, we chose this dataset due to availability of data and instructions for repeating the analysis https://beast.community/workshop_continuous_diffusion_yfv (accessed on August 2019). Following the tutorial, we used the same substitution model as for the West Nile Virus dataset, a skygrid coalescent [47] tree prior with 36 grid points, and a Cauchy relaxed random walk model [7].
When recreating sampling bias along a north-south gradient, we find little impact from removing southern samples (S26 Fig), while directional sampling bias seems to have considerable effect in BEAST analyses when removing northern samples (S27 Fig); this bias is greatly reduced by introducing sequence-free samples. PhyREX inference seems, expectedly, mostly unaffected by sampling bias, and shows much broader posterior distributions for ancestral locations (S28 and S29 Figs).
We also observed that many samples of this dataset were collected from few locations: six from Ladainha, five from Novo Cruzeiro, seven from Teófilo Otoni and five from Itambacuri. So, in a second alternative sub-sampling strategy, we reduced the maximum number of samples from any of these locations to two. As before, we aim to artificially recreate different sampling scenarios. We find that, after downsampling, the origin of the outbreak is not inferred anymore to be solely nearby Teófilo Otoni, but also possibly south, close to another cluster of samples near Caratinga (Fig 6 and S30 Fig). A third possible, but low-probability area remains near Belo Horizonte, close to the phylogenetic outgroup location.
These results further suggest that the decision of where to collect samples and which samples to include or exclude from a BMP analysis can significantly impact its conclusions, and that great care should be taken to make sure that the range of samples collected and their density reflect real geographic distributions.
4 Conclusion
We have shown that continuous space phylogeographic inference can be negatively affected by sampling biases, such as sampling efforts being focused in certain areas over others. These biases can lead to strongly mis-inferred ancestral node locations, up to completely excluding the true origin of outbreaks with complete confidence. These biases also usually lead to underestimating the dispersal velocity of pathogens, and can in some cases lead to inference of artificial patterns of correlated spread across space dimensions.
We explored possible ways to tackle these issues. A possible approach is to include sequence-free samples, which correspond to known cases (for which we know date and location) which have no corresponding genetic information. We find that sequence-free samples can considerably improve inference and compensate sampling biases, but that in most scenarios it would be computationally unfeasible or unrealistic to completely eliminate the effects of sampling biases, if possible at all. We confirm these results on real datasets from West Nile Virus and Yellow Fever Virus outbreaks by artificially recreating scenarios of sampling bias.
As an alternative, we investigated the use of an inference model that is in theory not affected by sampling biases: the ΛFV implemented in PhyREX. Similarly to the structured coalescent, in fact, the ΛFV model conditions on sampling locations, and so should not be adversely affected when samples are not collected proportionally to prevalence; note however that the ΛFV might be affected by geographical biases in the case where the true geographical range of the organism being studied is not known, but we don’t focus on this aspect here. We confirm that indeed this model is seemingly unaffected by different sampling strategies, but, more importantly, the model is also very different from the BMP, resulting in very different estimates. The assumptions and applicability of these two models being so different, we would expect few scenarios of common applicability. The BMP, in fact, well-describes the spread of outbreak within a new, unlimited environment, or at least within an area that is large compared to the current range of the outbreak. For example, in BMP simulations, lineages generally spread out from the original source and move in all directions, on average spreading further away from the origin as time progresses. The ΛFV, instead, fits better a scenario where an outbreak (or any population) has become endemic within an area, or at least where lineages are expected to have migrated across the area since their introduction. For example, in ΛFV simulations lineages usually tend to cross the considered geographic space several times before they all find a common ancestor. One practical consequence of this is that inference under the ΛFV of possible root locations can be, on the same data, much broader than the same inference under the BMP. Similarly, diffusion rates estimated under the ΛFV can be much higher than those estimated under the BMP. It is possible, however, that, in some scenarios or with some modifications, these two models would more substantially overlap in applicability. An example could be when restricting the allowed geographic range within the BMP to a limited space, that is, not allowing BMP migration outside of a confined area. In fact, we suspect that simulating migration under such a version of the BMP, and simulating a long phylogeny (in terms of distance traveled from the root before samples are collected) would lead to patterns very close to those simulated under the ΛFV. Another possibility is that considering samples collected at different points in time would reduce uncertainty in the inference under both models.
In the future, it would be of great interest to make the BMP robust to sampling biases by conditioning the BMP geographical likelihood on the location of collected samples, or to consider alternative, possibly approximate models that share this property, which is a topic we are currently working on. While in this manuscript we have only considered a simple model of migration, that is with no directional bias and no variation in diffusion rate over time, location or lineages, it will be interesting in the future to investigate how the relaxation of these assumptions [7, 45] would impact the results presented here. Another issue that we think would be important to address in the future is the elevated computational demand and slow convergence of inference under the ΛFV model; further work on this model, for example in the form of efficient approximations, could greatly improve its applicability in practical scenarios. While the simulation scenarios considered here are quite extreme in terms of sampling biases, it will be important to consider the degree and type of sampling biases in real life outbreaks (see e.g. [48]) in future research.
In conclusion, we report that often the choice of model and of sampling strategy has dramatic effects on the results of a continuous phylogeographic analysis. We therefore recommend attention be paid when deciding a sampling strategy for BMP so that the range and distribution of collected samples would reflect the geographical distribution of the outbreak as much as possible. We also recommend an appropriate phylogeographic model to be used, depending on the history and range of the considered outbreak.
Supporting information
Acknowledgments
We thank the developers of the script newick.py (https://github.com/tyjo/newick.py), of which we used an adaptation in our simulations.
Data Availability
All scripts and data used for simulations and analyses can be found at https://github.com/NicolaDM/Phylogeography.
Funding Statement
AK and YS were supported by the Cambridge Mathematics Placements (CMP, https://www.maths.cam.ac.uk/opportunities/careers-for-mathematicians/summer-research-mathematics/summer-research-mathematics-cmp-and-research-cms). GB was supported by the Interne Fondsen KU Leuven / Internal Funds KU Leuven (https://www.kuleuven.be/english/research/support/if) under grant agreement C14/18/094, and from the Research Foundation -- Flanders (`Fonds voor Wetenschappelijk Onderzoek -- Vlaanderen’, https://www.fwo.be/G0E1420N). SG was supported by the Agence Nationale pour la Recherche https://anr.fr/ through the grant GENOSPACE. AK, UP, YS, NG and NDM were supported by the European Molecular Biology Laboratory. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Yang Z, Rannala B. Molecular phylogenetics: principles and practice. Nature reviews genetics. 2012;13(5):303 10.1038/nrg3186 [DOI] [PubMed] [Google Scholar]
- 2. Schluter D, Price T, Mooers AØ, Ludwig D. Likelihood of ancestor states in adaptive radiation. Evolution. 1997;51(6):1699–1711. 10.1111/j.1558-5646.1997.tb05095.x [DOI] [PubMed] [Google Scholar]
- 3. Lemmon AR, Lemmon EM. A likelihood framework for estimating phylogeographic history on a continuous landscape. Systematic biology. 2008;57(4):544–561. 10.1080/10635150802304761 [DOI] [PubMed] [Google Scholar]
- 4. Ree RH, Moore BR, Webb CO, Donoghue MJ. A likelihood framework for inferring the evolution of geographic range on phylogenetic trees. Evolution. 2005;59(11):2299–2311. 10.1554/05-172.1 [DOI] [PubMed] [Google Scholar]
- 5. Ree RH, Smith SA. Maximum likelihood inference of geographic range evolution by dispersal, local extinction, and cladogenesis. Systematic biology. 2008;57(1):4–14. 10.1080/10635150701883881 [DOI] [PubMed] [Google Scholar]
- 6. Lemey P, Rambaut A, Drummond AJ, Suchard MA. Bayesian phylogeography finds its roots. PLoS computational biology. 2009;5(9):e1000520 10.1371/journal.pcbi.1000520 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Lemey P, Rambaut A, Welch JJ, Suchard MA. Phylogeography takes a relaxed random walk in continuous space and time. Molecular biology and evolution. 2010;27(8):1877–1885. 10.1093/molbev/msq067 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Landis MJ, Matzke NJ, Moore BR, Huelsenbeck JP. Bayesian analysis of biogeography when the number of areas is large. Systematic biology. 2013;62(6):789–804. 10.1093/sysbio/syt040 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus evolution. 2018;4(1):vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, et al. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLoS computational biology. 2019;15(4):e1006650 10.1371/journal.pcbi.1006650 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Lemey P, Rambaut A, Bedford T, Faria N, Bielejec F, Baele G, et al. Unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza H3N2. PLoS pathogens. 2014;10(2):e1003932 10.1371/journal.ppat.1003932 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Vaughan TG, Kühnert D, Popinga A, Welch D, Drummond AJ. Efficient Bayesian inference under the structured coalescent. Bioinformatics. 2014;30(16):2272–2279. 10.1093/bioinformatics/btu201 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. De Maio N, Wu CH, O’Reilly KM, Wilson D. New routes to phylogeography: a Bayesian structured coalescent approximation. PLoS genetics. 2015;11(8):e1005421 10.1371/journal.pgen.1005421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Pybus OG, Suchard MA, Lemey P, Bernardin FJ, Rambaut A, Crawford FW, et al. Unifying the spatial epidemiology and molecular evolution of emerging epidemics. Proceedings of the national academy of sciences. 2012;109(37):15066–15071. 10.1073/pnas.1206598109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Müller NF, Rasmussen DA, Stadler T. The structured coalescent and its approximations. Molecular biology and evolution. 2017;34(11):2970–2981. 10.1093/molbev/msx186 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Guindon S, Guo H, Welch D. Demographic inference under the coalescent in a spatial continuum. Theoretical population biology. 2016;111:43–50. 10.1016/j.tpb.2016.05.002 [DOI] [PubMed] [Google Scholar]
- 17. Dellicour S, Rose R, Faria NR, Vieira LFP, Bourhy H, Gilbert M, et al. Using viral gene sequences to compare and explain the heterogeneous spatial dynamics of virus epidemics. Molecular biology and evolution. 2017;34(10):2563–2571. 10.1093/molbev/msx176 [DOI] [PubMed] [Google Scholar]
- 18. Dellicour S, Troupin C, Jahanbakhsh F, Salama A, Massoudi S, Moghaddam MK, et al. Using phylogeographic approaches to analyse the dispersal history, velocity and direction of viral lineages?Application to rabies virus spread in Iran. Molecular ecology. 2019;28(18):4335–4350. 10.1111/mec.15222 [DOI] [PubMed] [Google Scholar]
- 19. Beerli P. Effect of unsampled populations on the estimation of population sizes and migration rates between sampled populations. Molecular ecology. 2004;13(4):827–836. 10.1111/j.1365-294X.2004.02101.x [DOI] [PubMed] [Google Scholar]
- 20. Slatkin M. Seeing ghosts: the effect of unsampled populations on migration rates estimated for sampled populations. Molecular ecology. 2005;14(1):67–73. [DOI] [PubMed] [Google Scholar]
- 21. Ewing G, Rodrigo A. Estimating population parameters using the structured serial coalescent with Bayesian MCMC inference when some demes are hidden. Evolutionary Bioinformatics. 2006;2:117693430600200026. [PMC free article] [PubMed] [Google Scholar]
- 22. Etheridge A. Drift, draft and structure: some mathematical models of evolution. Banach center publications. 2008;1(80):121–144. [Google Scholar]
- 23. Berestycki N, Etheridge A, Hutzenthaler M. Survival, extinction and ergodicity in a spatially continuous population model. Markov process related fields. 2009;15(3):265–288. [Google Scholar]
- 24. Barton N, Etheridge A, Véber A, et al. A new model for evolution in a spatial continuum. Electronic journal of probability. 2010;15:162–216. 10.1214/EJP.v15-741 [DOI] [Google Scholar]
- 25. Barton NH, Kelleher J, Etheridge AM. A new model for extinction and recolonization in two dimensions: quantifying phylogeography. Evolution: International journal of organic evolution. 2010;64(9):2701–2715. 10.1111/j.1558-5646.2010.01019.x [DOI] [PubMed] [Google Scholar]
- 26. Kelleher J, Etheridge A, Barton NH. Coalescent simulation in continuous space: Algorithms for large neighbourhood size. Theoretical population biology. 2014;95:13–23. 10.1016/j.tpb.2014.05.001 [DOI] [PubMed] [Google Scholar]
- 27. Felsenstein J. A pain in the torus: some difficulties with models of isolation by distance. The american naturalist. 1975;109(967):359–368. 10.1086/283003 [DOI] [Google Scholar]
- 28. Brown R. XXVII. A brief account of microscopical observations made in the months of June, July and August 1827, on the particles contained in the pollen of plants; and on the general existence of active molecules in organic and inorganic bodies. The philosophical magazine. 1828;4(21):161–173. 10.1080/14786442808674769 [DOI] [Google Scholar]
- 29. Cavalli-Sforza LL, Edwards AW. Phylogenetic analysis: models and estimation procedures. Evolution. 1967;21(3):550–570. 10.1111/j.1558-5646.1967.tb03411.x [DOI] [PubMed] [Google Scholar]
- 30. Baele G, Li WLS, Drummond AJ, Suchard MA, Lemey P. Accurate model selection of relaxed molecular clocks in Bayesian phylogenetics. Molecular biology and evolution. 2012;30(2):239–243. 10.1093/molbev/mss243 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Stadler T. Sampling-through-time in birth–death trees. Journal of theoretical biology. 2010;267(3):396–404. 10.1016/j.jtbi.2010.09.010 [DOI] [PubMed] [Google Scholar]
- 32. Minin VN, Bloomquist EW, Suchard MA. Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Molecular biology and evolution. 2008;25(7):1459–1471. 10.1093/molbev/msn090 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of molecular evolution. 1981;17(6):368–376. 10.1007/BF01734359 [DOI] [PubMed] [Google Scholar]
- 34. Felsenstein J, Felenstein J. Inferring phylogenies. vol. 2 Sinauer associates Sunderland, MA; 2004. [Google Scholar]
- 35. Bouckaert R, Lemey P, Dunn M, Greenhill SJ, Alekseyenko AV, Drummond AJ, et al. Mapping the origins and expansion of the Indo-European language family. Science. 2012;337(6097):957–960. 10.1126/science.1219669 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Kelleher J, Barton NH, Etheridge AM. Coalescent simulation in continuous space. Bioinformatics. 2013;29(7):955–956. 10.1093/bioinformatics/btt067 [DOI] [PubMed] [Google Scholar]
- 37. Joseph T, Hickerson M, Alvarado-Serrano D. Demographic inference under a spatially continuous coalescent model. Heredity. 2016;117(2):94 10.1038/hdy.2016.28 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010;26(12):1569–1571. 10.1093/bioinformatics/btq228 [DOI] [PubMed] [Google Scholar]
- 39. Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. Posterior summarization in Bayesian phylogenetics using Tracer 1.7. Systematic biology. 2018;67(5):901–904. 10.1093/sysbio/syy032 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Edwards CJ, Suchard MA, Lemey P, Welch JJ, Barnes I, Fulton TL, et al. Ancient hybridization and an Irish origin for the modern polar bear matriline. Current biology. 2011;21(15):1251–1258. 10.1016/j.cub.2011.05.058 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Duchene S, Di Giallonardo F, Holmes EC, Vaughan T. Inferring infectious disease phylodynamics with notification data. bioRxiv. 2019; p. 596700.
- 42. Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS biology. 2006;4(5):e88 10.1371/journal.pbio.0040088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Faria NR, Suchard MA, Abecasis A, Sousa JD, Ndembi N, Bonfim I, et al. Phylodynamics of the HIV-1 CRF02_AG clade in Cameroon. Infection, Genetics and Evolution. 2012;12(2):453–460. 10.1016/j.meegid.2011.04.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Dellicour S, Durkin K, Hong SL, Vanmechelen B, Martí-Carreras J, Gill MS, et al. A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages. BioRxiv. 2020. [DOI] [PMC free article] [PubMed]
- 45. Gill MS, Ho T, Si L, Baele G, Lemey P, Suchard MA. A relaxed directional random walk model for phylogenetic trait evolution. Systematic biology. 2017;66(3):299–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Faria NR, Kraemer MU, Hill S, De Jesus JG, Aguiar R, Iani FC, et al. Genomic and epidemiological monitoring of yellow fever virus transmission potential. Science. 2018;361(6405):894–899. 10.1126/science.aat7115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Gill MS, Lemey P, Faria NR, Rambaut A, Shapiro B, Suchard MA. Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci. Molecular biology and evolution. 2012;30(3):713–724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Lemey P, Hong SL, Hill V, Baele G, Poletto C, Colizza V, et al. Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2. Nature Communications. 2020;11(1):1–14. 10.1038/s41467-020-18877-9 [DOI] [PMC free article] [PubMed] [Google Scholar]