Summary paragraph
Following the first wave of SARS-CoV-2 infections in spring 2020, Europe experienced a resurgence of the virus starting late summer that was deadlier and more difficult to contain 1. Relaxed intervention measures and summer travel have been implicated as drivers of the second wave 2. Here, we build a phylogeographic model to evaluate how newly introduced lineages, as opposed to the rekindling of persistent lineages, contributed to the COVID-19 resurgence in Europe. We inform this model using genomic, mobility and epidemiological data from 10 European countries and estimate that in many countries over half of the lineages circulating in late summer resulted from new introductions since June 15th. The success in onward transmission of newly introduced lineages was negatively associated with local COVID-19 incidence during this period. The pervasive spread of variants in summer 2020 highlights the threat of viral dissemination when restrictions are lifted, and this needs to be carefully considered by strategies to control the current spread of variants that are more transmissible and/or evade immunity. Our findings indicate that more effective and coordinated measures are required to contain spread through cross-border travel even as vaccination begins to reduce disease burden.
Keywords: COVID-19, SARS-CoV-2, Europe, second wave, phylogeography, international mobility
Upon successfully curbing transmission in spring 2020, many European countries witnessed a resurgence in COVID-19 cases in late summer. The number of COVID-19 infections increased rapidly, and by the end of October, it was clear that the continent was deep into a second epidemic wave. This forced governments to reimpose lockdowns and social restrictions in an effort to contain the resurgence. While these measures reduced infection rates across Europe 3, several countries witnessed a stabilization at high levels or even a new surge in infections. The spread of more transmissible variants, in particular B.1.1.7 (Variant of Concern 202012/01 or 20I/501Y.V1 4), which was first identified in the United Kingdom (UK), has considerably exacerbated the challenge to contain COVID-19.
Already early on in the pandemic, modelling studies warned about new waves due to partial relaxation of restrictions 5 or seasonal variations 6. By mid-April, the European Commission constructed a roadmap to lifting coronavirus containment measures 7, recommending a cautious and coordinated manner to revive social and economic activities. However, the early start of the devastating second wave demonstrated that there was insufficient adherence to these measured recommendations. Cross-border travel, and mass tourism in particular, has been implicated as a major instigator of the second wave. Genomic surveillance demonstrated that a new variant (lineage B.1.177 8, 20A.EU1 [nextstrain.org]), which emerged in Spain in early summer, has spread to multiple locations in Europe 2. While this variant quickly grew into the dominant circulating SARS-CoV-2 strain in several countries, it did not appear to be associated with a higher intrinsic transmissibility 2.
Although it appears clear that travel significantly contributed to the second wave in Europe, it remains challenging to assess how it may have restructured and reignited the epidemic in the different European countries. Even without resuming travel, relaxing containment measures when low-level transmission is ongoing risks the proliferation of locally circulating strains. Phylodynamic analyses may provide insights into the relative importance of persistence versus the introduction of new lineages, but such analyses are complicated for SARS-CoV-2 for different reasons. Phylogenetic reconstructions may be poorly resolved due to the relatively limited SARS-CoV-2 sequence diversity 9. This is further confounded by the degree of genetic mixing that can be expected from unrestricted travel prior to the lockdowns in spring 2020.
Mobility data predicts SARS-CoV-2 spread
We analysed SARS-CoV-2 B.1 (20A) genomes from 10 European countries for which a minimal number of genomes from the second wave were already available on November 3rd, 2020. Using a two-step procedure that relied on subsampling relative to country-specific case counts (cfr. Methods), we compiled a data set of close to 4,000 genomes sampled between January 29th and October 30th, 2020 (Extended Data Table 1). In order to achieve maximum resolution in our evolutionary reconstructions, we constructed a Bayesian time-measured phylogeographic model that integrates mobility and epidemiological data. Our approach simultaneously infers phylogenetic history and ancestral movement throughout this history while also identifying the drivers of spatial spread 10. We used the latter functionality to determine the most appropriate mobility or connectivity measure. Specifically, we considered international air transportation data, the Google COVID-19 Aggregated Mobility Research Dataset (also referred to here as ‘mobility data’ for short), as well as Facebook's Social Connectedness Index (SCI), as covariates of phylogeographic spread (Extended Data Figure 1). The Google mobility data contains anonymized mobility flows aggregated over users who have turned on the Location History setting, which is off by default (cfr. Methods). The Social Connectedness Index reflects the structure of social networks and has been suggested to correlate with the geographic spread of COVID-19 11. To help inform the phylogenetic coalescent time distribution, we parameterized the viral population size trajectories through time as a function of epidemiological case count data for the countries under investigation.
Analyses using both time-homogeneous and time-inhomogeneous models offered strong support for mobility data as a predictor of spatial diffusion whereas air transportation data and SCI offered no predictive value (Extended Data Table 2). The fact that mobility data encompassing both air and land-based transport are required to explain COVID-19 spread highlights the need to consider both types of transport in containment strategies. To ensure that containment strategies were accommodated by our reconstructions, we further extended our time-inhomogeneous approach to model bi-weekly variation in the overall rate of spread between countries as a function of mobility (cfr. Methods, Extended Data Table 2).
Dynamic viral transmission through time
We use our probabilistic model of spatial spread informed by genomic data, mobility and epidemiological data to characterize the dynamics of spread throughout the epidemic in Europe. We first focus on the ratio of introductions over the total viral flow in and out of each country over time and the genetic structure of country-specific transmission chains (Figure 1). For the latter, we use a normalized entropy measure that quantifies the degree of phylogenetic interspersion of country-specific transmission chains in the SARS-CoV-2 phylogeny (cfr. Methods). Although estimates for individual dispersal between pairs of countries can also be obtained (Extended Data Figure 2), we remain cautious in interpreting these as direct pathways of spread because the genome sampling only covers a restricted set of European countries. The mobility to/from each country within our 10-country sample covers between 64% and 96% of the mobility of these countries to/from all countries within Europe (Extended Data Table 3, Extended Data Figure 3), except for Norway (27%), for which other Scandinavian countries account for considerable mobility connections (61%), and the UK (49%), for which Ireland accounts for a large fraction of mobility connections (38%).
Figure 1. Mobility, genome sampling, case counts and phylogeographic summaries through time for 10 European countries.
The first panel summarizes the country-specific Google mobility influx in the 10 countries during two-week intervals, while the second panel depicts the weekly genome sampling by country used in the phylogeographic analysis. In the remaining panels, we plot for each country the ratio of introductions over the total viral flow from and to that country (for two-week intervals) and a monthly normalized entropy measure summarizing the phylogenetic structure of country-specific transmission chains. The posterior mean ratios of introductions are depicted with circles that have a size proportional to the total number of transitions from and to that country and the grey surface represents the 95% highest posterior density (HPD) intervals. The posterior mean normalized entropies and 95% HPD intervals are depicted by dotted lines. These normalized entropy measures indicate how phylogenetically structured the epidemic is in each country, and ranges from 0 (perfectly structured, e.g., a single country-specific cluster) to 1 (unstructured interspersion of country-specific sequences across the entire SARS-CoV-2 phylogeny). The introduction ratios and normalized entropy measures are superimposed over COVID-19 incidence (daily cases/106 people) reported for each country through time (coloured density plot). The two vertical dashed lines represent the summer time interval (June 15th and August 15th, 2020) for which we subsequently evaluate introductions versus persistence (cfr. Figure 2).
According to the proportion of introductions, we estimate more viral import than export events for Switzerland, Norway, the Netherlands and Belgium throughout most of the time period under investigation. According to the estimated phylogenetic entropy, these countries also experienced many independent transmission chains since the epidemic started to unfold. This is consistent with country-specific studies; for the first wave in Belgium for example, about 331 individual introductions were estimated in the ancestry of a limited sample of 740 genomes 12. For Portugal, we also estimate higher proportions of introductions early in the first wave but with a subsequent decline to predominantly export events. France, Italy and Spain on the other hand are characterized by a relatively high viral export during the first wave. The proportion of introductions remained relatively low for Italy and Spain following the first wave, while in France these proportions were high from mid-June until the end of July. The absolute number of transitions in our sample are however low during this time period. These countries also had comparatively lower entropy values early in the epidemic, with an increase for France by the start of summer and a more gradual increase over time for Italy. In Spain however, the genetic complexity of SARS-CoV-2 transmission chains remained limited. In the UK and Germany, the viral flow in and out of the country was initially relatively balanced. A recent large-scale genomic analysis in the UK indicates that this can imply very high absolute numbers of cross-country transmissions, as more than 2,800 independent introduction events were identified from the analysis of 26,181 genomes 13. Although our sample is limited compared to this analysis, our reconstructions also recover major influx from Spain, France and Italy during the first wave in the UK (Extended Data Figure 2). We estimate an increase in the proportion of introductions for the UK from mid-June, indicating an important viral import relative to export around this time. The phylogenetic entropy also peaked around this time. In Germany, the proportions increased somewhat later in summer with a concomitant rise in phylogenetic entropy.
Introductions thrive in low incidence
To assess the impact of summer travel on the second wave in the different countries, we use our genomic-mobility reconstruction to estimate both the number of lineages persisting in each country and the number of newly introduced lineages, and how these proliferated early in the second wave. We focus on a two-month time period between June 15th, on which many EU and Schengen-area countries opened their borders to other countries, and August 15th, before which the majority of holiday return travel is expected for many countries. We identify the number of lineages circulating in each country on August 15th, and determine whether they result from a lineage that persisted since June 15th or from a unique introduction after this date (independent of the number of descendants for this lineage on August 15th, Extended Data Figure 4). In Figure 2, we plot i) the ratio of these unique introductions over the total unique lineages (unique introductions and persisting lineages) (p1), ii) the proportion of descendant lineages on August 15th that resulted from the unique introductions over the total descendants circulating on this date (p2), and iii) the proportion of descendant tips (sampled genomes) after August 15th that resulted from the unique introductions over the total number of descendant tips (p3, cfr. Methods and Extended Data Figure 4). We estimate a posterior mean proportion of unique introductions that is close to or higher than 0.5 except for Spain and Portugal. This indicates that by August 15th a relatively large fraction of circulating lineages in each country was spawned by new introductions over summer. Because the B.1.177/20A.EU1 variant that was predominantly disseminated through summer travel does not appear to be more transmissible 2, this was unlikely due to intrinsic advantages of the newly introduced viruses.
Figure 2. Posterior estimates for the relative importance of lineage introduction events in 10 European countries and their association with incidence.
We report three summaries (posterior mean and 95% HPD intervals) for each country: the ratio of unique introductions over the total number of unique persisting lineages and unique introductions between June 15th and August 15th, 2020 (p1), the ratio of descendant lineages from these unique introduction events over the total number of descendants circulating on August 15th, 2020 (p2), and the ratio of descendant taxa from these unique introductions over the total number of descendant taxa sampled after August 15th, 2020 (p3) (cfr. Extended Data Figure 4). The dot sizes are proportional to: (1) the total number of unique lineage introductions identified between June 15th and August 15th, 2020, (2) the total number of lineages inferred on August 15th, 2020, and (3) the total number of descendant tips after August 15th, 2020.
The two proportions of descendants from these introductions on August 15th (p2) and after this date (p3) measure the relative success of newly introduced lineages compared to persisting lineages, indicating considerable variation in onward transmission. In Figure 2, the country estimates are ordered according to decreasing average incidence during the June 15th - August 15th time period, suggesting that incidence may shape the outcome of the introductions. In countries that experienced relatively high summer incidence (e.g. Spain, Portugal, Belgium and France), the introductions lead to comparatively fewer descendants on August 15th or after. We find a significant overall association between incidence and the difference in the logit-scaled proportion of unique introductions and the logit-scaled proportion of their descendants on August 15th (p = 0.007) as well as between incidence and the difference in the logit-scaled proportion of unique introductions and the logit-scaled proportion of descendant tips after August 15th (p = 0.019, Extended Data Figure 5). With comparatively few descendants from introductions (Figure 2), Norway may to some extent be an outlier because lineages estimated as persisting in this country could in fact be introductions from other Scandinavian countries that are not represented in our genome sample. We recover qualitatively similar, but more variable and statistically unsupported associations between the success of introductions and incidence for the two-month time periods before and after the June 15th - August 15th time period (Extended Data Figure 5). This indicates that the comparatively higher proportion of introductions as well as the more stable and lower incidence between June 15th and August 15th provided the ideal conditions for a process of genetic drift by which introductions were able to fuel transmission.
Our estimates show that introductions in the UK particularly benefited from the conditions for successful onward transmission (Figure 2), with a considerable fraction of introductions originating from Spain (Extended Data Figure 6) reflecting the spread of B.1.177/20A.EU1 that rapidly became the most dominant strain in the UK 2. Our analysis captures the expansion of this variant as well as that of B.1.160/20A.EU2, which together account for more than 25% of the genomes in our data set. While Spain was indeed inferred to be the origin of B.1.177/20A.EU1, the UK also considerably contributed to its spread (Figure 3). The earliest introduction from Spain to the UK was estimated around the time Spain opened most EU borders (June 21st, Figure 3). While introductions from Spain to other countries soon followed, we estimate a similar rate and amount of spread from the UK to other countries before these other countries also further disseminated the virus. Although inferred from a limited sample, this illustrates a dynamic pattern of spread and the importance of the early establishment of B.1.177/20A.EU1 in the UK that likely served as an important secondary center of dissemination. We note however that this pattern may be impacted by the intensive and continuous genomic surveillance in the UK, which may also be reflected in our subsample of the available data. While the UK is also involved in the spread of B1.160/20A.EU2, this variant has been largely disseminated from France. The simple fact that this variant expanded later in France and subsequently also started to spread later compared to B.1.177/20A.EU1 (Extended Data Figure 7) may explain why the latter spread more successfully.
Figure 3.
Phylogeographic estimates of SARS-CoV-2 spread in 10 European countries. The tree on the left represents the maximum clade credibility tree summary of the Bayesian inference. Colours correspond to the countries in the legend. The two clades corresponding to B1.160/20A.EU2 and B1.177/20A.EU1 are highlighted in grey. The circular migration flow plots for these variants are based on the posterior expectations of the Markov jumps. In these plots, migration flow out of a particular location starts close to the outer ring and ends with an arrowhead more distant from the destination location. For B1.177/20A.EU1, we also summarize phylogeographic transitions as posterior mean estimates with 95% HPD intervals over time for four types of Markov jumps: i) from Spain to the UK, ii) from Spain to other countries, iii) from the UK, and iv) from other countries.
Discussion
Our Bayesian phylogeographic approach builds on a rich history of identifying drivers of spatial spread, with applications to various pathogens at different spatial scales, ranging from air transportation for influenza at a global scale 910 to gravity model transmission for Ebola in West Africa 14. Such studies use a relatively limited genomic sample to gain insights into viral transmission dynamics. This is also the case in our application to SARS-CoV-2 in Europe for which we further extend the phylodynamic data integration approach to confront the lack of resolution offered by SARS-CoV-2 genomic data. A concerted effort in containing international spread further sets apart the COVID-19 pandemic from these earlier events. For this reason, we have now incorporated variation in mobility over time to account for the impact of these measures. Our reconstructions show that the composition of lineages circulating towards the end of the summer was to a significant extent shaped by introductions in most of the European countries. The relative success of onward transmission of the introduced lineages appears to be shaped by local COVID-19 incidence during summer.
Our results should be interpreted in light of several important limitations. In addition to a limited overall size, the genome data only cover a selection of European countries, implying that we are missing transmission events that involve unsampled countries. This may be important for Norway for example, which according to our mobility data, is largely connected to other Scandinavian countries. We also lack sampling from eastern Europe, which was to a large extent spared by border controls and lockdowns during the first wave, but witnessed high excess mortality rates during the second wave. The emergence of more transmissible variants has led to more intensified genomic surveillance, so similar phylodynamic reconstructions may now be performed on a wider scale.
The pandemic exit strategy offered by vaccination programs is a source of optimism that also sparked proposals by EU member states to issue vaccine passports in a bid to revive travel and rekindle the economy. In addition to implementation challenges and issues of fairness, there are risks associated with such strategies when immunization is incomplete, as likely will be the case for the European population this summer. A recent modelling study for the United Kingdom suggests that vaccination in adults alone is unlikely to completely halt the spread of COVID-19 cases and that lifting containment measures early and suddenly can lead to a large wave of infections 15. A gradual release of restrictions was shown to be critical for minimizing the infection burden 15. We believe that travel policies may be a key consideration in this respect because similar conditions may arise as the ones we demonstrated to provide fertile ground for viral dissemination and resurgence in 2020. This may now also involve the spread of variants that evade immune responses triggered by vaccines and previous infections. Well-coordinated European strategies will therefore be required to manage the spread of SARS-CoV-2 and reduce future waves of infection, with hopefully a more unified implementation than hitherto observed.
Methods
Sequence data and subsampling
We used a two-step genome data collection procedure. We first evaluated the available genomes from European countries in GISAID 16 on November 3rd, 2020. We selected genomes from Belgium, France, Germany, Italy, Netherlands, Norway, Portugal, Spain, Switzerland and the UK primarily based on the availability of genome data from both the first and second wave at that time but also because of their high ratio of genomes to positive cases. A total of 39,812 genomes were available for these countries on November 3rd, 2020; the available number of genomes by country are listed in Extended Data Table 1. Portugal represented an exception because data for this country were limited to the first wave at that time, but we included genomes from Portugal because of its potential importance as a summer travel location.
We aligned the genomes from each country using MAFFT v7.453 17 and trimmed the 5′ and 3′ ends and only retained unique sequences from each location. To further mitigate the disparities in sampling, we subsampled each country proportionally to the cumulative number of cases on October 21st (the most recently sampled sequence at the time) by setting an arbitrary threshold of 6.5 sequences per 10,000 cases, with a minimum number of 100 sequences per country. To maximize the temporal and spatial coverage in each country, we binned genomes by epi-week and sampled as evenly as possible, sampling from a different region within the country when available. Only sequences from the B.1 lineage with the D614G mutation and exact sampling dates were selected for the analyses. From the final aligned sequence set, we removed 12 potential outliers, based on a root-to-tip regression applying TempEst v1.5.3 18 to a maximum-likelihood tree inferred with IQTREE v2.0.3 19, yielding a data set of 2,909 genomes (Extended Data Table 1).
Because of the nature of genome sequence accumulation, fewer recently sampled genomes were available for most countries on November 3rd (relative to the case counts at this time). Because our primary goal was to assess the persistence and introduction of lineages leading up to the second wave, we sought to augment our data set with more recent genomes, having already performed analyses on the initial data set. In the section on Bayesian evolutionary reconstructions, we outline how we update these analyses accordingly. On January 5th, 2021, we updated our dataset by adding over 1,000 non-identical sequences collected between August 1st and October 31st (out of a total of 56,395 available genomes; the available and selected number of genomes by country are listed in Extended Data Table 1). For Portugal, we extended this period back to June 22nd (the most recent sampling date for the previous Portuguese selection). We downloaded all new B.1 sequences with the D614G mutation collected during the selected time period from GISAID and performed the following subsampling. The number of genomes to add by country was obtained by raising the threshold ratio of sequences/cases to 8.5 and increasing the minimum number of sequences to 200. To bias the temporal coverage towards more recent samples, the genomes from each country were binned by week and sampled such that the number of sequences added by week was proportional to an exponential function of the form et/4, where t=0 represents August 1st and t=13 is October 31st. For Portugal, we did not use this preferential sampling as we needed to include close to all available genomes to raise the number of genomes to 200. The selected sequences were deduplicated and outliers were removed as described in the previous section. With the additional selection of 1,050 genomes, we arrived at a data set of 3,959 genomes (Extended Data Table 1).
Mobility data
We analysed four different mobility/connectivity measures: air traffic flows, a social connectedness index provided by Facebook, as well as aggregate Facebook 20 and Google international mobility data. Air traffic flow data were obtained from the International Air Transport Association (http://www.iata.org) and based on the number of origin-destination tickets while also taking into account connections at intermediate airports 21. We used monthly air traffic data between the 10 European countries under investigation for the time period between January 2020 and October 2020. The social connectedness index (SCI) is an anonymized snapshot of active Facebook users and their friendship networks to measure the intensity of social connectedness between countries (https://data.humdata.org/) 22. In practice, the SCI measures the relative probability of a Facebook friendship link between two users of the application in different countries. We used the SCI calculated for the 10 European countries represented in our genomic sample as of August 2020.
The Google COVID-19 Aggregated Mobility Research Dataset contains anonymized mobility flows aggregated over users who have turned on the Location History setting (on a range of platforms 23), which is off by default. To produce this dataset, machine learning is applied to logs data to automatically segment it into semantic trips 24. To provide strong privacy guarantees, all trips were anonymized and aggregated using a differentially private mechanism 25 to aggregate flows over time (see https://policies.google.com/technologies/anonymization). This research was done on the resulting heavily aggregated and differentially private data. No individual user data was ever manually inspected, only heavily aggregated flows of large populations were handled. All anonymized trips were processed in aggregate to extract their origin and destination location and time. For example, if users traveled from location a to location b within time interval t, the corresponding cell (a, b, t) in the tensor would be n ± η, where η is Laplacian noise. The automated Laplace mechanism adds random noise drawn from a zero-mean Laplace distribution and yields (ϵ, δ)-differential privacy guarantee of ϵ = 0.66 and δ = 2.1 × 10–29 per metric. The parameter ϵ controls the noise intensity in terms of its variance, while δ represents the deviation from pure ϵ-privacy. The closer they are to zero, the stronger the privacy guarantees. We used aggregated mobility flows between the 10 European countries and summarized them by two-week or monthly time periods between January 2020 and October 2020.
Finally, we also considered international mobility data from Facebook mobility data as an alternative to Google mobility data. These data are based on numbers of Facebook users moving over large distances, like air or train travel. Counts of international travel patterns are updated daily based only on users who have opted to share precise location data from their device with the Facebook mobile app through location services. Also in this case, we used aggregated mobility flows between the 10 European countries and summarized them by month between January 2020 and October 2020. Because international aggregate mobility data obtained from Google and Facebook are highly correlated (monthly Spearman correlation ranging from 0.84 to 0.92; Supplementary Figure 1), we only included the Google aggregate mobility data as a covariate in the phylogeographic analyses. We note that the mobility data are subject to limitations as these may not be representative for the population as whole and their representativeness may vary by location.
Bayesian evolutionary reconstructions
- Joint sequence-trait inference with a time-homogeneous GLM diffusion model
We performed Bayesian evolutionary reconstruction of timed phylogeographic history using BEAST 1.10 26 incorporating genome sequences, their country and date of sampling, epidemiological and mobility/connectivity data. Because of the relatively low degree of resolution offered by the sequence data, our full probabilistic model specification focuses on i) relatively simple model specifications and ii) informing parameters by additional non-genetic data sources. We modeled sequence evolution using an HKY85 nucleotide substitution model with gamma-distributed rate variation among sites and a strict molecular clock model. Our genome set includes three genomes from an early outbreak in Bavaria, which was caused by an independent introduction from China 27,28. We therefore constrained these genomes as an outgroup in the analysis, which according to root-to-tip regression plots as a function of sampling time resulted in a better correlation coefficient/R2 compared to the best-fitting root under the heuristic mean residual squared criterion (Supplementary Figure 2) 18.
As a coalescent tree prior, we modeled the effective population size trajectory as a piecewise constant function that changes values at pre-specified times (following 29), with log population sizes modelled as a deterministic function of log COVID-19 case counts (following 30). This reduces the nonparametric skygrid parameterization to a generalized linear model (GLM) formulation with an estimable regression intercept (α) and coefficient (β). In this parameterization, a coefficient estimate centered around 0 would imply constant population size dynamics through time. We specified two-week intervals and summarized as a covariate the total case counts over these time intervals for the 10 countries of sampling (obtained from https://www.ecdc.europa.eu/en/covid-19/data). The earliest interval with non-zero cases counts was from 2020-01-14 to 2020-01-28; before 2020-01-14, the log-transformed and standardized case count covariate was set to the equivalent of 1 case. We also tested whether a lag-time was required for the case count covariate using marginal likelihood estimation (MLE) 31. Specifically, we shifted the case counts by 1, 2, 3 and 4 weeks before summarizing them according to two-week intervals and estimated the model fit of these covariates against case counts without lag time (Supplementary Table 1). To mitigate the computational burden associated with the MLE procedure, we performed these analyses on a subset of 1,000 genomes (obtained using the Phylogenetic Diversity Analyzer tool 32). We estimated the highest (log) marginal likelihood for a two-week lag time (Supplementary Table 1) and used this for the case count covariate in our analyses.
Similar to sequence evolution, we modelled the process of transitioning through discrete location states (countries of sampling) according to a continuous-time Markov chain (CTMC) 33. We employed a parameterization that models the log transition rates as a log linear function of mobility/connectivity covariates 10. The Bayesian implementation of this model simultaneously estimates phylogenetic history, ancestral movement and the contribution of covariates to the movement patterns 10. While we mainly use this approach to obtain well-informed phylodynamic estimates, we also make use of its capacity to identify the most relevant mobility measure to inform our reconstructions. As covariates we considered Facebook’s SCI, air transportation data and mobility data. For the two time-variable mobility measures, we used the average of the log-transformed and standardized monthly mobility measures as a single covariate in our time-homogeneous phylogeographic GLM model. In this GLM formulation, we estimate positive effect sizes for each covariate as well as their inclusion probability through a spike-and-slab procedure 10. Although we subsampled the number of SARS-CoV-2 genomes by country in proportion to case counts, they do not fully correspond because we used a minimum number of genomes for countries with low case counts. We therefore evaluated whether this resulted in signal for sampling bias by including an origin and destination covariate in the GLM based on the residuals for a regression analysis between genomes and case counts (following 14). We performed this analysis using a set of empirical trees (cfr. below) applying both a time-homogeneous and time-inhomogeneous model, but found no support for these additional covariates (Supplementary Table 2).
We performed inference under the full model specification using Markov chain Monte Carlo (MCMC) sampling and used the BEAGLE library v3 34 to increase computational performance. We specified standard transition kernels on all parameters, except for the regression coefficients of the piecewise-constant coalescent GLM model. For these parameters, we implemented new Hamiltonian Monte Carlo (HMC) transition kernels to improve sampling efficiency. These kernels use principles from Hamiltonian dynamics and their approximate energy conserving properties to reduce correlation between successive sampled states, but require computation of the gradient of the model log-posterior with respect to the parameters of interest, in addition to efficient evaluation of the log-posterior that BEAGLE provides. To accomplish this, we extended our previous analytic derivation of the gradient of the log-density from the skygrid coalescent model with respect to the log-population-sizes 35 to now be with respect to the regression coefficients using the chain rule and their regression design matrix.
Due to the data set size, MCMC burn-in takes up considerable computational time. We therefore iterated through a series of BEAST inferences, initially only considering sequence evolution and subsequently adding the location data, to arrive at a tree distribution from which trees were taken as starting trees in our final analyses. The latter was composed of multiple independent MCMC runs that were run sufficiently long to ensure that their combined posterior samples achieved effective sample sizes (ESSs) larger than 100 for all continuous parameters.
- Data augmentation through online BEAST
As we updated our dataset following initial analyses of the 2,909 genome collection using the approach discussed in the previous subsection, we sought to capitalize on these efforts to limit the burn-in for subsequent analyses of the 3,959 dataset. Specifically, we adopted the distance-based procedure to insert new taxa into a time-measured phylogenetic tree sample as implemented in the BEAST framework for online inference 36. We subsequently use the augmented tree as the starting tree for the analyses of the updated dataset.
- Time-inhomogeneous reconstructions
To accommodate the time-variability of the mobility measures, we constructed epoch model extensions of the discrete phylogeography approach that allow specifying arbitrary intervals over the evolutionary history and associating them with different model parameterizations 37. As a complement to testing covariates of spatial diffusion using a time-homogeneous model, we used the epoch extension to specify monthly intervals allowing us to incorporate monthly mobility matrices (air transportation data were only available as monthly numbers), but assuming time-homogeneous effect sizes and inclusion probabilities. Monthly covariates were again log-transformed and standardized after adding a pseudo-count to each entry in the monthly matrices.
In addition, we performed another analysis in which we relaxed the constant-through-time inclusion probability of the covariates. In this model specification, each interval is associated with a specific set of indicator variables to represent the inclusion/exclusion of covariates, but we pool information about predictor inclusion across the intervals using hierarchical graph modelling 38. This approach uses a set of indicator variables to model covariate inclusion at the hierarchical level but allows interval-specific inclusion or predictors to diverge from the hierarchical level with a non-zero probability (with the number of differences modelled as a binomial distribution 38), which was set to 0.10 in our case. We estimated hierarchical and interval-level inclusion using spike-and-slab 38.
Finally, we performed an analysis using the time-inhomogeneous model in which the interval-specific transition rates are modelled as a function of the single covariate that is supported by the analyses above leveraging aggregate mobility. We incorporated more variability through time by specifying two-week intervals (similar to the coalescent GLM interval specification). In addition, we add time-homogeneous random effects to the phylogeographic transition rate parameterization in order to account for potential biases in the ability of mobility to predict phylogeographic spread. While posterior mean estimates for these random effects vary, only very few indicate that individual phylogeographic transition rates significantly deviate from the mobility data (Supplementary Figure 3). The time-inhomogeneous GLM approach we employ allows modelling relative differences in transition rates, but also the overall rate of migration between countries varies through time, and importantly, this is strongly impacted by intervention strategies. To accommodate these dynamics, we further extended this model by incorporating a time-inhomogeneous overall CTMC rate scaler and parameterize it as a log linear function of the total monthly between-country log-transformed and standardized mobility (time-variable rate scalar GLM in Extended Data Table 2). To generate realisations of the discrete location CTMC process and obtain estimates of the transitions (Markov jumps) between states under this model, we employed posterior inference of the complete Markov jump history through time 10,39.
While the epoch model allows us to flexibly accommodate time-variable spatial dynamics, it considerably increases the computational burden associated with likelihood evaluations. In order to efficiently draw inference under this model for our large data set, we fit the time-inhomogeneous spatial diffusion process to a set of trees inferred under the time-homogeneous GLM diffusion model described above. Although likelihood evaluations remain computationally expensive, even with the speed-up offered by GPU computation with BEAGLE, eliminating simultaneous tree estimation tremendously reduces parameter-space, requiring only modest MCMC chain lengths to adequately explore it. Model and inference specifications for the different analyses are available as BEAST XML input files 40.
- Posterior Summaries
We assessed MCMC mixing (e.g. using ESSs) and summarized continuous parameter estimates using Tracer v1.7.1 41. Credible intervals were computed as 95% HPD intervals. Trees were visualized using FigTree v1.4.4 (available at https://github.com/rambaut/figtree/releases). In terms of phylogeographic estimates, we mainly focused on i) transitions to each location and from each location (based on Markov jump estimates) instead of pairwise transitions, ii) ratios of these transitions and iii) how these transitions structured transmission chains in individual countries. Transitions to each and from each location avoid drawing conclusions about direct migration between countries, which can be tenuous given the incomplete genomes coverage of Europe, while their ratios avoid using absolute numbers of transitions, which are highly sample-dependent. Phylogeographic inference is limited to reconstructing the transitions in the ancestral history of a sample of sequences, which will only be a small fraction of the actual migration events especially when these events result in insufficient onward transmission to be captured in our limited sample. In addition, SARS-CoV-2 genome data can be poorly resolved and identical genomes in different locations are consistent with hypotheses that involve both a sparse and a rich number of virus flows between these locations. As the data hold little information to distinguish these hypotheses, we only consider sparse scenario's by including only unique sequences for each location. A joint inference of sequence evolution and discrete spatial diffusion would err on the side of sparse hypotheses anyway because it will tend to cluster identical sequences that share a location. Despite the general underestimation of spatial dispersal, a phylogeographic inference is still likely to capture the transition events with important onward transmission, and evaluating the importance of such events relative to persistence is a major focus of this study. Cryptic transmission also complicates the ability to reconstruct spatial dispersal, but we expect this to be equally likely for introductions and persistence and therefore focus on their ratio for each location.
We provide three new tree sample tools in the BEAST codebase available at https://github.com/beast-dev/beast-mcmc) to obtain posterior summaries of location transition histories using posterior tree distributions annotated with Markov jumps:
TreeMarkovJumpHistoryAnalyzer allows collecting Markov jumps and their timings from a posterior tree distribution annotated with Markov jumps histories in a .csv file for further analyses.
- TreeStateTimeSummarizer decomposes the total tree time into the times associated with contiguous partitions of a tree estimated to be in a particular location state, with the partitions determined by the Markov jumps. An arbitrary lower- and upper-time boundary can be specified to restrict the summary to a particular time interval in the evolutionary history. We use the time estimates for the separate partitions associated with each state to calculate an entropy measure that summarizes the genetic make-up of country-specific transmission chains. Specifically, we use for each location a normalized Shannon entropy:
where pi, is the proportion of time associated with that location for partition i of a phylogeographic tree and n represents the number of partitions for that location in the tree.(1) PersistenceSummarizer also uses posterior tree distributions annotated with Markov jumps to summarize the number of lineages at a particular point in time (evaluation time, Te, cfr. Extended Figure 5), which location states they are associated with, since what time point in the past they have maintained that state and how many sampled descendants they have after time Te (Extended Figure 5). In addition, it allows summarizing how long these lineages have circulated independently prior to Te, so before sharing common ancestry with other lineages that maintained the same location state. This information allows us to determine how many lineages are circulating at Te that stem either from a unique persistent lineage (maintaining the same location states) or unique introduction event since a particular time prior to Te (Ta in Extended Figure 5). The association between incidence and the difference in the logit proportion of unique introductions and the logit proportion of their descendants on August 15th was evaluated using a p-value obtained by a linear regression analysis.
Extended Data
Extended Data Figure 1.
Monthly international mobility data matrices: international air traffic data (a), international Facebook mobility data (b), and international Google mobility data (c). For Facebook data, we also report the single social connectedness index matrix (SCI, b).
Extended Data Figure 2.
Estimated introductions through time in the 10 European countries and circular migration flow plots summarizing the estimated transitions between the countries for different time intervals throughout the SARS-CoV-2 evolutionary history. (a) The introductions through time serve as an illustration and are based on the Markov jump history in the MCC tree. We note that the posterior distribution of trees is accompanied with considerable uncertainty about the location of origin, destination and timing of the transitions, which is difficult to appropriately visualize. The grey box represents the time period from June 15th to August 15th. (b) The circular migration flow plots are based on the posterior expectations of the Markov jumps. The sizes of the plots reflect the total number of transitions for each period. In these plots, migration flow out of a particular location starts close to the outer ring and ends with an arrowhead more distant from the destination location.
Extended Data Figure 3.
Pairwise mobility data among the 10 countries included in the phylogeographic analysis and other European countries. Heatmap cells are coloured according to international Google mobility data for the time period between January and October 2020.
Extended Data Figure 4.

Conceptual representation of persistent lineages and introductions during the time interval delineated by the evaluation time (Te) and the ancestral time (Ta). At Te, we evaluate how many lineages are circulating in the location of interest, in this case 12 (lineages in other locations are represented by thick grey branches). We subsequently identify whether these lineages maintained this location up to Ta in their ancestry or whether they result from an introduction event in the time interval of interest. By determining whether other lineages circulating in the location of interest at Te are descendants of the same persistent lineage or whether they share an introduction event, we identify the unique persistent lineages or introductions, in this case 2 and 4 respectively. In addition to the proportion of unique introductions (p1 = 4/6), we also summarize the proportion of their descendants at Te (p2 = 9/(9+3) in this case) and the proportion of their descendants in terms of sampled tips after Te (p3). Those tips are not shown here but conceptually represented for both introductions and persistent lineages by ovals.
Extended Data Figure 5.
Scatter plots of the difference in the logit proportion of unique introductions (p1) and the logit proportion of their descendants on August 15th (p2) against incidence and the difference in the logit proportion of unique introductions and the logit proportion of descendant tips after August 15th (p3) against incidence. Both plots are shown for the period between April 15th and June 15th, for the period between June 15th and August 15th, and for the period between August 15th and October 15th, respectively. The p-values in the lower right corner of the plots are the p-values for the hypothesis tests based on the t-statistic evaluating whether the regression coefficient in a linear regression model is different from 0.
Extended Data Figure 6.
Estimated geographic origin of viral influx over the summer (June 15th - August 15th, 2020) in each country. Each bar plot summarizes the posterior Markov jump estimates into a specific country. For the bar representing a low number of introductions into Portugal, a magnified view is provided.
Extended Data Figure 7.
Phylogeographic transitions for lineages B1.1777/20A.EU1 and B1.160/20A.EU2. Cumulative phylogeographic transitions are summarized as posterior mean estimates with 95% HPD intervals over time for four types of Markov jumps. For B1.1777/20A.EU1: i) from Spain to the UK, ii) from Spain to other countries, iii) from the UK, and iv) from other countries; For B1.160/20A.EU2: i) from France to the UK, ii) from France to other countries, iii) from the UK, and iv) from other countries.
Extended Data Table 1.
Genome sampling by country, collected on November 3rd, 2020, and updated on January 5th, 2021.
| country | genomes (Nov. 3rd, 2020) |
genomes (Jan 5th, 2021) |
total |
|---|---|---|---|
| Belgium | 183 (1,091) | 53 (957) | 236 |
| France | 600 (1,441) | 167 (762) | 767 |
| Germany | 246 (486) | 75 (482) | 321 |
| Italy | 281 (795) | 75 (257) | 356 |
| The Netherlands | 159 (2,387) | 47 (1,032) | 206 |
| Norway | 100 (414) | 92 (482) | 192 |
| Portugal | 100 (1,370) | 100* | 200 |
| Spain | 647 (2,443) | 191 (827) | 838 |
| Switzerland | 100 (3,019) | 98 (1,421) | 198 |
| The United Kingdom | 493 (26,366) | 152 (50,175) | 645 |
| total | 2,909 | 1,050 | 3,959 |
The numbers in between brackets represent the number of available genomes that were subsampled.
For Portugal, almost all available genomes were included to increase the number of genomes to 200.
Extended Data Table 2.
Parameter estimates for the various Bayesian time-measured phylogeographic models applied to the SARS-CoV-2 genome data set.
| Model | Parameter estimates | |
|---|---|---|
| Time-homogenous spatial diffusion | coalescent GLM spatial GLM |
α = 2.604 [2.487,2.735], β = 1.711 [1.603,1.898] air travel: E[δ] = 0.018, (β∣δ=1) = 0.044 [0.001,0.123] SCI: E[δ] = 0.004, β(∣δ=1) = 0.013 [0.003,0.032] mobility: E[δ] > 0.999, β(∣δ=1) = 0.358 [0.258,0.456] |
| Time-inhomogeneous spatial diffusion | spatial GLM, constant inclusion probabilities | air travel: E[δ] = 0.018, β(∣δ=1) = 0.029 [0.001,0.105] SCI: E[δ] = 0.008, β∣δ=1 = 0.024 [0.001,0.078] mobility: E[δ] > 0.999, β(∣δ=1) = 0.333 [0.229,0.438] |
| spatial GLM, time-variable inclusion probabilities | air travel: E[δh] = 0.010, β∣(δh=1) = 0.047 [0.002,0.139] SCI: E[δh] = 0.012, β∣δh=1 = 0.018 [0.000,0.062] mobility: E[δh] = 0.949, β∣δh=1) = 0.357 [0.230,0.503] |
|
| spatial GLM time-variable rate scalar GLM |
mobility: β = 0.271 [0.118,0.444] mobility: α = 0.740 [0.618,0.856], β = 0.504 [0.350,0.646] |
|
The coalescent generalized linear model (GLM) parameterizes bi-weekly effective population sizes as a log-linear function of COVID-19 incidence data, with α and β representing the log intercept and log regression coefficient. In the time-inhomogeneous spatial diffusion models, no coalescent prior was used as these models were fitted onto posterior trees inferred from the time-homogeneous model (cfr. Methods). For the spatial GLM model, we report inclusion probability estimates through the expectations of the boolean indicators (δ) associated with each predictor and log conditional effect sizes (the regression coefficient conditional on the predictor being included in the model, β(∣δ=1)). SCI = Social Connectedness Index, based on Facebook data. For the model with time-variable inclusion probabilities, we report the parameters at the hierarchical level (δh and β∣δh, cfr. Methods). In the model with a time-variable rate scalar, we parameterize this rate scalar as a log-linear function of the overall between-country mobility, with α and β representing the log intercept and log regression coefficient.
Using a time-homogeneous model of spatial diffusion, we estimate a maximum inclusion probability for the mobility data whereas air transportation data and SCI offer no predictive value. We also estimate a strong positive association between viral population size change through time and COVID-19 incidence in the coalescent GLM. We further confirm the support for the mobility covariate in a time-inhomogeneous spatial model that incorporates monthly mobility measures, with either constant or time-variable inclusion probabilities. In addition to parameterizing the relative rates of spread between countries according to this covariate, we extend our time-inhomogeneous approach to also model bi-weekly variation in the overall rate of spread between countries as a function of mobility measures (time-variable rate scalar GLM). This approach estimates a positive association between the overall rate of spatial spread and mobility data.
Extended Data Table 3.
Mobility to or from each country within our 10-country sample as the percentage of the total between-country mobility for these countries within Europe.
| country | Mobility percentage |
|---|---|
| Belgium | 87.2 |
| France | 89.5 |
| Germany | 63.9 |
| Italy | 64.8 |
| The Netherlands | 93.2 |
| Norway | 27.1 |
| Portugal | 94.0 |
| Spain | 90.3 |
| Switzerland | 84.8 |
| The United Kingdom | 48.6 |
The pairwise mobility measures summarized in this table are shown in Extended Data Figure 3.
Supplementary Material
Acknowledgments
We would like to thank all the authors who have kindly shared genome data on GISAID, and we have included a table (Supplementary Table 3) acknowledging the authors and institutes involved.
The research leading to these results has received funding from the European Research Council under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 725422-ReservoirDOCS) and the Bill & Melinda Gates Foundation (OPP1094793 and INV-024911). This study was partially funded by EU grant 874850 MOOD and is catalogued as MOOD 005. The contents of this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission. The Artic Network receives funding from the Wellcome Trust through project 206298/Z/17/Z. PL acknowledges support by the Research Foundation - Flanders (‘Fonds voor Wetenschappelijk Onderzoek - Vlaanderen’, G066215N, G0D5117N and G0B9317N). GB acknowledges support from the 'Interne Fondsen KU Leuven' / Internal Funds KU Leuven under grant agreement C14/18/094, and the Research Foundation – Flanders (‘Fonds voor Wetenschappelijk Onderzoek - Vlaanderen’, G0E1420N, G098321N). MAS acknowledges support from National Institutes of Health grant U19 AI135995 and R01 AI153044. SD is supported by the Fonds National de la Recherche Scientifique (FNRS, Belgium). We gratefully acknowledge support from NVIDIA Corporation with the donation of parallel computing resources used for this research. We would also like to thank AMD for the donation of critical hardware and support resources from its HPC Fund that made this work possible.
Footnotes
Code availability
The code for running BEAST analyses is available in the hmc_develop branch of the BEAST codebase available at https://github.com/beast-dev/beast-mcmc (DOI: 10.5281/zenodo.4895235). The tools TreeMarkovJumpHistoryAnalyzer, TreeStateTimeSummarizer and PersistenceSummarizer are available from the master branch in the same codebase.
Competing Interests
The authors declare no competing interests.
Data availability
BEAST XML input files are available at https://github.com/phylogeography/SARS-CoV-2_EUR_PHYLOGEOGRAPHY (DOI: 10.5281/zenodo.4876442). The SARS-CoV-2 genome data required for running these XML files can be downloaded from https://www.gisaid.org; all GISAID accession numbers are listed in the GISAID acknowledgments table (Supplementary Table 3).
The Google COVID-19 Aggregated Mobility Research Dataset and the Facebook mobility data are not publicly available owing to stringent licensing agreements. Information on the process of requesting access to the Google mobility data is available from A.S. (sadilekadam@google.com) and the COVID-19 Community Mobility Reports that were derived from the Google data are publicly available at https://www.google.com/covid19/mobility/. The Facebook mobility data are made available through the Data for Good program (https://dataforgood.fb.com) under the terms of a data license agreement which defines the allowed terms of use by partners (contact: disastermaps@fb.com). Once a partner institution’s request for access is vetted and an appropriate data license agreement is signed, then access is granted through a Facebook’s web-based spatial visualization tool called GeoInsight. Air travel data were obtained from the International Air Transport Association (http://www.iata.org). Log-transformed and standardized among country mobility and air travel data are specified in the available XML files. COVID-19 incidence data was obtained from https://www.ecdc.europa.eu/en/covid-19/data.
References
- 1.Data on 14-day notification rate of new COVID-19 cases and deaths. https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19 (2021).
- 2.Hodcroft EB et al. Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv (2020) doi: 10.1101/2020.10.25.20219063. [DOI] [PubMed] [Google Scholar]
- 3.COVID-19 situation update for the EU/EEA, as of week 3, updated 28 January 2021. https://www.ecdc.europa.eu/en/cases-2019-ncov-eueea. [Google Scholar]
- 4.Rambaut A, Loman N, Pybus OG, Barclay W, Barrett J, Carabelli A, Connor T, Peacock T, Robertson DL, Volz E, on behalf of COVID-19 Genomics Consortium UK (CoG-UK). Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations. https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563 (2020). [Google Scholar]
- 5.Di Domenico L, Pullano G, Sabbatini CE, Boëlle P-Y & Colizza V Impact of lockdown on COVID-19 epidemic in Île-de-France and possible exit strategies. BMC Med. 18, 1–13 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Neher RA, Dyrdak R, Druelle V, Hodcroft EB & Albert J Potential impact of seasonal forcing on a SARS-CoV-2 pandemic. Swiss Med. Wkly 150, w20224 (2020). [DOI] [PubMed] [Google Scholar]
- 7.McKee M A European roadmap out of the covid-19 pandemic. BMJ 369, m1556 (2020). [DOI] [PubMed] [Google Scholar]
- 8.Rambaut A et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol 5, 1403–1407 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Morel B et al. Phylogenetic analysis of SARS-CoV-2 data is difficult. Molecular Biology and Evolution (2020) doi: 10.1093/molbev/msaa314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lemey P et al. Unifying Viral Genetics and Human Transportation Data to Predict the Global Transmission Dynamics of Human Influenza H3N2. PLoS Pathog. 10, e1003932 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kuchler T, Russel D & Stroebel J The Geographic Spread of COVID-19 Correlates with the Structure of Social Networks as Measured by Facebook. (2020) doi: 10.3386/w26990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dellicour S et al. A Phylodynamic Workflow to Rapidly Gain Insights into the Dispersal History and Dynamics of SARS-CoV-2 Lineages. Mol. Biol. Evol (2020) doi: 10.1093/molbev/msaa284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.du Plessis L et al. Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science (2021) doi: 10.1126/science.abf2946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dudas G et al. Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544, 309–315 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Moore S, Hill EM, Tildesley MJ, Dyson L & Keeling MJ Vaccination and non-pharmaceutical interventions for COVID-19: a mathematical modelling study. Lancet Infect. Dis (2021) doi: 10.1016/S1473-3099(21)00143-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Shu Y & McCauley J GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 22, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Katoh K, Asimenos G & Toh H Multiple alignment of DNA sequences with MAFFT. Methods Mol. Biol 537, 39–64 (2009). [DOI] [PubMed] [Google Scholar]
- 18.Rambaut A, Lam TT, Max Carvalho L & Pybus OG Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus Evol 2, vew007 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Minh BQ et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol 37, 1530–1534 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Maas P Facebook disaster maps: Aggregate insights for crisis response & recovery. in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (ACM, 2019). doi: 10.1145/3292500.3340412. [DOI] [Google Scholar]
- 21.Gilbert M et al. Preparedness and vulnerability of African countries against importations of COVID-19: a modelling study. Lancet 395, 871–877 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bailey M, Cao R, Kuchler T, Stroebel J & Wong A Social Connectedness: Measurement, Determinants, and Effects. J. Econ. Perspect 32, 259–280 (2018). [PubMed] [Google Scholar]
- 23.Kraemer MUG et al. Mapping global variation in human mobility. Nat Hum Behav 4, 800–810 (2020). [DOI] [PubMed] [Google Scholar]
- 24.Bassolas A et al. Hierarchical organization of urban mobility and its connection with city livability. Nat. Commun 10, 4817 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wilson RJ et al. Differentially Private SQL with Bounded User Contribution. Proceedings on Privacy Enhancing Technologies 2020, 230–250. [Google Scholar]
- 26.Suchard MA et al. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol 4, vey016 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Böhmer MM et al. Investigation of a COVID-19 outbreak in Germany resulting from a single travel-associated primary case: a case series. Lancet Infect. Dis 20, 920–928 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Worobey M et al. The emergence of SARS-CoV-2 in Europe and North America. Science 370, 564–570 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gill MS et al. Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci. Mol. Biol. Evol 30, 713–724 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Faria NR et al. Genomic and epidemiological monitoring of yellow fever virus transmission potential. Science 361, 894–899 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Baele G, Lemey P & Suchard MA Genealogical Working Distributions for Bayesian Model Testing with Phylogenetic Uncertainty. Systematic Biology vol. 65 250–264 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Chernomor O et al. Split diversity in constrained conservation prioritization using integer linear programming. Methods Ecol. Evol 6, 83–91 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lemey P, Rambaut A, Drummond AJ & Suchard MA Bayesian phylogeography finds its roots. PLoS Comput. Biol 5, e1000520 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ayres DL et al. BEAGLE 3: Improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Syst. Biol 68, 1052–1061 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Baele G, Gill MS, Lemey P & Suchard MA Hamiltonian Monte Carlo sampling to estimate past population dynamics using the skygrid coalescent model in a Bayesian phylogenetics framework. Wellcome Open Res 5, 53 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gill MS, Lemey P, Suchard MA, Rambaut A & Baele G Online Bayesian Phylodynamic Inference in BEAST with Application to Epidemic Reconstruction. Mol. Biol. Evol 37, 1832–1842 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bielejec F, Lemey P, Baele G, Rambaut A & Suchard MA Inferring heterogeneous evolutionary processes through time: from sequence substitution to phylogeography. Syst. Biol 63, 493–504 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Cybis GB, Sinsheimer JS, Lemey P & Suchard MA Graph hierarchies for phylogeography. Philos. Trans. R. Soc. Lond. B Biol. Sci 368, 20120206 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Minin VN & Suchard MA Fast, accurate and simulation-free stochastic mapping. Philos. Trans. R. Soc. Lond. B Biol. Sci 363, 2985–2995 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.BEAST XML files available at https://github.com/phylogeography/SARS-CoV-2_EUR_PHYLOGEOGRAPHY (DOI: 10.5281/zenodo.4876442). [DOI] [Google Scholar]
- 41.Rambaut A, Drummond AJ, Xie D, Baele G & Suchard MA Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7. Syst. Biol 67, 901–904 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
BEAST XML input files are available at https://github.com/phylogeography/SARS-CoV-2_EUR_PHYLOGEOGRAPHY (DOI: 10.5281/zenodo.4876442). The SARS-CoV-2 genome data required for running these XML files can be downloaded from https://www.gisaid.org; all GISAID accession numbers are listed in the GISAID acknowledgments table (Supplementary Table 3).
The Google COVID-19 Aggregated Mobility Research Dataset and the Facebook mobility data are not publicly available owing to stringent licensing agreements. Information on the process of requesting access to the Google mobility data is available from A.S. (sadilekadam@google.com) and the COVID-19 Community Mobility Reports that were derived from the Google data are publicly available at https://www.google.com/covid19/mobility/. The Facebook mobility data are made available through the Data for Good program (https://dataforgood.fb.com) under the terms of a data license agreement which defines the allowed terms of use by partners (contact: disastermaps@fb.com). Once a partner institution’s request for access is vetted and an appropriate data license agreement is signed, then access is granted through a Facebook’s web-based spatial visualization tool called GeoInsight. Air travel data were obtained from the International Air Transport Association (http://www.iata.org). Log-transformed and standardized among country mobility and air travel data are specified in the available XML files. COVID-19 incidence data was obtained from https://www.ecdc.europa.eu/en/covid-19/data.









