Abstract
Despite its increasing role in the understanding of infectious disease transmission at the applied and theoretical levels, phylodynamics lacks a well-defined notion of ideal data and optimal sampling. We introduce a method to visualize and quantify the relative impact of pathogen genome sequence and sampling times—two fundamental sources of data for phylodynamics under birth–death-sampling models—to understand how each drives phylodynamic inference. Applying our method to simulated data and real-world SARS-CoV-2 and H1N1 Influenza data, we use this insight to elucidate fundamental trade-offs and guidelines for phylodynamic analyses to draw the most from sequence data. Phylodynamics promises to be a staple of future responses to infectious disease threats globally. Continuing research into the inherent requirements and trade-offs of phylodynamic data and inference will help ensure phylodynamic tools are wielded in ever more targeted and efficient ways.
Keywords: phylodynamics, birth–death model, Bayesian phylogenetics
Introduction
Phylodynamics combines phylogenetic and epidemiological modeling to infer epidemiological dynamics from pathogen genome data (Volz et al. 2013; du Plessis and Stadler 2015; Baele et al. 2018). Analyses are usually conducted within a Bayesian Markov chain Monte Carlo (MCMC) framework, meaning that the output consists of posterior samples for parameters of interest, such as the basic reproductive number, (i.e., the average number of secondary infections from the index case in an otherwise fully susceptible population). Input data usually consist of partial- or whole-genome sequences with associated sample collection dates. In the case of birth–death-sampling models (Stadler 2010), both sequence and date data inform the branching of inferred trees by either temporally clustering lineages or via sequence similarity. Internal nodes are assumed to coincide with transmission events, such that they provide information about patterns of transmission that sampling time data alone cannot (Featherstone et al. 2022). Sampling times, or date data, are similar to standard epidemiological time series data whereas sequence data introduce the evolutionary aspect. The widely used birth–death model uses sampling times to infer a sampling rate which is also informative about transmission rates (Stadler et al. 2012; Boskova et al. 2018).
Application of phylodynamic methods has intensified since the onset of the SARS-CoV-2 pandemic. Moreover, larger and more densely sequenced outbreaks are being studied. Although the value of pathogen genome data is now well established, an increasingly pertinent question is whether the inclusion of more sequence data after a point is of diminishing returns for some densely sequenced outbreaks (Hill et al. 2021; Porter et al. 2022). Since the answer to this question will naturally vary with each data set and pathogen considered, we believe it is most appropriately addressed by a method to quantify the individual effects of date and sequence data. This would help broaden our understanding of the phylodynamic tools that now feature in infectious disease surveillance. Such a method also has the potential to help target sampling efforts of future outbreaks for optimization of knowledge gain against resource expenditure.
Earlier work showed that sequence sampling times, referred to here as “date data,” can drive epidemiological inference under the birth–death model (Volz and Frost 2014; Boskova et al. 2018; Featherstone et al. 2021). However, each stopped short of proposing a formal method to measure this effect in regular application. The birth–death model is most applicable to the question at hand since it includes a rate of sampling. The coalescent is another key phylodynamic model, but it typically conditions on sampling dates which therefore precludes a comparison of date and sequence effects. Some coalescent formulations include a sampling rate (Volz and Frost 2014); however, these are used less often than the birth–death or standard Kingman coalescent (Kingman 1977). The Kingman coalescent also assumes a low sampling proportion relative to population size such that its typical formulation would be inappropriate for many densely sequenced outbreaks (Boskova et al. 2018), where the question of the effect of large amounts of sequence data is most relevant.
Building upon these earlier results, we introduce a theoretical framework and a new method to quantify and visualize the effect of sequence and dates for any parameter under the birth–death with continuous sampling. We focus on continuous sampling because it is most relevant to how emerging outbreak data are collected. It also classifies which data source is driving the inference but crucially also indicates whether a binary classification is meaningful. We believe these observations will form a critical addition to the phylodynamic toolkit used to inform public health decisions because they clearly quantify the added-knowledge acquired from genomes in a given analysis.
New Methods
Isolating Date and Sequence Data
We conduct four analyses for a given data set to contrast the effects of complete data, date data, sequence data, and the absence of both (fig. 1A). We focus on inferring , with all other parameters fixed (the sampling proportion, becoming-uninfectious rate, molecular clock rate, and substitution model parameters), but this new approach is applicable to any parameter under the birth–death with any combination of priors. First, we use complete data to fit a birth–death model and infer the posterior distribution of . This represents the combined effects of dates and sequences. Second, to isolate the effect of date data, we remove sequence information and retain dates, thus integrating over the prior on tree topology. This is traditionally referred to as “sampling from the prior,” but this term should be avoided in the context of models where the sampling times are treated as data, such as the birth–death. Third, to isolate the effect of sequence data, we keep sequence data and remove dates. This requires estimation of all sampling dates, analogously to how removing sequence data causes integration over topology. We use a novel Markov chain Monte Carlo (MCMC) operator to estimate dates which are implemented in the feast v17 package for BEAST 2 (Bouckaert et al. 2019). Briefly, the operator adjusts the time between the final sample date and end of the birth–death process such that dates can rescale relative to each other and in absolute time. See supplementary figure S1, Supplementary Material online for a visual representation. Lastly, and for completeness, we conduct the analysis with both date and sequence data removed. This formally corresponds to the marginal prior conditioned on the number of samples. The resulting Wasserstein metric, , is useful for quantifying whether full data offer information in addition to the prior.
Fig. 1.
Graphical summary of the process to quantify signal and classify signal drivers. (A) Colored boxes give examples of data under each of the four treatments with letters in brackets giving shorthand notation for each. From left to right: Marginal Prior results from the removal of both date and sequence data. Dates Only includes date data while ignoring sequence data through a constant phylogenetic likelihood. This can be represented as converting all sequence characters to “N” (i.e., alignments of entirely missing data). Full data represents the usual combination of both sequence and date data. This produces a reference distribution from which the Wasserstein metric to other posteriors is calculated. Sequence Only corresponds to the removal and re-estimation of dates while sequence data are retained. (B) Example posterior output for with color corresponding to each treatment in A. The Wasserstein metric is calculated as the difference in the inverse distribution function of each posterior from Full Data integrated over 0 to 1. Example values for the Wasserstein metric are given in white boxes. (C) The plane with x and y axes and and shaded classification regions. is the Euclidean distance from the origin to a point , with higher values indicating that one or both of data and sequence data drive differing signals from the reference posterior. is the vertical distance from a point to the line , with points closer to this line corresponding to more similar data and sequence data effects such that classification is less meaningful. In the example, the distance from the posterior under only sequence data to the full data posterior () is the smallest, leading to classification as “Seq-Driven.”
Quantifying Data Signal
We employ the Wasserstein metric in one dimension to measure a “distance” between each of the sequence posterior, date posterior, or marginal prior, and the posterior derived from the complete data. We write these distances as , with being D, S, or N for the date, sequence, and marginal-prior distributions, respectively. For example, the Wasserstein distance from the date data to complete data posterior is:
where and are cumulative distribution functions for posterior under date and complete data, respectively. then maps from cumulative probability to the corresponding value on the domain of the parameter of interest (e.g., ). The units of are equivalent to the units of the parameter of interest (e.g., for the birth rate, becoming uninfectious rate, sampling rate, and unitless for ). See supplementary figure S2, Supplementary Material online for a visual interpretation of the expression. We used the transport R package to calculate the Wasserstein metric (Schuhmacher et al. 2020).
As in figure 1C, we can now consider a plane where the axes are and . We classify the data source with the lowest Wasserstein distance from the complete data posterior as contributing most to the posterior from full data. We refer to this as a classifier, although we emphasize that this is not a classifier as in machine learning literature. There is no prediction or statistical modeling of the driving data source beyond ). In this case, the lines mark the classification boundary as in the shading in figure 1C.
Finally, we can quantify the disagreement in signal between each data source. We define disagreement with respect to the full data posterior, as the magnitude of the vector leading to each point in the plane. This is also the radius from the origin to the point. Values near zero indicate that the posteriors under date data, sequence, and complete data are all near identical and classification as date- or sequence-driven is less meaningful. Larger values signify that one or both data sources drive differing posteriors and classification is more meaningful. We also define disagreement without respect to full data as a quantification of disagreement between date and sequence posteriors without reference to the full data (supplementary fig. S3C, Supplementary Material online). Visually, this corresponds to the vertical distance to the classification boundary () such that smaller values correspond to less meaningful classification. and are similar in that when is near-zero, necessarily is too. also accounts for the case where is high, but both date and sequence data have similarly sized effects. In this case, is higher whereas is lower and classification of one or another as driving analysis is inaccurate.
Results
We simulated 600 alignments to explore the differing signals in date and sequence data using the Wasserstein metric. These derive from 100 simulated outbreaks of 500 cases, sampled with proportion 1, 0.5, or 0.05 (, 250, or 500 accordingly) and used to simulate sequences with an evolutionary rate of or (subs/site/time). Higher evolutionary rates induce more site patterns and therefore more informative sequence data. We estimated under each data treatment with all other parameters fixed using a birth–death tree prior. In all analyses, simulated data provided information in addition to the prior (, supplementary fig. S5, Supplementary Material online). Among the 600 data sets, we observe a mixture of cases where date and sequence data infer similar or dissimilar posterior . This supports the core assumption that date and sequence data can have differing signals concealed in their combination (fig. 2). Classification using the Wasserstein metric results in a mix of date- and sequence-driven classifications, supporting that our proposed method is sensitive to differences between data sets (fig. 3). Most data sets were classified as date driven (372/600), which is consistent with earlier work showing that dates are highly influential under the birth–death (Volz and Frost 2014).
Fig. 2.
Comparison of full data, date data, sequence data, and marginal prior for four of the 600 simulated data sets where the true value is 2.5.
Fig. 3.
(A) Each point represents for one of the 600 simulated data sets with marginal histograms corresponding to the distribution of and , respectively. (B) The number of simulated data sets classified as Date or Seq Driven, stratified by evolutionary rate and sampling proportion. (C) Points colored by number of site patterns. Lower site patterns tend to co-occur with date-driven classification. Coloring appears categorical due to the log scale and shared color scale, but the number of site patterns involves a wide range and intermediate values (supplementary fig. S4, Supplementary Material online). (D) Points colored by date span with no clear patterns corresponding with classification as for site patterns.
Reliability of Wasserstein Metric
We conducted a subsampling analysis to ensure that Wasserstein metric values reflected differences between date and sequence data, rather than noise alone. For each of the 600 simulated data sets, we subsampled posterior distributions corresponding to each of the 4 data treatments 100 times, taking a number of samples equal to half the length of the postburnin posterior. This yielded 60,000 subsampled posteriors for which we recalculated Wasserstein values and reclassified each as in the original simulation study. Of the 60,000, subsampled posteriors, only 99 were misclassified across 99 of the 600 original data sets. In other words, misclassification occurred only once out of 100 replicates for 99 of the 600 simulated data sets.
Of the 600 simulated data sets, those where misclassification occurred had substantially smaller differences between Wasserstein distance to the date-only and sequence-only posteriors () (supplementary fig. S3, Supplementary Material online). Misclassification occurred where in the complete posteriors, with no error above this level. Differences below correspond to a level of difference between data- and sequence-only posteriors where classification is of little significance (i.e., assuming both posteriors have similar uncertainty). Classification is wholly reliable above this threshold, validating the trends observed in the simulation study and that the Wasserstein metric is sensitive to differing signals concealed in date and sequence data.
Observations about the Effects of Sequence and Dates
The distribution of is more diffuse than , meaning the sequence data posterior tends to differ more from full data than date data (fig. 3A). This again aligns with previous results showing that date data usually drive inference under the birth–death.
Low sequence diversity, measured here in the number of site patterns, seems to preclude sequence data from driving inference (fig. 3C, supplementary fig. S4, Supplementary Material online). This matches the expectation that fewer site patterns result in less sequence information to inform the posterior. It appears as a necessary but not sufficient condition for a data set to have at least one site pattern per sample in order for analysis to be sequence driven (supplementary fig. S6, Supplementary Material online). On the other hand, the date span does not follow an equivalent trend with lower diversity coinciding with analyses being sequence driven. Here, the relative date span is the time between the first and last sample, divided by the height of the outbreak tree and thus should be indicative of the information content of the dates through the proportion of the outbreak’s duration that they capture. The distribution of relative date span appears random across classifications, unlike the distribution of site patterns (fig. 3C, supplementary fig. S4, Supplementary Material online).
Sampling proportion appears to play a secondary role in driving the influence of sequence data (fig. 3B, animated here). Lower sampling proportions coincide with a higher relative proportion of sequence-driven data sets, assuming lower sampling still yields a usable sample size. This matches expectation because lower sampling proportion increases the likelihood of sampling more divergent lineages, resulting in more site patterns to drive inference.
One of the core assumptions of phylodynamics is that inferred phylogenetic trees bear a resemblance in their shape and timing to underlying transmission networks (Featherstone et al. 2022). Based on this, we would expect that analyses with less uncertainty in posterior epidemiological estimates have posterior tree distributions that are more divergent. To test this expectation, we calculated tree distance metrics pairwise for each of the 2400 posterior tree distributions, originating from the 600 simulated outbreaks and four data treatments. We sampled 100 trees from each posterior and considered the pairwise topological distance between each. We focus on the mutual clustering information (MCI) introduced by Smith (2020) (supplementary fig. S7, Supplementary Material online), but all observed patterns were repeated when using other metrics including Robinson–Foulds and Information-corrected Robinson–Foulds. The full data treatment consistently had the lowest pairwise MCI since these analyses incorporated the most information to constrain tree space. The marginal prior, reflecting the least information, reported the highest MCI values. Between these two extremes, sequence-only analyses consistently had a lower MCI than date-only treatments (based on visual inspection). This can be explained by sequence data providing information about favored topologies via the phylogenetic likelihood. Conversely, date-only data incorporate no topological information and consistently have higher MCI values. Given that date data tend to drive inference, especially where the evolutionary rate is low and to an extent where sampling proportion is high, this presents strong evidence that the chronology of tips can be equally, if not more informative than the topology of trees. This highlights the importance of curating sampling time data as carefully as sequence data is curated. Moreover, this applies at the phylogenetic level of phylodynamic inference, as distinct from the choice of the phylodynamic model and its parameters.
Empirical Results
Australian SARS-CoV-2 Clusters
We analyzed data from two SARS-CoV-2 transmission clusters from 2020 in Australia to demonstrate that date-driven and sequence-driven analyses can arise in practice (table 1). These clusters were chosen out of the 595 identified by Lane et al. (2021) because they possessed the most complete sampling-time data to compare with sequence data. They are also ideal examples of the size and type of genome data set that public health practitioners may seek phylodynamic insight from as each sample most likely reflects transmission stemming from a single, unknown source as supported by contact tracing.
Table 1.
Wasserstein Distances for Empirical Analyses of SARS-CoV-2 Clusters from Lane et al. (2021).
| Cluster 1 | Cluster 2 | |
|---|---|---|
| n | 112 | 188 |
| Classification | Seq Driven | Date Driven |
| 0.223 | 0.009 | |
| 0.078 | 0.008 | |
| 0.192 | 0.001 | |
| 0.114 | 0.009 | |
| 0.325 | 0.481 |
Analysis of the first cluster is classified as sequence driven, with indicating an appreciable difference between the sequence posterior and complete data. adds that date data also drive an effect of similar size, offering the interpretation that both date and sequence data are influential in this analysis. The second cluster is classified as date driven but with and . The low value indicates a near-negligible difference between date, sequence, and complete data posteriors. Since is low, is necessarily also low. Due to this, classification is effectively meaningless, and it can be concluded that both date and sequence data drive a highly congruent signal. Moreover, for both analyses is more than double each of and , which affirms that both sources of data contribute to the posterior deviating from the prior and are therefore informative with respect to the prior in both analyses.
2009 H1N1
The above simulation study and SARS-CoV-2 empirical data consist of analyses where all parameters are fixed except . However, this extent of prior certainty is rare in practice. In particular, there exists an inherent trade-off between prior certainty in the evolutionary rate and the strength of signal due to sequence data. Moreover, this affects the inference of the sampling proportion, which relies on sequence information to inform the evolutionary distance between samples and tMRCA, in turn informing the proportion of lineages captured in sampling. To explore the effects of these trade-offs, we consider North-American H1N1 data from the 2009 swine flu pandemic originally studied by Hedge et al. (2013). This data set was chosen because it represents a small sample from a much larger outbreak, which helps to ensure to ensure the presence of sequence information with which to test the effects of prior uncertainty in the evolutionary rate.
We again fit a birth–death process, estimating two values before and after early July 2009 ( and , respectively capturing the exponential and postexponential phases of the outbreak) and the sampling proportion (p). Analyses were conducted with either a fixed evolutionary rate, subs/site/year following Hedge et al. (2013), or a uniform prior , and under each of the four data treatments defined above. We placed a noninformative prior to the sampling proportion and fixed the becoming-uninfectious rate (), such that comparing analyses with a fixed or uniform prior on the evolutionary rate reveals the effects of uncertainty in evolutionary rate on and sampling proportion. Posterior distributions for and p are presented in figure 4 and Wasserstein values in table 2.
Fig. 4.
Posterior estimates for and p for the 2009 H1N1 data set. (A) Histogram gives shape of posterior density for the tMRCA from the analysis of the full data under a strict clock. Counts are rescaled by a factor of . Vertical marks along the date axis represent sampling times in the data set. The vertical dashed line marks the change time between and . Asymmetrical violin plots give the posterior density for under each data treatment with the marginal prior omitted. Distributions on the left correspond to analyses with a fixed evolutionary rate and those on the right correspond analyses with a uniform prior on the evolutionary rate. The more diffuse clock prior decreases the influence of date data. (B) An asymmetrical violin plot of the posterior sampling proportion with the left and right corresponding to fixed and uniform clock priors identically to A. Sequence data drive inference for both clock prior configurations.
Table 2.
2009 H1N1 Data.
| Clock Prior | p | |||||
|---|---|---|---|---|---|---|
| 0.069 | 0.168 | 0.05 | 0.039 | 0.239 | 0.255 | |
| 0.380 | 0.145 | 0.160 | 0.135 | 0.001 | 0.002 | |
| 0.071 | 0.170 | 0.947 | 10.9 | 0.891 | 0.884 | |
Note.—Wasserstein distances for estimated and sampling proportion under a fixed and and uniform clock prior.
The patterns in posterior under each treatment verify the trade-off between estimating tip dates and evolutionary rate. This is reflected in higher values when using a prior on the evolutionary rate than when fixing it (table 2), which corresponds to posterior under Full Data overlapping less with the posterior from sequence data. Moreover, the sequence data posterior is wider where the uniform clock prior is used, reflecting that more sequence information is devoted to ascertaining the evolutionary rate where the prior is more diffuse, which would otherwise further ascertain (fig. 4A). There is also a disparity between the posterior tMRCA between each clock prior under Full Data (Only the fixed prior tMRCA is shown in fig. 4A), which is expected given the well-known nonidentifiability between the evolutionary rate and tMRCA.
Estimates of sampling proportion appear to be strongly sequence driven regardless of the evolutionary rate prior (fig. 4B). The posterior sampling proportion is two orders of magnitude lower for the sequence data and full date treatments, both containing sequence information, than for date data alone. These estimates are far more in line with expectations too, given the disparity between the H1N1 sample size (), and scale of the 2009 H1N1 pandemic. This is expected because the diffuse prior placed on p made sequence information essential to informing the evolutionary distance between samples and driving estimates of sampling proportion lower. Conversely, estimates with date-only data are much higher and more uncertain owing to the lack of sequence information. The evolutionary rate prior drives a similar trend as for , but its effect is diminished by the major effect of sequence data. Altogether, this highlights that different parameters can be drawn from date and sequence data to different extents.
We do not present the marginal-prior data in figure 4 because each corresponding posterior is uninformatively broad (i.e., effectively uniform over the prior domain). However, this too supports our assertion of a trade-off between sequence information and posterior certainty in sampling proportion because of its contrast to the simulation studies and SARS-CoV-2 data sets where sampling proportion was fixed. The only information included in the marginal-prior treatment is the sample size, and this is most informative when the sampling proportion is fixed. However, in the H1N1 analyses, the diffuse prior on sampling proportion means sample size alone is insufficient to drive an informatively narrow posterior and sequence is the primary driver of inference.
Taken together, the results of the H1N1 data set demonstrate that uncertainty in the evolutionary rate prior partly determines the strength of effect for sequence data, since any information in sequence data is used in shifting from prior to posterior. In addition, sampling proportion demands sufficient sequence information for inference, and this cannot be substituted by sampling times as readily as has been shown for (Volz and Frost 2014; Featherstone et al. 2021).
Discussion
The results of our simulation study add clarity and practical understanding to previous results showing that sampling times contribute substantially to phylodynamic inference under the birth–death (Volz and Frost 2014; Featherstone et al. 2021). We furthermore demonstrate that sequence data are not always secondary in influence and can drive inference of in some instances. This affirms the sensitivity of the birth–death to the signal encoded in sequence data. We demonstrate that the number of site patterns is key to enabling sequence-driven inference but is not the sole factor. Over-sampling from a pathogen population can also dilute sequence information and pivot analyses to being date driven.
The tendency for date data to drive inference as site patterns decrease and sampling proportion increases can be explained by the reduction in uncertainty that each data source offers. Dates impose hard bounds by restricting tree space to a subset of trees that agree with the chronology of sampling times. Conversely, sequence data inform topology through phylogenetic likelihood but do not definitively constrain tree space in the same way as date data. The net result is that sequence data must explore a larger tree space than date data. Thus, as the information in sequence decreases, such as via intensified sampling lowering the number of site patterns per sample, the disparity between the sequence and date data tree spaces increases and dates more often drive analyses. See supplementary figure S8, Supplementary Material online for visual explanation.
We note that having sufficient sequence information to drive inference is a related but ultimately different concept to the phylodynamic threshold, which is defined as the sampling timespan needed for a pathogen to display temporal signal (Duchene et al. 2020). Our analyses show that surpassing the phylodynamic threshold (i.e., having temporal signal present in the data) is generally not a sufficient condition for sequence-driven inference. Conversely though, we demonstrate with the H1N1 data set that increased prior certainty in the evolutionary rate increases the effect of sequence data in the analysis. This means that an assessment of the phylodynamic threshold is highly useful in practice to promote sequence-driven inference, but only retrospective analysis using the methods introduced here can conclusively appraise the contribution of the sequence data.
Practical Insights
Our central finding is that sampling time data increasingly drive phylodynamic analyses as data sets grow in the proportion of sampled cases. Although information from sample times can be a boon to phylodynamic analyses when the sampling process is properly modeled, dependence on this information can become a vulnerability as sampling processes deviate from the assumptions of phylodynamic models over time and space. Thus, the phylodynamic value of a data set must take into consideration the amount of sequence and date information available to direct inference over tree space, rather than the size of the data set alone.—Bigger is not automatically better. To this end, we recommend a cautious approach to sampling, with the recognition that each additional sample is increasingly costly to the analysis in requiring sequence data to navigate a disproportionately large tree space in comparison to sampling times. Therefore, samples that are both identical in sampling time and sequence, such as from a super spreading event, should be avoided when curating phylodynamic data sets. However, we make a distinction between sampling and data set curation. We do not at all advocate that sampling be reduced to avoid recovering clonal samples since these are undeniably hallmark of many outbreaks. Rather, when larger data sets are being curated specifically for phylodynamic analysis, care should be taken to avoid excessive duplication where possible. As a basic start point, having at least one site pattern per sample greatly increases the likelihood of a sequence-driven analysis (supplementary fig. S6, Supplementary Material online). Other practical considerations are summarized in table 3.
Table 3.
Practical Considerations for Inference Stratified by Relevance to Phylogenetic Inference or the Phylodynamic Model (birth–death here).
| Trade-off/Unidentifiability/Challenge in Inference | Mitigating Action | Effect |
|---|---|---|
| Phylogenetic level | ||
| Ideally ≥1 site patterns per sample | Preference samples over a longer sampling window, potentially with lower sapling intensity when curating data sets. Data-sharing makes this more feasible in general | More site patterns |
| Phylodynamic model level (birth–death) | ||
| Reducing prior uncertainty in one reduces posterior uncertainty in the other | Reduces mutual unidentifiability | |
| Unidentifiability of | Fix at least one parameter (Louca et al. 2021) | Localize inference to correct congruence class of parameters |
| Narrowing prior r | More sequence information available to ascertain | |
| More sequence information available to ascertain p | ||
| Consider if fewer site patterns in sample is inflating posterior p erroneously | ||
The clock rate in subs/site/year.
Effect of sequence data much stronger than uncertainty in prior r.
Repositories of pathogen genome sequences and curated data sets, such as GISAID and Nextstrain are indispensable assets to the phylodynamics research program, but careful curation of data sets with respect to both sampling times and sequence quality is essential to provide valuable phylodynamic results. Just as it is valuable to surpass the phylodynamic threshold in sampling, it is of equal value not to surpass the notional threshold of redundancy in sampling from phylodynamic studies.
Moreover, different parameters require date and sequence information for accurate inference to differing extents. The ability to infer sampling proportion is integral to the value of phylodynamics as it allows for estimation of how many unknown cases there may be in an outbreak. We show sampling proportion is highly sequence driven, which necessitates both sampling to maximize sequence information relative to date information, and measurement of the effect of sequence data in analyses to comment on the reliability of estimated sampling proportions.
Fixing or constraining parameters in analyses where it is justifiable is also critical to ensuring sequence information is reflected in posterior results. For example, fixing the evolutionary rate is ideal if possible. Providing topological constraints between samples would also help concentrate sequence information in the data by reducing the tree space analyses must traverse. These guidelines are complementary to recent work due to Louca et al. (2021), showing that at least one of the unidentifiable parameters of the birth–death model (, , and p) needs to be fixed for accurate inference from the state space of analyses (i.e., the tree and generative parameters).
A final practical message is that a phylodynamically ideal microbe would, for example, evolve at a fixed substitution (subs/genome/time) on an infinitely sized genome with consistent clock-like evolution to uniquely barcode every transmission event. Clearly, no such microbe exists. Phylodynamics instead operates over microbes with genomes between and bases in size and evolution rates from to (subs/site/year) (Biek et al. 2015). In this view, there will never be a perfect data set or pathogen to analyze. The challenge instead becomes sampling and constraining analyses parsimoniously enough to ensure the epidemiological signature encoded in sequence data is returned as well as measuring this effect, such as by the methods developed here.
Application
In general, we recommend that users follow the same workflow as presented here to tease apart date and sequence signals under the birth–death model. That is, separate data into each of the four treatments (marginal prior is optional), perform analyses for each, and compare posteriors using the Wasserstein distance. However, we note that as with all Bayesian phylodynamic analyses, convergence the posterior is never guaranteed, and our workflow may not be feasible such as in the case of for the H1N1 data set. In these cases, the practical guidelines (summarized in table 3) still offer a valuable reference for the design and interpretation of phylodynamic analyses under the birth–death model.
The question remains as to what users should do if analyses appear insufficiently sequence driven. This depends heavily on the use case. For example, work involving ancient DNA or fossil taxa is unlikely to allow much scope for adding or removing data to enrich sequence signal. In these cases, our workflow can still be applied to offer useful insight into the information driving analysis, even if no new data are available to increase sequence information.
Our workflow is especially pertinent to public health applications involving large data sets that inform policy and decision-making. Here, it is especially important to test for the relative impacts of date and sequence as these directly inform or understanding of the data driving evidence-based decisions. For example, if results are heavily date driven, then less weighting may be given to the posterior sampling proportion and by extension, phylodynamic conclusions about the number of total number of cases. Consider the order-of-magnitude difference between date-only and sequence only/full data posterior sampling proportion for the H1N1 data set as an example. In addition, phylodynamic analyses may be employed to garner understanding about the origin of transmission through the posterior distribution of trees. Again the level of credence in results can be judged relative to the impact of sequence in analyses. At this point, researchers may opt to include extra sequence data to enrich sequence information, or exclude duplicate samples to hone in on samples that drive existing sequence signal. Fundamentally though, our workflow offers the chance to develop a more complete picture of the drivers of inference, so analyses and data can be better understood in the context from which they originated. From this new perspective, users can decide to add or remove data depending on their objective and data availability.
Future Directions
The work presents the beginning of deeper interrogations of the drivers of phylodynamic inference. Such an understanding is critical as phylodynamics becomes a mainstay of infectious disease surveillance.
Here, we have presented a method, based on the Wasserstein metric, for post hoc analysis of the drivers of phylodynamic inference. A critical future direction would be to develop predictive methods with explicit mathematical formulae for the information content of phylodynamic data. Such a toolkit would enable a theory of optimal sampling for phylodynamic inference to be explicitly described. However, we stress that a method to assay the drivers of inference, such as the work presented here, needed to be developed first to help measure the effectiveness of any future predictive methods.
Data Archival
All scripts used to simulate and analyze data are available at https://github.com/LeoFeatherstone/phyloDataSignal.git. The Feast package, containing our date estimator, is available at https://github.com/tgvaughan/feast.git.
Materials and Methods
Simulation Study
We simulated 100 outbreaks of 500 cases under a birth–death process using the Tree-Sim R package (Stadler 2019). The birth rate was set to , death rate 1, corresponding to . Sampling probability was set to , resulting in trees with 500 tips. We then extended this to a set of 300 outbreaks by sampling again with probability and , resulting in trees of 25 and 50 tips. We used a consistent seed such that each outbreak with or corresponds to a subsample of another with , allowing us to assess the effect of sampling proportion on inferring . For each outbreak, we used Seq-Gen to sequence alignments of length 20,000, which is roughly average for RNA viruses (Rambaut and Grass 1997; Sanjuán et al. 2010). We set an Jukes–Cantor model with evolutionary rate set to either or subs/site/time using Seq-Gen (Rambaut and Grass 1997). Our choice of evolutionary rates allows us to compare the effects of higher and lower sequence information with the former corresponding to about 20 substitutions per infection, and 0.2 for the latter. The above resulted in 600 alignments to test in the four treatments described above. We analyzed each under a birth–death model using BEAST v2.6 Bouckaert et al. (2019) with a prior for and all other parameters set to the true value for simplicity and to disentangle any impacts of parameter nonidentifiability (Louca et al. 2021).
Empirical Data
All empirical analyses were conducted using BEAST v2.6.6 (Bouckaert et al. 2019) and MCMC chains were run until all parameters of interest had ESS above 200 following burnin. The only exception is the marginal prior for the H1N1 data with a fixed clock. Achieving for required a prohibitively long chain. The lack of convergence in could be attributed to an inconsistent signal for branching to inform . However, this outcome is reflective of our experience that analyses under the no-data, sequence-only, and marginal-prior data treatments are sometimes nonconvergent. In these cases, we assert that the value of this work persists in the practical guidelines elucidated (table 3). Nevertheless, we found that convergent analyses required computation time similar to that for complete data, which naturally varies with the computational platform used. Analyses using date-only data are usually faster since they do not require calculation of the phylogenetic likelihood. Sequence-only and marginal-prior analyses are usually slower, which is consistent with prior knowledge that dates usually drive inference under the birth–death.
SARS-CoV-2
We analyzed two similar SARS-CoV-2 data sets taken from Lane et al. (2021). They consisted of 112 and 188 samples, respectively.
We analyzed each data set under the four conditions above. In each, we placed a prior on and an prior on the becoming-uninfectious rate () following estimates of the duration of infection () in Lauer et al. (2020). We also fixed the sampling proportion to since every known Victorian SARS-CoV-2 case was sequenced at this stage of the pandemic, with a roughly 20% sequencing failure rate. We also placed an prior on the origin, corresponding to a lag of up to 1 week between the index case and the first putative transmission event.
H1N1
We analyzed North American H1N1 swine flu samples taken from the 2009 swine flu pandemic studied by Hedge et al. (2013). We fit a birth–death model with two intervals for , before and after July 07, 2020. This date sits in a breakpoint in sampling intensity and was chosen to separate transmission dynamics between the early and late stages of the sampling timespan.
We fixed the becoming infectious rate to 91, corresponding to a duration of infection of 4 days. We also placed a prior on the sampling proportion and used an HKY with four gamma categories. We compared the effect of two prior configurations for the evolutionary rate, using either a fixed rate a subs/site/year or a uniform prior (). All other priors were left as defaults.
Supplementary Material
Acknowledgments
L.A.F. is grateful to Swissnex for awarding a Student Research Scholarship to foster this research. S.D. and L.A.F. has received funding from the Australian Research Council (DE190100805), the Australian Medical Research Futures Fund (MRF9200006), and the National Health and Medical Research Council (APP1157586). This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative. The authors are also grateful to Art Poon and one other anonymous reviewer for feedback that significantly improved the manuscript.
Contributor Information
Leo A Featherstone, Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, VIC, Australia.
Sebastian Duchene, Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, VIC, Australia.
Timothy G Vaughan, Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
References
- Baele G, Dellicour S, Suchard MA, Lemey P, Vrancken B. 2018. Recent advances in computational phylodynamics. Curr Opin Virol. 31:24–32. [DOI] [PubMed] [Google Scholar]
- Biek R, Pybus OG, Lloyd-Smith JO, Didelot X. 2015. Measurably evolving pathogens in the genomic era. Trends Ecol Evol. 30(6):306–313. doi: 10.1016/j.tree.2015.03.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boskova V, Stadler T, Magnus C. 2018. The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic. Virus Evol. 4(1):vex044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, Heled J, Jones G, Kühnert D, Maio ND, et al. 2019. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLoS Comput Biol. 15(4):e1006650. doi: 10.1371/journal.pcbi.1006650 [DOI] [PMC free article] [PubMed] [Google Scholar]
- du Plessis L, Stadler T. 2015. Getting to the root of epidemic spread with phylodynamic analysis of genomic data. Trends Microbiol. 23(7):383–386. [DOI] [PubMed] [Google Scholar]
- Duchene S, Featherstone L, Haritopoulou-Sinanidou M, Rambaut A, Lemey P, Baele G. 2020. Temporal signal and the phylodynamic threshold of SARS-CoV-2. Virus Evol. 6:veaa061. doi: 10.1093/ve/veaa061 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Featherstone LA, Di Giallonardo F, Holmes EC, Vaughan TG, Duchêne S. 2021. Infectious disease phylodynamics with occurrence data. Methods Ecol Evol. 12(8):1498–1507. doi: 10.1111/2041-210X.13620 [DOI] [Google Scholar]
- Featherstone LA, Zhang JM, Vaughan TG, Duchene S. 2022. Epidemiological inference from pathogen genomes: a review of phylodynamic models and applications. Virus Evol. 8(1):veac045. doi: 10.1093/ve/veac045 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hedge J, Lycett SJ, Rambaut A. 2013. Real-time characterization of the molecular epidemiology of an influenza pandemic. Biol Lett. 9(5):20130331. doi: 10.1098/rsbl.2013.0331 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hill V, Ruis C, Bajaj S, Pybus OG, Kraemer MU. 2021. Progress and challenges in virus genomic epidemiology. Trends Parasitol. 37(12):1038–1049. [DOI] [PubMed] [Google Scholar]
- Kingman J. 1977. A note on multidimensional models of neutral mutation. Theor Popul Biol. 11(3):285–290. [DOI] [PubMed] [Google Scholar]
- Lane CR, Sherry NL, Porter AF, Duchene S, Horan K, Andersson P, Wilmot M, Turner A, Dougall S, Johnson SA, et al. 2021. Genomics-informed responses in the elimination of COVID-19 in Victoria, Australia: an observational, genomic epidemiological study. Lancet Public Health. 6(8):e547–e556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, Azman AS, Azman AS, Reich NG, Lessler J. 2020. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Ann Intern Med. 172(9):577–582. doi: 10.7326/M20-0504 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Louca S, McLaughlin A, MacPherson A, Joy JB, Pennell MW. 2021. Fundamental identifiability limits in molecular epidemiology. Mol Biol Evol. 38(9):4010–4024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Porter AF, Sherry N, Andersson P, Johnson SA, Duchene S, Howden BP. 2022. New rules for genomics-informed COVID-19 responses–lessons learned from the first waves of the omicron variant in Australia. PLoS Genet. 18(10):e1010415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rambaut A, Grass NC. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics. 13(3):235–238. doi: 10.1093/bioinformatics/13.3.235 [DOI] [PubMed] [Google Scholar]
- Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. 2010. Viral mutation rates. J Virol. 84(19):9733–9748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schuhmacher D, Bähre B, Gottschlich C, Hartmann V, Heinemann F, Schmitzer B. 2020. transport: computation of optimal transport plans and Wasserstein distances. Available from: https://cran.r-project.org/package=transport.
- Smith MR. 2020. Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics. 36(20):5007–5013. doi: 10.1093/bioinformatics/btaa614 [DOI] [PubMed] [Google Scholar]
- Stadler T. 2019. TreeSim: simulating phylogenetic trees. R package version 2.4. Available from: https://CRAN.R-project.org/package=TreeSim
- Stadler T. 2010. Sampling-through-time in birth–death trees. J Theor Biol. 267(3):396–404. doi: 10.1016/j.jtbi.2010.09.010 [DOI] [PubMed] [Google Scholar]
- Stadler T, Kouyos R, von Wyl V, Yerly S, Böni J, Bürgisser P, Klimkait T, Joos B, Rieder P, Xie D, et al. 2012. Estimating the basic reproductive number from viral sequence data. Mol Biol Evol. 29(1):347–357. [DOI] [PubMed] [Google Scholar]
- Volz EM, Frost SDW. 2014. Sampling through time and phylodynamic inference with coalescent and birth–death models. J R Soc Interface. 11(101):20140945. doi: 10.1098/rsif.2014.0945 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Volz EM, Koelle K, Bedford T. 2013. Viral phylodynamics. PLoS Comput Biol. 9(3):e1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




