Abstract
Multi-type birth–death (MTBD) models are phylodynamic analogies of compartmental models in classical epidemiology. They serve to infer such epidemiological parameters as the average number of secondary infections Re and the infectious time from a phylogenetic tree (a genealogy of pathogen sequences). The representatives of this model family focus on various aspects of pathogen epidemics. For instance, the birth–death exposed-infectious (BDEI) model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARS-CoV-2), and permits its estimation along with other parameters. With constantly growing sequencing data, MTBD models should be extremely useful for unravelling information on pathogen epidemics. However, existing implementations of these models in a phylodynamic framework have not yet caught up with the sequencing speed. Computing time and numerical instability issues limit their applicability to medium data sets (≤ 500 samples), while the accuracy of estimations should increase with more data.
We propose a new highly parallelizable formulation of ordinary differential equations for MTBD models. We also extend them to forests to represent situations when a (sub-)epidemic started from several cases (e.g., multiple introductions to a country). We implemented it for the BDEI model in a maximum likelihood framework using a combination of numerical analysis methods for efficient equation resolution. Our implementation estimates epidemiological parameter values and their confidence intervals in two minutes on a phylogenetic tree of 10,000 samples. Comparison to the existing implementations on simulated data shows that it is not only much faster but also more accurate. An application of our tool to the 2014 Ebola epidemic in Sierra-Leone is also convincing, with very fast calculation and precise estimates. As MTBD models are closely related to Cladogenetic State Speciation and Extinction (ClaSSE)-like models, our findings could also be easily transferred to the macroevolution domain.
Keywords: Birth–death model, Ebola, epidemiology, mathematical modelling, ordinary differential equations, phylodynamics
The interaction of epidemiological and evolutionary processes leaves a footprint in pathogen genomes. Phylodynamics leverages this footprint to estimate epidemiological parameters, such as the average number of secondary infections, (Grenfell et al., 2004; Volz et al., 2013). It relies on models that bridge the gap between traditional epidemiology and sequence data. Under these models, the parameter inference is drawn from topology and branch lengths of pathogen phylogenetic trees (i.e., genealogies of the pathogen population, approximating the transmission trees) combined with metadata on the samples. This is particularly useful for emerging epidemics, for which not enough data (e.g., incidence curves) might yet be gathered for accurate estimations with classical epidemiological methods. Rapidly growing genetic data coupled with phylodynamic estimations can provide valuable insights at an early stage of the epidemic spread and help prevent it (e.g., accurate estimation of is crucial for adjusting potential non-pharmaceutical interventions, such as lockdowns).
Phylodynamic models can be classified into two main families: coalescent (Volz et al., 2009; Drummond et al., 2005; Pybus et al., 2000) and birth–death (BD) (Kendall, 1948; Maddison et al., 2007; Stadler, 2009, 2010). Coalescent models are often preferred for estimating deterministic population dynamics; however, BD models are better adapted for highly stochastic processes, such as the dynamics of emerging pathogens (Macpherson et al., 2021). In BD models, births represent pathogen transmission events, while deaths correspond to becoming non-infectious (e.g., due to healing, self-isolation, starting a treatment, or death). Models of the BD family are phylodynamic analogies of compartmental models in classical epidemiology (e.g., SIR, Susceptible-Infectious-Recovered; Hethcote, 2000). Many extensions of the classical BD model with incomplete sampling (BDS; Stadler, 2009) were developed over time, including multi-type birth-death (MTBD) models (Stadler and Bonhoeffer, 2013). They add a population structure to the classical birth-death process by allowing for different types of individuals. A particularly useful representative of the MTBD family is the birth–death exposed-infectious (BDEI) model (Stadler et al., 2014). It was designed for pathogens featuring an incubation period between the moments of infection and of becoming infectious, e.g., Ebola and SARS-CoV-2. It is closely related to the SEIR (Susceptible-Exposed-Infectious-Recovered; Hethcote, 2000) model, widely used in classical epidemiology.
In MTBD framework, the evolution of a transmission tree is described with a system of master differential equations with respect to global time. The model parameters can be estimated with maximum-likelihood (Stadler and Bonhoeffer, 2013) or Bayesian methods (Bouckaert et al., 2019) by exploring the likelihood (or posterior probability) landscape of trees. However, the closed form solution of the master equations exists only for the initial BDS model (Stadler, 2009), while for its extensions (like the BDEI model and MTBD models in general) the master equations for likelihood calculation need to be resolved with numerical methods. The complexity of the master equations and their boundary conditions (which recursively depend on the tree evolution later in time), make their numerical resolution challenging and time consuming (Scire et al., 2022; Voznica et al., 2022).
The trade-off between the complexity of the biological questions a model can address and the computational speed for its parameter estimation is crucial in phylodynamics. On one hand, denser sampling should improve the accuracy of parameter estimations with complex models. On the other hand, denser sampling leads to larger data sets (thousands of samples), while computational issues often limit model applicability to medium or small ones (hundreds of samples). Calculations become time-consuming and numerically challenging (e.g., due to underflow issues) as tree size increases, resulting in numerical instability and inaccuracy (Scire et al., 2022; Voznica et al., 2022). Existing likelihood-based implementations of MTBD models (Stadler and Bonhoeffer, 2013; Bouckaert et al., 2019; Scire et al., 2022) can handle trees of medium size. In Voznica et al. (2022) we proposed PhyloDeep, a likelihood-free deep-learning-based solution to the numerical instability issue. While being very efficient and accurate at the prediction stage, this approach however requires a computationally heavy training stage: Millions of trees covering a wide parameter range (where the real data is expected to fall) need to be simulated for training the deep learning predictor.
In the macroevolution domain, there exist several models that are closely related to the MTBD models. These are the models of the State Speciation and Extinction (SSE) family, in which the births correspond to species specifications and the deaths correspond to extinctions. The main difference is that in epidemiological models sampling happens through time, while in the macroevolution ones it usually occurs at present (at the extant species). Important representatives of the SSE model family include the Binary SSE (BiSSE; Maddison et al., 2007) model, which introduced two compartments with a possibility of anagenetic state change between them (i.e., along the tree branches), its extension to any number of states (multiple SSE, MuSSE; FitzJohn et al., 2009), and the cladogenetic SSE (ClaSSE; Goldberg and Igić, 2012) model, which introduced a possibility of cladogenetic state changes (i.e., when one of the offsprings may have a different state from its parent’s one right after the speciation event). In a recent work, Louca and Pennell (2020) described a general mathematical framework for efficient likelihood calculation of these types of models, based on the “flow”, and implemented it for the MuSSE-like models but not for the ClaSSE-like ones. However, macroevolutionary analogues of the BDEI and general MTBD models belong to the ClaSSE-like family (as they permit a donor and a recipient to be in different states at the moment of transmission), and an efficient parameter estimator for these models on very large trees is currently lacking.
In this study, we introduce a likelihood-based approach that intends to improve the accuracy and reduce the likelihood computation time of MTBD models. We propose a new formulation of the MTBD master equations that (i) removes the recursive dependency between child and parent nodes in the tree, hence permitting their parallel computation, and (ii) avoids numerical issues that could arise from very small boundary condition values. Under our approach, the master equations are resolved in parallel for different tree nodes, and then combined into the tree likelihood. In the general MTBD case the combination step is performed in a computationally light recursive way. However, we identified a subclass of MTBD models (including the BDEI model) whose likelihood formulae can be expressed in a non-recursive manner, thus allowing for even simpler calculations. Additionally, we extend the MTBD model-applicability from single trees to forests. Forests could correspond to multiple introductions of the epidemic to the region of interest, or to a health policy change, which led to a new epidemic stage starting with severalcases.
We applied our findings to the BDEI model and implemented its parameter estimator PyBDEI, employing targeted numerical analysis methods for accurate and fast resolution of its equations. We show the accuracy and speed of PyBDEI on simulated data and compare it to the gold standard Bayesian tool BEAST2 (Bouckaert et al., 2019) and the deep-learning-based tool PhyloDeep (Voznica et al., 2022). We find that our approach outperforms the competitors and makes the BDEI model applicable to very large data sets. Lastly, we apply PyBDEI to infer the epidemiological parameters that shaped the Ebola epidemic in Sierra-Leone in 2014. Our estimator is freely available at https://github.com/evolbioinfo/bdei.
MTBD models and their special case, the BDEI model
In a pathogen transmission tree (approximated by a time-scaled pathogen phylogeny) the tips represent sampled pathogens, patient state transitions occur along the branches, and bifurcations (i.e., internal nodes) correspond to transmissions (Fig. 1). The tree branch lengths are measured in units of time, where is the time that passed between the beginning of the (sub-)epidemic (, corresponding to the time of the root in Fig. 1) and the last sampled tip.
Figure 1.

A transmission tree with external nodes (i.e., tips, which correspond to sampling events: ), internal nodes (which correspond to transmissions: (the root) and ) and branches (plus the root branch of zero length). Time starts at the beginning of the (sub-)epidemic (here represented by the root of the tree, ) and goes till the last sampled tip. The times of the nodes are shown on the left, e.g., is the time of tip (when ’s pathogen was sampled). corresponds to the end of the sampling period (when the most recent tip, , was sampled).
The basic BDS model (Stadler, 2009) has only one state: infectious. An individual in state can transmit their pathogen to another individual (whose state will be also ) at a constant average rate , or stop being infectious at a constant average rate (due to treatment start, healing, self-isolation or death). After stopping being infectious, the individual and their pathogen exit the study, at which point the pathogen might get sampled with a probability . The sampling is incomplete: an infectious individual may be removed from the system without being sampled (i.e., unobserved in the transmission tree), for example due to healing. The BDS model permits inference of such important epidemiological parameters as
effective reproduction number , expected number of individuals directly infected by an infectious case;
infectious time , time during which an infectious individual can further spread the epidemic.
The BDS model is asymptotically unidentifiable (see Remark 3.4 by Stadler, 2009), but to become identifiable it requires one of the parameters to be fixed.
MTBD models (Stadler and Bonhoeffer, 2013) add population structure by allowing different individual states, transmissions between them and state changes. A general MTBD model with individual states has parameters: An individual in state () can be removed at a constant average rate (with pathogen sampling probability ), change their state to state () at a constant average rate (where ), and transmit their pathogen to an individual in state at a constant average rate . The time between events of the same type is hence modelled with exponential distribution.
The BDEI model (Stadler et al., 2014) (Fig. 2), for example, is a special case of MTBD models that adds a second possible state to the state : exposed, an individual who is already infected but not yet infectious (cannot transmit), and will eventually become infectious.
Figure 2.

The BDEI model. An individual in exposed state becomes infectious at a rate . An infectious individual transmits the pathogen at a rate (hence creating a new exposed individual ), and gets removed at a rate (decreasing the number of infectious individuals ). Upon removal, the individual’s pathogen might be observed with a probability . Note, that the BDEI model does not include a susceptible state (as for example SEIR) and makes the assumption that the susceptible population is unlimited (as for example in the beginning of an epidemic, or when the removed individuals could get reinfected).
Under the BDEI model, the only allowed transmissions are from to : At the moment of a transmission, the transmitter is always in state , while the recipient is in state . Hence the only non-zero transmission rate is , while the other 3 transmission rates are trivial: . As we typically do not have the information to distinguish a transmitter from a recipient in a phylogenetic tree (which approximates the transmission tree), we have to consider both possibilities during parameter estimation. For the MTBD models where multiple states can transmit, the number of possibilities increases combinatorially.
Under the BDEI model, we assume that only the individuals in state can exit the study (at rate ) and be detected and sampled (with a probability ): For instance, for many pathogens with an incubation period, the detection is triggered by the onset of symptoms, which in turn happens in the infectious state. Hence all the BDEI tree tips are in state , and .
In addition to and infectious time, the BDEI model permits inference of a third epidemiological parameter:
incubation period , time between the infection and becoming infectious.
The incubation period can be expressed via becoming infectious rate , corresponding to a state transition from to . The inverse state change is not allowed: .
MTBD models, as extensions of the BDS model, are asymptotically unidentifiable and require one of their parameters to be fixed in order to become identifiable. In practice, it is often the sampling probability, as it may be approximated from epidemiological data (e.g., the proportion of sampled cases among the declared ones) or the infectious time (estimated from observations of infected cases).
Master Equations
In the standard MTBD master equations proposed by Stadler and Bonhoeffer (2013), time goes backward from the last sampling event (the most recent tip in the tree) till the beginning of the epidemic. These equations permit calculation of the likelihood density of the data (observed tree) given the model parameter values . In the general MTBD case, the observed tree, reconstructed from sampled pathogen genomes, differs from the real transmission tree: the states of its internal nodes (corresponding to transmissions) are unknown, we cannot distinguish between the transmitter and the recipient branches, the moments of state changes are also unknown, and due to incomplete sampling some parts of the real transmission tree are unobserved in the reconstructed tree. We therefore need to integrate over all possibilities while calculating the likelihood. The BDEI model is a slightly simpler case as only infectious individuals can transmit or get sampled, hence all the node states are known (). For the BDEI model .
In System (1) we show the MTBD master equations for a model with states (for the BDEI model ); however, presenting them with the time going forward from the time of the epidemic start (, i.e., tree root) to the time of the last sampled tip (). These equations describe the likelihood density functions (LDFs) of evolving as in the reconstructed tree, starting at time in state (for the BDEI model ) on a branch connecting a node to its parent, and till the end of the sampling period. The boundary condition is defined at time (i.e., at the node ). To account for incomplete sampling, the system also includes the probabilities of evolving unobserved till the time , starting at time instate .
From now on, we use the following notation: the id of the root node is ; the ids of its children are and ; and, by extension, the children of a node (if they exist) have ids and (as in Fig. 1). Additionally if the state of a node is known, then we will name it . (For example for the BDEI model .)
| (1) |
Tree likelihood density
The likelihood density of a tree for given parameter values is then calculated as the LDF at time on the root, whose id is and whose state is :
| (2) |
So far, we assumed that the epidemic started directly with the first transmission, however we can relax this assumption. The root of the tree in Fig. 1 is placed at , and does not have a branch (its length is zero). Allowing for a non-zero root branch corresponds to an epidemic start some time before the first transmission (). This implies that the state of the individual represented by the root branch at time is unknown, and all possible states should be considered (Eq. (3)). The same formula applies to cases where the root state is unknown. Assuming that the relative number of individuals in each state is at equilibrium, we can calculate the weight of each possible state (derived by Stadler et al., 2013 for the general MTBD case and in Equation (16) for the BDEI model).
| (3) |
Root’s LDF recursively depends on the LDFs of the child node branches via the boundary condition (System 1), and hence is calculated with a pruning algorithm (Felsenstein, 1973) while climbing the tree from tips till the root. Therefore when parallelized to maximum, it still requires consecutive steps, where stands for the height of the tree and depends on its topology: (balanced tree) (ladder-like tree). At each step System (1) needs to be resolved for the corresponding nodes. Moreover, the values of at internal nodes and their boundary conditions progressively become smaller as getting deeper in the tree, due to successive additions and multiplications of the LDF values. In trees with many tips, this might lead to numerical underflow, and hence such measures as rescaling need to be taken (Berger and Stamatakis, 2009; Defour, 2010; Scire et al., 2022).
Extension to Forests
In some cases, the assumption that a (sub-)epidemic started with one infected individual might be too constraining. For instance, there could be multiple pathogen introductions to a country of interest (e.g., while in China the SARS-CoV-2 epidemic is commonly assumed to have started with one case, there were multiple independent introductions to other countries (Zhukova et al., 2020)). This scenario is depicted in Fig. 3b. Another example is a change of health policies leading to a change in parameter values (e.g., sampling). Such a change corresponds to a new stage of the epidemic, starting from several infected cases from the previous stage. This scenario is depicted in Fig. 3a. In Bayesian settings, the situations when the system behaviour (and parameters) change over time, are modeled via skyline methods. Stadler et al. (2013) developed the one-state Bayesian birth-death skyline plot that divides the time into intervals and allows for different piece-wise constant rates on them. Kühnert et al. (2016) combined the MTBD model with the skyline to allow for both piece-wise constant rate changes over time and multiple individual types. The skyline approach therefore relies on a single tree, but estimates a separate set of parameters for each time interval, all under the same model. As the number of parameters increases with multiple skyline intervals, MTBD-skyline models therefore require more data and computational time for their accurate estimation, and are more prone to numerical instability than the classical MTBD models.
Figure 3.
Forest representing a (sub-)epidemic that started with multiple infected cases. a) The observed forest trees ( and ), corresponding to three different initial infected cases, are shown with solid lines. All the forest trees start at the same time ). This scenario can correspond to a change of health policies leading to a change in parameter values (e.g., sampling). Such a change corresponds to a new stage of the epidemic, starting from several infected cases from the previous stage (shown with dashed lines). b) The observed forest trees (), corresponding to two different initial infected cases, are shown with solid lines. They start at different times (). This scenario can correspond to multiple introduction to the same country from other countries (shown with dashed lines).
We propose a simpler alternative, where the (sub-)epidemic starts with multiple individuals (not necessarily at the same time) and leads to a forest of observed trees: . The forest might also include a certain number of unobserved trees, i.e., individuals who were infected at the beginning of the (sub-) epidemic, but whose trees stayed unobserved as none of their tips got sampled. This can be incorporated in the likelihood calculation. Forest likelihood formula hence combines the likelihoods of observed and hidden trees, and can be represented in logarithmic form (Eq. (4)). Tree likelihood formula is its special case, where and .
| (4) |
In Equation (4) we assumed that all the sub-epidemics in the forest started at the same time (). This condition can be easily relaxed by replacing zeros with the corresponding tree starting times for observed and unobserved tree evolutions. As in practice we do not know the starting times of the unobserved trees if the sub-epidemics could start at different times, we approximate the unobserved tree starting times with the mean of the observed tree starting times: :
| (5) |
In our parameter estimator implementation for the BDEI model (see “Efficient parameter and CI estimation for the BDEI model” section), mean can be replaced with median, maximum or minimum of the observed tree times, via a user-specified parameter .
For given model parameter values we can estimate the number of hidden trees from the number of observed trees as
| (6) |
Hence, working with forests does not add an additional parameter to likelihood estimation. Moreover, as our simulations show (see “Performance on simulated data and comparison to other tools” section), the value of has little impact on the parameter estimation. However, in some cases (e.g., change in health policy and thus of parameter values), it might be better to estimate based on external data (e.g., number of cases at ), rather than assuming that the parameter values predating the trees in the forest were the same as those in the forest. We explore both approaches in the “Application: Ebola in Sierra-Leone” section.
Using forests permits estimation of the model parameters on the last skyline interval without the restriction that the epidemic followed the same model before this interval (i.e., the top part of the tree, which includes the common ancestors of the forest roots, Fig. 3a). It reduces the number of parameters to those of the last interval. It also permits estimation of parameters for a (sub-)epidemic that started with several individuals but not at the same time (e.g., due to multiple introductions to a country, Fig. 3b).
Avoiding numerical problems and parallelizing calculations
In this section we introduce a way to rewrite LDFs in System (1) that permits (i) obtaining simpler boundary conditions to avoid potential numerical issues during resolution of equations; and (ii) removing recursion and resolving equations for each tree node in parallel, hence speeding up the calculations.
System (1) has several properties. First of all, its subsystem that defines unobserved probabilities () is self-defined, and hence can be calculated independently from the rest. Secondly, in the subsystem that defines observed LDFs () the right-hand side of the differential equations is a sum where each element is linear with respect to one of the (), and this sum does not contain any free term. This condition implies that if we rescale by a common factor, the differential equations will not change. Moreover, if the boundary conditions for all states but one are zero (), the rescaling will change the boundary condition only for . The latter is the case when tree node states are known, for example for the BDEI model, under which all nodes are in state .
Assuming the state of the node is known (), let us define as
| (7) |
Then the differential equations for will only differ from those for in the boundary condition for , which is (the other boundary conditions stay zero). Conceptually, is a probability of an individual evolving as on an observed branch that connects a node to its parent, starting at time in state on this branch and finishing at time in state (without taking into account ’s subtree and the event atnode ).
Solving the master equations for instead of permits us to both (i) remove the recursive dependency between child and parent nodes (during ordinary differential equation (ODE) resolutions); and (ii) avoid numerical issues that could arise from very small values of the boundary condition of , which is particularly pertinent for large trees. The calculation of can be done in parallel for each node . Hence, when parallelized to maximum, the computing time to resolve all these master equations become constant (i.e., the problem is ”embarrassingly parallelizable” following computer science terminology).
In the general case, where the tree node states are unknown, all the possibilities () for each node can be considered separately and in parallel. Calculating for all the possible combinations of corresponds to the flow matrix calculation in the general formulation recently proposed by Louca and Pennell (2020).
In the case where all node states are known (e.g., from metadata or due to model itself, as for the BDEI case), using instead of also permits us to express tree likelihood for model parameters in a non-recursive way, and easily transform it to a logarithmic form (Eq. (8)) (to avoid underflow issues while multiplying small numbers at the likelihood combination step). In Materials and Methods we show its equivalence to the recursive representa-tion (Eq. (2)).
| (8) |
For the general case, the combination of different internal node state configurations into tree likelihood formula need to be performed with a pruning algorithm (Felsenstein, 1973). The likelihood-combining tree traversal starts from the tips and climbs the tree till the root, while calculating a subtree LDF for each visited node for each possible state :
| (9) |
Note that unlike the known-tree-node-state likelihood (Eq. (8)), the recursive unknown-tree-node-state likelihood (Eq. (9)) does not allow for an easy logarithmic representation, and hence is prone to underflow issues. Its calculation on large trees therefore requires additional small number rescaling techniques as recently described in (Scire et al., 2022) and common in phylogenetic inference. However, unlike in the original MTBD representation, master equation resolutions (for and ) can be performed independently, in parallel, and avoiding underflow issues for boundary conditions.
Overall, the PDF reconditioning technique can be applied to any model of the MTBD family, and facilitates its parameter estimation by separating master equation resolution (non-recursive and parallelizable) from likelihood calculation (recursive, but negligible in time cost compared to equation resolution). Recursive likelihood calculation can be performed with a standard pruning algorithm and rescaling techniques to control for potential underflow. For parameter estimation on trees with known node states (e.g., from metadata, or because they were generated by an MTBD process in which only one state can transmit or get sampled, like the BDEI model), tree likelihood can be calculated with a non-recursive formula in a logarithmic form (Eq. (8)), fully avoiding underflow.
Efficient Parameter and CI Estimation for the BDEI Model
We applied our theoretical findings to implement a fast and efficient parameter estimator for the BDEI model (which we called PyBDEI). It estimates the BDEI model parameters for a forest comprising observed trees in the maximum-likelihood framework, where one of the parameters in is fixed (for identifiability reasons). The number of hidden trees can be either given by the user, or estimated from BDEI model parameters (as in Eq. (6)).
Once the optimal parameter values are found, we calculate their confidence intervals (CIs) using Wilks’ method (Wilks, 1938). For each non-fixed parameter , we calculate its -CI as including the values such that , where is the value of chi-squared distribution with 1 degree of freedom corresponding to the significance level of (i.e., ). corresponds to the maximum-likelihood estimates for the other non-fixed parameters when .
Performance on Simulated Data and Comparison to Other Tools
To assess the performance of our maximum-likelihood estimator PyBDEI, we used the simulated data from Voznica et al. (2022), where we generated 100 medium trees with 200–500 tips under the BDEI model, with the parameter values sampled uniformly at random within the following boundaries: incubation period , , infectious time , sampling probability . These trees were evaluated with the standard Bayesian method BEAST2 (Bouckaert et al., 2019) and the deep learning-based estimator PhyloDeep (detailed configurations are described in Materials and Methods). Additionally 100 large trees (5000–10,000 tips) were generated for the same parameter values, and assessed with PhyloDeep in Voznica et al. (2022). PhyloDeep’s maximal pre-trained tree size is 500 tips; however, for larger trees it estimates BDEI parameters by (i) extracting the largest non-intersecting set of subtrees of sizes covered by the pre-trained set (50–500 tips), (ii) estimating parameters on each of the subtrees independently, and (iii) averaging each parameter’s estimate over the subtrees (weighted by subtree sizes).
To evaluate PyBDEI performance on forests, we additionally generated two types of forests for the large data set. The first type of forests was produced by cutting the oldest (i.e., closest to the root) 25% of each full tree, and keeping the forest of bottom-75% subtrees (in terms of time). We hence obtained 100 forests representing sup-epidemics that all started at the same time (, as in Fig. 3a). These forests contained observed trees each, with a total of tips.
The second type of forests represented epidemics that started with multiple introductions happening at different times (as in Fig. 3b). To generate them, we used the parameter values corresponding to each tree in the large dataset (), and the times between the start of the tree and the time of its last sampled tip. Each (potentially hidden) tree in the forest was then generated under parameters till reaching the time (drawn uniformly and independently for each tree). Trees were added to the forest till their total number of sampled tips reached at least : . The resulting forest included those of the trees that contained at least one sampled tip (i.e., observed trees). These forests contained observed trees each, with a total of tips, and hidden trees.
We applied PyBDEI to these data sets, and compared the results to those reported for BEAST2 and PhyloDeep in Voznica et al. (2022). For the large data set, we applied PyBDEI to full trees, but also to the two types of forests.
We calculated the relative error (normalized distance between the estimated and the target values: ) and the relative bias [(] for each parameter on each tree/forest. Average relative errors for PyBDEI were on the medium trees and on the large trees (hence decreasing with the data set size, as expected), and well centered around zero (i.e., unbiased), as shown in Fig. 4. The relative 95%-CI width [] also decreased: from on the medium data set to on the large one. The target values of rates , and were within the estimated CIs in correspondingly 92%, 89%, and 98% of cases on the medium data set, and in 95%, 90%, and 95% of cases on the large one.
Figure 4.
Comparison of inference accuracy of different methods on the a) medium (200-500 tips) and b) large (5000–10000 tips) 100-tree data sets. For the medium data set BEAST2 (in orange), PhyloDeep (in green) and our estimator (PyBDEI, in blue) are compared. Two (out of 100) trees of the medium data set on which BEAST2 did not converge after MCMC steps are excluded from the analysis and not shown in the figure. For large trees (500 tips), PhyloDeep extracts the largest non-intersecting set of subtrees of 50–500 tips, estimates parameters on each of the subtrees independently, and averages each parameter’s estimate over the subtrees (weighted by subtree sizes). We assessed our method on full trees (dark-blue), on forests obtained from the full trees by removing the oldest (closest to the root) 25% (in terms of height) of those trees (forests 1, light-blue), and on forests whose trees were generated using varying sampling period durations (forests 2, pink). We show the swarmplots (colored by method) of relative errors for each test tree/forest and parameter, which are measured as the normalized distance between the median a posteriori estimate by BEAST2 or a point estimate by PhyloDeep/PyBDEI and the real value. Average relative error (and in parentheses average relative bias) are displayed for each parameter and method below their swarmplot. The accuracy of the methods is compared by a paired z-test; are shown above each method pair; non-significant P-values are not shown.
To assess method accuracy we calculated P-values based on two-sided z-tests for each parameter and method pair. On the medium data set for all the methods performed in a comparable way. For the infectious time, , PyBDEI was at least as accurate as PhyloDeep and more accurate than BEAST2 (P-value ). For the incubation period, , PyBDEI was more accurate than both other methods (see Fig. 4). On the large data set BEAST2 was inapplicable due to computation times (57 CPU hours were already required for each medium-sized tree, on average), while PyBDEI was more accurate than PhyloDeep, both using full trees and the forests of the first type (i.e., where all the trees started at the same time, P-value for all the parameters, see Fig. 4). On the forests of the second type (i.e., trees starting at different times) PyBDEI’s performance was comparable to the one of PhyloDeep for all the parameters. While the mean relative errors were low (), PyBDEI performed worse (P-value ) on forests of type 2 than on full trees or forests of type 1 for the infectious period and incubation time. This can be explained by the fact that the starting times of the hidden trees in forests of type 2 were not known and hence needed to be approximated. Moreover, forests of type 2 contained less branches (mean 13155) than forests of type 1 (14896) or full trees (14969), and hence less data for parameter inference.
For two trees in the medium data set, BEAST2 did not converge after Markov Chain Monte Carlo (MCMC) steps: We did not include these two data points in the analysis. PyBDEI performed well on these two trees: real parameters were within estimated confidence intervals, relative errors for , relative errors for incubation period and infectious time .
There were also several trees where BEAST2 seems to have converged to a local optimum (estimates with relative errors close to 1). To investigate this hypothesis further, we calculated the tree likelihoods for the real parameter values and those estimated by the three methods (Table 1). Indeed, BEAST2 had a likelihood lower than the one obtained on real values for 8% of trees in the medium dataset (8 out of 98 trees on which BEAST2 converged), which corresponds to a local optimum. These 8 data points correspond to the high BEAST2 relative error values ( for infectious time and for incubation period) shown in Fig. 4. PyBDEI estimates had a higher or equal likelihood than any other method and real values for all the trees of both data sets, suggesting that PyBDEI reaches the global optimum of the likelihood function. On the medium dataset, PyBDEI estimates had an equal likelihood to the ones of BEAST2 for 85% of trees, and a higher likelihood for the other 15% of trees. Comparing to PhyloDeep, PyBDEI estimates had an equal likelihood for 48% of trees, and a higher likelihood for the other 52% of trees. Interestingly, PhyloDeep, while being a likelihood-free method, performed very well on the medium data set: on 81% of trees it estimated parameters with higher or equal likelihood to the one of the real parameters. On the large data set it performed worse in terms of likelihood, estimating parameters with higher or equal likelihood to the one of the real parameters only on 29% of trees. It could however be explained by the fact that PhyloDeep estimated parameters on each of the smaller (50–500 tip) subtrees selected by its subtree picker procedure, and averaged the result, instead of being retrained on large 5000–10000-tip trees.
Table 1.
Tree likelihood comparison between the real parameter values and those estimated by different methods on medium and large data sets.
| Medium | Large | Medium | Medium | Large | Medium | Large | ||
|---|---|---|---|---|---|---|---|---|
| Real values | BEAST2 | PhyloDeep | PyBDEI | |||||
| > | 0.082 | 0.092 | 0.71 | 0 | 0 | |||
| real values | = | 0.214 | 0.306 | 0.13 | 0.173 | 0.18 | ||
| < | 0.704 | 0.602 | 0.16 | 0.827 | 0.82 | |||
| > | 0.704 | – | 0.367 | – | 0 | – | ||
| BEAST2 | = | 0.214 | – | 0.541 | – | 0.847 | – | |
| < | 0.082 | – | 0.092 | – | 0.153 | – | ||
| > | 0.602 | 0.16 | 0.092 | 0 | 0 | |||
| PhyloDeep | = | 0.306 | 0.13 | 0.541 | 0.520 | 0.06 | ||
| < | 0.092 | 0.71 | 0.367 | 0.480 | 0.94 | |||
| > | 0.827 | 0.82 | 0.153 | 0.480 | 0.94 | |||
| PyBDEI | = | 0.173 | 0.18 | 0.847 | 0.520 | 0.06 | ||
| < | 0 | 0 | 0 | 0 | 0 | |||
The value provided in row i, column j indicates the proportion of trees for which the likelihood with parameters estimated by method i is either higher (sub-row >), equal (sub-row ) or lower (sub-row <) than the likelihood estimated by method j. For example, 0.847 in the row “BEAST2 ” and the column “PyBDEI medium” means that the estimates of BEAST2 had an equal likelihood to those of PyBDEI on 84.7% of trees of the medium data set. According to a sign test, the differences between all the pairs of methods are significant (-value < 0.01).
In terms of time, on the medium data set PyBDEI needed on average seconds per tree on 1 CPU, and converged in 864 iterations (including CI calculation). These times cannot be directly compared to BEAST2 times, as BEAST2 performs a Markov Chain Monte Carlo (MCMC) parameter space exploration instead of looking for the optimum (as PyBDEI does), hence requires many more steps: for MCMC steps it took on average CPU hours. While the number of MCMC steps could probably be reduced for some runs, for two out of 100 trees it was not sufficient for convergence. Implementing likelihood calculation with our new MTBD formulation and targeted numerical analysis methods, could be helpful in Bayesian context as well: comparing time per iteration (which is roughly time per likelihood calculation), our optimizer required CPU seconds, while BEAST2 took one order of magnitude longer: CPU seconds. This is probably due to the fact that BEAST2 uses the general MTBD model formulation, configured for BDEI, while our implementation uses a BDEI-tailored implementation. Moreover, the ODE reconditioning allows to avoid underflow errors (and rescaling efforts during tree pruning (Scire et al., 2022)), which could also play a role. Using several CPUs would allow for an even larger gain (see the results on the large data set below). PhyloDeep took CPU seconds per tree, which is faster than our method’s time but does not include the training time of deep learning predictors (hundreds of hours). To our knowledge, the only other available maximum-likelihood estimator for BDEI is implemented in the TreePar package (Stadler and Bonhoeffer, 2013). However, as it suffers from underflow issues for BDEI already on trees of small size, its developers suggest using BEAST2 instead (private communication).
The average time of PyBDEI convergence on the large data set was 2 min 28 s on 1 CPU, and required 960 iterations. Parallelization on 2 CPUs reduced it to 1 min and 24 s (1.8 times faster). The speed up is close to the number of cores, which shows the efficiency of parallelization of master equation resolution despite the pre- and post-processing steps (tree reading, distribution of jobs between the threads, combining their results), which are always performed on one CPU. This suggests that our estimator will be easily applicable to much largertrees.
As the BDEI model requires one of the parameters to be fixed in order to become asymptomatically identifiable (Stadler, 2009), we fixed to the real value, both in Voznica et al. (2022) and in the comparison described above. However, to assess PyBDEI performance with other parameters fixed, we estimated parameters for trees in the large data set under three additional settings: with (1) , (2) , or (3) fixed to its real value. The results are shown in Fig. S1 (online Appendix, available at https://doi.org/10.5061/dryad.r7sqv9sgx). Average relative errors were for all parameters when was fixed to the real value, when was fixed, when or were fixed. For estimates of we calculated absolute errors instead of relative ones: their average was for fixed or , and for fixed . Hence, the estimations can be successfully performed with any of the parameters being fixed, but fixing or might be particularly useful. Moreover, these parameters are relatively easy to estimate with real data (e.g., patient observations for , and proportion of sampled cases among the declared ones for ).
Finally, we assessed the impact of the number of hidden trees on the parameter estimation. We estimated parameters with , and being estimated from model parameters on the two types of forests of the large dataset. For forests of type 2 (where trees started at different times), we additionally compared estimations using minimum, mean, median and maximum of observed tree-specific times (see Eq. (5)). Note that the minimum time and represent the two extremes, as the probability of a tree to stay fully unobserved decreases with time. The relative errors for different parameters are shown in Fig. S2 (online Appendix). While estimating using mean, median or maximum times seems to have a slightly smaller relative error for (3% vs. 4% for minimum time or ), these differences are non-significant. Overall, seems to have little impact on parameter estimation.
The assessment of our maximum-likelihood estimator on simulated data shows that it opens new possibilities to fast and accurate analyses of extremely large data sets, while being flexible with respect to parameter settings (e.g., the parameter to be fixed). The use of forests makes it possible to focus on a specific part of a large tree, e.g., the most recent period, the subtrees corresponding to a given region or country, or the origin of the epidemic.
Application: Ebola in Sierra-Leone
Using PyBDEI, we analysed the 2014 Ebola epidemic in Sierra-Leone (SLE). Ebola virus features an incubation period (reported by the World Health Organisation (WHO) to take between 2 and 21 d (WHO, 2021)). Using statistical methods based on time series of reported Ebola cases, the incubation period of Ebola during the 2014 SLE epidemic was previously estimated to be around 10–11 d, the infectious time around 4–5 d, and the reproduction number to decrease from around 2 in the beginning of the epidemic to values close to 1 by late 2014 (due to control measures) (Team, 2014, 2015; Rivers et al., 2014; see also Van Kerkhove et al., 2015 for a review).
Sequence data could improve and complement these estimates. However, the existing phylodynamic study of these parameters was limited by the data set size: Stadler and Bonhoeffer (2013) applied the BDEI model to the early spread of Ebola in SLE by analysing 72 Ebola samples from late May to mid June 2014 (sequences from Gire et al. (2014)). They estimated the expected length of the incubation period to be 4.9 d (median; 95% HPD 2.1–23.2), and the infectious time of 2.6 d (median; 95% HPD 1.2–7).
To show the power of phylodynamic analyses on larger data sets, we took the 1610-sequence alignment and metadata (sampling times and countries) that were used in the study by Dudas et al. (2017), who analysed the factors that spread the 2014–2016 Ebola epidemic in West Africa. Using these data, we reconstructed a time-tree of the Ebola epidemic in West Africa, which we then used to extract a forest of time-subtrees representing the Ebola epidemic in SLE between July 30, 2014 (when the SLE government began to deploy troops to enforce quarantines (News24, 2014)) and September 7, 2015 (the last SLE sample in the data set). This was done to obtain a forest of subtrees with a homogenous health policy (after July 30). The details on the forest reconstruction are given in Materials and Methods. To check for robustness of the estimates, forest reconstruction was performed 10 times, obtaining slightly different forests.
We estimated the BDEI parameters on these 10 forests. As the BDEI model requires one of the parameters to be fixed for identifiability (Stadler, 2009), we performed the estimations fixing the sampling probability . We estimated as the proportion of cases represented by our forests (853–854) with respect to the total number of SLE Ebola cases reported by the Centers for Disease Control and Prevention (CDC, 2020) between September 8, 2015 (the closest date to the last SLE sample in our data set, 13683 cases) and July 31, 2014 (the day following the quarantine measures start, 533 cases): (calculated independently for each forest). To check the robustness of the predictions with respect to this estimation of , we additionally estimated the parameter values assuming 20% more (15780, ) and 20% less (10520, ) total cases. For each of these settings we performed three estimations: (1) with the number of unobserved trees being estimated, (2) with it being fixed (via setting the parameter ) to the difference between the total number of SLE Ebola cases reported by the CDC on July 31 (533) and the number of trees in the corresponding forest (varying between 143 and 174), and (3) with it being fixed to zero.
The results for different values, estimated vs. fixed , and different trees were compatible, with intersecting CIs (see Table S1 in online Appendix). We estimated the value between 0.95 and 1, suggesting a contained epidemic (which is in a good agreement with the quarantine measures and the end of the epidemic in early 2016). The incubation period was estimated between 11 and 14 days. These estimates are fully compatible with the previous studies and allow to narrow down the WHO incubation period estimate (2–21 d, non-specific to the SLE epidemic). We estimated a very short infectious period hour. While it does not correspond to the epidemiological estimates (4–5 d) reported in previous studies, it makes sense in the setting we are looking at. The BDEI infectious period corresponds to the time interval between the moment when a person becomes infectious and the moment when they cannot transmit anymore. In the beginning of an epidemic impossibility to transmit is typically defined by biological factors, such as healing or death, and corresponds to the epidemiological definition. However, it could also be influenced by logistic reasons, such as self-isolation. As we are looking at the lock-down period, with strict surveillance, it seems likely that a person who develops symptoms (i.e., passes from the exposed to the infectious state) is immediately detected and isolated, hence having very limited time to transmit on average. Our estimate therefore corresponds to this logistic scenario.
Note, that comparing the setting with being fixed according to the CDC-declared case count to the one when it is estimated, estimates were slightly smaller with fixed (while the CIs intersected). As the number of cases is defined by the epidemic preceding the studied period, might not correspond to the number estimated from the studied period parameters, and it might be more accurate to fix it.
Overall, the analysis took h for the reconstruction of forests and min per forest for the BDEI parameter estimation.
This application shows the advantages of PyBDEI not only in terms of calculation times, but also in terms of flexibility of input settings (extracting information from multiple trees).
Discussion
We proposed a highly parallelizable formulation of master equations for MTBD models. We also proposed an extension of the MTBD models to forests, to tackle situations where health policies change over time (providing a flexible alternative to the Bayesian skyline), as well as situations of multiple (not necessarily simultaneous) pathogen introductions to a country of interest. The extension to forests does not introduce additional model parameters; however, when available, it allows to incorporate external data on the number of infectious cases at the start of the forest.
The peculiar properties of original MTBD equations permitted us to rewrite them in a branch-specific way. This representation features simple boundary conditions with 0 or 1 values, and avoids numerical and underflow problems that could occur in the original system due to very small positive values of the boundary conditions. Even more importantly, our branch-specific representation removes the recursive dependency between the equations corresponding to parent and child tree nodes, and permits their time-consuming resolution to be performed in parallel and independently. The results can then be combined into tree likelihood with a nearly standard pruning algorithm. While the likelihood-combining step in the most general MTBD case remains recursive, its time cost should be negligible in comparison with the recursive master equation resolution used in previous studies. Moreover, for cases where tree node states are known we obtained an explicit likelihood formula, which can be represented in a logarithmic form (Eq. 8) to avoid potential underflow issues during the likelihood-combining step. Tree node states could be known from metadata, or because they were generated by an MTBD process in which only one state can transmit or get sampled.
We implemented our theoretical findings in a maximum likelihood parameter and CI estimator for the BDEI model, which is a special case of MTBD models. It is also one of the most useful models in epidemiology, being applicable to Ebola, Sars-CoV 1 and 2, Tuberculosis and other pathogens that feature an incubation period. Under this model tree node states are known, as only the infectious individuals can transmit their pathogen or get detected (after symptom onset). Our parameter estimator, PyBDEI, drastically increases parameter optimization performance, accuracy and speed with respect to previously available estimators.
We applied our estimator to the 2014 Ebola epidemic in Sierra Leone, after the introduction of quarantine measures. The analysis took h (the majority of which was the tree reconstruction). The obtained estimates of epidemiological parameters are in agreement with what we now know about this epidemic. In particular, the estimate of , slightly less than 1, suggests a contained epidemic, and indeed Sierra Leone was declared Ebola free in early 2016, a few months after the sampling date (September 7, 2015) of the most recent Ebola sequence in our data set.
The accuracy of estimations improves with the data set size (as expected, see our simulations). In the world of rapidly growing sequencing data sets (Hodcroft et al., 2021), we can gain important insights on epidemic spreads by harvesting all available information. PyBDEI is applicable to very large data sets (2 min on a 10000-tip tree), making parameter and CI estimation instantaneous with respect to phylogenetic tree reconstruction times (hours or even days). Our approach could be easily used in a Bayesian setting as well, and could potentially be implemented in BEAST2.
As the MTBD models are epidemiological analogues of the ClaSSE-like models, our findings could also be easily transferred to the macroevolution domain. Our parallelizable MTBD model formulation is closely related to general matrix-based flow framework recently proposed by Louca and Pennell (2020). Using this framework, they implemented an efficient parameter estimator castor for MuSSE-like models; however, this type of models does not cover cladogenetic changes possible in ClaSSE-like and MTBD models. Moreover, our approach, thanks to forests, allows for multiple introductions, for example, of an epidemic in a given country, or of species from the same clade within a given ecological realm, which could be useful in the macroevolution domain. With rapidly growing genome sequence data, castor and PyBDEI open way to fast and accurate parameter estimations for ecology and epidemiology.
Materials and Methods
Reconditioned BDEI master equations
| (10) |
Equivalence between Equations (2) and (8)
The likelihood Equation (2) for a tree is recursive, and when using () needs to be resolved with a pruning algorithm while climbing the tree. However, for a tree with known node states, we can transform it into a non-recursive Equation (8) with , by alternating replacement and unfolding steps. A replacement step consists in replacing with and is followed by an unfolding step. An unfolding step either (if is a tip) unfolds into and stops; or (if is an internal node) unfolds into and proceeds with replacements. In Equation (11) we show the transformation process.
| (11) |
Stationary Distribution
Stationary state distribution corresponds to the ratios of states and at a given time after these ratios stopped changing (assuming that this may happen). , where is the number of individuals of type and is the total number of infected (infectious or not) individuals at time . Hence, the derivative of the number of individuals of type is proportional to the derivative of the total number of infected individuals:
| (12) |
The number of individuals in state increases due to becoming infectious of individuals in state and decreases due to removal, while the number of individuals in state decreases due to becoming infectious and increases due to transmissions:
| (13) |
Becoming infectious changes the corresponding individual’s state but does not affect the total number of infected individuals, transmissions increase the total number, and removal decreases it. Note that only individuals in state can transmit or be removed:
| (14) |
Combining [13] and [14] we rewrite [12] as a system of multivariate algebraic equations:
| (15) |
from which we derive a quadratic equation for : , and the following stationary distribution:
| (16) |
PyBDEI, Its Code and Data Availability
Parameter estimation starts with a preprocessing step of reading the input trees, calculating the time at each node and memorising the association between each node and its child nodes: . This step requires one tree traversal, and its time is negligible with respect to the numerical master equation resolution performed in the next steps.
The estimation then proceeds with a search for the optimal parameter set , where is the set of admissible parameter values. We use the globally-convergent method of moving asymptotes (Svanberg, 2002) for the optimization. At each optimization step the corresponding likelihood needs to be calculated, which implies calculating for each of the observed internal nodes of , and combining them as in the forest likelihood formula (Eq. (4)). The reconditioned version of BDEI master equations (10) for the parameter values can be resolved in parallel for each of the observed forest nodes (differing in the times of their boundary conditions).
To resolve these master equations numerically, we start by separating the self-defined subsystem for the unknowns and from the rest of System (10), and calculate it independently. Taking into account the fact that the equations in System (10) are either linear (for and ) or quadratic (for and ) the use of an implicit scheme is simple. We chose implicit schemes with an automatic computation of the time step, such that the error is less than a given tolerance. In our implementation, we used the implicit Euler scheme (Butcher, 2016) for solving the linear equations and the Crank-Nicolson implicit scheme (Crank and Nicolson, 1947) for the non-linear ones. This permits us to avoid possible stability time restrictions and only choose the time steps for precision.
The core of our estimator is implemented in C++ and uses the NLopt library (Johnson, ) for non-linear optimization. The parallelization is achieved with the C++ thread_pool tools (Williams, 2012). To facilitate the use of our estimator in Python and perform additional validation of input trees, we wrapped the core estimator into a Python 3 library PyBDEI. PyBDEI uses ETE 3 framework for tree manipulation (Huerta-Cepas et al., 2016) and NumPy package for array operations (Harris et al., 2020).
Our estimator is available as a command-line program and a Python 3 library via PyPi (https://pypi.org/project/pybdei), and via Docker/Singularity (https://hub.docker.com/r/evolbioinfo/bdei). Its source code, the simulated and real data used for its assessment, as well as the Snakemake (Köster and Rahmann, 2012) data analysis pipelines, and the installation and usage documentation are available on GitHub at github.com/evolbioinfo/bdei. The simulated data are also available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.r7sqv9sgx.
BEAST2 and PhyloDeep Settings
BEAST2 (v2.6.2 with package bdmm (Scire et al., 2022) v1.0) was configured for MCMC steps with the following priors: , , and fixed to the real value (, different for different trees). The initial values in the MCMC were set to the medians used in the PhyloDeep training set, namely , , and . The tree was fixed to the real tree. For each tree, the Effective Sample Size (ESS) on all parameters was evaluated, and the median of a posteriori values was reported, corresponding to all recorded steps (i.e., actual MCMC steps spaced by 1000) past the 10% burn-in. The simulations for which BEAST2 did not converge after MCMC steps (2%) were discarded from the analyses.
PhyloDeep (v0.2.51) was run with fixed to the real value and Convolutional Neural Networks trained on the Compact Bijective Ladderized Vector full tree representation (CNN-CBLV).
The visualizations of the analyses of simulated data were performed with the Python 3 library seaborn (v0.11.2) (Waskom, 2021; Hunter, 2007).
Tree Reconstruction for Ebola SLE Epidemic Analysis
We reconstructed a maximum-likelihood phylogeny of 1610 tips for the Ebola samples from (Dudas et al., 2017) with RAxML-NG (v1.0.2, GTR+G4+FO+IO) (Kozlov et al., 2019), and rooted it based on sampling dates using LSD2 (v2.4.1) (To et al., 2016). As Ebola’s mutation rate is slower than its transmission rate, the initial phylogeny contained polytomies (i.e., multiple transmissions, which happened faster than the virus acquired a mutation, hence making them undistinguishable in the phylogeny). The BDEI model, on the other hand, assumes a binary tree. We therefore resolved these polytomies randomly (10 times, to check for robustness of the estimates) using a coalescent approach.
We then dated each of the 10 trees with LSD2 (To et al., 2016) (v2.4.1: https://github.com/tothuhien/lsd2/tree/v.2.4.1, under strict molecular clock with outlier removal) using tip sampling dates, and reconstructed the ancestral characters for country with PastML (Ishikawa et al., 2019) (v1.9.40, MPPA+F81).
Lastly, we extracted 10 SLE forests from these trees to represent the Ebola epidemic in SLE between July 30, 2014 (when the SLE government began to deploy troops to enforce quarantines (News24, 2014)) and September 7, 2015 (the last SLE sample in our dataset) by (1) cutting each tree on July 30, 2014 to remove the more ancient part (with a different health policy); (2) among the July-31-on trees, picking those whose root’s predicted character state for country was SLE (light-green branches at the level of July 31, 2014 in Fig. S3 in online Appendix); (3) removing the non-SLE subtrees (indicated with other colors in Fig. S3) from the selected July-31-on SLE trees to focus on the epidemic within the country, without further reintroductions.
The reconstruction took 1 hour for the phylogeny, 10 minutes for tree dating, and 1 minute for country ancestral character prediction.
Acknowledgments
The authors thank Dr Jakub Voznica for valuable discussions.
Contributor Information
Anna Zhukova, Unité Bioinformatique Evolutive, Institut Pasteur, Université de Paris, 28 rue du docteur Roux, 75015 Paris, France; Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, 28 rue du docteur Roux, 75015 Paris, France.
Frédéric Hecht, Sorbonne Université, CNRS, Université Paris Cité, Laboratoire Jacques-Louis Lions (LJLL), 4 place Jussieu, F-75005 Paris, France.
Yvon Maday, Sorbonne Université, CNRS, Université Paris Cité, Laboratoire Jacques-Louis Lions (LJLL), 4 place Jussieu, F-75005 Paris, France; Institut Universitaire de France, 1 rue Descartes, 75231 Paris CEDEX 05, France.
Olivier Gascuel, Unité Bioinformatique Evolutive, Institut Pasteur, Université de Paris, 28 rue du docteur Roux, 75015 Paris, France; Institut de Systématique, Evolution, Biodiversité (ISYEB) - URM 7205 CNRS, Museum National d’Histoire Naturelle, SU, EPHE & UA, 57 rue Cuvier, CP 50 75005 Paris, France.
Supplementary Material
Supplementary material, including data files and the online-only appendix, can be found in the Dryad data repository at https://doi.org/10.5061/dryad.r7sqv9sgx.
Funding
O.G. was supported by PRAIRIE (ANR-19-P3IA-0001). Y.M. was supported by European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 810367), project EMC2.
Author Contributions
O.G. initiated the project and provided critical assessment of project progression; A.Z. conceived the new master equation representation, extension to forests, applications to Ebola and simulated data, and coordinated the project; F.H. and Y.M. conceived the numerical approach for fast and efficient parameter optimization; F.H. implemented the numerical approach; A.Z. wrote the manuscript with input from all the authors; all authors discussed the intermediate and final results.
REFERENCES
- Berger S.A., Stamatakis A.. 2009. Accuracy and Performance of single versus double precision arithmetics for maximum likelihood phylogeny reconstruction Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 6068 LNCS:270–279. [Google Scholar]
- Bouckaert R., Vaughan T.G., Barido-Sottani J., Duchêne S., Fourment M., Gavryushkina A., Heled J., Jones G., Kühnert D., De Maio N., Matschiner M., Mendes F.K., Müller N.F., Ogilvie H.A., Du Plessis L., Popinga A., Rambaut A., Rasmussen D., Siveroni I., Suchard M.A., Wu C.H., Xie D., Zhang C., Stadler T., Drummond A.J.. 2019. BEAST 25: An advanced software platform for Bayesian evolutionary analysis. PLoS Comput. Biol. 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Butcher JC. 2016. Numerical methods for ordinary differential equations. 3 ed. Wiley. [Google Scholar]
- CDC. 2020. 2014–2016 Ebola outbreak in West Africa: case counts. Available from https://www.cdc.gov/vhf/ebola/history/2014-2016-outbreak/case-counts.html.
- Crank J., Nicolson P.. 1947. A practical method for numerical evaluation of solutions of partial differential equations of the heat-conduction type. Math. Proc. Camb. Philos. Soc. 43:50–67. [Google Scholar]
- Defour D. 2010. Accuracy of a maximum likelihood phylogeny reconstruction. Technical report. LIRMM. Available from https://hal.archives-ouvertes.fr/hal-00726409.
- Drummond A.J., Rambaut A., Shapiro B., Pybus O.G.. 2005. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 22:1185–1192. [DOI] [PubMed] [Google Scholar]
- Dudas G., Carvalho L.M., Bedford T., Tatem A.J., Baele G., Faria N.R., Park D.J., Ladner J.T., Arias A., Asogun D., Bielejec F., Caddy S.L., Cotten M., D’Ambrozio J., Dellicour S., Di Caro A., Diclaro J.W., Duraffour S., Elmore M.J., Fakoli L.S., Faye O., Gilbert M.L., Gevao S.M., Gire S., Gladden-Young A., Gnirke A., Goba A., Grant D.S., Haagmans B.L., Hiscox J.A., Jah U., Kugelman J.R., Liu D., Lu J., Malboeuf C.M., Mate S., Matthews D.A., Matranga C.B., Meredith L.W., Qu J., Quick J., Pas S.D., Phan M.V.T., Pollakis G., Reusken C.B., Sanchez-Lockhart M., Schaffner S.F., Schieffelin J.S., Sealfon R.S., Simon-Loriere E., Smits S.L., Stoecker K., Thorne L., Tobin E.A., Vandi M.A., Watson S.J., West K., Whitmer S., Wiley M.R., Winnicki S.M., Wohl S., Wölfel R., Yozwiak N.L., Andersen K.G., Blyden S.O., Bolay F., Carroll M.W., Dahn B., Diallo B., Formenty P., Fraser C., Gao G.F., Garry R.F., Goodfellow I., Günther S., Happi C.T., Holmes E.C., Kargbo B., Keïta S., Kellam P., Koopmans M.P.G., Kuhn J.H., Loman N.J., Magassouba N., Naidoo D., Nichol S.T., Nyenswah T., Palacios G., Pybus O.G., Sabeti P.C., Sall A., Ströher U., Wurie I., Suchard M.A., Lemey P., Rambaut A.. 2017. Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature. 544:309–315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. 1973. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Syst. Biol. 22:240–249. [Google Scholar]
- FitzJohn R.G., Maddison W.P., Otto S.P.. 2009. Estimating trait-dependent speciation and extinction rates from incompletely resolved phylogenies. Syst. Biol. 58:595–611. [DOI] [PubMed] [Google Scholar]
- Gire S.K., Goba A., Andersen K.G., Sealfon R.S., Park D.J., Kanneh L., Jalloh S., Momoh M., Fullah M., Dudas G., Wohl S., Moses L.M., Yozwiak N.L., Winnicki S., Matranga C.B., Malboeuf C.M., Qu J., Gladden A.D., Schaffner S.F., Yang X., Jiang P.P., Nekoui M., Colubri A., Coomber M.R., Fonnie M., Moigboi A., Gbakie M., Kamara F.K., Tucker V., Konuwa E., Saffa S., Sellu J., Jalloh A.A., Kovoma A., Koninga J., Mustapha I., Kargbo K., Foday M., Yillah M., Kanneh F., Robert W., Massally J.L., Chapman S.B., Bochicchio J., Murphy C., Nusbaum C., Young S., Birren B.W., Grant D.S., Scheiffelin J.S., Lander E.S., Happi C., Gevao S.M., Gnirke A., Rambaut A., Garry R.F., Khan S.H., Sabeti P.C.. 2014. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science. 345:1369–1372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldberg E.E., Igić B.. 2012. Tempo and mode in plant breeding system evolution. Evolution 66:3701–3709. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1558-5646.2012.01730.x [DOI] [PubMed] [Google Scholar]
- Grenfell B.T., Pybus O.G., Gog J.R., Wood J.L.N., Daly J.M., Mumford J.A., Holmes E.C.. 2004. Unifying the epidemiological and evolutionary dynamics of pathogens. Science. 303(5656):327–332. [DOI] [PubMed] [Google Scholar]
- Harris C.R., Millman K.J., van der Walt S.J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N.J., Kern R., Picus M., Hoyer S., van Kerkwijk M.H., Brett M., Haldane A., del Río J.F., Wiebe M., Peterson P., Gérard-Marchant P., Sheppard K., Reddy T., Weckesser W., Abbasi H., Gohlke C., Oliphant T.E.. 2020. Array programming with NumPy. Nature. 585:357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hethcote H.W. 2000. The mathematics of infectious diseases. SIAM Rev. 42:599–653. [Google Scholar]
- Hodcroft E.B., De Maio N., Lanfear R., MacCannell D.R., Minh B.Q., Schmidt H.A., Stamatakis A., Goldman N., Dessimoz C.. 2021. Want to track pandemic variants faster? Fix the bioinformatics bottleneck. Nature. 591:30–33. [DOI] [PubMed] [Google Scholar]
- Huerta-Cepas J., Serra F., Bork P.. 2016. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33:1635–1638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hunter J.D. 2007. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9:90–95. [Google Scholar]
- Ishikawa S.A., Zhukova A., Iwasaki W., Gascuel O.. 2019. A fast likelihood method to reconstruct and visualize ancestral scenarios. Mol. Biol. Evol. 36:2069–2085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson SG. The nlopt nonlinear-optimization package. Accessed: 2021-01-26.
- Kendall D.G. 1948. On the generalized “birth-and-death” process. Ann. Math. 19:1–15. [Google Scholar]
- Köster J., Rahmann S.. 2012. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 28:2520–2522. [DOI] [PubMed] [Google Scholar]
- Kozlov A.M., Darriba D., Flouri T., Morel B., Stamatakis A.. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 35:4453–4455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kühnert D., Stadler T., Vaughan T.G., Drummond A.J.. 2016. Phylodynamics with migration: a computational framework to quantify population structure from genomic data. Mol. Biol. Evol. 33:2102–2116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Louca S., Pennell M.W.. 2020. A general and efficient algorithm for the likelihood of diversification and discrete-trait evolutionary models. Syst. Biol. 69:545–556. [DOI] [PubMed] [Google Scholar]
- Macpherson A., Louca S., Mclaughlin A., Joy J.B., Pennell M.W.. 2021. Unifying phylogenetic birth-death models in epidemiology and macroevolution. Syst. Biol. 71(1):172–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maddison W., Midford P., Otto S.. 2007. Estimating a binary character’s effect on speciation and extinction. Syst. Biol. 56:701–710. [DOI] [PubMed] [Google Scholar]
- News24. 2014. Sierra leone, liberia deploy troops for ebola. Available from https://www.news24.com/Africa/News/Sierra-Leone-Liberia-deploy-troops-for-Ebola-20140804.
- Pybus O.G., Rambaut A., Harvey P.H.. 2000. An integrated framework for the inference of viral population history from reconstructed genealogies. Genetics. 155:1429–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rivers C.M., Lofgren E.T., Marathe M., Eubank S., Lewis B.L.. 2014. Modeling the impact of interventions on an epidemic of Ebola in Sierra Leone and Liberia. PLoS Curr. 6:ecurrents.outbreaks.4d41fe5d6c05e9df30ddce33c66d084c. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scire J., Barido-Sottani J., Kühnert D., Vaughan T.G., Stadler T.. 2022. Robust phylodynamic analysis of genetic sequencing data from structured populations. Viruses. 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stadler T. 2009. On incomplete sampling under birth-death models and connections to the sampling-based coalescent. J. Theor. Biol. 261:58–66. [DOI] [PubMed] [Google Scholar]
- Stadler T. 2010. Sampling-through-time in birth-death trees. J. Theor. Biol. 267:396–404. [DOI] [PubMed] [Google Scholar]
- Stadler T., Bonhoeffer S.. 2013. Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods. Philos. Trans. R. Soc. B Biol. Sci 368:20120198–20120198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stadler T., Kühnert D., Bonhoeffer S., Drummond A.J.. 2013. Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). Proc. Natl. Acad. Sci. U.S.A. 110:228–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stadler T., Kühnert D., Rasmussen D.A., du Plessis L.. 2014. Insights into the early epidemic spread of Ebola in Sierra Leone provided by viral sequence data. PLoS Curr 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Svanberg K. 2002. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM J. Optimiz 12:555–573. [Google Scholar]
- Team W.E.R. 2014. Ebola Virus disease in West Africa — the first 9 months of the epidemic and forward projections. N. Engl. J. Med. 371:1481–1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Team W.E.R. 2015. West African ebola epidemic after one year — slowing but not yet under control. N. Engl. J. Med. 372:584–587. [DOI] [PMC free article] [PubMed] [Google Scholar]
- To T.H., Jung M., Lycett S., Gascuel O.. 2016. Fast dating using least-squares criteria and algorithms. Syst. Biol. 65:82–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Kerkhove M.D., Bento A.I., Mills H.L., Ferguson N.M., Donnelly C.A.. 2015. A review of epidemiological parameters from Ebola outbreaks to inform early public health decision-making. Sci. Data. 2:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Volz E.M., Koelle K., Bedford T., Bhattacharya T., Delaporte E.. 2013. Viral phylodynamics. PLoS Comput. Biol. 9:e1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Volz E.M., Kosakovsky Pond S.L., Ward M.J., Leigh Brown A.J., Frost S.D.. 2009. Phylodynamics of infectious disease epidemics. Genetics. 183:1421–1430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Voznica J., Zhukova A., Boskova V., Saulnier E., Lemoine F., Moslonka-Lefebvre M., Gascuel O.. 2022. Deep learning from phylogenies to uncover the transmission dynamics of epidemics. Nat. Commun. 13:3896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waskom M.L. 2021. seaborn: statistical data visualization. J. Open Source Software. 6:3021. [Google Scholar]
- WHO. 2021. Ebola virus disease [fact sheet]. Available from https://www.who.int/news-room/fact-sheets/detail/ebola-virus-disease.
- Wilks S.S. 1938. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9:60–62. [Google Scholar]
- Williams A. 2012. C++ concurrency in action: practical multithreading, 1st ed. Shelter Island, NY: Manning Publications. [Google Scholar]
- Zhukova A., Blassel L., Lemoine F., Morel M., Voznica J., Gascuel O.. 2020. Origin, evolution and global spread of SARS-CoV-2. C.R. Biol. 344(1):57–75. [DOI] [PubMed] [Google Scholar]


