Scalable Bayesian Divergence Time Estimation With Ratio Transformations

Xiang Ji; Alexander A Fisher; Shuo Su; Jeffrey L Thorne; Barney Potter; Philippe Lemey; Guy Baele; Marc A Suchard

doi:10.1093/sysbio/syad039

. 2023 Jul 17;72(5):1136–1153. doi: 10.1093/sysbio/syad039

Scalable Bayesian Divergence Time Estimation With Ratio Transformations

Xiang Ji ^1,^✉, Alexander A Fisher ², Shuo Su ³, Jeffrey L Thorne ^4,^5,⁶, Barney Potter ⁷, Philippe Lemey ⁸, Guy Baele ⁹, Marc A Suchard ^10,^11,^12,^✉

Editor: Ziheng Yang

¹ Department of Mathematics, School of Science & Engineering, Tulane University, 6823 St. Charles Avenue, New Orleans, LA 70118, USA

² Department of Statistical Science, Duke University, 214 Old Chemistry, Durham, NC 27708, USA

³ MOE International Joint Collaborative Research Laboratory for Animal Health & Food Safety, Jiangsu Engineering Laboratory of Animal Immunology, Institute of Immunology, College of Veterinary Medicine, Nanjing Agricultural University, No. 1 Weigang, Xiaolingwei District, Nanjing, Jiangsu 210095, China

⁴ Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA

⁵ Department of Statistics, North Carolina State University, Raleigh, NC, USA

⁶ Department of Biological Sciences, North Carolina State University, Ricks Hall, 1 Lampe Dr, Raleigh, NC 27607, USA

⁷ Department of Microbiology, Immunology and Transplantation, Rega Institute, Herestraat 49, 3000 Leuven, Belgium

⁸ Department of Microbiology, Immunology and Transplantation, Rega Institute, Herestraat 49, 3000 Leuven, Belgium

⁹ Department of Microbiology, Immunology and Transplantation, Rega Institute, Herestraat 49, 3000 Leuven, Belgium

¹⁰ Department of Biomathematics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA

¹¹ Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA

¹² Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095, USA

^✉

Correspondence to be sent to: Xiang Ji, Department of Mathematics, School of Science & Engineering, Tulane University, 6823 St. Charles Avenue, New Orleans, LA 70118, USA; E.mail: xji4@tulane.edu.

^✉

Correspondence to be sent to: Marc A. Suchard, Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095; E.mail: msuchard@ucla.edu

Roles

Ziheng Yang: Associate Editor

PMCID: PMC10636426 PMID: 37458991

Abstract

Divergence time estimation is crucial to provide temporal signals for dating biologically important events from species divergence to viral transmissions in space and time. With the advent of high-throughput sequencing, recent Bayesian phylogenetic studies have analyzed hundreds to thousands of sequences. Such large-scale analyses challenge divergence time reconstruction by requiring inference on highly correlated internal node heights that often become computationally infeasible. To overcome this limitation, we explore a ratio transformation that maps the original Inline graphic internal node heights into a space of one height parameter and ratio parameters. To make the analyses scalable, we develop a collection of linear-time algorithms to compute the gradient and Jacobian-associated terms of the log-likelihood with respect to these ratios. We then apply Hamiltonian Monte Carlo sampling with the ratio transform in a Bayesian framework to learn the divergence times in 4 pathogenic viruses (West Nile virus, rabies virus, Lassa virus, and Ebola virus) and the coralline red algae. Our method both resolves a mixing issue in the West Nile virus example and improves inference efficiency by at least 5-fold for the Lassa and rabies virus examples as well as for the algae example. Our method now also makes it computationally feasible to incorporate mixed-effects molecular clock models for the Ebola virus example, confirms the findings from the original study, and reveals clearer multimodal distributions of the divergence times of some clades of interest.

Keywords: Bayesian inference, divergence time estimation, effective sample size, Hamiltonian Monte Carlo, pathogens, phylogenetics, ratio transformation

Since Zuckerkandl and Pauling (1962) proposed the first molecular clock model, the development of more reliable divergence time estimation techniques has thrived. Because evolutionary rate and time are confounded in stochastic models for molecular sequence data, one may improve divergence time inference either via advances in treatment of rates or treatment of times. However, the majority of the effort has centered upon improving the model aspects that describe either how evolutionary rates change across the tree or how divergence events happen on the tree resulting as the positions of internal nodes (e.g., coalescent events and/or birth–death events) while improvement of the estimation machinery has received less attention.

This imbalance is partly due to the constraints on the node heights imposed by the tree structure. Assuming a rooted tree with the root node on the top and tip nodes at the bottom, an internal node must be higher than its descendant nodes but lower than its parent node. These constraints pose great challenge for inferring internal node heights jointly, so one typically samples or optimizes the height of one node at a time.

Despite this inference difficulty, divergence time estimation is crucial to provide temporal signals for dating biologically important events, from species divergence to viral transmissions in space and time (Erwin et al. 2011; Meredith et al. 2011; Düx et al. 2020; Lemey et al. 2020). Repeated breakthroughs in sequencing technologies have led to molecular data accumulating at an ever-increasing pace. This often results in data sets that contain so many sequences that the desired divergence time analyses become computationally infeasible. When faced with such obstacles, investigators resort to analyzing only a small proportion of the available data and/or sacrificing statistical rigor and biological plausibility by adopting procedures and models that are flawed but computationally convenient (see, e.g., Simion et al. (2020)). There is, therefore, substantial value in reducing the amount of computation necessary for statistically sound divergence time inference.

In Kishino et al. (2001), the authors transform the internal node heights of a phylogeny with contemporaneous data (sampled at the same time) into a collection of ratios that sum to Inline graphic . With a Dirichlet prior distribution, Kishino et al. were then able to jointly sample all proportions at one time. Inspired by their pioneering work, we explore a more general ratio transformation, similar to that used in Fourment and Darling (2019), for the internal node heights that one can apply to both serially sampled or contemporaneous data. The ratio transformation serves as a reparameterization that works with any existing phylogenetic models without the need for any specific prior. In fact, the proposed ratio transformation preserves the topology-imposed constraints by its construction, allowing the ratios to be independent so that they are easy to sample from or optimize on.

We here show that one can calculate the transformation and the determinant of the Jacobian matrix of the transformation in linear-time with respect to the number of tips ( Inline graphic ). With the determinant of the Jacobian matrix, one can set up the phylogenetic model with respect to the untransformed node heights, but sample from the transformed ratio space. To make use of an advanced linear-time gradient of the log-likelihood algorithm (Ji et al. 2020), we show that one can transform the gradient with respect to the untransformed node heights to the gradient with respect to the transformed ratio space with Inline graphic calculations. The linear-time gradient transformation enables the application of gradient-based Monte Carlo samplers such as the Hamiltonian Monte Carlo (HMC) method (Neal 2011) in the Bayesian framework. HMC shows great potential for improving computational efficiency in many phylogenetic applications (Dinh et al. 2017; Ji et al. 2020; Baele et al. 2020).

We apply the ratio transformation to simultaneously learn the branch-specific evolutionary rates and the internal node heights of 4 viral examples with serially sampled data and an algae example with contemporaneous samples and fossil-informed calibration priors. Our method significantly improves inference efficiency with a 5- to 8-fold computational performance increase for our Lassa and rabies virus examples and an 11-fold increase for the algae example. More interestingly, the West Nile virus example shows that our sampler better approximates the posterior density than do classic univariable samplers that suffer from Markov chain Monte Carlo (MCMC) mixing issues. For an Ebola virus example, we show that our method makes it computationally feasible to employ a mixed-effects relaxed clock model (Bletsa et al. 2019) to account for both clade- and branch-specific effects that reveal clearer multi-modal distribution of divergence times for clades of interest.

Materials and Methods

New Approach

In this section, we define necessary notation and derive the ratio transformation and its related linear-time algorithms.

Notation.

Assume the root node is on the top of a rooted phylogeny with Inline graphic tips and internal nodes. We use numbers to denote the tip nodes and numbers for the internal nodes where the root node is always . We use notation to denote the parent node of node . We denote a branch on the tree by the number of the child node it ends at (i.e., branch connects node Inline graphic to ). We denote the height (i.e., time) of node with . When is a tip node (i.e., ), its height is the sampling time. In divergence time estimation, one is interested in estimating the heights of internal nodes.

Without loss of generality, we derive the ratio transform where the tip nodes can be associated with serially sampled data and where the transformation with contemporaneous data is then a special case where all tip node times are identical. We first define epochs such that any internal node belongs to one and only one epoch. We then define a ratio parameter ascribed to each of the internal nodes except for the root.

Epoch construction and the ratio transformation.

For aninternal node, we refer to its earliest (i.e., highest) descendant tip node as its anchor node. Therefore, the anchor node of an internal node is its closest descendant tip node. To make the anchor nodes consistent and unique, we assign an arbitrary ordering among tip nodes to distinguish those with the same sampling times. For example, we pick the tip node with the smallest node number as the anchor node from all closest tip nodes sampled at the same time. We group all internal nodes with the same anchor node into an epoch. We refer to an epoch by the number of its anchor node. An epoch is constructed to have a chain structure from its anchor node up to the highest node in the epoch (see Fig. 1a). Except for the epoch to which the root node belongs, we refer to the parent node of the highest node in an epoch as its connecting node such that the connecting node of an epoch belongs to another epoch. We treat the root node as the connecting node for epochs of its immediate descendant nodes.

Epoch construction on a -taxa tree. a) Example tree with serially sampled data. b) One epoch example where epoch starts from node down to its anchor node and node is the connecting node of epoch that belongs to epoch . For the example tree in a) with anchor tip , , , and . For anchor tip , , , and . For anchor tip , is the starting epoch that contains the root node. Tip nodes , and do not anchor any epochs (i.e., their parent nodes belong to epochs anchored at other tip nodes).

Let Inline graphic denote the height of node and be the epoch to which node belongs. We refer to the epoch to which the root node belongs as the starting epoch and assign it as . We abuse notation by referring to the node of epoch as . For epoch that contains internal nodes with strictly positive branch lengths, we have Inline graphic . We refer to the connecting node of an epoch as the node of an epoch (i.e., ). We define as the length of epoch (see Fig. 1b). For the internalnode from epoch (i.e., ), we define its ratio parameter as

r_{k_{i}} = \frac{t_{k_{i}} - t_{k}}{t_{k_{i - 1}} - t_{k}},

(1)

where Inline graphic is the height of the anchor node of epoch and . Note that the anchor node of epoch is not necessarily immediately descendant to node , whereas node is always immediately descendant to node . In fact, the anchor node of epoch is the highest descendant tip node for all nodes in the epoch (by definition) and is only immediately descendant to the last node Inline graphic of the epoch. Therefore, when , node and node are both from epoch . And when , node is the connecting node of epoch that belongs to another epoch and the denominator in Equation (1) becomes (i.e., the length of epoch ). One can write the time of an internal node as a function of the ratios and the epoch lengths as

t_{k_{i}} = L_{k} \prod_{n = 1}^{i} r_{k_{n}} + t_{k} .

(2)

To ease notation, let Inline graphic be the product of ratios for internal node of epoch . Equation (2) simplifies to

t_{k_{i}} = L_{k} S_{k_{i}} + t_{k} .

(3)

Interestingly, there is only one degree of freedom for all epoch lengths because

t_{k_{0}} = t_{k} + L_{k} = t_{ℰ (k_{0})} + L_{ℰ (k_{0})} S_{k_{0}},

(4)

such that the length of epoch Inline graphic is determined by the length of the epoch of its connecting node () and the two associated anchor node times (, ). We arrive at the following recursive relationship for epoch lengths

L_{k} = t_{ℰ (k_{0})} - t_{k} + L_{ℰ (k_{0})} S_{k_{0}}

(5)

Therefore, there is effectively only one degree of freedom for the scale of time with all ratios denoting the relative height an internal node has using its parent node and the anchor node as reference. There are many choices for modeling this single dimension for time scale (e.g., one may arbitrarily choose one of the epoch lengths). We pick the starting epoch length as the free parameter Inline graphic , which we refer to as the height parameter because it represents the height difference from the root node to its closest tip node (all tip nodes are descendants of the root) and is the only dimension. We refer to the space of the height and ratio parameters as the ratio space. We refer to the space of all untransformed internal node heights as the height space. We refer to the transformation from the height space into the ratio space as theratio transform.

Algorithm 1 illustrates the ratio transform through a single post-order traversal that visits every node on the tree in a descendant-first manner. Likewise, one can perform the inverse ratio transform to get node heights from the ratios by reversing Equation (1) through a pre-order traversal.

Algorithm 1 Ratio transform through a single post-order traversal

for node in a post-order traversal do
- if is a tip node then
  - Set the anchor tip of epoch as node .
- else
  - Set the anchor tip of the same as the highest anchor tip of its immediate descendant nodes.
  - Calculate according to Equation (1).
- end if
end for

Gradient and Jacobian.

Many modern inference machineries benefit from gradient information to find descending directions of the likelihood surface or to efficiently integrate dynamics along the surface for generating Monte Carlo proposals (e.g., Ji et al. (2020) contains gradient applications in non-linear optimization and Bayesian posterior sampling). When transforming probability densities from their original space into another (e.g., the ratio space in this case), one needs the determinant of the Jacobian matrix to correctly “weight” the transformed density (see Theorem 2.1.5 from Casella and Berger (2001)). In this section, we derive algorithms for transforming the “unweighted” likelihood into the ratio space together with the associated quantities from the log-determinant of the Jacobian matrix to correctly set the “weight.”

In Ji et al. (2020), we introduced a linear-time algorithm for calculating the gradient of the log-likelihood with respect to the branch length Inline graphic that is the product of the evolutionary rate and the time duration of branch . To calculate the gradient with respect to node heights, one starts with the gradient with respect to branch lengths and finishes via the chain rule. More specifically, for node with its two immediate descendant nodes Inline graphic and , the derivative of the log-likelihood, , with respect to is:

\frac{\partial}{\partial t_{h}} \log ℙ (Y) = {\begin{array}{l} \frac{\partial \log ℙ (Y)}{\partial b_{h}} \frac{\partial b_{h}}{\partial t_{h}} + \frac{\partial \log ℙ (Y)}{\partial b_{i}} \frac{\partial b_{i}}{\partial t_{h}} + \frac{\partial \log ℙ (Y)}{\partial b_{j}} \frac{\partial b_{j}}{\partial t_{h}}, h \neq 2 N - 1 \\ ​ \frac{\partial \log ℙ (Y)}{\partial b_{i}} \frac{\partial b_{i}}{\partial t_{h}} + \frac{\partial \log ℙ (Y)}{\partial b_{j}} \frac{\partial b_{j}}{\partial t_{h}}, h = 2 N - 1. \end{array}

(6)

It is important to recall that a ratio parameter is only explicit to the node it assigns to and all its descendant nodes by Equation (2). Therefore, we only need the partial derivatives Inline graphic from node and all its descendant nodes to finish the chain rule

\frac{\partial}{\partial h} \log ℙ (Y) = \sum_{k} [\frac{\partial}{\partial t_{k}} \log ℙ (Y) \frac{\partial t_{k}}{\partial r_{h}}] .

(7)

To derive the partial derivative Inline graphic for any two nodes and such that node is a descendant of node , we separate the node pairs into two cases. The first case considers node and node in the same epoch (including the pair where , e.g., Equation (3)), such that

\begin{array}{l} \begin{matrix} \frac{\partial t_{k}}{\partial h} & = & L_{ℰ (k)} \frac{\partial S_{k}}{\partial h} \end{matrix} \\ \begin{matrix} = & \frac{t_{k} - t_{ℰ (k)}}{r_{h}} . \end{matrix} \end{array}

(8)

For the other case where node Inline graphic and node belong to different epochs, we start with revealing the relationship between the partial derivatives of node ’s height and its connecting node ’s height with respect to the same ratio (e.g., plug Equation (5) in Equation (3)), such that

\frac{\partial t_{k}}{\partial r_{h}} = S_{k} \frac{\partial (t_{ℰ {(k)}_{0}} - t_{ℰ (k)} + L_{ℰ (ℰ {(k)}_{0})} S_{ℰ {(k)}_{0}})}{\partial r_{h}} = S_{k} \frac{\partial t_{ℰ {(k)}_{0}}}{\partial r_{h}} .

(9)

Equation (9) shows that one obtains the partial derivative of a node height Inline graphic with respect to ratio by multiplying the related ratio product (i.e., ) and the partial derivative of the node height with respect to ratio (i.e., ). Combining Equations (8) and (9), we inductively derive a general expression for the derivatives where node and node do not belong to the same epoch. We arrive at this derivation through the existence of a series of connecting nodes (when traveling from node Inline graphic to node ) starting from epoch that the last connecting node belongs to the same epoch as node , that is, . The general expression for the derivative becomes

\frac{\partial t_{k}}{\partial r_{h}} = S_{k} S_{ℰ {(k)}_{0}} \dots S_{ℰ {(\dots ℰ {(k)}_{0})}_{0}} \frac{\partial t_{ℰ {(\dots ℰ {(k)}_{0})}_{0}}}{\partial r_{h}} .

(10)

By naively plugging Equations (8) and (10) into Equation (7), we obtain the gradient with respect to the ratio space. However, this operation amounts to Inline graphic computations for transforming the gradient. To overcome this computational burden, we develop a linear-time algorithm for transforming the gradient.

Post-order traversal

Consider 3 internal nodes Inline graphic , , and such that node is the parent node of node and node . The linear-time algorithm for transforming the gradient with respect to ratio parameters builds on 2 properties of the ratio transformation. The first property is that any descendant node of node except node or node is a descendant node of either node Inline graphic or node (for bifurcating trees). The other property is that node belongs to the same epoch as either node or node . As is common in dynamic programming algorithms, we want to derive the relationship of with and , where node is descendant of node to reuse quantities cached from evaluating Equation (7) on descendant nodes. More specifically, we want to reuse the summations already determined for Inline graphic and when calculating as in Equation (9).

Without loss of generality, we assume node Inline graphic belongs to the same epoch as node . The following relationships between derivatives with respect to the three ratio parameters , , and enable the linear-time algorithm through a single post-order traversal to update the gradient from the height space into the ratio space (except for the height parameter). From Equation (8) and Equation (10), when node Inline graphic is a descendant of node (including ) such that node and node are in the same epoch,

\frac{\partial t_{k}}{\partial r_{h}} = \frac{\partial t_{k}}{\partial r_{i}} \frac{r_{i}}{r_{h}} .

(11)

When node Inline graphic is descendant of node (including ) such that node is the connecting node to the epoch where node is the first node,

\frac{\partial t_{k}}{\partial r_{h}} = \frac{\partial t_{k}}{\partial r_{j}} \frac{r_{j}}{L_{ℰ (j)}} \frac{\partial t_{h}}{\partial r_{h}} .

(12)

Note that we model the ratio parameters as independent of each other (i.e., Inline graphic ). Equations (11) and (12) come from the special structure of the transform that the height of an internal node is a product of a series of ratio parameters with one single height parameter. Algorithm 2 illustrates updating the gradient with respect to all ratio parameters (except for the height parameter) where one reuses the derivatives of the log-likelihood with respect to two immediate descendant nodes (i.e., nodes Inline graphic and ) to calculate the derivative of the log-likelihood with respect to the parent node (i.e., node ).

Algorithm 2 Transforming the gradient of the log-likelihood with respect to ratio parameters by post-order traversal

for node in a post-order traversal do
- if is a tip node then
  - Set the gradient of as 0.
- else
  - Let node and node be the two immediate descendant nodes of node such that node and node belong to the same epoch.
  - Set the gradient of as
  - .
- end if
end for

Pre-order traversal

We now update the gradient of the log-likelihood with respect to the height parameter which is the only dimension left in the ratio transform. We use a pre-order traversal to update the gradient in this dimension because the transformation of all internal node heights depends on it. The update is

\frac{\partial}{\partial L_{ℰ (2 N - 1)}} \log ℙ (Y) = \sum_{k} [\frac{\partial}{\partial t_{k}} \log ℙ (Y) \frac{\partial t_{k}}{\partial L_{ℰ (2 N - 1)}}] .

(13)

Based on Equation (4), we calculate all the partial derivatives Inline graphic according to Algorithm 3 through a single pre-order traversal.

Algorithm 3 Transforming gradient of the log-likelihood with respect to the height parameter by pre-order traversal

for node in a pre-order traversal do
if is the root node then
Set the derivative of node height with respect to height parameter as 1 (i.e., ).
else
Set the derivative of as the product of and the derivative of its parent node with respect to height parameter (i.e., $\frac{\partial t_{k}}{\partial L_{ℰ (2 N - 1)}} = r_{k} \frac{\partial t_{pa (k)}}{\partial L_{ℰ (2 N - 1)}}$ ).
end if
end for

Determinant of the Jacobian matrix

We now derive theJacobian matrix associated with the ratio transform whose determinant sets the weight for the transformed density. One derives the full Jacobian matrix for the ratio transform by applying Equation (8) and Equation (10). Note the special structure that has Inline graphic if and only if or node is descendant of node , and also note the independence between the height parameter and the ratio parameters. By ordering the entries in a descendant node first fashion that coincides with how nodes are visited in a post-order traversal, the Jacobian matrix becomes triangular (including the height parameter). Because the determinant of a triangular matrix only involves the diagonal entries, the determinant of the Jacobian matrix Inline graphic becomes

| J | = \prod_{i} \frac{\partial t_{i}}{\partial r_{i}} = \prod_{i} [t_{p a (i)} - t_{ℰ (i)}] .

(14)

Gradient of log-determinant of the Jacobian matrix

We complete this section with a final linear-time algorithm for calculating the gradient of the log-determinant of the Jacobian matrix with respect to the ratio space for applying HMC on this transformed space as described in the next section. This additional gradient component facilitates using HMC to sample all dimensions jointly in the ratio space. Similar to the case of updating the gradient of the log-likelihood from the original space into the ratio space, naively applying Equation (8) and Equation (10) results in an undesired quadratic computational load. One can benefit from the same properties that lead to Algorithm 2 with a modified two-pass linear-time Algorithm 4 that calculates all the derivatives of the log-determinant of the Jacobian matrix with respect to the ratioparameters.

Algorithm 4 Calculating gradient of the log-determinant of the Jacobian matrix with respect to ratio parameters by post-order traversal

for node in a post-order traversal do
if is a tip node then
else
Let node and node be the two immediate descendant nodes of node such that node and node belong to the same epoch, and compute
.
end if
end for
for every internal node do
Update .
end for

Hamiltonian Monte Carlo.

HMC is a state-of-the-art MCMC method that generates efficient proposals through Hamiltonian dynamics (Neal 2011) for the Metropolis–Hastings algorithm (Metropolis et al. 1953; Hastings 1970). For an arbitrary and unbounded parameter of interest Inline graphic with the posterior density , HMC introduces an auxiliary parameter and samples from the product density through:

\begin{array}{l} \frac{d p}{d t} = - \nabla U (θ) = \nabla \log π (θ) and \\ \frac{d θ}{d t} = \nabla K (p) = M^{- 1} p, \end{array}

(15)

where Inline graphic is the “potential energy” often set to the negative log-posterior density and is the “kinetic energy” as the auxiliary parameter typically follows a multivariate normal distribution with a “mass matrix” as the covariancematrix. HMC has shown great potential in diversephylogenetic applications (Dinh et al. 2017; Baele et al. 2020; Ji et al. 2020).

Naive application of HMC on the space of internal node heights is highly inefficient because of the irregular constraints on these parameters. Instead, the ratio space is trivial to extend such that it is unbounded by applying a logit-transform to each ratio independently and a log-transfrom to the single height parameter. We apply HMC on the (extended) ratio space for efficient sampling of all internal node heights while fixing the tree topology and other model parameters. Finally, we also apply HMC for jointly sampling the evolutionary rates and times (i.e., divergence time estimation) and explore the additional efficiency gain this affords.

Preconditioning with adaptive variance

The geometric structure of the posterior distribution significantly affects the computational efficiency of HMC. For example, when the scales of the posterior distribution vary among individual parameters, failing to account for such structure may reduce the efficiency of HMC (Neal 2011; Stan Development Team 2017; Ji et al. 2020). We can adapt HMC for such structure by modifying the dynamics in Equation (15) via an appropriately chosen mass matrix Inline graphic . In Ji et al. (2020), we employ a mass matrix informed by the diagonal entries of the Hessian matrix of the log-posterior to account for the variable scales among dimensions. Unfortunately, one needs the full Hessian matrix in the original height space to transform into the Hessian matrix with respect to the ratio space. This strategy is too computationally expensive toadopt.

To incorporate information from the covariance matrix without excessive computational burden, we seek an alternative adaptive MCMC procedure (Haario et al. 1999; Andrieu and Thoms 2008; Roberts and Rosenthal 2009). Adaptive MCMC has previously found its way into Bayesian phylogenetic inference (Baele et al. 2017) and we use this technique here to tune Inline graphic to the covariance matrix estimated from previous samples in the Markov chain. We further restrict to remain diagonal and hence to scale the ratio dimensions according to their marginal covariance. This restriction is commonly imposed to regularize the estimate, and a diagonal matrix alone can greatly enhance sampling efficiency of HMC in many situations (Stan Development Team 2017; Ji et al. 2020). We start the HMC sampler with an identity matrix as Inline graphic to collect an initial set of samples (e.g., 200 in our analyses), after which we employ the sample covariance to tune adaptively. Also, we only update the diagonal mass matrix every HMC iterations so that the cost of computing the adaptive diagonals remains negligible.

Data

We examine the molecular evolution of West Nile virus (WNV) in North America (1999–2007), rabies virus (RABV) in the United States (1982–2004), the S segmentof Lassa virus (LASV) in West Africa (2008–2013), Ebolavirus (EBOV) in the Democratic Republic of Congo, Africa (2018–2020), and the coralline red algae subclass Corallinophycidae with contemporaneous data and fossil record informed calibration priors on Inline graphic internal nodes (Biek et al. 2007; Pybus et al. 2012; Andersen et al. 2015; Mbala-Kingebeni et al. 2021; Pena et al. 2020). In all data sets, phylogenetic analyses have revealed a high variation of the evolutionary rates across branches in the underlying phylogeny.

West Nile virus

West Nile virus is a mosquito-borne RNA virus that involves multiple species of mosquitoes and birds where birds are the primary host. WNV first emerged in the Americas in New York in 1999, and quickly spread across the continent, causing an epidemic of human disease accompanied with massive bird deaths. In total, human infections have resulted in over 48,000 reported cases, 24,000 reported neuroinvasive cases, and over 2300 deaths (Hadfield et al. 2019). The molecular sequence data consist of 104 full genomes, with a total alignment length of 11,029 nucleotides, and were collected from infected human plasma samples from 2003 to 2007 as well as near-complete genomes obtained from GenBank (Pybus et al. 2012).

Rabies virus

Rabies is an RNA virus that can cause zoonotic disease and is responsible for over Inline graphic human deaths every year. Besides bats, several terrestrial carnivore species such as raccoons are important rabies reservoirs. Before the detection of a raccoon-specific rabies virus variant in 1970s, there was only limited focus on raccoons as a primary host for rabies in the southeastern United States, specifically Florida. Over the following decades, an emergence of the virus spread along the mid-Atlantic coast and northeastern United States. We analyze the molecular sequences originally described in Biek et al. (2007) that previously served as an example dataset in work on the flexible non-parametric skygrid coalescent model (Gill et al. 2016). The data consist of Inline graphic sequences sampled from rabid raccoons between 1982 and 2004 that contain the complete rabies nucleoprotein gene (1365 bp) with part of a noncoding region (87 bp) immediately following its 3 end, and a large portion of the glycoprotein gene (1359 bp).

Lassa virus

Lassa virus is the causative agent of Lassa fever, a hemorrhagic fever endemic to parts of West Africa that is responsible for thousands of deaths and tens-of-thousands of hospitalizations each year (Andersen et al. 2015). LASV infections can lead to Lassa fever, a hemorrhagic fever similar to that from EBOV and endemic to parts of West Africa. Despite the fact that Lassa fever can lead to over 50% fatality rates among hospitalized patients, an effective vaccine for LASV has yet to be developed and approved. Unlike EBOV (see next paragraph), which passes directly between humans, LASV circulates in a rodent (Mastomys natalensis) reservoir and mainly infects humans through contact with rodent excreta. The LASV genome is comprised of 2 negative-sense single-stranded RNA segments: the L segment is Inline graphic kilobase pairs (kb) long, and the S segment is kb long. In this paper, we use the S segment of the LASV sequence data set of Andersen et al. (2015) that consists of samples obtained at clinics in both Sierra Leone and Nigeria, rodents in the field, laboratory isolates and previously sequenced genomes.

Ebola virus

The Ebola virus disease (EVD) outbreak in North Kivu province in the Democratic Republic of Congo (DRC) during 2018–2020 was the world’s second largest Ebola outbreak on record. It led to 3481 total cases with 2299 deaths (World Health Organization 2021). One patient who received the recombinant vesicular stomatitis virus-based vaccine was diagnosed with EVD and recovered within 14 days after treatment. However, 6 months later, the same patient presented again with severe EVD-like illness and EBOV viremia and died (Mbala-Kingebeni et al. 2021). The molecular sequence data consist of Inline graphic sequenced isolates that contain epidemiologically linked cases to the patient’s second infection.

Algae

The coralline red algae (Corallinophycidae) are characterized by the presence of calcite crystals in their cell walls. Corallines, as a group, possess the richest fossil record among marine algae. In their pioneering study, Pena et al. (2020) use a multi-locus dataset with taxon sampling and comprehensive collection of carolline fossil records to reconstruct a time-calibrated phylogeny of the subclass Corallinophycidae. The algae dataset contains Inline graphic Corallinophycidae taxa and outgroup species with 7 genes (LSU, SSU, 23S, cos1, EF2, psbA, rbcL) concatenated into an alignment of more than bp. We employ the same fossil-informed normal priors on internal nodes as in the original study (Pena et al. 2020). More specifically, we place the same normal priors on the time to most recent common ancestor (tMRCA) with mean Inline graphic Mya (million years ago) and standard deviation Mya for clade A: Harveylithon, mean Mya and standard deviation Mya for clade B: Porolithon, mean Mya and standard deviation Mya for clade C: Lithophyllum pustulatum, mean Mya and standard deviation Mya for clade D: Hydrolithoideae, mean Inline graphic Mya and standard deviation Mya for clade E: Hapalidiales, and mean Mya and standard deviation Mya for clade F: Sporolithales as shown in Figure 7.

Mixed-effects Relaxed Clock Model

We employ mixed-effects relaxed clock models (as detailed in Bletsa et al. (2019)) to learn the evolutionary rates of the 4 viral datasets and the algae dataset. More specifically, we use the same random-effects relaxed clock model detailed in Ji et al. (2020) for the analysis of WNV, RABV, and LASV datasets. For the EBOV example, we use a mixed-effects relaxed clock model with clade-specific fixed-effects to model clade-specific rate variations among the 3 branches leading to 3 clades of interest (relapse clade, MAN14985 clade, and KAT21596 clade). For the algae example, we use a mixed-effects relaxed clock model with clade-specific fixed-effects to model clade-specific rate variations among the 8 clades of interest as in the original study. The use of the clade-specific fixed-effects mimics a local clock model that allows us to model and compare possibly within-clade rate variations but has previously not been computationally feasible.

Priors

We use the same data partitions, substitution models, and prior distributions as in each example’s original study (Biek et al. 2007; Pybus et al. 2012; Andersen et al. 2015; Pena et al. 2020; Mbala-Kingebeni et al. 2021).

Implementations

We have implemented the algorithms in thismanuscript within the development branch of the software package BEAST (SHA 17da204e2d9bdadb6c8284fd092413054f161bdc) (Suchard et al. 2018) with likelihood computations off-loaded to the high-performance BEAGLE library (SHA 3bdb30bd645e15983f8c8cf952564813e306ad83) (Ayres et al. 2019). We provide instructions and the BEAST XML files for reproducing these analyses on Github at https://github.com/suchard-group/hmc_divergence_time_manuscript_supplement.

Results

We summarize the computational efficiency improvement with HMC on the ratio space followed by our biological findings on divergence time estimations of the 5 examples.

Computational Performance

We infer the posterior distribution of all internal node heights using 2 different MCMC proposal kernels implemented in BEAST (Suchard et al. 2018) with likelihood computations off-loaded to the high-performance BEAGLE library (Ayres et al. 2019). The first kernel proposes new values for one internal node height at a time from their support. This represents the current best-practice approach used in BEAST and we will refer to this kernel as “univariable.” The other proposal kernel utilizes HMC with a diagonal mass matrix informed by adaptive variance on the ratio space that we will refer to as “HMC.” As is conventional for Bayesian phylogenetics, we employ a Metropolis-within-Gibbs (Tierney 1994; Andrieu et al. 2003) approach that cycles between sampling the tree, the evolutionary rates and the other phylogenetic modeling parameters, each from theirrespective full conditional distributions (see, e.g.,Equation (6) in Hassler et al. (2023) for more details).

As expected, sampling the topology and the high-dimensional rate and time (i.e., node height) parameters is computationally rate-limiting. Therefore, we explore 2 scenarios: 1) we sample divergence times only, while keeping the evolutionary rate and all other parameters fixed in scenario “time”; and 2) we sample evolutionary rate and time jointly, while keeping all other parameters fixed in scenario “rate & time.” We compare the efficiency of these proposal kernels through their effective sample size (ESS) per unit time for divergence time estimations. For each analysis, we run the MCMC iterations with each of the kernels for roughly the same run time (more details regarding chain lengths can be found in the supplementary BEAST XML files). This strategy aims to accommodate the difference in computational cost per MCMC iteration among kernels for fair comparisons. To maintain identifiability of internal nodes, we constrain the comparisons of the WNV, RABV, LASV, and algae examples to a fixed topology that was randomly selected from its posterior distribution. Specifically, we set all parameters, except for those of interest in each scenario, fixed to their realized values from a randomly selected MCMC iteration. The topologies and parameter values of the WNV and LASV examples are the same as in Ji et al. (2020). This topology constraint brings no additional work or difficulty for applying our method to integrate over topology space since one typically cycles between sampling the topology, the divergence times and other parameters, each from their respective full conditional distributions as in a Metropolis-within-Gibbs inference strategy (Tierney 1994; Andrieu et al. 2003). To demonstrate, we relax the topology constraint (i.e., we don’t fix the tree topology) for the EBOV example. We also relax the topology constraint when inferring the maximum clade credible evolutionary trees for all 5 examples and report posterior estimates for the evolutionary rate parameters in this scenario in the following section. We present the computational efficiency improvement with HMC in the ratio space for sampling node heights. The application of HMC on the ratio dimensions greatly improves the mixing of the MCMC chain, whereas the univariable samplers are problematic for learning the height of some internal nodes that are close to the root in the WNV example.

Figure 2 illustrates the posterior sampling efficiency with HMC and univariable samplers in terms of ESS per unit time. Table 1 shows the summary statistics of the efficiency gain of the HMC sampler compared with the univariable samplers for the 3 examples. We exclude the WNV example from the efficiency comparison because the poor mixing with univariable samplers leads to an inflated speed-up for HMC. The HMC sampler yields at least Inline graphic -fold efficiency improvement in terms of the minimum ESS per unit time in the RABV, LASV, and algae examples that have no difficulties of mixing for the univariable sampler.

Posterior sampling efficiency on all node height parameters for the WNV, RABV, LASV, and algae examples. We bin parameters by their ESS/s values. The 2 proposal kernels employed in the MCMC are color-coded: a univariable proposal kernel and an HMC proposal kernel with an adaptive mass matrix.

Table 1.

Computational performance of proposal kernels for the RABV, LASV, and algae examples. Computational efficiency measured in terms of effective sample size per second (ESS/s) and effective sample size per proposal (ESS/N). We compare the performance of our HMC proposal kernels operating on the transformed ratio space with a univariable (univariable) proposal kernel on the original node height space. We report speedup with respect to the minimum and median ESS/s and ESS/N (listed in the columns of “univariable” and “HMC”) across parameters for each example and method. We do not report the unreliably high speed-ups for the WNV dataset because of mixing issues under the “univariable” kernel.

			Univariable		HMC		Speedup
	Source		Minimum	Median	Minimum	Median	Minimum	Median
ESS/s	RABV	Time	3.187	12.154	17.358	23.579	5.4	1.9
	RABV	Rate & Time	0.927	4.638	6.324	8.355	6.8	1.8
	LASV	Time	0.008	0.090	0.042	0.104	5.0	1.2
	LASV	Rate & Time	0.002	0.016	0.018	0.040	8.0	2.4
	Algae	Time	2.47E	1.59E	2.72E	1.34E	11.0	8.4
	Algae	Rate & Time	9.01E	7.37E	1.26E	4.26E	14.0	5.8
ESS/N	RABV	Time	2.12E	8.10E	3.39E	4.60E	159.3	56.7
	RABV	Rate & Time	2.26E	1.13E	2.45E	3.24E	108.3	28.6
	LASV	Time	8.68E6	9.45E5	1.17E3	2.92E3	134.8	30.9
	LASV	Rate & Time	2.70E6	1.98E5	1.21E3	2.63E3	447.3	132.9
	Algae	Time	4.31E5	2.77E4	1.60E3	7.86E3	37.2	28.4
	Algae	Rate & Time	2.51E6	2.05E5	2.87E4	9.67E4	114.2	47.2

Open in a new tab