Abstract
Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.
Keywords: Bayesian inference, Hamiltonian Monte Carlo, phylogeography
Along the branches of a phylogenetic tree, discrete characters such as nucleotides, amino acids, or morphologic traits evolve according to some (typically unknown) substitution process. Substitution models are probabilistic representations of the substitution process and are central quantities in phylogenetic and phylodynamic models. Broadly, substitution models describe the relative rates of discrete change from one character state to another.
When inferring phylogenies from character data, the nature of the substitution process is generally not the subject of primary biological interest. Nevertheless, because substitution models stand as the key link between the phylogenetic tree and the observed discrete character data, appropriate modeling remains of paramount importance to avoid bias and their specification has received considerable attention (see, e.g., Tavaré 1986; Suchard et al. 2001, 2003; Woodhams et al. 2015; Abadi et al. 2019; Fabreti and Höhna 2022). There are also cases in which the substitution process is itself of direct interest. In phylogeographic modeling of rapidly evolving pathogens, character states may represent the geographic locations of sampled pathogen sequences and the substitution process describes the spread of the pathogens through geographic space. In this case, inferring an appropriately parameterized substitution model can deliver insight into the factors driving the spread of disease (Lemey et al. 2014, 2020; Dudas et al. 2017). Questions regarding potentially coevolving traits can also be addressed with substitution models by expanding their state-space, such as to pairs of binary characters (Pagel and Meade 2006).
Popular phylogenetic substitution models arecontinuous-time Markov chain (CTMC) models parameterized in terms of one or more infinitesimal rate matrices and branch lengths that measure the expected number of substitutions along each branch in the phylogeny. As the number of possible characters in the data state-space grows, the number of potential parameters in each rate matrix quickly becomes large. When inferring a phylogeny from nucleotide sequences, the rate matrix is small, and relatively parameter-rich models have been considered (Tavaré 1986; Yang 1994a). But rate matrices for other data types can easily grow large: there are 20 amino acids, 64 codons, and phylogeographic analyses can easily encompass many dozens of locations (Lemey et al. 2014; Dudas et al. 2017; Gao et al. 2022). Models which account for heterogeneity of the substitution process along branches, such as Markov-modulated models (Baele et al. 2021) may involve hundreds of parameters. In such cases, inferring the unconstrained model, in which all non-diagonal elements are free parameters, has been historically prohibitive. One reason is because the typical approach to Bayesian inference of substitution models is to use random-walk Metropolis–Hastings-based Markov chain Monte Carlo (MCMC) (Metropolis et al. 1953; Hastings 1970). Such large rate matrices have many parameters which are (potentially) strongly correlated and often only weakly identifiable, rendering random-walk MCMC burdensome.
When confronted with substitution models for large state-spaces, the historical approach has been to find ways to reduce the number of free parameters in the model. Amino acid models are often parameterized empirically (Dayhoff et al. 1978; Whelan and Goldman 2001), requiring no free parameters for inference. Codon models are often represented as combinations of site-level nucleotide models and codon-level processes (Yang et al. 2000), some of which may be measured empirically (Hilton and Bloom 2018). Such approaches reduce the number of parameters that must be inferred to . Another approach is to parameterize the rate matrix in terms of log-linear functions of observed covariates. This generalized linear model (GLM) approach has been successful in phylogeographic inference, where observed covariates include factors like the distance between locations and air travel volumes (Lemey et al. 2014; Dudas et al. 2017). In addition to making inference tractable, the GLM approach can be used to quantify the strength of evidence for which factors do or do not affect the spread of infectious diseases.
In this paper, we demonstrate the utility of random-effects substitution models. These models extend a wide class of CTMC models to incorporate additional rate variation by representing the original (base) model as fixed-effect model parameters and allowing the additional random-effects to capture deviations from the simpler process. We demonstrate the utility of these models in Bayesian inference on a variety of exemplar evolutionary problems. On a dataset of 583 SARS-CoV-2 genomes, an HKY model with random-effects captures known mutational biases in SARS-CoV-2 and is shown to be superior to the richest, general-time reversible (GTR) model. Applied to a phylogeographic analysis of influenza A subtype H3N2, a GLM substitution model with random-effects provides evidence that air travel volume captures the geographic process of dispersal for all except a small set of pairs for which it underpredicts dispersal. As a test for ecologically dependent trait evolution, a random-effects pairwise-dependent substitution model finds no evidence for an effect of arboreality on the swimming mode in hylid tree frogs. We quantify the performance of random-effects substitution models using simulations.
Before we can perform inference under these models, however, there stand two obstacles that must be overcome. First and foremost, the parameter-space of the random effects models can be very large, and these parameters may be strongly correlated. To overcome the dimensionality, we derive an efficient-to-compute approximation of the gradient of the phylogenetic(log-)likelihood with respect to (wrt) all parameters of an arbitrary CTMC substitution model simultaneously. Notably, the exact gradient is often computationally prohibitive. We implement our approximate gradient in the phylogenetic inference software package BEAST 1.10 (Suchard et al. 2018) and the high-performance computational library BEAGLE 3 (Ayres et al. 2019), enabling the use of Hamiltonian Monte Carlo (HMC), a gradient-based alternative to random-walk MCMC (Neal 2011), for efficient parameter inference. HMC leverages gradients to take bold steps through even highly correlated parameter spaces and can greatly increase MCMC efficiency. Second, to avoid identifiability issues with potentially overparameterized models (such as inferring a rate matrix based on a single observed character), we make use of the Bayesian bridge prior (Polson et al. 2014) that is strongly regularizing and allows the data to decide which parameters are important to capture their variability.
The rest of this paper is structured as follows. In the Methods section, we formally introduce the random-effects substitution model and the Bayesian bridge prior distribution. Then we derive our approximate gradient of the phylogenetic log-likelihood with respect to parameters of the substitution model. We also provide an introduction to gradient-based inference. In the Results section, we first investigate the increase in efficiency from using our approximate gradients compared to alternative approaches, both for optimization tasks and full Bayesian inference. Then we apply our random-effects substitution model to a number of real-world examples and to simulated data. We conclude by contemplating future approaches for improving inference efficiency and additional application areas where random-effects substitution models are likely to be useful.
Methods
In this paper, we assume that there is a (possibly unknown) rooted phylogeny with tips that links the observed character sites and internal nodes. We index the branch lengths and nodes such that the edge connecting node to its parent has length . Along each branch of the tree, we assume that characters arise from an alphabet of size and evolve under a CTMC model with instantaneous rate matrix , where for and the diagonal elements are fixed such that row-sums of equal 0. We measure branch length in expected number of substitutions per site according to a probability mass vector over the characters. Often, is taken as the stationary distribution of , but this need not be the case. To account for variation in evolutionary rates across sites in the character alignment, finite mixture models (e.g., the discrete-gamma model of Yang 1994b) modulate the expected number of substitutions along all branches at a specific site. Consider that there are rate categories, the rate scalar in the th category is , and the prior probability of being in any particular mixture category is . Then, the finite-time transition probability matrix along branch in category is given by , where we assume that is normalized wrt . The matrix governs the probability of change from state to state along branch in category . Note that the subscripts on do not denote elements of the matrix but rather which of the distinct transition probability matrices—one for each branch and rate category—is under consideration. In truth, the rate matrix is a function of a vector of estimable parameters , specifically , but we suppress this notation for ease of presentation.
Random-effects substitution models
Random-effects substitution models are extensions of simpler CTMC substitution models. We start with a base model, which could be as simple as Jukes-Cantor (Jukes and Cantor 1969), as complex as a GLM substitution model (Lemey et al. 2014) or anything in between. This base model carries a rate matrix and probability mass vector over the characters. We define the random-effects substitution model rate matrix using the following log-linear formulation,
(1) |
and set . Intuitively, the random-effects are multiplicative real-valued parameters which enable each non-diagonal element to deviate from the values specified by the base model. For example, doubles the rate implied by the base model, .
Random-effects substitution models retain the basic structure of the base model that may be biologically or epidemiologically motivated, while allowing for potentially large deviations from this base process. In our phylogeographic example, we start with an epidemiologically motivated model for the spread of rapidly evolving pathogens parameterized using air travel volume between countries with a GLM substitution model (although we could potentially use many more covariates) and allow the random-effects to capture shortcomings of this description. For convenience, we shorthand the random-effects version of a substitution model , where is the base model (e.g., Hasegawa, Kishino, and Yano model [HKY]; Hasegawa et al. 1985).
Bayesian regularization
Random-effects substitution models are in general overparameterized and as such not identifiable by the likelihood alone. In a Bayesian setting, a prior will provide relief from this and allow for a posterior to be inferred. Nevertheless, such circumstances demand careful thought about the choice of prior, as it will play a key role in determining the posterior. An attractive class of priors for these situations are shrinkage priors such as the Bayesian bridge (Polson et al. 2014) or the horseshoe (Carvalho et al. 2010). Originally developed for handling regression models when there are more parameters than observations, shrinkage priors regularize coefficients (often strongly) pulling them to be near or equal to 0 when the data provide little or no information to the contrary and, otherwise, attempting to impart little bias into the posterior. Shrinkage priors often (though not always) induce sparsity in the model, pulling many coefficients to be effectively 0. Shrinkage priors have also found success in phylogenetic contexts, including models for the rate of evolution of the rate of evolution (Fisher, Ji, Nishimura et al. 2021; Fisher, Ji, Zhang et al. 2021), population sizes over time (Faulkner et al. 2020), and rates of speciation and extinction (Magee et al. 2020). In these cases, the prior pull is toward no change, either from a branch to its descendants or from one time interval to the next. In the context of random-effects substitution models, shrinkage is imposed by pulling the random-effects toward the null value of 0 (at 0, there is no deviation from the base model in the direction and ).
Shrinkage priors also permit us to perform model selection, with the data and prior reconciling which parameters belong in or out of the model. For the random-effects substitution model, a particular random-effect is excluded from the model if it is approximately 0. Where discrete mixture models, such as those used in Bayesian stochastic search variable selection (BSSVS, Lemey et al. 2009), carry a finite probability that a parameter achieves exactly 0, shrinkage priors instead have a large spike of prior density near 0. While this occasionally makes it more difficult to declare if a parameter belongs in the model or not, the use of purely continuous priors usually yields Markov chains that mix more efficiently and, importantly, permits the use of gradient-based inference.
The Bayesian bridge prior on random-effect has density,
(2) |
where the global scale controls the overall spread of the distribution and the exponent controls the shape. The Bayesian bridge is perhaps best thought of as family of distributions, modulated by , some of which fall into the class of sparsity-inducing priors () and some of which do not. At , the density coincides with a standard normal distribution, while at it is the density of the Laplace distribution. At lower exponent values, the distribution becomes increasingly peaked around 0 and induces sparsity. We use , which in practice imposes a useful level of sparsity without compromising MCMC convergence. The global scale also plays an important role in determining the degree of regularization. To both permit the data to inform the strength of regularization and efficient Gibbs sampling procedures, we place a Gamma(shape=1, scale=2) prior on (Nishimura and Suchard 2022).
The Bayesian bridge distribution has particularly fat tails for lower . This can hamper sampling, and it can allow parameter values which are, a priori, unrealistically large (or small). Particularly, large random-effects can also cause numerical instability when exponentiating the substitution-rate matrix. These effects can be ameliorated by the use of the shrunken-shoulder Bayesian bridge (Nishimura and Suchard 2022). This formulation includes a “slab” parameter that controls the tails of the distribution. Specifically, outside of , the tails of the shrunken-shoulder Bayesian bridge become Normal(0,). We set , which a priori specifies that it is unlikely for a particular element of the rate matrix to be more than times larger or smaller than specified by the base model.
The gradient of the phylogenetic log-likelihood
In this paper, we are interested in the gradient of the phylogenetic log-likelihood with respect to the parameters of the substitution model. The data are a collection of homologous sites (columns in a multiple sequence alignment), . We will write the likelihood , and its gradient . The gradient is the collection of derivatives wrt all substitution model parameters,
(3) |
where ′ denotes the transpose operator.
Under the common assumption that sites evolve independently and identically, we can express the log-likelihood as a sum across all sites, and hence derivatives of it as well. We have
(4) |
The denominator is simply the likelihood of a site . For simplicity, we will focus on the computation of for site under rate category .
Partial likelihood vectors and the phylogenetic likelihood
We can, at any node in the tree, compute the likelihood as
(5) |
The post-order partial likelihood vector describes the probability, at node , in rate category , at the th site, of observing the tip-states in all tips which descend from the node, conditioned on the state at the node. The pre-order partial likelihood vector, , describes the joint probability of observing the tip-states in all tips not descended from the node and the state at the node.
The post-order partial likelihood vectors are computed via pruning from the tip to the roots (a post-order traversal), for the tree in Fig. 1, via
(6) |
where denotes the Hadamard (element-wise) product. The pre-order partial likelihood vectors are then computed in a root-to-tip pass through the tree (a pre-order traversal) using the relation
(7) |
We note that is independent of , while is dependent on . At the root , the pre-order partial likelihood vector is simply the root-frequency vector , which may or may not be the same as the frequency vector used to normalize the rate matrix.
A naïve derivative
We can use the multivariable chain rule to obtain the total derivative of the likelihood wrt . To do this, we first envision a simple parameter expansion with branch- and root-specific variables . By setting , we can frame the derivative wrt in terms of a sum across the all the per-branch contributions from the derivatives wrt . First we rewrite the differential as
(8) |
Then, the partial likelihood vector representation of the phylogenetic likelihood allows us to isolate the contribution of each branch and to this total derivative. In doing so, we also recall that is independent of and , such that for all . By summing over all branches and the root, we obtain the total derivative as
(9) |
where the contribution from root-frequency vector is
(10) |
In the third-to-last step, we defined to simplify the notation and focus on the part of the equation which depends on (). In the last step, we employed the matrix chain rule (Petersen et al. 2008). The term is the derivative of the matrix exponential with respect to one of the elements of the rate matrix, which we discuss in more detail in the Section “Efficiently approximating the derivative of the matrix exponential.” We note that the rate matrix is sometimes defined to be unnormalized, such that the transition probability matrix along a branch is instead given by . For simplicity of notation, when normalization is desired we take the rate matrix to be normalized, and allow the term to capture the effect of the normalizing constant on the elements of the rate matrix. We discuss the computation of the gradient of the rate matrix with respect to its parameters and the root contribution in Supplementary Material, “More details on the form of the gradient” section.
As we discuss in Supplementary Material, “Computational complexity of alternative approaches to computing the gradient of a matrix exponential” section, the computational cost of obtaining is . The sum in Equation 9 requires this quantity for all elements in and for each of the branches, making the cost to compute the derivative . Obtaining the gradient requires using Equation 9 for all substitution model parameters, making the cost of the gradient . For random effects models, this is . Such costs are prohibitive for even moderate , so we turn our attention now to improving the computational efficiency of gradient computations.
Reducing the computational complexity
We can reformulate the naïve approach of Equation 9 to produce a more efficient gradient computation. By rearranging the order of summation, we can disentangle the derivative of the rate matrix wrt its elements from the derivative of its elements wrt model parameters. Specifically,
(11) |
where the operator makes the matrix into a column vector by stacking the columns on top of each other, and we obtain the last line by defining two new quantities which we will now discuss. The matrix is a mapping matrix, which stores in each row a vector of the partial derivatives of all elements of wrt ,
(12) |
The matrix contains the contribution of branch to the derivative of the phylogenetic likelihood wrt the entry of in rate category . Specifically,
(13) |
We arrive at the entire gradient (as opposed to a single entry) and increase computational efficiency by replacing with in Equation 11 and rearranging,
(14) |
This approach separates the gradient of the phylogenetic likelihood wrt model parameters into two pieces, a gradient of the phylogenetic likelihood wrt elements of the rate matrix, and a gradient of the elements of the rate matrix wrt the model parameters. The result is the intermediate quantity that can be obtained with only a single computation of the derivative of a matrix exponential per branch. As this quantity can be summed across the tree prior to mapping it to the substitution model parameters, matrix multiplications are avoided. The result is that this approach is rather than . For random-effects substitution models, this is the difference between a computation and a computation. Note that this approach works for branch-specific models as well, by specifying the mapping matrix appropriately.
Efficiently approximating the derivative of the matrix exponential
We now turn our attention to an efficient approximation to the derivative of a matrix exponential. The derivative of a matrix exponential can be represented as a power-series (Najfeld and Havel 1995, equation 103),
(15) |
where is a matrix which is 0 for all but the th entry, which is 1. The matrix commutator power for non-negative integer is defined recursively (Najfeld and Havel 1995), such that and , where is the matrix commutator .
The first-order approximation to Equation 15 is taken by keeping only the term, yielding
(16) |
We can use this first-order approximation to approximate on each branch. Specifically,
(17) |
where we get from line 3 to line 4 by noting that and applying Equation 7. Intuitively, we have the (approximate) derivative with respect to an transition depending on the pre-order partial likelihood in state and the post-order partial likelihood in state .
Equation 17 means that we can write our approximate as an outer product,
(18) |
This means that we can obtain all entries of in , which is much more efficient than the cost of the non-approximate computation. Thus, with this approximation and the mapping approach outlined in the previous section, the (approximate) substitution gradient can be obtained in , rather than the cost suggested by Equation 14 or the cost suggested by Equation 9. We will denote the approximate gradient that comes from using this approximation to in Equation 14 as .
Hamiltonian Monte Carlo with surrogate trajectories
HMC (Duane et al. 1987; Neal 2011) is an advanced MCMC algorithm that falls broadly within the well-known class of Metropolis–Hastings MCMC (MH-MCMC) algorithms (Metropolis et al. 1953; Hastings 1970). By allowing samples to be drawn (sequentially) from arbitrary target distributions, MH-MCMC algorithms like HMC allow users to approximate distributions that do not have known closed-form densities. Unlike many commonly employed random-walk Metropolis–Hastings proposals, however, HMC uses information captured by the log-posterior gradient to traverse a model’s parameter space much more efficiently by proposing (and accepting) states which are farther apart.
HMC constructs an artificial Hamiltonian system by augmenting the parameter space to include an auxiliary Gaussian “momentum” variable MVN that is independent from the target variable by construction. Letting denote the posterior density, the resulting Hamiltonian energy function is the negative logarithm of the joint distribution over and . Ignoring normalizing constants we obtain
and Hamilton’s equations are
On the one hand, one may show that the action of the dynamical system that satisfies these equations leaves the target invariant thanks to the reversibility, volume preservation and energy conservation of Hamiltonian dynamics. On the other hand, closed-form descriptions of these dynamics are rarely available for arbitrary target distributions, leading to the need for computer intensive approximations. In particular, the Störmer–Verlet (velocity Verlet) or leapfrog method (Leimkuhler and Reich 2004) has become the numerical integrator of choice for obtaining discretized trajectories within HMC. Beginning at time and letting be small, a single leapfrog iteration proceeds thus:
(19) |
Trajectories arising from concatenated leapfrog iterations maintain some of the desirable qualities of the exact Hamiltonian dynamics (reversibility, volume preservation) but no longer conserve energy.
A single iteration of the HMC algorithm features three distinct steps. It starts at the current parameter values , setting , and draws an initial momentum from the given Gaussian distribution, setting . Then, it uses steps of the leapfrog integrator for some given step-size , repeatedly applying Equation 19 to generate a proposed parameter state , as well as a final momentum . Lastly, it either accepts or rejects this new value according to the usual Metropolis–Hastings acceptance rule (Metropolis et al. 1953; Hastings 1970) with acceptance probability , where indicates the density of the Gaussian momentum distribution evaluated at . (The final momentum is negated to ensure the proposal is a symmetric (Neal 2011; Vats 2023).) The accept/reject step accounts for integration error and leaves the target distribution invariant.
Indeed, HMC’s Metropolis correction allows for additional deviations from Hamiltonian dynamics over and beyond numerical discretization schemes such as Equation 19. Surrogate HMC methods seek to improve computational performance of HMC by approximating the log-posterior gradient with less expensive surrogate models including, e.g., piecewise-approximations (Zhang et al. 2017b), Gaussian processes (Rasmussen 2003; Lan et al. 2016) or neural networks (Zhang et al. 2017a; Li et al. 2019). Directly relevant to the present work, Li et al. (2019) show the validity of replacing the log-posterior gradient within the leapfrog method (Equation 19) with any vector function . In particular, such an approach maintains the reversibility and volume preservation of Hamiltonian dynamics and, when paired with Metropolis corrections, leaves the target posterior distribution invariant. In the present work, we select , the approximate posterior gradient obtained by using our approximation to the gradient of the phylogenetic log-likelihood and the true gradient for the prior. In the supplementary Materials, we discuss an alternative justification.
Results
C to T bias in SARS-CoV-2 evolution
The mutational profile of SARS-CoV-2 has been intensely scrutinized, one feature in particular which has been noted is a strongly increased rate of CT substitutions over the reverse TC substitutions. We note that while RNA viruses like SARS-CoV-2 use uracil (U) in place of thymine (T), it is generally coded as thymine–the coding of adenosine (A), cytosine (C), and guanine (G) are unchanged. The elevation of one direction of substitution over its reverse is a violation of the common phylogenetic assumption of reversibility made by the GTR (Tavaré 1986) family of substitution models. Random-effects substitution models are suitable for addressing this model violation, in particular we consider an HKY+RE substitution model. In principle, we could choose any GTR-family model. HKY represents a balance between the simplicity of JC+RE (where the random-effects would also have to account for uneven nucleotide frequencies) and the complexity of GTR+RE (where the random-effects only capture nonreversibilities). The rate matrix is
(20) |
where is the HKY parameter governing relative rate of transitions to transversions, indicates that the to change is a transition, and are the HKY stationary frequencies.
We apply this HKY+RE model to infer both the dynamics of molecular substitution and the phylogeny for 583 SARS-CoV-2 sequences from Pekar et al. (2021). (More information about the model and dataset is in Supplementary Table S1.) Consistent with previous studies (e.g. Matyášek and Kovařík 2020; Tonkin-Hill et al. 2021), we find evidence for a greatly elevated rate of CT substitutions, as well as an elevated GT rate (Figure 2). We can test the support for nonreversibilities, for example the difference between the CT and TC rates, with Bayes factors. The fact that a model with the CT and TC rates equal (reversible wrt CT) is nested within the random-effects model allows us to use the Savage-Dickey ratio (e.g. Wagenmakers et al. 2010) to compute the Bayes factor from the posterior distribution of the random-effects model, as we discuss in the Supplemental Section “Assessing the strength of evidence for nonreversibilities.” (There is no need to fit any additional models or estimate marginal likelihoods directly, though for models like these where analytical posterior distributions are unavailable, the results are still not entirely free from the numerical difficulties common in Bayes factor estimation.) The Bayes factor provides “very strong” (Kass and Raftery 1995) support for the nonreversibility of CT and GT rates (over the reversible model). We can also assess the strength of evidence for nonreversibilities via the posterior sign probability. This is the posterior probability that the sign of a variable is the same as the sign of the posterior median (this is one minus the tail probability used by Zhang et al. 2021). The estimated sign probability ranges from 0.5 to 1.0, with larger values indicating increasingly strong support that the parameter is non-zero. Here, as with Bayes factors, we are interested in the sign probabilities of the differences in random-effects rather than the random-effects directly. The sign probabilities agree with the Bayes factors that there is strong evidence for the nonreversibility of CT and GT rates, with both estimated sign probabilities above 0.99.
Given the strong evidence for nonreversibilities, we sought to investigate the issue of the adequacy of reversible models (namely GTR) using posterior predictive model checks. In a posterior predictive framework, a summary of the observed dataset is compared with the distribution of summaries of datasets produced by drawing from the posterior distribution on model parameters. Broadly, if the model fits the data well, we expect that the predicted summaries will match the observed values, while if the fit is poor there will be a mismatch. As our test statistics, we consider all pairwise covariances of the proportion of each nucleotide (A, C, G, and T) across the alignment (we discuss this in more detail in Supplementary Material, “Posterior predictive p-values for proportions” section). These test statistics clearly demonstrate that the HKY+RE model better captures the evolutionary processes at hand (Fig. 3). Compared with inference using GTR, the analysis with HKY+RE produces notably higher support for the root-most divergence (the 95% credible set includes 67 possible resolutions for GTR and 1 for HKY+RE, Supplementary Fig. S8) and infers a root time approximately 5 days earlier.
Phylogeography of influenza from 2002 to 2007
For a larger state-space example of random-effects substitution models, we consider the global spread of human influenza A virus (subtype H3N2) from 2002 to 2007. Lemey et al. (2014) examined the movement patterns between 14 distinct air travel communities using 1529 viral genomes. The authors used a GLM to parameterize the spread of the virus between these communities as a function of a number of covariates, and discovered that the most consistently supported predictor of spread between communities was the volume of air traffic.
We re-analyze this dataset using a GLM substitution model with random-effects. We now briefly review the setup of a GLM substitution model, and our random-effects extension. For each pair of locations and , let be a vector of predictors of the rate of movement from to (these may depend on the source , the destination , or both) with associated coefficients . A GLM substitution model with random-effects defines the rate matrix through
(21) |
This is a log-linear model, in which the GLM defines a substitution rate based on predictors and the estimated coefficients, and the random-effects allow for deviations from the model’s predictions.
In particular, we employ a simple GLM with only air traffic included as a predictor. This approach allows us to determine how well air traffic volume predicts the spread of influenza A virus in the mid-2000s. If most random-effects are negligible, then air traffic volume alone is perhaps adequate for modeling the spread of influenza in this time frame. On the other hand, if many or most random-effects are not negligible, although air traffic volume may be an important model component, it is not sufficient to explain spread, absent random-effects. While Lemey et al. (2014) used spike-and-slab priors on in a Bayesian model averaging approach, since we are using only predictors identified previously to be important, we use a Normal prior instead (corresponding to the slab in the original study). We apply Bayesian bridge priors for the random-effects. To account for phylogenetic uncertainty, we marginalize our inference over the same empirical distribution of phylogenetic trees used by Lemey et al. (2014).
We find that air traffic volume sufficiently explains the viral spread between most communities. That is, for most community pairs, the posterior distribution of the random-effect indicates that the parameter has been declared “insignificant,” and is a spike centered at 0 (Fig. 4). However, for five pairs of communities (from the United States to Japan and South America; from China to the United States and Japan; from Oceania to the United States), the inferred random-effect is clearly significant (all sign probabilities ) and strongly positive, indicating 6- to 12-fold higher dispersal than predicted by travel. There is support for an additional six random-effects (from the United States to Oceania, Russia, and Southeast Asia; from China to Oceania; from Japan to Oceania; from Southeast Asia to Oceania) which have sign probabilities between 0.87 and 0.97 and correspond to 2- to 5-fold higher dispersal than predicted by travel. All other area pairs of sign probabilities are less than 0.78. Given the offset seasons between hemispheres, some of these connections likely do not represent biologically meaningful connections, and may potentially be attributed to sampling biases. A comparison of the number of samples in the dataset to the population sizes of the regions (a rough proxy for the number of infections in the regions) reveals that the United States, Oceania, and Japan are strongly oversampled. Thus, sampling biases likely explains many of the significant random-effects, including the between-hemisphere connections. As China is not particularly oversampled, the elevated rates of transmission from China may represent source-sink dynamics which are not captured by air travel alone, rather than sampling bias.
Analysis of paired macroevolutionary traits
Random-effects substitution models can also be used to test for dependent substitution processes between multiple characters as follows. Let us assume that we have two characters of interest, and . These characters could be morphological, behavioral, or even ecological traits. If these characters evolved independently along the phylogeny , we could model this with two rate matrices, and , a (strict) clock rate which defines the rate of change (in substitutions per year or million years) for , and a relative rate parameter which defines how much faster (or slower) evolves compared with . We can define a composite character from and by considering both states simultaneously. This yields a new character which is the Cartesian product of the two state-spaces, with the combined state-space size . The rate matrix for the combined character is 0 for any double substitution and for any single substitution is defined by or depending on which character changes. Written on the log-scale, the (unnormalized) rate matrix is given by
We can test for departures from independent evolution by allowing the state of one character to modulate the rates of change between states in the other through the addition of random-effects.
We employ this random-effects dependent morphological evolution model on a dataset of 29 species of frogs in the family Hylidae (subfamily Hylinae). Taking the phylogeny inferred by Caviedes-Solis et al. (2020) to be fixed, we focus on two traits described in Caviedes-Solis (2019), one ecological and one behavioral. The ecological trait is the habitat, which is characterized as either arboreal or understory. The behavioral trait is the swimming mode, which is characterized by whether the back legs move in an alternating or simultaneous fashion or whether both types are observed. To determine the structure of the underlying independent-trait models, we first fit the independent model using asymmetric rates for both traits. Bayes factors show no evidence for any model more complex than the Mk (Jukes-Cantor-like) model (Lewis 2001).
In particular, we are interested in whether the degree of arboreality, defined as habitat preference, impacts the swimming mode, as canopy-dwelling species move the back legs in an alternating fashion while climbing. Thus, we place random-effects only in the direction of arboreality affecting swimming mode. Letting be arboreality and be swimming mode, the unnormalized rate matrix for our random-effects substitution model is,
We infer no effect of arboreal habitat on the swimming mode, all posterior sign probabilities are between 0.5 and 0.57, indicating that all random-effects have clearly been deemed insignificant (Fig. 5). We also infer that the rate of habitat evolution is roughly twice that of swimming-mode evolution (, 95% CI 0.23–1.16). There are two important caveats to these results. First, with only 29 species, the power to detect dependent evolution is likely low unless the effect is quite large. Secondly, by only modeling two traits, we are missing out on possible interactions between other aspects of ecology (such as the aquatic environments the species make use of) and morphology (such as the lengths of limbs and digits) which might modulate this relationship.
Performance gains from gradients
For inferring random-effects in nucleotide substitution models, we find a notable improvement in efficiency using HMC with our approximate gradients over using standard random-walk MH-MCMC. For our measure of efficiency, we consider the number of effectively independent samples taken per second (ESS/s). This measure incorporates both the increased ability of HMC to move through parameter space, as well as the increased cost per MCMC move required for repeated evaluation of the gradient. We track the efficiency separately for each random-effect (that is, we use the univariate ESS), and we consider two summaries of efficiency gains from HMC. As an overall measure of efficiency increase, we consider the parameterwise average increase in the efficiency. However, as analyses are constrained by waiting for the slowest-mixing parameter to achieve a sufficiently large ESS, we also consider the improvement in the minimum ESS (regardless of which parameter is slowest-mixing). When applied to nucleotide models (HKY+RE) to infer the tree from sequence data, we observe an average parameterwise increase in efficiency of 6.6-fold, and an increase in the minimum efficiency of 14.8-fold (Fig. 6). For the larger state-space of the flu phylogeographic example (14 discrete areas), where we average across a set of posterior samples of the tree from the original study, we find an average parameterwise increase in efficiency of 20.2-fold and an increase in minimum efficiency of 33.6-fold (Fig. 6). Timing was done on a MacBook Pro with an 8-core CPU M1 Pro chip and 32 GB of memory.
Analyses of simulated data
To assess the performance of random-effects substitution models in estimation of model parameters, we performed a simulation study. We based the simulation setup on our analysis of the SARS-CoV-2 data. In particular, we used the posterior distribution of trees, the HKY parameter, and the shape parameter governing the Gamma-distributed among-site rate variation. For the random-effects, we simulated from three groups: null effects, moderate effects, and strong effects. For each of these we drew values from Normal distributions (Supplementary Fig. S3) chosen to reflect the values observed in the real-data posterior distributions. The strong effects were CT and GT, which were simulated from a Normal(1.50,0.18) distribution. The moderate effects were AT and GA, which were simulated from Normal(0.68,0.37) and Normal(0.68,0.37) distributions, respectively. The remaining 8 random-effects were classified as null and simulated from Normal(0.0,0.11) distributions. We simulated 100 datasets under this model.
Analyses of simulated datasets were conducted following the analysis of the SARS-CoV-2 dataset, with two exceptions. First, we treated the tree as known. Second, we considered several values for the exponent parameter, each simulated dataset was analyzed four times with . These values range from strongly regularizing priors (small ) to the weakly regularizing Laplace prior ().
Overall, we find that random-effects are well-estimated and that random-effects which imply strong deviations from the base model (HKY) can be confidently identified using the posterior sign probability. Overall the posterior means are strongly correlated with the true simulating value, (), although it appears that null and strong effects are generally better-estimated than moderate effects (Fig. 7 and Supplementary Fig. S5). To determine whether a random-effect is significant, one can use a threshold on the sign probability, declaring larger sign probabilities to be evidence for significant effects. Particularly notable deviations from the base model are easy to detect at any chosen threshold. Lower thresholds declare many negligible deviations to be significant, while higher thresholds are somewhat underpowered to detect smaller, but potentially important, deviations. A threshold of around 0.8 (0.75–0.85) provides a good balance between these forces (Supplementary Fig. S4).
Examining the choice of exponent , we find that values on the order of provide a reasonable trade-off between estimation performance and MCMC behavior. The smaller exponents, and , in general produce posteriors which are notably closer to the true values than the larger values (Supplementary Fig. S5). The difference in performance is less notable for the evidence for significance provided by sign probabilities. The smaller exponents perform better for identifying null effects as null, while the larger exponents produce more confident support that moderate effects belong in the model, and all coefficients do well with strong effects (Supplementary Fig. S6). We investigated MCMC efficiency by examining the minimum effective sample size per sample. This efficiency is higher at than any other exponent (Supplementary Fig. S7).
Discussion
In this paper, we demonstrated the versatility and usefulness of random-effects substitution models. By wrapping around a simpler base substitution model, random-effects substitution models enable increased flexibility while retaining the useful structure of the base model. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY+RE model picks up strong CT and GT mutational biases and is shown by posterior predictive model checks to be an adequate substitution model where reversible models like GTR fail. Used with a GLM substitution model to analyze the phylogeographic pattern of spread of influenza in humans, the random-effects suggest the air traffic volume alone is a powerful explanation for the spread of influenza from 2002 to 2007. In examination of the evolution of ecological and behavioral characters in hylid tree frogs, a random-effects model shows no evidence for an effect of arboreality on the mode of swimming. Simulations show that random-effects can be accurately estimated and provide guidelines for interpreting whether a random-effect is significant or not.
To enable efficient inference of random-effects substitution models, we derived an approximate substitution gradient. The time-complexity of our approximate approach is cubic in the size of the state-space, while “exact” analytical techniques are quintic. For parameter-rich random-effects substitution models, numerical gradients are also quintic, and our approximate gradients enable maximum a posteriori inference of the parameters of an amino-acid substitution model over 50 times faster than numerical gradients (“Inferring the dynamics of amino acid substitution in Metazoa” section in the Supplementary Material available on Dryad). Used in Bayesian inference, we find that HMC using our approximate gradients is 6.6 to 20.2 times more efficient than standard Metropolis–Hastings moves, with yet more substantial gains when comparing the dimension with the most difficult sampling (where the efficiency gains are 14.8- and 33.6-fold). In particular, it appears that the efficiency of HMC with the approximate gradients is roughly invariant to the dimension (Fig. 6). For our SARS-CoV-2 example, with a rate matrix, the average efficiency of HMC is 6.3 effective samples per second, while for the influenza A virus phylogeographic example, with a rate matrix, it is 7.6 effective samples per second. However, the efficiency of Metropolis–Hastings moves decreases from 1.5 effective samplers per second to 0.41. We expect this trend to continue as the size of the state-space increases, and that for sufficiently large models (such as codon models or Markov-modulated amino acid models), HMC will be the only approach capable of inferring random-effects substitution models in any reasonable timeframe.
Although the approximate substitution gradient we derived performed very well in our applications, it cannot be expected to perform ideally in every circumstance. Mathematical analysis and in silico experiments suggest that the error in our approximation grows with the branch length measured in genetic distance (“Error in the approximate gradient” section in the Supplementary Material available on Dryad). Thus, we should expect performance to be best where the tree has few substitutions per site. Wertheim et al. (2022) refer to this as the near-perfect regime, and it is common in viral phylodynamic applications. However, we note reasons for optimism in applying our approximate gradients in regimes with larger numbers of substitutions. The influenza phylogeographic example falls outside the near-perfect regime, and the efficiency of HMC using our approximate gradients is still quite good. Similarly, good inference efficiency is observed in maximum a posteriori inference of an amino acid model on a Metazoan tree which has over 5.5 substitutions per site on average (“Inferring the dynamics of amino acid substitution in Metazoa” section in the Supplementary Material available on Dryad). It is also important to note that when used for HMC, the accept–reject step ensures correctness even in regimes where the approximation gets poor. It is likely the error bound we have obtained is quite conservative. Furthermore, Didier et al. (2023) establish a more rigorous error bound and show that the error decreases with increasingly large state-spaces, suggesting that phylogeographic analyses are well suited to this approximation.
An open question is to define the regimes where the approximation becomes poor enough that inference becomes inefficient such that other techniques would be preferable. We note two such alternative approaches which could be considered and compared with the efficiency of our approximation in future work. An exact gradient can be obtained from a data-augmentation procedure which jointly samples the complete mutational history along the tree, such as the approach adopted by Lartillot (2006). Within the framework of approximate gradients, an affine correction, as Didier et al. (2023) suggest, may yield smaller approximation error when the expected number of substitutions per branch is relatively large.
There are a number of important extensions of this work. Currently, we have implemented the gradient computations (in BEAST 1.10 (Suchard et al. 2018) and BEAGLE 3 (Ayres et al. 2019)) for use on CPUs; however, GPU-based likelihood computations have proven incredibly efficient in many phylogenetic contexts (Suchard and Rambaut 2009; Dudas et al. 2017; Ayres et al. 2019; Baele et al. 2021; Lemey et al. 2021). In particular, Gangavarapu et al. (2023) recently showed minimum increases of 8-fold and 128-fold for nucleotide and codon models, respectively, when computing gradients with respect to branch rate parameters. A GPU implementation of our approximate gradients would likely produce notable speedups in inference, especially for large state-space models. Mathematically, our approximation holds for any case in which there is a single substitution rate matrix on any edge of the phylogeny (although we have currently only implemented the case for a single rate matrix across the whole tree). However, the process of geographic spread may be temporally inhomogeneous while applying consistently across all lineages alive at any given time. In such cases, epoch models (Bielejec et al. 2014; Gao et al. 2022) are needed. The epoch times break branches into multiple regimes, which requires matrix convolutions for likelihood computation and thus an extension of our approach.
Random-effects substitution models are a flexible approach for creating more realistic substitution models, but they are not a panacea. They cannot, for example, address gross violations of the underlying assumptions of the CTMC model, such as memorylessness. Nor can they address dependence between characters without carefully predefining the set of (potentially) coevolving characters and expanding the state space of the model. The Bayesian bridge provides a robust framework for regularization, and HMC an efficient framework for inference. However, the additional complexity of random-effects models may occasionally cause challenges for MCMC which require more active user intervention. Consider, for example, the (likely) APOBEC-induced CT bias observed in our SARS-CoV-2 example, which, in a double-stranded virus, will also lead to a GA bias (Gigante et al. 2022; O’Toole et al. 2023). Application of HKY+RE to such a dataset will lead to multimodality (caused by ridges in the likelihood) jointly involving five substitution model parameters, and the pairs of random-effects and . Such multimodality does not invalidate the model, and it could be mitigated by the use of TN93+RE or avoided entirely by using a simpler model like F81+RE.
Acknowledgments
We gratefully acknowledge support from Advanced Micro Devices, Inc. with the donation of parallel computing resources used for this research. The views expressed in this article are those of the authors and do not necessarily represent the views or policies of the Centers for Disease Control and Prevention.
Contributor Information
Andrew F Magee, Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA.
Andrew J Holbrook, Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA.
Jonathan E Pekar, Bioinformatics and Systems Biology Graduate Program, University of California - San Diego, La Jolla, CA, USA; Department of Biomedical Informatics, University of California - San Diega, La Jolla, CA, USA.
Itzue W Caviedes-Solis, Department of Biology, Swarthmore College, Swarthmore, PA, USA.
Fredrick A Matsen IV, Howard Hughes Medical Institute, Seattle, Washington, USA; Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA; Department of Genome Sciences, University of Washington, Seattle, Washington, USA; Department of Statistics, University of Washington, Seattle, Washington, USA.
Guy Baele, Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium.
Joel O Wertheim, Department of Medicine, University of California - San Diego, La Jolla, CA, USA.
Xiang Ji, Department of Mathematics, Tulane University, New Orleans, LA, USA.
Philippe Lemey, Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium.
Marc A Suchard, Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA; Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA; Department of Human Genetics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA.
Supplementary material
Data available from the Dryad Digital Repository: https://doi.org/10.5068/D1709N.
Funding
This work was supported through NSF grants DMS 2152774 and DMS 2236854, as well as NIH grants R01 AI153044, R01 AI162611, and K25 AI153816. J.O.W. was funded by AI135992. J.E.P. was funded by NIH T15LM011271. Dr. Matsen is an Investigator of the Howard Hughes Medical Institute. P.L. acknowledges funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 725422-ReservoirDOCS) and from the European Union’s Horizon 2020 project MOOD (grant agreement no. 874850). G.B. acknowledges funding from the Internal Funds KU Leuven under grant agreement C14/18/094, from the Research Foundation – Flanders (‘Fonds voor Wetenschappelijk Onderzoek – Vlaanderen’, G0E1420N and G098321N) and from the DURABLE EU4Health project 02/2023-01/2027, which is co-funded by the European Union (call EU4H-2021-PJ4) under Grant Agreement No. 101102733.
Data availability
BEAST XML files for the analyses in this paper are available at, as well as the supplementary text, are available on Dryad, https://doi.org/10.5068/D1709N. The approximate gradients have been implemented in the hmc-clock branch of BEAST (https://github.com/beast-dev/beast-mcmc/tree/hmc-clock/) and the v4.0.0 release of BEAGLE (https://github.com/beagle-dev/beagle-lib/releases/tag/v4.0.0). BEAST XML files for the analyses in this paper are additionally available at https://github.com/suchard-group/approximate_substitution_gradient_supplement.
References
- Abadi S., Azouri D., Pupko T., Mayrose I. 2019. Model selection may not be a mandatory step for phylogeny reconstruction. Nat. Commun. 10(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ayres D.L., Cummings M.P., Baele G., Darling A.E., Lewis P.O., Swofford D.L., Huelsenbeck J.P., Lemey P., Rambaut A., Suchard M.A. 2019. BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Syst. Biol. 68(6):1052–1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baele G., Gill M.S., Bastide P., Lemey P., Suchard M.A. 2021. Markov-modulated continuous-time Markov chains to identify site-and branch-specific evolutionary variation in BEAST. Syst. Biol. 70(1):181–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bielejec F., Lemey P., Baele G., Rambaut A., Suchard M.A. 2014. Inferring heterogeneous evolutionary processes through time: from sequence substitution to phylogeography. Syst. Biol. 63(4):493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carvalho C.M., Polson N.G., Scott J.G. 2010. The horseshoe estimator for sparse signals. Biometrika 97(2):465–480. [Google Scholar]
- Caviedes-Solis I.W. 2019. Intertwined evolution of swimming, morphology and microhabitat in tree frogs from the subfamily Hylinae. [PhD thesis]. University of Washington. https://digital.lib.washington.edu/researchworks/handle/1773/44728 [Google Scholar]
- Caviedes-Solis I.W., Kim N., Leaché A.D. 2020. Species IUCN threat status level increases with elevation: a phylogenetic approach for neotropical tree frog conservation. Biodivers. Conserv. 29(8):2515–2537. [Google Scholar]
- Dayhoff M., Schwartz R., Orcutt B. 1978. A model of evolutionary change in proteins. In: Dayhoff M.O., editor. Atlas of protein sequence and structure. Washington, (DC): National Biomedical Research Foundation. p. 345–352. [Google Scholar]
- Didier G., Glatt-Holtz N.E., Holbrook A.J., Magee A.F., Suchard M.A. 2024. On the surprising effectiveness of a simple matrix exponential derivative approximation, with application to global SARS-CoV-2. Proc. Natl. Acad. Sci. USA 16;121(3). 10.1073/pnas.2318989121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duane S., Kennedy A.D., Pendleton B.J., Roweth D. 1987. Hybrid Monte Carlo. Phys. Lett. B, 195(2):216–222. [Google Scholar]
- Dudas G., Carvalho L.M., Bedford T., Tatem A.J., Baele G., Faria N.R., Park D.J., Ladner J.T., Arias A., Asogun D., Bielejec F., Caddy S.L., Cotten M., D’Ambrozio J., Dellicour S., Di Caro A., Diclaro J.W., Duraffour S., Elmore M.J., Fakoli L.S., Faye O., Gilbert M.L., Gevao S.M., Gire S., Gladden-Young A., Gnirke A., Goba A., Grant D.S., Haagmans B.L., Hiscox J.A., Jah U., Kugelman J.R., Liu D., Lu J., Malboeuf C.M., Mate S., Matthews D.A., Matranga C.B., Meredith L.W., Qu J., Quick J., Pas S.D., Phan M.V.T., Pollakis G., Reusken C.B., Sanchez-Lockhart M., Schaffner S.F., Schieffelin J.S., Sealfon R.S., Simon-Loriere E., Smits S.L., Stoecker K., Thorne L., Tobin E.A.., Vandi M.A., Watson S.J., West K., Shannon W., Wiley M.R., Winnicki S.M., Wohl S., Wölfel R., Yozwiak N.L., Andersen K.G., Blyden S.O., Bolay F., Carroll M.W., Dahn B., Diallo B., Formenty P., Fraser C., Gao G.F., Garry R.F., Goodfellow I., Günther S., Happi C.T., Holmes E.C., Kargbo B., Keïta S., Kellam P., Koopmans M.P.G., Kuhn J.H., Loman N.J., Magassouba N.F., Naidoo D., Nichol S.T., Nyenswah T., Palacios G., Pybus O.G., Sabeti P.C., Sall A., Ströher U., Wurie I., Suchard M.A., Lemey P., Rambaut Andrew. 2017. Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature 544(7650):309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fabreti L.G., Höhna S. 2022. Bayesian inference of phylogeny is robust to substitution model over-parameterization. bioRxiv. https://doi.org/10.1101/2022.02.17.480861, preprint: not peer reviewed. [Google Scholar]
- Faulkner J.R., Magee A.F., Shapiro B., Minin V.N. 2020. Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories. Biometrics 76(3):677–690. [DOI] [PubMed] [Google Scholar]
- Fisher A.A., Ji X., Nishimura A., Lemey P., Suchard M.A. 2021. Shrinkage-based random local clocks with scalable inference. Mol Biol Evol 40(11):msad242. 10.1093/molbev/msad242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fisher A.A., Ji X., Zhang Z., Lemey P., Suchard M.A. 2021. Relaxed random walks at scale. Syst. Biol. 70(2):258–267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gangavarapu K., Ji X., Baele G., Fourment M., Lemey P., Matsen IV F.A., Suchard M.A. 2024. Many-core algorithms for high-dimensional gradients on phylogenetic trees. Bioinformatics 40(2):btae030. 10.1093/bioinformatics/btae030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao J., May M.R., Rannala B., Moore B.R. 2022. New phylogenetic models incorporating interval-specific dispersal dynamics improve inference of disease spread. Mol. Biol. Evol., 39(8):msac159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gigante C.M., Korber B., Seabolt M.H., Wilkins K., Davidson W., Rao A.K., Zhao H., Smith T.G., Hughes C.M., Minhaj F., Waltenburg M.A., Theiler J., Smole S., Gallagher G.R., Blythe D., Myers R., Schulte J., Stringer J., Lee P., Mendoza R.M., Griffin-Thomas L.A., Crain J., Murray J., Atkinson A., Gonzalez A.H., Nash J., Batra D., Damon I., McQuiston J., Hutson C.L., McCollum A.M., Li Y. 2022. Multiple lineages of monkeypox virus detected in the United States, 2021–2022. Science 378(6619):560–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hasegawa M., Kishino H., Yano T. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22(2):160–174. [DOI] [PubMed] [Google Scholar]
- Hastings W.K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109. [Google Scholar]
- Hilton S.K., Bloom J.D. 2018. Modeling site-specific amino-acid preferences deepens phylogenetic estimates of viral sequence divergence. Virus Evol. 4(2):vey033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jukes T.H., Cantor C.R. 1969. Evolution of protein molecules. In: Munro H.N., editor. Mammalian protein metabolism. New York: Academic Press, p. 21–132. [Google Scholar]
- Kass R.E., Raftery A.E. 1995. Bayes factors. J. Am. Statist. Assoc. 90(430):773–795. [Google Scholar]
- Lan S., Bui-Thanh T., Christie M., Girolami M. 2016. Emulation of higher-order tensors in manifold Monte Carlo methods for Bayesian inverse problems. J. Comput. Phys. 308:81–101. [Google Scholar]
- Lartillot N. 2006. Conjugate Gibbs sampling for Bayesian phylogenetic models. J. Comput. Biol. 13(10):1701–1722. [DOI] [PubMed] [Google Scholar]
- Leimkuhler B., Reich S. 2004. Simulating Hamiltonian dynamics. Number 14 in Cambridge Monographs on Applied and Computational Mathematics. Cambridge, England: Cambridge University Press. [Google Scholar]
- Lemey P., Rambaut A., Drummond A.J., Suchard M.A. 2009. Bayesian phylogeography finds its roots. PLoS Comput. Biol. 5(9):e1000520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemey P., Rambaut A., Bedford T., Faria N., Bielejec F., Baele G., Russell C.A., Smith D.J., Pybus O.G., Brockmann D., Suchard M.A. 2014. Unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza H3N2. PLoS Pathogens 10(2):e1003932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemey P., Hong S.L., Hill V., Baele G., Poletto C., Colizza V., O’Toole Á., McCrone J.T., Andersen K.G., Worobey M., Nelson M.I., Rambaut A., Suchard M.A. 2020. Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2. Nat. Commun. 11(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemey P., Ruktanonchai N., Hong S.L., Colizza V., Poletto C., Van den Broeck F., Gill M.S., Ji X., Levasseur A., Oude Munnink B.B., Koopmans M., Sadilek A., Lai S., Tatem A.J., Baele G., Suchard M.A., Dellicour S. 2021. Untangling introductions and persistence in COVID-19 resurgence in Europe. Nature 595(7869):713–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewis P.O. 2001. A likelihood approach to estimating phylogeny from discrete morphological character data. Syst. Biol. 50(6):913–925. [DOI] [PubMed] [Google Scholar]
- Li L., Holbrook A., Shahbaba B., Baldi P. 2019. Neural network gradient Hamiltonian Monte Carlo. Comput. Statist. 34(1):281–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magee A.F., Höhna S., Vasylyeva T.I., Leaché A.D., Minin V.N. 2020. Locally adaptive Bayesian birth-death model successfully detects slow and rapid rate shifts. PLoS Comput. Biol. 16(10):e1007999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matyášek R., Kovařík A. 2020. Mutation patterns of human SARS-CoV-2 and bat RaTG13 coronavirus genomes are strongly biased towards c¿u transitions, indicating rapid evolution in their hosts. Genes 11(7):761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller A.H., Teller E. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6):1087–1092. [Google Scholar]
- Najfeld I., Havel T.F. 1995. Derivatives of the matrix exponential and their computation. Adv. Appl. Math. 16(3):321–375. [Google Scholar]
- Neal R.M. MCMC using Hamiltonian dynamics. In: Brooks S., Gelman A., Jones G., Meng X.-L. editors. Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC Press. [Google Scholar]
- Nishimura A., Suchard M.A. 2022. Shrinkage with shrunken shoulders: Gibbs sampling shrinkage model posteriors with guaranteed convergence rates. Bayesian Anal. 18(2):367–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Toole Á., Neher R.A., Ndodo N., Borges V., Gannon B., Gomes J.P., Groves N., King D.J., Maloney D., Lemey P., Lewandowski K., Loman N., Myers R., Suchard M.A., Worobey M., Chand M., Ihekweazu C., Ulaeto D., Adetifa I., Rambaut A. 2023. Putative APOBEC3 deaminase editing in MPXV as evidence for sustained human transmission since at least 2016. BioRXiv, p. 2023–01. https://doi.org/10.1101/2023.01.23.525187, preprint: not peerreviewed. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pagel M., Meade, A. 2006. Bayesian analysis of correlated evolution of discrete characters by reversible-jump Markov chain Monte Carlo. Am. Nat. 167(6):808–825. [DOI] [PubMed] [Google Scholar]
- Pekar J., Worobey M., Moshiri N., Scheffler K., Wertheim J.O. 2021. Timing the SARS-CoV-2 index case in Hubei province. Science 372(6540):412–417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petersen K.B., Pedersen M.S. 2008. The matrix cookbook. Technical University of Denmark 7(15):510. [Google Scholar]
- Polson N.G., Scott J.G., Windle J. 2014. The Bayesian bridge. J. R. Statist. Soc. Ser. B Statist. Methodol. 76(4):713–733. [Google Scholar]
- Rasmussen C.E. 2003. Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals. Seventh Valencia International Meeting, dedicated to Dennis V. Lindley. Oxford, England: Oxford University Press. p. 651–659. [Google Scholar]
- Suchard M.A., Rambaut A. 2009. Many-core algorithms for statistical phylogenetics. Bioinformatics 25(11):1370–1376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suchard M.A., Weiss R.E., Sinsheimer J.S. 2001. Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol. 18(6):1001–1013. [DOI] [PubMed] [Google Scholar]
- Suchard M.A., Weiss R.E., Dorman K.S., Sinsheimer J.S. 2003. Inferring spatial phylogenetic variation along nucleotide sequences: a multiple change-point model. J. Am. Statist. Assoc. 98(462):427–437. [Google Scholar]
- Suchard M.A., Lemey P., Baele G., Ayres D.L., Drummond A.J., Rambaut A. 2018. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 4(1):vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tavaré S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17(2):57–86. [Google Scholar]
- Tonkin-Hill G., Martincorena I., Amato R., Lawson A.R.J., Gerstung M., Johnston I., Jackson D.K., Park N., Lensing S.V., Quail M.A., Gonçalves S., Ariani C., Chapman M.S., Hamilton W.L., Meredith L.W., Hall G., Jahun A.S., Chaudhry Y., Hosmillo M., Pinckert M.L., Georgana I., Yakovleva A., Caller L.G., Caddy S.L., Feltwell T., Khokhar F.A., Houldcroft C.J., Curran M.D., Parmar S.; COVID-19 Genomics UK (COG-UK) Consortium; Alderton A., Nelson R., Harrison E.M., Sillitoe J., Bentley S.D., Barrett J.C., Torok M.E., Goodfellow I.G., Langford C., Kwiatkowski D.; Wellcome Sanger Institute COVID-19 Surveillance Team. 2021. Patterns of within-host genetic diversity in SARS-CoV-2. Elife, 10: e66857. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vats D. 2023. Hamiltonian Monte Carlo for (physics) dummies. https://dvats.github.io/assets/pdf/HMCforDummies.pdf. [Google Scholar]
- Wagenmakers E.-J., Lodewyckx T., Kuriyal H., Grasman R. 2010. Bayesian hypothesis testing for psychologists: a tutorial on the savage–dickey method. Cognit. Psychol. 60(3):158–189. [DOI] [PubMed] [Google Scholar]
- Wertheim J.O., Steel M., Sanderson M.J. 2022. Accuracy in near-perfect virus phylogenies. Syst. Biol. 71(2):426–438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whelan S., Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18(5):691–699. [DOI] [PubMed] [Google Scholar]
- Woodhams M.D., Fernández-Sánchez J., Sumner J.G. 2015. A new hierarchy of phylogenetic models consistent with heterogeneous substitution rates. Syst. Biol. 64(4):638–650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z. 1994a. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39(1):105–111. [DOI] [PubMed] [Google Scholar]
- Yang Z. 1994b. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39(3):306–314. [DOI] [PubMed] [Google Scholar]
- Yang Z., Nielsen R., Goldman N., Pedersen A.M.K. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155(1):431–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C., Shahbaba B., Zhao H. 2017a. Hamiltonian Monte Carlo acceleration using surrogate functions with random bases. Statist. Comput. 27(6):1473–1490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C., Shahbaba B., Zhao H. 2017b. Precomputing strategy for Hamiltonian Monte Carlo method based on regularity in parameter space. Comput. Statist. 32(1):253–279. [Google Scholar]
- Zhang Z., Nishimura A., Bastide P., Ji X., Payne R.P., Goulder P., Lemey P., Suchard M.A. 2021. Large-scale inference of correlation among mixed-type biological traits with phylogenetic multivariate probit models. Ann. Appl. Statist. 15(1):230–251. doi: 10.1214/20-AOAS1394. 10.1214/20-AOAS1394. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
BEAST XML files for the analyses in this paper are available at, as well as the supplementary text, are available on Dryad, https://doi.org/10.5068/D1709N. The approximate gradients have been implemented in the hmc-clock branch of BEAST (https://github.com/beast-dev/beast-mcmc/tree/hmc-clock/) and the v4.0.0 release of BEAGLE (https://github.com/beagle-dev/beagle-lib/releases/tag/v4.0.0). BEAST XML files for the analyses in this paper are additionally available at https://github.com/suchard-group/approximate_substitution_gradient_supplement.