Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2024 Jan 12;121(3):e2318989121. doi: 10.1073/pnas.2318989121

On the surprising effectiveness of a simple matrix exponential derivative approximation, with application to global SARS-CoV-2

Gustavo Didier a, Nathan E Glatt-Holtz a, Andrew J Holbrook b,1, Andrew F Magee b, Marc A Suchard b,c,d
PMCID: PMC10801879  PMID: 38215186

Significance

Recent work uses a simplistic approximation to the matrix exponential derivative to apply gold-standard models from evolutionary biology to a collection of challenging data analyses. Whereas one may expect the naive approach to break down with increasing model dimensionality, empirical results show no such failure. Here, we 1) develop rigorous error bounds that improve—in a certain sense—as model dimension grows and 2) demonstrate the scalability of the naive approach to a higher-dimensional analysis of the global spread of the virus responsible for the COVID-19 pandemic.

Keywords: continuous-time Markov chains, Hamiltonian Monte Carlo, matrix exponential, molecular epidemiology, random matrix theory

Abstract

The continuous-time Markov chain (CTMC) is the mathematical workhorse of evolutionary biology. Learning CTMC model parameters using modern, gradient-based methods requires the derivative of the matrix exponential evaluated at the CTMC’s infinitesimal generator (rate) matrix. Motivated by the derivative’s extreme computational complexity as a function of state space cardinality, recent work demonstrates the surprising effectiveness of a naive, first-order approximation for a host of problems in computational biology. In response to this empirical success, we obtain rigorous deterministic and probabilistic bounds for the error accrued by the naive approximation and establish a “blessing of dimensionality” result that is universal for a large class of rate matrices with random entries. Finally, we apply the first-order approximation within surrogate-trajectory Hamiltonian Monte Carlo for the analysis of the early spread of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) across 44 geographic regions that comprise a state space of unprecedented dimensionality for unstructured (flexible) CTMC models within evolutionary biology.


Phylogeographic methods (14) model large-scale viral transmission between human populations as a function of the shared evolutionary history of the viral population of interest. Data take the form of dates, locations, and genome sequences associated with individual viral samples. Spatiotemporal structure interfaces with network structure given by the phylogeny, or family tree, describing the viruses’ collective history beginning with the most recent common ancestor. While one cannot directly observe this history, one may statistically reconstruct the phylogenetic tree by positing that changes in the viral genome happen randomly at regular intervals, thereby capturing the intuition that viral samples with more differences between their (aligned) sequences should find themselves further apart on the family tree.

The continuous-time Markov chain (CTMC) (5) represents the gold-standard mathematical model for such evolution of characters (e.g., nucleotides) within a fixed span of evolutionary time. A CTMC defined over a discrete, d-element state space consists of a row vector π0 whose individual components describe the probability of inhabiting each of the possible states at time t=0 as well as a d×d infinitesimal generator (or rate) matrix Q with nonnegative off-diagonal elements qij, ij, and nonpositive diagonal elements qii=jqij. For any lag t0, the matrix exponential (6, 7) provides the Markov chain’s transition probability matrix

Pt:=etQ:=n=0tnQnn!, [1]

which has elements [Pt]ij that dictate the probability of the process jumping from state i to state j after time t. It is straightforward to verify that Pt is a valid transition matrix, having probability vectors for rows: If 1 and 0 are the column vectors of ones and zeros, respectively, then Q1=0 and, therefore, Pt1=1. The law of total probability then provides the marginal probability of the process at any time t0 as πt=π0etQ.

Whether frequentist (8) or Bayesian (912) likelihood-based approaches to phylogenetic reconstruction allow phylogenetic tree branch lengths to parameterize time lags within the CTMC framework. We present the exact statement of the phylogenetic CTMC paradigm below (Section 3). Here, we note that the historical importance of tree-reconstruction from aligned sequences leads to an early emphasis on the sparse specification of Q based on biologically motivated assumptions (1315). Classical Markov chain Monte Carlo (MCMC) procedures (16, 17) work well for such low-dimensional models. But the phylogenetic CTMC framework has applications beyond simple nucleotide substitution models. Within, e.g., Bayesian phylogeography, the work in ref. 1 provides a phylogenetic CTMC model for the spread of avian influenza across d=20 global geographic locations but, for computational reasons, favors a low-dimensional Od parameterization of Q. Similarly, Lemey et al. (2) model the spread of influenza A H1N1 and H3N2 between as many as d=26 geographic regions but—again for computational reasons—fit the model with approximation techniques that provide no inferential guarantees.

Recently, Magee et al. (18) demonstrate the feasibility of approximate gradient-based methods for both maximum a posteriori and full Bayesian inference of flexible and fully-parameterized rate models and apply these methods to a gold-standard Od2 mixed-effect CTMC model for the spread of A H3N2 influenza between d=14 geographic locations. The usual CTMC log-likelihood gradient calculations feature the matrix exponential derivative (Eqs. 13 and 15)

JetQ:=limϵ0et(Q+ϵJ)etQϵ [2]
=etQn=0tn+1(n+1)!l=0n(1)lnlQlJQnl [3]

computed in the direction J of each of the d2 natural basis elements Jij spanning the space of real-valued, d×d matrices M(d)=M(d,R), thereby requiring at least Od5 floating point operations (19).

Within the phylogenetic CTMC models of Section 3, log-likelihood derivative computations that require JetQ balloon to OKNd5, for N the number of biological specimens observed and K the number of parameters parameterizing Q. To address this overwhelming computational cost, Magee et al. (18) leverage the simplistic approximation obtained by setting n=0 within Eq. 3:

~JetQ:=tetQJ. [4]

Ref. 18 show that this approximation helps reduce total cost to OKd2+Nd3 and use this speedup within surrogate-trajectory Hamiltonian Monte Carlo (HMC) (SI Appendix) to obtain a 34-fold improvement in effective sample size per second (ESS/s) over random-walk MCMC within their 14-region phylogeographic example. When trying to explain the remarkable empirical performance of the naive approximation, the authors derive an error upper bound (for an arbitrary matrix norm)

||~JetQJetQ||J||Q||2(e2t2t1) [5]

that fails to leverage the specific forms of Q and J. Notably, this bound explodes as either t or ||Q|| diverges to . Of course, the latter quantity would be expected to grow large with dimension d without more careful structural assumptions, e.g., that Q is a rate matrix belonging to the class

R(d):={QM(d)|Qij0for ji,j=1dQij=0}. [6]

In the following, we use the finer structural properties of Q to obtain more precise bounds on

E(t):=JetQtetQJ. [7]

In Theorem 1, we provide an affine (in t) correction to the approximation Eq. 4 that yields an exponentially tight t asymptotic for the error Eq. 7. Then, in Theorem 3, we establish precise probabilistic bounds in the high-dimensional d limit for a large class of randomly drawn rate matrices QR. Here, we show, for any QR whose off-diagonal elements are determined by independently and identically distributed (iid) draws from a positive, sub-exponential distribution F, that all of the nonzero singular values grow as σj(Q)d along with asymptotically valid almost sure bounds for these rates as d.

In regards to this second result, Theorem 3, note that random rate (or “Laplacian”) matrices have attracted a great deal of attention from the probability research community, especially in regard to their high-dimensional properties; see e.g., refs. 20, 21, 22, 23, 24, 25. In particular, seminal papers such as refs. 26, 27, 28 establish broad characterizations of bulk behavior for the Laplacian eigenspectrum. One contribution of this paper, of independent interest, is a short and self-contained construction of useful bounds for the singular values of QR.

In Theorem 2 and Corollary 1, we show how Theorems 1 and 3 combine to provide a more refined analysis of E for suitable randomly generated QR(d). In Theorem 2, we establish that particular terms appearing in the bound Eq. 24 in Theorem 1 decay with a rate on the order of 1/d in the operator norm topology for large d for certain classes of symmetric generators Q. This class includes the random elements considered in Theorem 3. Here, although Corollary 1 applies only for symmetric matrices composed of sub-exponential draws, we provide strong supplemental numerical evidence that our bounds remain valid well beyond this special symmetric, sub-exponential special case (Remark 5 and Fig. 1).

Fig. 1.

Fig. 1.

Frobenius norm errors obtained by first-order approximation tetQJ and by affine-corrected first-order approximation tetQJt(IQ+Q)JQQ+ (Theorem 2) under increasingly relaxed assumptions. Within each assumption set, we average over 20 independent Monte Carlo simulations of random generator matrices for each dimension. Plot A corresponds to asymmetric generators with off-diagonal elements having independent and iid standard exponential random variables that correspond to the sub-exponential distribution hypothesis. Plot B drops the identical distribution assumption by allowing each row and column of the generator matrix to additively contribute its own mean—itself given by a standard exponential—to its corresponding exponentially distributed entries. Plot C features rate matrices with iid Cauchy entries truncated to be positive. Empirically, the results of Corollary 1 extend beyond the symmetric, iid, and sub-exponential hypotheses, suggesting scope of future work.

One notable practical implication of Theorem 2, Corollary 1 and Remark 5 is the identification of a further correction to Eq. 4. Crucially, this correction has the same Od3 computational cost as Eq. 4 while leading to an asymptotically temporally uniformly accurate approximation of JetQ (Remark 5 and Eq. 46).

Section 2 contains simulation studies comparing accuracy of matrix exponential derivative approximations for different distributional assumptions on the generator matrix; the posterior distributions obtained using surrogate-trajectory HMC and traditional HMC; and parameter identification under different priors on generator matrix elements.

In Section 3, we follow these theoretical and empirical investigations with an application of the naive, first-order gradient approximation Eq. 4 to a challenge in phylogeography requiring Bayesian inference of a rate-matrix of unprecedented dimensionality. Namely, we apply the approximation to a gold-standard mixed-effects, phylogenetic CTMC model that uses 1,897 parameters to describe the spread of SARS-CoV-2 across a d=44 dimensional state space consisting of different global geographic locations. Such an application complements the empirical studies of ref. 18 in a manner that emphasizes the naive approximation’s potential for impact.

1. Rigorous Results

This section lays out our rigorous results Theorems 13 and Corollary 1, the proofs of which appear in SI Appendix.

In what follows, we adopt the following notational conventions. For any AM(d), we list the associated (not necessarily distinct) eigenvalues of A in ascending order according to their real part, namely,

Rλ1(A)Rλd(A). [8]

Similarly, the singular values of A are written in ascending order

σ1(A)σd(A). [9]

We let S(d)=S(d,R) and S(d,C) represent the spaces of d×d symmetric and Hermitian matrices, respectively.

We make use of multiple matrix norms leading to materially different bounds as d (29). Take

AF:=i,j=1dAij2=j=1dσj(A)2 [10]

for the Frobenius norm and

Aop:=λd(AA)=λd(AA)=σd(A) [11]

for the operator norm of A. Finally, note that when we simply write ·; the statement then holds for any valid matrix norm as in our formulation of Theorem 1.

1.1. Deterministic Bounds on Approximation Error in Time.

We begin by deriving a dynamical equation for the error E(t), defined in Eq. 7. Recall that X(t):=etQ for any QM(d) obeys the (matrix-valued) ordinary differential equation

dXdt=QX,X(0)=I. [12]

Setting Yϵ=ϵ1(et(Q+ϵJ)etQ) and taking a limit as ϵ0, we find that Y=JetQ obeys

dYdt=QY+JX=QY+JetQ,Y(0)=0. [13]

Thus, variation of constants yields that, for any t0,

JetQ=etQ0tesQJesQds [14]
=etQk,m=00t(s)kQkJsmQmk!m!ds=etQn=0tn+1(n+1)!l=0n(1)lnlQlJQnl. [15]

Taking the first-order (n=0) approximation in Eq. 15 produces Eq. 4. Note that, from Eq. 14, this approximation ~etQJ is evidently exact in the special case when J and Q commute.

Next notice that, if we differentiate Y~:=tetQJ in t, we find that Y~ obeys

dY~dt=QY~+etQJ,Y~(0)=0. [16]

Thus, taking the error E(t) as in Eq. 7 and combining Eq. 13 with Eq. 16 yields

dEdt=QE+JetQetQJ,E(0)=0. [17]

Hence, again integrating this expression, we find

E(t)=etQ0t(esQJesQJ)ds [18]
=etQn=1tn+1(n+1)!l=0n(1)lnlQlJQnl [19]

as could also be directly deduced from Eqs. 14 to 15.

Given a rate matrix Q in R(d), recall that Rλd1(Q)λd(Q)=0 by the Gershgorin circle theorem (30, Theorem 6.1.1). Imposing a further non-degeneracy assumption (e.g., that Rλd1(Q)<0) we therefore have an exponential decay in QetQ. This starting point suggests that, under fairly general conditions, we may decompose Eq. 18 into a component where QetQ induces an exponential decay in time and a complementary component taking the form of a time-affine correction term.

These observations lead to the following theorem, the proof of which appears in SI Appendix.

Theorem 1.

Suppose that Q,JM(d) for some d1. We assume that we can find an element Q+M(d) such that Q is a generalized inverse of Q+, namely,

Q+QQ+=Q+ [20]

and such that

eτQ(IQ+Q)=IQ+Q,(IQQ+)eτQ=IQQ+, [21]

for any τR. Furthermore, we suppose that Q+ and Q commute

QQ+=Q+Q. [22]

Finally, taking · be any matrix norm, we assume that

QeτQC0eκτ,for all τ0, [23]

where the constants C0>0,κ>0 are independent of τ. Then, under these circumstances,

Q+J(IQQ+)(IQ+Q)JQ++t(IQ+Q)JQQ++JetQtetQJC(1+t)eκt [24]

for any t0. Here, C>0 is a t-independent constant which is given explicitly as

C0((IQ+Q)J(Q+)2+(Q+)2J(IQQ+)+Q+J)+C02Q+JQ+. [25]

Remark 1:

To illuminate the scope of Theorem 1, we have the following three classes of matrices maintaining the conditions Eqs. 2023 as follows.

  • (i)
    Suppose that QM(d) is such that
    Rλd1(Q)<Rλd(Q)0,λd(Q)is simple [26]
    and such that, if λd(Q) has an imaginary component, then its real part is strictly negative. Under these circumstances, writing Q in its Jordan canonical form yields, for some m1,
    Q=Mdiag(J1,,Jm1,λd(Q))M1. [27]
    Here, under Eq. 26 each of these blocks Jj must be invertible and so we may take
    Q+:=Mdiag(J11,,Jm11,λd(Q)+)M1, [28]
    where
    λd(Q)+=0if λd(Q)=0,λd(Q)1otherwise. [29]
  • (ii)
    We next consider the case where QM(d) is diagonalizable and its spectrum lies strictly on the left half plane or at the origin. This time, we can write
    Q=MΛM1where Λ=diag(λ1(Q),,λd(Q)) [30]
    and we set
    Q+=MΛ+M1with Λ+=diag(λ1(Q)+,,λd(Q)+). [31]
    The complex numbers λj(Q)+, j=1,,d are defined as in Eq. 29.
  • (iii)
    Finally, we specialize to the case where QS(d) is symmetric. In this case, Q=UΛU, where U is a unitary matrix and Λ=diag(λ1(Q),,λd(Q)), λj(Q) are its (real) eigenvalues. We suppose that these eigenvalues are all nonpositive,
    λd(Q)0. [32]
    Here, we take Q+ as the Moore-Penrose inverse, namely,
    Q+=UΛ+U, [33]
    where Λ+ is as in Eq. 31.

In anticipation of Theorem 3 and our desired application in Section 3, we are preoccupied with the dimensional dependence of the constants in Eqs. 23, 24, 25 in our formulation of Theorem 1. We next provide some such desirable bounds in case (iii) of Remark 1. Note that analogous results for generators Q in the classes (i) or (ii) would seemingly require a delicate analysis of the associated eigenspaces, i.e., of the structure of M in Eq. 27 or Eq. 30 respectively. However, Remark 4 and Fig. 1 provide numerical evidence of a broader scope for dimensionally improving approximations beyond the symmetric case, at least for certain classes of randomly drawn matrices.

Theorem 2.

Let symmetric QS(d) be nonpositive, i.e., suppose that Eq. 32 holds. Take Q+ as in Eq. 33 and define

d= max{1jd|Rλj(Q)<0}. [34]

Then, for any JM(d),

t(IQ+Q)JQQ++JetQtetQJF(d|λ1(Q)|·[2ddQ+F2+Q+F+d|λ1(Q)|Q+F2](1+t)et|λd(Q)|+2ddQ+F)JF [35]

with Q+F2:=k=1d1λk2(Q), whereas

t(IQ+Q)JQQ++JetQtetQJop(|λ1(Q)|(2+|λd(Q)|+|λ1(Q)|)λd2(Q)(1+t)et|λd(Q)|+2|λd(Q)|)Jop. [36]

Under the further assumption that λd(Q)=0 and

μ1d|λd1(Q)||λ1(Q)|μ2d [37]

for some 0<μ1μ2, we have

t(IQ+Q)JQQ++JetQtetQJF(μ2μ12·[2d+μ1d+μ2d2](1+t)tμ1d+2μ1d)JF [38]

and that

t(IQ+Q)JQQ++JetQtetQJop(μ2μ12·[2d+μ1+μ2](1+t)etμ1d+2μ1d)Jop. [39]

1.2. High-Dimensional Asymptotics via Random Matrix Theory.

We turn to our probabilistic bounds on the singular values of randomly generated rate matrices, Theorem 3. Although interesting in its own right, this result leads to consequences for the bounds in Eqs. 23 and 24 when applied in Theorem 2. Before proceeding, we briefly introduce further mathematical preliminaries associated with the so-called sub-exponential random variables. To avoid confusion, note that the following definition uses the term in the same way as, e.g., ref. 31, but that other definitions that mean quite the opposite (i.e., heavier than exponential tails) appear in the literature (32).

Definition 1:

A random variable X is called sub-exponential if there exists some constant K>0 for which its tails satisfy

P(|X|t)2et/K,t0. [40]

In this case, the sub-exponential norm of X is defined by

Xψ1= inf{s>0:Ee|X|/s2}. [41]

The class of sub-exponential distributions is denoted by

Lψ1={FX(dx):Xψ1<}.

Remark 2:

In fact, by Vershynin (31, Proposition 2.7.1, p. 33), condition Eq. 40 is equivalent to the existence of some s0>0 such that Ee|X|/s02, namely, Xψ1< in Eq. 41. As a notable example, if Xexp(λ), λ>0, then it is easy to see that XLψ1, indeed.

We formulate our second major result as follows.

Theorem 3.

Let FXLψ1 be a distribution such that X0 a.s., EX=μ>0 and VarX=σ2>0. We consider a sequence of random matrices QQ(d)={qij}i,j=1,,dR(d), dN, where either

{qij}i,j=1,,d,ijiidFXandqii=j{1,,d}\{i}qij,i=1,,d [42]

or we impose that QR(d)S(d) as

{qij}i,j=1,,d, i>jiidFX,qij:=qjifor i<jandqii=j{1,,d}\{i}qij,i=1,,d. [43]

Then, in either of these cases, for any dN\{1}, we have

μ+Oa.s.(logdd)σ2(Q)dσd(Q)dμ+Oa.s.(logdd) [44]

almost surely.

Remark 3:

The bounds constructed in Theorem 3 are strongly reminiscent of the bounds for eigenvalues provided in Theorem 1.5 of the seminal paper (28) (see also, for instance, Corollary 1.6 in ref. 26 and Corollary 1.1 in ref. 27).

Finally, let us observe that Theorems 2 and 3 as well as the fact that

σj(Q)=|λdj+1(Q)|,for e.g., any QS(d)R(d) [45]

combine to produce the following immediate corollary.

Corollary 1.

Consider any sequence of random matrices QQ(d)={qij}i,j=1,,dR(d), dN, as in Theorem 3 under the second (symmetric) case Eq. 43. Then, taking μ1=μ1(d) and μ2=μ2(d) as the resulting lower and upper bounds defined by Eq. 44, we have that Q satisfies both Eqs. 38 and 39, cf. Eq. 45 relative to this sequence of μ1,μ2 for any JM(d).

Remark 4:

Our rigorous formulation of Corollary 1 is limited to symmetric random rate matrices whose above diagonal elements are independent and iid draws from a sub-exponential distribution. However, strong numerical evidence suggests that the scope of the approximations Eqs. 38 and 39 reach far beyond the limitations of Corollary 1 in several different ways. Fig. 1 explores the consequences of relaxing various assumptions of Corollary 1. The third plot involves folded Cauchy random variables, the heavy tails of which violate the sub-exponential assumption (Definition 1). There appears to be no significant departure from the idea that the form t(IQ+Q)JQQ+ corresponds increasingly well to the true approximation error [Eq. 7] as the dimension increases.

Remark 5:

Calculation of ~JetQ:=tetQJ requires Od3 operations by, e.g., computing the spectral decomposition Q as in Eq. 30. One may then recycle this decomposition to determine the additional term t(IQ+Q)JQQ+ for little extra cost. In view of Corollary 1 and Fig. 1,

^JetQ:=tetQJt(IQ+Q)JQQ+ [46]

provides an accurate approximation of JetQ for asymptotically for large d. Thus, we anticipate further computational improvements when fitting large, gold-standard models using this refined approximation. That said, we leave the efficient and scalable application of ^JetQ to future work.

2. Empirical Studies

Before applying the naive matrix exponential derivative approximation to the phylogeographic analysis of SARS-CoV-2, we carry out a few targeted studies that illustrate the empirical performance of the approximate derivative and its affine correction Eq. 46 (Fig. 1); agreement between CTMC generator matrix empirical posterior distributions generated by surrogate-trajectory HMC algorithms using approximate derivatives and the truth (Fig. 2); and point estimation of generator matrix element values under different sparsity regimes and different CTMC state space dimensionalities d (Fig. 3).

Fig. 2.

Fig. 2.

Posterior density plots for surrogate-trajectory HMC using the naive approximate derivative, the corrected approximation Eq. 46 and the exact matrix exponential derivative for elements of a 5×5 generator matrix. The near-perfect overlap reflects the fact that each algorithm’s transition kernel leaves the posterior distribution invariant (33). To generate data, we randomly draw standard normal generator entries once and simulate 20 independent initial/final position pairs from a CTMC with time interval t=1. We show the true value in red and negate diagonal elements to simplify presentation.

Fig. 3.

Fig. 3.

Posterior means versus truth for CTMC generator matrix elements within differing sparsity regimes and with different dimensionalities d, holding observation count fixed at 300. We generate posteriors using surrogate-trajectory HMC with the naive matrix exponential derivative. To affect sparsity levels, we generate generator matrix entries according to the Bayesian bridge distribution (34) with different exponents (α{1,1/2,1/4} for plots A, B and C, respectively), normalizing by the largest absolute values to ease comparison. Smaller α values encode more peaked distributions with heavier tails and thus enforce greater sparsity. Plot C reflects the fact that the Bayesian bridge prior with exponent α=1/4 helps identify non-null parameters in small sample contexts (18). With this intuition in mind, we specify such a prior on generator random effects in Section 3.

Here, we fill in remaining simulation details not included in figure captions. In the simulations contributing to Fig. 1, we randomly generate new, independent direction matrices J at each time step within each of the 20 independent runs. The results are not sensitive to J in general. We also note that under Definition 1 the Cauchy distribution is not sub-exponential and therefore represents a deviation from core assumptions of Section 1. Using varying distributions on the elements of generator matrices Q, the simulations contributing to Figs. 2 and 3 both randomly generate N independent initial states y10,,yN0 according to uniform distributions over their respective CTMC state spaces. For each of these initial states y0, we then simulate from the CTMC for time interval t=1 to obtain samples y11,,yN1. The likelihood then takes the form n(yn0)TeQyn1. For the simulation contributing to Fig. 2, we specify standard normal priors and sample from the posterior by generating 100,000 iterations from each algorithm. For the simulation contributing to Fig. 3, we generate 500,000 MCMC samples using surrogate-trajectory HMC with the naive matrix exponential derivative and calculate the posterior mean of each generator matrix element using the final 100,000 samples.

3. Application: Global Spread of SARS-CoV-2

Ref. 18 consider phylogenetic models that involve CTMC priors and show that the first-order approximation Eq. 4 of the matrix exponential derivative Eq. 2 performs remarkably well within surrogate HMC (SI Appendix), achieving an over 30-fold efficiency gain compared to random-walk Metropolis for a 14-state model with over 180 model parameters. Here, we demonstrate similar strong performance of the first-order approximation within surrogate HMC for a 44-state model with almost 1,900 model parameters.

3.1. Phylogenetic CTMC.

Start with a (possibly unknown) rooted and bifurcating phylogenetic tree T consisting of N leaf nodes that correspond to observed biological specimens and N1 internal nodes that correspond to unobserved ancestors. The tree also contains 2N2 branches of length tv connecting each child node v to its parent u.

Given T, we model the evolution of d characters along each branch of the tree according to a CTMC model with d×d generator matrix Q and stationary distribution π=(π1,,πd)=limtπ~etQ, for π~ any arbitrary probability vector. Examples of characters are the d=4 nucleotides within a set of aligned genome sequences (13) and the set of d=15 geographic regions visited by an influenza subtype (2). We may scale tv to be raw time (e.g., years) or the expected number of substitutions with respect to π depending on the given problem. In the former case, one may augment the model with a rate scalar γ that modulates the expected number of substitutions across all branches, and the finite-time transition probability matrix along branch v becomes Pv:=eγtvQ. In the following, we further posit Q=Q(θ) for θ a vector of parameters.

Let data Y be the d×N matrix with columns yn, n{1,,N}, each having a single nonzero entry (set to 1) corresponding to the observed state of the biological specimen. One may use any node v to express the likelihood

p(Y|θ)=pvTqv, [47]

where pv and qv are the post-order and pre-order partial likelihood vectors, respectively (35). The former describes the probability of the observed states for all observed specimens (i.e., leaf nodes) that descend from node v, conditioned on the state at node v. The latter describes analogous probabilities for all observed specimens not descending from node v. For leaf nodes, pn:=yn, n{1,,N}, and one may specify the root node’s pre-order partial likelihood to be any arbitrary probability vector a priori. Let “ᴏ” denote the Hadamard or elementwise product between matrices or vectors of equal dimensions. If we suppose that node u gives rise to two child nodes, v and w, then

pu=Pvpv  Pwpw,qv=PvTquPwpw. [48]

Using the chain rule and the fact that pv does not depend on Pv, one may write the likelihood’s derivative with respect to a single parameter θk thus:

θkp(Y|θ)v=12N2tr(pvTqv)PvPvTθk=v=12N2tr(qu°Pwpw)pvT(eγtvQθk)T=v=12N2pvT(i,j=1deγtvQqijqijθk)T(qu°Pwpw), [49]

where we suppress the dependence of u and w on v.

Whereas the recursions of Eq. 48 facilitate fast likelihood computation, inferring θ using gradient-based approaches such as HMC requires a large number of repeated evaluations of the matrix exponential derivative. These computations become particularly onerous when one opts for a gold-standard mixed-effects model (18) and specifies

logqij=bij+ϵij,ij. [50]

Here, the fixed effects bij are elements of some non-random matrix B, and the random effects ϵij are mutually independent a priori and inferred as model parameters. The dimension K of θ in this model is O(d2), so the O(KNd5) cost of the log-likelihood derivative Eq. 49 becomes a massive O(Nd7). In this context, Magee et al. (18) shows that an approximate log-likelihood derivative based on the first-order approximation to the matrix exponential helps achieve a considerable speedup over the exact derivative, requiring only O(d4+Nd3) floating-point operations yet facilitating high-quality proposals in the context of surrogate HMC.

In the following section, we use this method to analyze the global spread of SARS-CoV-2 and show that the first-order approximation maintains its performance for an even higher-dimensional problem than previously considered.

3.2. Bayesian Analysis of SARS-CoV-2 Contagion.

We use the phylogenetic CTMC framework to model the early spread of SARS-CoV-2—the virus responsible for the ongoing COVID-19 pandemic—based on N=285 observed viral samples collected from 31 regions worldwide between 24 December 2019 and 19 March 2020. These regions comprise 13 provinces within China and 18 countries without. Understanding the manner in which viruses travel between human populations is an object of ongoing study, and phylogeographic analyses point to the central role of travel networks including those measured by airline passenger counts (3) or Google mobility data (36). Here, we include three such predictors of travel in our CTMC model by expanding the fixed effects in regression model Eq. 50 to take the form

B=θ1X1+θ2X2+θ3X3 [51]

for X1, X2, and X3 fixed 44 × 44 matrix predictors and θ1, θ2, and θ3 real-valued regression coefficients. Note we have expanded the number of regions between which viruses may travel to d=44 by including two additional Chinese provinces and 11 additional countries. Fig. 4 presents the three predictors of interest: X1 contains air travel proximities between locations as measured by annualized air passenger counts between airports contained within a region (3); X2 contains intracontinental proximities arising from physical distances when two regions inhabit the same continent and fixed at the minimum otherwise; finally, X3 describes the Hubei asymmetry, i.e., the nonexistence of human travel out of the Hubei province in early 2020.

Fig. 4.

Fig. 4.

Matrix predictors described in Eq. 51 combine in a linear manner to form the fixed-effect matrix B featured within the mixed-effects regression model Eq. 50. Air traffic proximities (A) are proportional to the number of air passengers exchanged between airports within respective regions (3). Intracontinental proximities (B) take values between 1 for regions on different continents to 1 for adjacent regions. The Hubei asymmetry (C) roughly characterizes the Hubei quarantine of early 2020.

In the context of a Bayesian analysis, we specify independent normal priors on θ1, θ2, and θ3 with means of 0 and variances of 2. We also assume that the 1,892 random effects ϵij follow sparsity-inducing Bayesian bridge priors with global scale parameter τ and exponent α=0.25. Here, we follow ref. 37 and specify a Gamma prior on τα with a shape parameter of 1 and a rate parameter of 2. Finally, we place a flat prior on the rate scalar γ. Inferring the posterior distribution of all 1,897 model parameters requires an advanced MCMC strategy. Namely, we adopt an HMC-within-Gibbs approach, updating the scalars γ and τ independently but updating all 1,895 regression parameters (both fixed and random effects) using surrogate HMC accomplished with the first-order approximation

γtveγtvQJijeγtvQqij

within Eq. 49.

We generate 8 million MCMC samples in this manner, saving 1 in 10,000 Markov chain states, in order to guarantee a minimum ESS greater than 100. By far, the worst-mixing parameter is the global scale τ, which obtains an ESS of 185. The three fixed-effects regression coefficients θ1, θ2, and θ3 obtain ESS’s of 721, 721, and 644, respectively. The median and minimum ESS for the 1,892 random effects ϵss are 721 and 190, respectively. We note that, after thinning and removing burn-in, the sample only consists of 721 states, so ESS of 721 implies negligible autocorrelation between samples.

The Left plot of Fig. 5 presents posterior densities for the fixed-effect coefficients from regression Eq. 50. The air traffic proximity, intracontinental proximity and Hubei asymmetry coefficients have posterior means and 95% credible intervals of 0.76 (0.50, 1.03), 0.04 (0.04, 0.12), and 0.27 (0.39, 0.78), respectively. While these results suggest a positive association between each predictor matrix and the infinitesimal generator matrix Q, the only predictor with a statistically significant association is air traffic proximity. This result agrees with a previous phylogeographic analysis of the global spread of influenza (3). The Right plot of Fig. 5 presents the posterior mean for generator Q. As one may expect, the matrix looks similar to that of the air traffic predictor matrix in Fig. 4, but one may also see the influence of the Hubei asymmetry in, e.g., the squares corresponding to travel between Guangdong and Hubei provinces. Finally, we randomly generate regions of unobserved ancestors from their posterior predictive distributions every 100-thousandth MCMC iteration. After collapsing regions into 8 major blocks, Fig. 6 projects the empirical posterior predictive mode of these blocks onto the phylogenetic tree T. The general pattern looks similar to that of figure 1 from ref. 38, although the geographic blocking scheme differs slightly.

Fig. 5.

Fig. 5.

Posterior inference. Posterior densities (A) for the three fixed-effect regression coefficients corresponding to the predictor matrices of Fig. 4 reflect largely positive associations between predictors and infinitesimal rate matrix, although air traffic proximity has the only statistically significant coefficient with posterior mean of 0.76 and 95% credible interval of (0.50, 1.03). The posterior mean (B) infinitesimal rate matrix closely resembles the air traffic predictor, while reflecting the Hubei asymmetry to a lesser extent.

Fig. 6.

Fig. 6.

Posterior predictive modes for (unobserved) ancestral locations color a phylogenetic tree that describes the shared evolutionary history of 285 SARS-CoV-2 samples.

In addition to these scientific questions of interest, we are interested in the performance of the first-order approximation as a surrogate gradient for HMC in such a high-dimensional setting. Whereas we know that the surrogate-trajectory HMC transition kernel leaves its target distribution invariant regardless of the approximation quality (33), transitions that rely on poor gradient approximations result in small acceptance rates, more random walk behavior and high autocorrelation between samples. Since ESS is inversely proportional to a Markov chain’s asymptotic autocorrelation, larger ESS suggests a useful gradient approximation. To isolate the approximation’s performance, we fix the Bayesian bridge global-scale τ at 2.5×105. We generate a Markov chain with 80,000 states, saving every tenth state and removing the first 1,000 states as burn-in. Despite the relatively small number of iterations, we observe large ESS that suggest satisfactory accuracy of the first-order approximation within the context of high-dimensional surrogate HMC. The ESS for the three fixed-effect regression coefficients θ1, θ2, and θ3 are 1,053, 1,343, and 498, respectively. The median and minimum ESS for the 1,892 random effects ϵij are 1,514 and 1,161, respectively.

4. Discussion

We develop tight probabilistic bounds on the error associated with a simplistic approximation to the matrix exponential derivative for a large class of CTMC infinitesimal generator matrices with random entries. Our “blessing of dimensionality” result shows that this error improves for higher-dimensional matrices. We apply the numerically naive approach to the analysis of the global spread of SARS-CoV-2 using a mixed-effects model of unprecedented dimensions. The results obtained herein suggest the further study of CTMCs through the lens of random matrix theory. Furthermore, this analysis suggests a refinement of the first-order approximation to the matrix exponential derivative that may be particularly useful within modern, high-dimensional settings.

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

G.D. was partially supported by the Simons Foundation collaboration grant #714014. N.E.G.-H. received support for this work under NSF DMS 2108790. A.J.H. is supported by a gift from the Karen Toffler Charitable Trust and by grants NIH K25 AI153816, NSF DMS 2152774, and NSF DMS 2236854. A.F.M. and M.A.S. are partially supported through grants NIH R01AI153044 and R01AI162611. The views expressed in this article are those of the authors and do not necessarily reflect the views or policies of the Centers for Disease Control.

Author contributions

A.J.H. and M.A.S. designed research; G.D., N.E.G.-H., and A.J.H. calculated mathematical results; A.J.H. and A.F.M. performed simulations and data analysis; and G.D., N.E.G.-H., and A.J.H. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

Public data have been deposited in Github (39).

Supporting Information

References

  • 1.Lemey P., Rambaut A., Drummond A. J., Suchard M. A., Bayesian phylogeography finds its roots. PLoS Comput. Biol. 5, e1000520 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lemey P., et al. , Unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza H3N2. PLoS Pathog. 10, e1003932 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Holbrook A. J., et al. , Massive parallelization boosts big Bayesian multidimensional scaling. J. Comput. Graph. Stat. 30, 11–24 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Holbrook A. J., Ji X., Suchard M. A., From viral evolution to spatial contagion: A biologically modulated Hawkes model. Bioinformatics 38, 1846–1856 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Norris J. R., Continuous-Time Markov Chains I. Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, 1997), pp. 60–107. [Google Scholar]
  • 6.Moler C., Van Loan C., Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev. 20, 801–836 (1978). [Google Scholar]
  • 7.Moler C., Van Loan C., Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. 45, 3–49 (2003). [Google Scholar]
  • 8.Felsenstein J., Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mole. Evol. 17, 368–376 (1981). [DOI] [PubMed] [Google Scholar]
  • 9.Sinsheimer J. S., Lake J. A., Little R. J., Bayesian hypothesis testing of four-taxon topologies using molecular sequence data. Biometrics 52, 193–210 (1996). [PubMed] [Google Scholar]
  • 10.Yang Z., Rannala B., Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method. Mol. Biol. Evol. 14, 717–724 (1997). [DOI] [PubMed] [Google Scholar]
  • 11.Mau B., Newton M. A., Larget B., Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55, 1–12 (1999). [DOI] [PubMed] [Google Scholar]
  • 12.Suchard M. A., Weiss R. E., Sinsheimer J. S., Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol. 18, 1001–1013 (2001). [DOI] [PubMed] [Google Scholar]
  • 13.Jukes T. H., et al. , Evolution of protein molecules. Mamm. Protein Metabol. 3, 21–132 (1969). [Google Scholar]
  • 14.Kimura M., A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980). [DOI] [PubMed] [Google Scholar]
  • 15.Hasegawa M., Kishino H., Yano T., Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985). [DOI] [PubMed] [Google Scholar]
  • 16.Metropolis N., Rosenbluth A. W., Rosenbluth M. N., Teller A. H., Teller E., Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953). [Google Scholar]
  • 17.Hastings W. K., Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970). [Google Scholar]
  • 18.A. F. Magee et al., Random-effects substitution models for phylogenetics via scalable gradient approximations. arXiv [Preprint] (2023). http://arxiv.org/abs/2303.13642. Accessed 25 September 2023. [DOI] [PMC free article] [PubMed]
  • 19.I. Najfeld, T. F. Havel, Derivatives of the matrix exponential and their computation. Adv. Appl. Math. 16, 321–375 (1995).
  • 20.Y. Takahashi, “Markov chains with random transition matrices” in Kodai Mathematical Seminar Reports (Department of Mathematics, Tokyo Institute of Technology, Tokyo, Japan, 1969), vol. 21, pp. 426–447.
  • 21.Bai Z. D., Methodologies in spectral analysis of large dimensional random matrices, a review. Stat. Sinica 9, 611–677 (1999). [Google Scholar]
  • 22.Chafaï D., The Dirichlet Markov ensemble. J. Multiv. Anal. 101, 555–567 (2010). [Google Scholar]
  • 23.Bordenave C., Caputo P., Chafaï D., Circular law theorem for random Markov matrices. Probab. Theory Related Fields 152, 751–779 (2012). [Google Scholar]
  • 24.Chatterjee A., Hazra R. S., Spectral properties for the Laplacian of a generalized Wigner matrix. Random Matrices: Theory Appl. 11, 2250026 (2022). [Google Scholar]
  • 25.Nakerst G., Denisov S., Haque M., Random sparse generators of Markovian evolution and their spectral properties. Phys. Rev. E 108, 014102 (2023). [DOI] [PubMed] [Google Scholar]
  • 26.Bryc W., Dembo A., Jiang T., Spectral measure of large random Hankel, Markov and Toeplitz matrices. Ann. Probab. 34, 1–38 (2006). [Google Scholar]
  • 27.Ding X., Jiang T., Spectral distributions of adjacency and Laplacian matrices of random graphs. Ann. Appl. Probab. 20, 2086–2117 (2010). [Google Scholar]
  • 28.Bordenave C., Caputo P., Chafaï D., Spectrum of Markov generators on sparse random graphs. Commun. Pure Appl. Math. 67, 621–669 (2014). [Google Scholar]
  • 29.Trefethen L. N., Bau D., Numerical Linear Algebra (Siam, 2022), vol. 181. [Google Scholar]
  • 30.Horn R. A., Johnson C. R., Matrix Analysis (Cambridge University Press, 2012). [Google Scholar]
  • 31.Vershynin R., High-Dimensional Probability: An Introduction with Applications in Data Science (Cambridge University Press, 2018), vol. 47. [Google Scholar]
  • 32.C. M. Goldie, C. Klüppelberg, “Subexponential distributions” in A Practical Guide to Heavy Tails: Statistical Techniques and Applications, R. Adler, R. Feldman, M. Taqqu, Eds. (Birkhäuser, Boston, MA, 1998), pp. 435–459.
  • 33.Glatt-Holtz N. E., Krometis J. A., Mondaini C. F., On the accept-reject mechanism for Metropolis-Hastings algorithms. Ann. Appl. Probab. 33, 5279–5333 (2023). [Google Scholar]
  • 34.Polson N. G., Scott J. G., Windle J., The Bayesian bridge. J. R. Stat. Soc. Ser. B: Stat. Methodol. 76, 713–733 (2014). [Google Scholar]
  • 35.Ji X., et al. , Gradients do grow on trees: A linear-time O(N)-dimensional gradient for statistical phylogenetics. Mol. Biol. Evol. 37, 3047–3060 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Worobey M., et al. , The emergence of SARS-CoV-2 in Europe and North America. Science 370, 564–570 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Nishimura A., Suchard M. A., Shrinkage with shrunken shoulders: Gibbs sampling shrinkage model posteriors with guaranteed convergence rates. Bayes. Anal. 1, 1–24 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Lemey P., et al. , Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2. Nat. Commun. 11, 5110 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.A. J. Holbrook, ExpmDerivative: Source materials for “On the surprising effectiveness of a simple matrix exponential derivative approximation, with application to global SARS-CoV-2”. GitHub. https://github.com/andrewjholbrook/expmDerivative. Deposited 1 March 2023. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

Public data have been deposited in Github (39).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES