Abstract
Exact methods for the exponentiation of matrices of dimension N can be computationally expensive in terms of execution time (N3) and memory requirements (N2), not to mention numerical precision issues. A matrix often exponentiated in the natural sciences is the rate matrix. Here, we explore five methods to exponentiate rate matrices, some of which apply more broadly to other matrix types. Three of the methods leverage a mathematical analogy between computing matrix elements of a matrix exponential process and computing transition probabilities of a dynamical process (technically a Markov jump process, MJP, typically simulated using Gillespie). In doing so, we identify a novel MJP-based method relying on restricting the number of “trajectory” jumps that incurs improved computational scaling. We then discuss this method’s downstream implications on mixing properties of Monte Carlo posterior samplers. We also benchmark two other methods of matrix exponentiation valid for any matrix (beyond rate matrices and, more generally, positive definite matrices) related to solving differential equations: Runge–Kutta integrators and Krylov subspace methods. Under conditions where both the largest matrix element and the number of non-vanishing elements scale linearly with N—reasonable conditions for rate matrices often exponentiated—computational time scaling with the most competitive methods (Krylov and one of the MJP-based methods) reduces to N2 with total memory requirements of N.
I. INTRODUCTION
Matrix exponentiation is a common operation across scientific computing. As a concrete example of how matrix exponentials arise, we consider a system evolving through a discrete state space whose probability of occupying distinct states follows a linear differential equation of the form1–4
| (1) |
where is a (row) vector of probabilities for the system’s states and Aθ is the generator matrix, also called a transition rate matrix (where all rows sum to one). Usually, the elements of Aθ are the functions of a small number of kinetic parameters—e.g., the propensity of each chemical reaction modeled. We denote this concise set of unique parameters as θ. To illustrate this with practical applications, Sec. III A provides specific examples.
Systems described by (1) capture diverse phenomena, such as allosteric enzyme control,5 chromatin reorganization,6 metabolic interactions,7,8 transcriptional regulation,9–12 and tumor growth.13
Modeling systems described by (1) typically requires integrating (1), from time 0 to time t, yielding
| (2) |
where is a vector of state probabilities at time 0.
As (1) is often invoked to describe complicated physical systems, Aθ’s dimensionality (physically interpreted as the number of states the system can occupy), N, quickly grows, and direct matrix exponentiation becomes unwieldy for two reasons: (i) even as a straightforward operation in linear algebra, the dominant exact matrix exponential method’s cost scales as N3. Although approximations (e.g., the Padé approximation and the “scaling and squaring” algorithm14) reduce absolute computational times, they rely on matrix diagonalization or inversion, also scaling as N3, as outlined in Ref. 14; and (ii) for dynamics assumed irreducible, i.e., when it is possible to go from any state to any other state in finite time with non-zero probability, the exponential is fully dense and the memory required to store the matrix exponential scales with N2. To give some idea of the concrete limitations posed by (i) and (ii), when N is only 44 722, the theoretical minimum memory allocation for a dense square matrix of double precision floats exceeds 16 GB (a common memory allocation for multi-core computing systems as of this writing).
There are redeeming features of scientific computing that help alleviate the worst-case scenario N2 memory scaling of storing the generator matrix Aθ.
In fact, in modeling inspired from chemical systems, whose acceptable state spaces vastly exceed the number of reactions allowed, we often encounter sparse generator matrices. Indeed, properly exploiting sparse matrices provides an immediate cost benefit since storing them requires memory scaling with the number of non-zero elements, i.e., linear scaling or better with N. Furthermore, since for irreducible dynamics, the matrix exponential will be fully dense, avoiding dense matrix exponentiation using only sparse matrix–vector products incurs an additional benefit to computational time scaling, reducing the total computational cost to scale with the number of non-zero elements, which, even beyond modeling chemical systems, often scale linearly or better with N.
The benefit of using algorithms tailored to sparse matrices is particularly important since the most common statistical inference methods—the iterative methods such as Markov chain Monte Carlo (MCMC)—require repeated calculation of likelihoods obtained by taking the product of many probabilities of the form of (2) at each Monte Carlo iteration,7,9,10,15–22 augmenting the already challenging task of exponentiating a fixed matrix for systems of large N.18,23–26 Taken together, these considerations raise the question: how can we avoid time (i) and memory (ii) problems involved in matrix exponentiation under a sparse matrix assumption?
To determine the best available solution to problems (i) and (ii), we identify five alternative methods, organized in Fig. 1(c), elaborated in detail in Sec. II B. In particular, the first three methods rely on the mathematics originally formulated to simulate Markov jump processes (MJPs)27 and subsequently adapted to dynamical inference.28–30 We refer to the three methods leveraging the MJP structure underlying (1) as MJP-based methods.
FIG. 1.
Diagram depicting MJP trajectories, probability distributions, and the methods included in our benchmark. In (a), we present sample trajectories of an MJP model of a birth–death process, detailed in Sec. III A 1. In (b), three plots represent the probability distributions over states at times indicated by the dashed lines in (a). In (c), methods 1–5 are categorized. Methods with the best scaling are highlighted in green. For more details, refer to Sec. II B.
The first MJP-based method (method 1), which we refer to as trajectory MJP-based method (T-MJP) was developed by Rao and Teh in Ref. 28. In T-MJP, the system’s trajectory is sampled through the forward-filtering backward-sampling algorithm.3,15,28,29,31 This method learns the parameters θ in a manner that is statistically equivalent to exactly solving (1). As we will show, T-MJP’s trajectory sampling (as a substitute for matrix exponentiation) time scaling is N2 (an improvement over N3 for Naïve matrix exponential) but unfortunately retains the N2 memory scaling described in (ii). Even worse, since T-MJP avoids the matrix exponential by introducing a large number of latent random variables (i.e., the trajectories that include the jump times and states visited after each jump), it severely increases the dimensionality of any statistical inference task. As a result, many MCMC iterations are needed to sample both trajectories and parameters. That is, the MCMC chains suffer from “poor mixing,” giving rise to slow convergence. These computational issues may be avoided if trajectory samples are not needed.
Method 2 avoids the matrix exponential by sampling only the number of discrete state jumps (details in Sec. II A 1), both reducing the memory cost scaling to N and improving mixing. Since it samples only jumps, we refer to this method as the jump MJP-based method (J-MJP). As we will see later, although J-MJP exhibits marginally better mixing than T-MJP, it still introduces a large number of latent variables (the number of jumps per trajectory).
Method 3, referred to as remarginalized MJP-based method (R-MJP), improves mixing by entirely marginalizing over trajectories, meaning integrating over all trajectories ending in the same observation. Although marginalizing over trajectories demands an approximation not required by methods 1 and 2, R-MJP provides an upper bound in the total error in the final result, with no additional computation.30 With the same scaling as both previous methods, we will demonstrate R-MJP’s considerably better mixing than methods 1 and 2.
The remaining two methods treat the solution of (1) without leveraging trajectories or, indeed, any facet of MJP structure. We, therefore, refer to them as direct probability calculation methods.
Method 4 simply applies the Runge–Kutta method for solving differential equations32 to avoid matrix exponentiation. Method 4 proceeds by slicing the time projection into small intervals of size Δt and evolving (1) within that time interval. Naturally, the choice of Δt is arbitrary, making it significantly harder to keep track of the upper bound in the approximation error. Thus, arises from the sequence , . We focus, in particular, on the fourth-order Runge–Kutta (RK4) methods, an implementation typically chosen as a balance between precision and computational time.32 As usual with RK4, time scaling is N2 and memory requirements scale as N.
The final method (method 5) is the Krylov subspace approximation of (1), 25,33,34 discretizing time in the same manner as RK4. It approximates the matrix exponential by generating a Krylov subspace of κ vectors as , and exponentiating the matrix projection of AθΔt into the subspace. Here, problems (i) and (ii) associated with calculating the exponential of Aθ are avoided by calculating the exponential of the smaller κ × κ projection, rather than the N × N matrix exponential.
In what follows, we implement methods 1–5 comparing their computational time scaling, memory requirements, and mixing. We do so in order to determine which method is best suited for iterative MCMC aimed at estimating parameters and their associated uncertainty. Due to method 1’s excessive memory requirements, we avoid comparing it to other methods and proceed by benchmarking the remaining methods (methods 2–5). We show that when compared to RK4 and R-MJP, J-MJP requires a much larger number of matrix exponential equivalent calculations to complete a modeling task, as predicted. From our benchmark, we conclude that either R-MJP (method 3) or Krylov subspace (method 5) attains the fastest overall computational time, depending on both the state space and the dynamics at hand. A guide for when to use each method is given in the conclusion (Sec. IV). A diagram illustrating the methods and the reasons, some of which are considered unsuitable, are shown in Fig. 1(c).
II. METHODS
This section outlines two key points: first, as an overview, we introduce the necessary background on Markovian dynamics within discrete state spaces in Sec. II A. Next, in Sec. II B, we elaborate on the key aspects required in implementing methods 1–5. While technical details presented herein are essential for those intending to immediately apply these methods, readers primarily interested in application and benchmarking may directly proceed to Sec. III. Moreover, code for the implementation of methods 2–5 (using the Python library for compiled sparse matrices35) can be found in our GitHub repository.36
A. Background on MJPs
Here, we describe a general theory of continuous time systems transitioning between distinct states, σn.1–3,29 We collect any system’s states into a set, termed the state space σ1:N = {σ0, σ1, …, σN−1}. We describe the transition dynamics within the state space by a transition rate matrix Λ, whose elements λnm, called transition rates from state σn to σm, imply that the probability of the system transitioning from σn to σm in an infinitesimal time interval dt is λnmdt. By definition, all self-transition rates are zero, λnn = 0 ∀n, leaving N(N − 1) potentially independent parameters.
In many problems of scientific interest, of which some examples are provided in Sec. III A, we will see that the remaining λnm are typically described using a much smaller number of unique parameters, collected under θ, from which λnm are determined. For example, in a simple death process, the rates in going from n members to n − 1 members, λn,n−1 are all the same irrespective of n (and coincide with the death rate). To make θ dependence explicit for matrices, we define Λθ = Λ(θ) and denote the count of non-zero elements in a matrix M as [M]. For example, for a simple death process starting from N members, we have N non-zero elements.
Such systems in discrete space evolving in continuous time with rate matrices are termed MJPs. An MJP’s dynamics in a time interval [0, T], usually thought of as an experimental time-course, are described completely by the states the system occupies over time, with each of these occupied states being one of the state space elements, , and the time at which each transition occurs, t1, t2, …, tl. Together, these describe a trajectory, an equivalent description of the MJP’s dynamics as function mapping times from the interval [0, T] onto the state space,
| (3) |
Examples of many superposed trajectories are plotted, as shown in Fig. 1(a).
1. Forward modeling
For now, we are interested in describing the probabilities over states, , occupied over this interval [0, T]. To do so, we begin by concretely defining and Aθ in (1). We define as a probability vector, with each element ρn(t) denoting the probability that the system is in state σn at time t,
| (4) |
The rate matrix, Λ, described at the beginning of Sec. II A is closely related to the generator matrix, Aθ. First appearing in (1), Aθ’s elements anm(θ) are
| (5) |
Importantly, as all λnm are non-negative, |ann| ≥ |anm| ∀m, then the largest element of ‖Aθ‖ is along the diagonal. In addition, as λnn = 0 ∀n, Aθ has more non-vanishing elements, to wit, [Aθ] = [Λθ] + N. With our notation established, we can describe the evolution of by (1).
To compute leveraging the mathematics of MJPs, we define a matrix we call the uniformized transition matrix,
| (6) |
where Ωθ is an arbitrary number greater than the magnitude of any one element of Aθ such that all elements of Bθ are positive. Although it can take any value larger than , we choose Ωθ as . Note that Aθ and Bθ have the same number of non-zero elements ([Aθ] = [Bθ]), and, from (5), Bθ’s rows sum to one. Using (6), we can rewrite (2) as
| (7) |
We recognize the Poisson density within brackets, Poisson(k|Ωθt),
| (8) |
and interpret (8) as a marginalizing k out of such that
| (9) |
leaving
| (10) |
Now, sampling trajectories from (10) yields MJP trajectories mediated by Aθ, described in Sec. II A. As proven in Ref. 27, the jump times themselves may be uniformly sampled across [0, t], resulting in using the term “uniformization” to describe the use of (10) in the simulation of MJPs.3,27,28 Unlike the Gillespie simulation, k also includes self-transitions for uniformization to remain equivalent to its corresponding MJP under uniform jump time statistics. We, therefore, follow the uniformization literature and say k counts “virtual jumps” (i.e., it is distinct from the number of transitions l in Sec. II A), to indicate that it includes both real and self-transitions. Although uniformization is unpopular relative to Gillespie’s more efficient stochastic simulation algorithm,37 we describe it here since it is the standard in proposing trajectories conditioned on data subsequently accepted or rejected within Monte Carlo.
2. Inverse modeling
With the details of forward modeling established, we turn to the so-called inverse problem, requiring the probability of states ρn, given θ, usually by matrix exponentiation. In the inverse problem, information on θ is to be decoded from observations. Notably, methods 1 and 2 provide stochastic samples of trajectories and jumps, respectively, contributing to the matrix exponential, while methods 3–5, in principle, can be used to compute the vector times matrix exponential directly as in (2). Later, in Sec. III B, we will embed all methods within an inverse modeling scheme as a means of assessing the efficiency of each method. In order to do so, we immediately consider multiple back-to-back intervals, [0, T]. We collect the end times of each interval as the set . The state of a system could be observed at these times, leading to a sequence of state observations . Alternately, we could also consider a different scheme involving a collection of systems at times , where each system’s state is observed at each of the times, also yielding a sequence of state observations . The latter falls within the realm of what is sometimes called snapshot data.9 The methods we discuss here apply to both types of observations, although, for ease of computation, our benchmarking of methods 1–5 is performed on snapshot data.
Following notation established in (3), from the state observations, we write , often termed the likelihood. As defined in (4), we have , equivalent to what is obtained by the vector times matrix exponential in (2).
In Sec. II B, we will explain different ways to avoid the full matrix exponential calculation, but first, we establish the notation necessary to explain methods 1–3. Learning the kinetic parameters θ from the data means obtaining related to the likelihood through Bayes’ theorem,
| (11) |
For methods 1 and 2, we do not directly compute but rather introduce latent variables , the trajectories, and ki, the number of virtual jumps. That is, we rewrite the posterior we desire as arising from the marginalization over trajectories and virtual jumps, respectively,
| (12a) |
and
| (12b) |
In other words, we “de-marginalize” the trajectories based on .
In spite of its importance, a closed-form expression for the posterior is often not attainable. In such cases, we can, instead, draw samples from the posterior using techniques such as MCMC, which constructs a random walk in parameter space, whose stationary distribution is the posterior. When we sample the de-marginalized posteriors in (12a), we marginalize out all but the variables of interest (θ) as a post-processing step. In practice, this just means ignoring all sampled values of the latent variables. More details will be given for each of methods 1–5.
B. Avoiding the matrix exponential in methods 1–5
For each method mentioned in the Introduction and described in Fig. 1, we describe how each method allows one to calculate the state probabilities and discuss cost scaling, as summarized in Table I. Although each method deals with the time evolution in its own manner, time scaling with N2 appears repeatedly in Table I. This is because each method requires a loop where is multiplied by either Bθ (in methods 1–3) or Aθ (in methods 4 and 5) sequentially within a single calculation. Therefore, we determine the scaling on total time cost using the cost of each loop’s iteration and the total number of iterations needed to calculate the full vector times matrix exponential product (2).
TABLE I.
Summary of the methods discussed, their number of iterations (first column) and time cost per iteration (second column), with κ being the number of vectors taken into the Krylov subspace. The third column (time scaling) is built by multiplying the first and second columns, assuming conditions (a) and (b) outlined in the second paragraph of Sec. II B. Although J-MJP appears competitive with other methods, we will see in Sec. III B 2 that the latent variables it requires lead to poor MCMC mixing. The variable k* related to the R-MJP method (method 3) is properly defined in Sec. II B 3; here, it is important to know it scales proportionally to .
| Number of iterations | Cost per iteration | Time scaling | Memory required | Required latent variables | |
|---|---|---|---|---|---|
| T-MJP | O([Aθ]) | O(N2) | N2 + [Aθ] | Yes | |
| J-MJP | O([Aθ]) | O(N2) | N + [Aθ] | Yes | |
| R-MJP | k* | O([Aθ]) | O(N2) | N + [Aθ] | No |
| RK4 | O([Aθ]) | O(N2) | 4N + [Aθ] | No | |
| Krylov | O(κ[Aθ] + κ2N + κ3) | O(κ2 + κN2) | κ2 + κN + [Aθ] | No |
We obtain concise time and memory scaling between different methods under two assumptions inspired by relevant physical examples: (a) the number of nonzero elements in Aθ scales linearly with N (conveniently guaranteeing sparse Aθ) and (b) the exponent of any power of N appearing explicitly in Aθ is less than or equal to 1 as larger rates can introduce more jumps in the trajectories introduced in methods 1–3 and reduce the time step of integration of methods 4 and 5. To be clear, both conditions are used to describe how the time and memory required to calculate likelihood scales with the number of particles. They are not, however, required for the methods to work.
Regardless of which method we are discussing, when dealing with chemical reactions, the number of nonzero elements scales at worst linearly with the state space size, O([Aθ]) = O(N), as in condition (a). It is also usually the case that when molecular species do not interact with themselves, the largest element of the rate matrix scales linearly with the state space—, as in condition (b). As we will see, all methods have some dependence on the largest value in the rate matrix, . As this quantity represents the fastest time scale at which dynamics occur, all methods must take such scale into account in order to provide accurate results.
1. Method 1—Trajectory-based MJP method (T-MJP)
Informed by a set of observed states, , we can alternately sample and θ from their respective distributions following a Gibbs sampling scheme,3,38
| (13a) |
and
| (13b) |
When enough alternating samples are drawn, we obtain a model of the dynamics (that is, samples of θ informed by ), avoiding entirely the matrix exponential itself.
Sampling is detailed elsewhere,3,28,29 but its cost lies mainly in the required forward-filtering backward-sampling algorithm3 within uniformization,27 since it involves storing the intermediate probability distributions (filters) associated with a set of virtual jumps. In the previously defined notation, this means saving the vectors up to the largest sampled number of virtual jumps . When condition (b) is met, this memory requirement scales as O(N2) [i.e., problem (ii)].
2. Method 2—Jump-based MJP method (J-MJP)
Avoiding both matrix exponentiation and the memory requirements of method 1, method 2 proceeds by alternating between samples of the number of virtual jumps and θ, respectively, also following a Gibbs sampling scheme,
| (14a) |
and
| (14b) |
As with the previous method, for a sufficiently large number of alternate samples, we can avoid matrix exponentiation entirely and still obtain the appropriate probabilities for the states given the kinetic parameters, θ, and the number of virtual jumps for respective observations, . In Sec. III B 2, we will discuss practical challenges this poses.
Since each ki is independent, sampling in (14a) is equivalent to sampling each element ki from
| (15) |
In both (14a) and (14b), the most computationally expensive step is calculating the state probabilities conditioned on ki and θ, p(si|ki, θ); its pseudocode is provided in Algorithm 1.
ALGORITHM 1.
State probabilities for J-MJP. Calculates the state probabilities, p(si|ki, θ), as in (8).
| Input: The dynamical parameters θ, the initial probability vector , the observed states , and the respective number of jumps |
| Output: An array, φ, of size I (number of observed states) elements, whose ith element is p(si|ki, θ) |
| From θ, calculate Ωθ and Bθ; |
| Set φ as an empty array of I elements; |
| Set k = 0 and ; |
| while do |
| for each i: ki == k do |
| ; |
| Set k+ = 1; |
| Set ; |
| return l |
Rather than the complete set of required by T-MJP, Algorithm 1 only requires saving the vector , denoted in the pseudocode as . Its outer loop is run times. Given that ki is Poisson-distributed at rate TiΩθ (see Sec. II A 1) and Ωθ scales with , the number of outer loop calls scales with , with T being the largest collection time . Meanwhile, the most computationally intensive operation within each loop iteration is multiplying by Bθ, which scales in time as O([Bθ]) = O([Aθ]). Therefore, the total time scaling of the state probability computation is . Under conditions (a) and (b) mentioned in the Introduction, this is equivalent to O(N2).
3. Method 3—Remarginalized MJP method (R-MJP)
Eliminating the numerous samples of from method 2, method 3 (R-MJP) directly approximates the full sum over k in (7). That is, we obtain the probability vector at the observation time , by truncating the sum over k in (7) at a cutoff k*,
| (16) |
where is the upper bound in the approximation error. In practice, the goal is to calculate the probability vectors at the collection times Ti. For practical reasons, we choose , guaranteeing a modest error bound for Ωθt > 10. While implementing the approximation in (16) is straightforward, we provide pseudocode detailing how to efficiently calculate the summation in (16) in Algorithm 2.
ALGORITHM 2.
R-MJP for calculating the final probability vector as in (16).
| Input: The dynamical parameters θ, the initial probability vector , and the time t |
| Output: The final probability vector |
| From θ, calculate Ωθ and Bθ; |
| Set k = 0 and ; |
| Set PCumulative = PPoisson; |
| Set ; |
| Set ; |
| while do |
| Set k+ = 1; |
| Set ; |
| Set ; |
| Set PCumulative+ = PPoisson; |
| Set ; |
| Set ; |
| return |
In order to calculate the probability vector at the observation time from , the outer loop in Algorithm 2 is set to run times. Similarly to the previous method, this scales as O(T maxn|ann|). Within this loop, the most computationally expensive task is the multiplication of an array of N elements by Bθ, with scales with O([Bθ]). Consequently, the overall time cost scaling is O(maxn|ann|[Bθ]). Again, when conditions (a) and (b) in Sec. II A 2 are met, this overall cost becomes O(N2), while Algorithm 2 requires only saving a single array of size N besides the matrix Bθ, giving memory scaling O(N).
Note that unlike the previous methods, which require latent variable sampling, this method allows us to directly approximate the solution of (1) while avoiding the matrix exponential in (2). Notably, the number of virtual jumps, k, follows a Poisson distribution with rate ΩθT, which we can interpret 1/Ωθ as something like an adaptive time grid. Subsequent methods that directly approximate (1) differ from R-MJP by explicitly segmenting time into steps, Δt, which must be chosen by the user. By comparison, method 3 adaptively determines k* based on the system’s dynamics and specifies an error upper bound, , with no additional computation.
4. Method 4—Runge–Kutta
The most conceptually straightforward technique to avoid the matrix exponential in (2) is to numerically integrate the trajectory-marginalized Eq. (1). Runge–Kutta, designed for ordinary differential equations,32 balances stability and accuracy flexibly based on a chosen order of expansion and time step Δt. For example, the first order Runge–Kutta reads
| (17) |
Thus, we evolve the initial vector, , and compute successive points along the solution curve until a final time T, i.e., compute iteratively. This is equivalent to first-order Taylor on the exponential at each time step.
Avoiding numerical inaccuracy and instability often warrants higher-order methods. A common choice is the fourth-order Runge–Kutta method (RK4).32
In RK4, we write (1) as
| (18) |
where
| (19) |
and iteratively evolve from t = 0 toward t = T. This reduces the error to Δt4 (see Ref. 32) at each step. We made the time dependence in (19) explicit [writing Aθ(t)] because this method can be used for transition rates that change in time, in opposition to the previous methods relying on the expansion in (7), which assumes that Aθ is constant in time.
When implementing RK4, numerical stability considerations dictate that . Thus, the number of iterations required in computing the likelihood is given by . Since evolving by Δt requires 4 matrix–vector products (19), this scales in time as O([Aθ]). When conditions (a) and (b) mentioned in Sec. II A 2 are met, it follows that the Runge–Kutta solution takes a time of order O(N2). Besides storing Aθ, no more than 5 arrays of size N need to be saved at a time—δ1 to δ4 in (19) and , thus leading to the value found in Table I.
5. Method 5—Krylov subspace
The Krylov subspace technique25,33,34 evolves the initial probability vector through time by approximating the matrix exponential over a short time step Δt, . In keeping with the other methods, it still avoids the full matrix exponential . It does so by constructing a Krylov subspace at each time step, , spanned by , where the size of the subspace, κ, is a predefined integer, which serves a role similar to the approximation order in Runge–Kutta. Note that, as with methods 1–3, the Krylov subspace requires that Aθ is constant in the interval (t, t + Δt).
Although the vectors spanning the subspace are clearly defined, we require an orthonormal basis for , in order to project Aθ into the subspace. This projection is given by
| (20) |
where Q† represents the transpose of Q and Q’s ith column is . Both H and the basis are obtained using the Arnoldi algorithm.39 To make its computational cost clear, we describe the Arnoldi algorithm in Algorithm 3.
ALGORITHM 3.
Arnoldi algorithm for obtaining the orthonormal basis and projection of the generator matrix onto the Krylov subspace.
| Input: The generator matrix Aθ, the initial probability vector for the interval [t, t + Δt], and the Krylov subspace size κ |
| Output: Q, whose columns are an orthonormal basis of and H, the projection of Aθ into the Krylov subspace (20) |
| From , we obtain , with ‖ ⋅‖2 representing the 2-norm; |
| Set H as a κ × κ of elements hij all equal to zero; |
| for each i: 2, 3, …, κ do |
| Set ; |
| for each j: 1, …, i − 1 do |
| Set , with ⟨⋅, ⋅⟩ denoting the inner product; |
| Set ; |
| Set ; |
| Set ; |
| return Q, a matrix whose ith column is , and H |
From (20), we can approximate the matrix exponential in (2) as
| (21) |
Now, since the columns of Q form an orthonormal basis, Q† = Q−1. Using also the fact that the first basis vector is parallel to , we can simplify further
| (22) |
with the subscript [:, 1] denoting the first column of the matrix.
Naturally, the choice of κ and Δt contributes to the approximation error and computational cost of method 5. In particular, Krylov allows larger Δt than RK4—we have found that leads to a stable integrator. The number of iterations required to compute the (2) by sequentially approximating (22) is then given by . However, larger values of κ also lead to larger matrices Q and H, leading back to the original problems with the matrix exponential. For our later benchmark, we fix κ = 20.
There are three calculations at each time step in (22). First, compute Q and H through Algorithm 3. Each iteration requires κ matrix times vector products, scaling as O(κ[Aθ]), and dot products, O(κ2N). Next, calculate eHΔt, which scales as O(κ3). Finally, perform the matrix multiplication in (22), which scales as O(κN). Thus, the total time scaling becomes .
At each iteration, three objects are stored: Aθ, Q, and H. Therefore, the memory cost scales as O(κ2 + κN + [Aθ]). Again, assuming that (a) and (b) mentioned in Sec. II A 2 are satisfied, the total time and the memory requirements scale as O(κ2 + κN2) and O(κN), respectively. These together determine the scaling per iteration presented in Table I.
Note that we usually use a fixed κ and observe the scaling for N, as such we are most interested in κ ≪ N, but we will not take that limit in scaling explicitly as of now. This happens because, in practical terms, the presence of κ leads to overhead in the total time calculation that will be relevant in our benchmarking. However, one could expect that when such a limit is reached, the scaling would become O(N2) for computational time and O(N) for memory, similarly to the previous methods.
III. RESULTS
With all the methods described, we now compare them on three systems of interest we present in Sec. III A. Benchmarking of the different methods is given in Sec. III B.
A. Examples
As a basis for our benchmark, we describe some examples of dynamics mediated by (1), inspired by the selected literature. An illustrative cartoon of each system is found in Fig. 2.
FIG. 2.
Cartoon representation of dynamical systems described in Sec. III A. Panel (a) represents a birth–death process of R as described in Sec. III A 1, panel (b) represents a system that switches between two states, only one of them, the active one σact, produces R, as in Sec. III A 2, and panel (c) represents an autoregulatory system, with one state, σact, generating a product R, which, in turn, generates a second product P, which subsequently plays a role in inactivating the system by moving it to a state, σina, where R is not produced, as in Sec. III A 3.
In the subsections dedicated to each example below, we will first motivate the model and describe the required parameters, θ, then build the state space and formulate the transition rate matrix, Λ, and finally describe the simulation of synthetic experiments producing the data to be used in our benchmark in Sec. III B.
1. Birth–death process
The birth–death process is relevant across population dynamics. It models the stochastic production (or birth) of a species R at a constant rate β that can be degraded (or die) at rate γ. Thus, the full set of parameters is simply . A representation of this process is depicted in Fig. 2(a). Expressed in the language of chemical reactions, we have
| (23) |
For example, in molecular biology, we can model an actively transcribing DNA sequence producing one RNA9 at rate β that can be degraded at rate γ. Data for inference on this system would be the population of R at different time points across a number of identical DNA sequences.
We write the elements of the state space σn to represent an R population of n, with N being an upper cutoff in the R number. While selection of the upper cutoff can be worthy of independent investigation—see, for example, Refs. 10, 23, and 24—here, and in all the following examples, we determine N in a data-driven manner. To set the cutoff, we assume that the probability for R counts larger than double the highest count found in the data is effectively zero. In this case, the elements λnm(θ) of the transition rate matrix Λ are given by
| (24) |
We note that Λ is sparse by construction, and each row has, at most, two non-vanishing elements. Since was an important quantity when discussing the methods’ time scaling, it is straightforward to see from (5) that , which scales as O(N).
In our synthetic experiments for this process, we fix the time units, so the ground truth degradation rate is 1, γ = 1 or, equivalently, time has units of 1/γ. Crucially, this is merely a choice of parameters for the synthetic experiment and we do not assume any knowledge of γ a priori. All simulations start with zero population of R, and we observe the final state at various time points Ti, evenly spaced from 0.5 to 5. For each of these time points, we conduct 300 individual simulations, yielding a total dataset of I = 3000 observations. We perform synthetic experiments for each value of the production rate β from the set {50, 100, 200, 400, 800}, selected because the population at equilibrium fluctuates around β/γ, allowing us to assess the scaling (on time and memory) of each algorithm with the number of states necessary for accurate inference, spanning a range from hundreds to thousands of states.
2. Two-state birth–death
Next, we explore a two-state birth–death reaction network as a simple extension of the birth–death process, where the system’s production stochastically deactivates. We reuse β for the production rate in the system’s active state, σact. The product R degrades at a constant rate γ just as before. In addition, the system deactivates—transitions from σact to an inactive state σina—at rate λina and reactivates at rate λact. A representation of this system is depicted in Fig. 2(b).
The full set of kinetic parameters is denoted by θ = {β, γ, λact, λina}, and the system’s state is described by two variables: the population of R, denoted as nR, and the gene state, σG, which can be either active, σG = 0 representing σact, or inactive, σG = 1 representing σina.
This is equivalent to the so-called bursty expression of a gene with one active state—RNA producing—and one inactive state. Expressed in the language of chemical reactions, we write
| (25) |
Since we must choose a linear ordering of states, in order to write their probabilities as a vector, we ought to write our state space with a single index n. We index the states using n = nR + σGNR with NR being the upper cutoff for the R population, and N = 2NR. Similarly, we can recover the population of R, nR, and the gene state σG from n as , where x mod q represents the remainder obtained when x is divided by q and represents the largest integer smaller than x. Here, the transition matrix elements are given by
| (26) |
Since all elements λnm are positive and going back to the definition of anm in (5), we can see that , scaling as O(N).
Just as we did for the previous system, we set γ = 1 in our synthetic experiment, thereby setting the units. In this case, the production rate β takes the values in {100, 200, 400, 500}, enabling us to observe how the time and memory scale within each of the methods with the number of states, since the steady-state amount of product for the previous example, β/γ, is observed in the limiting case where the system never moves from the active state (i.e., if λina = 0). The activation and deactivation rates, λact and λina, are chosen to be on the same order as the measurement time, λact = 2 and λina = 1. The observation points, Ti, are evenly spaced from 0.5 to 10, leading to a total of I = 6000 observations. The observations (data) are the population of R, nR, at different time end points. However, unlike in the previous example, there is no direct observation of the σact or σina state. Hence, when calculating the likelihood of observation, one has to sum the probabilities of states (nR, 0) and (nR, 1).
3. Autoregulatory gene network
Another scenario worth studying is the model for an autoregulatory gene network. In such a system, RNA is produced, which is later translated into a protein, which, in turn, suppresses its own production—see Fig. 2(c). As a result, the system is capable of maintaining a controlled level of both transcribed RNA and synthesized protein within cells.18
In order to model this system, we consider an active state, σact, and an inactive state, σina. In the active state, the system produces a component R at a rate of βR. Meanwhile, each copy of R can produce another component P at rate βP. This additional component P, in turn, inhibits the production of the first product R.
Expressed in the language of chemical reactions, we have
| (27) |
Thus, the set of kinetic parameters is denoted by θ = {βR, βP, γR, γP, λact, λina}, and the system’s state is described by three variables: the population of R, nR; the population of P, nP; and the gene state, σG. Similarly to how it was done in the previous example, we index the state space using n = nR + nPNR + σGNRNP, which is inverted as , with NR and NP being the cutoff on the population of R and P, respectively. Thus, N = NRNP. The transition matrix elements are given by
| (28) |
Here, calculating the largest element of the transition rate matrix is not as trivial as in the previous examples, but it can be done by plugging the elements λnm into (5), obtaining
| (29) |
Although not as straightforward as in the previous examples, this system scales better than linearly with the total size of the state space N. Later, this will mean that the scaling in time will be considerably better than expected from Table I.
Here, we use the degradation rate of R to set our time units, γR = 1. We explore the production rates for both R and P as (βR, βP) ∈ {(2.5, 0.125), (5, 0.25), (10, 0.5), (20, 1)}. The activation and deactivation rates are set at λact = 0.1 and λina = 0.05, respectively, and the degradation rate of P is given as γP = 0.1. Smaller production and degradation rates are chosen for P to represent the greater stability and higher production cost of proteins over RNA. Similarly to the previous example, the collection times are taken evenly spaced from 0.5 to 10, leading to a total of I = 6000 observations. The data available are now the population of both R and P, nR and nP, respectively, at the final state. Similarly to the previous example, there is no direct observation of σact or σina state. Hence, when calculating the likelihood of an observation, one has to sum the probabilities of states (nR, nP, 0) and (nR, nP, 1).
B. Benchmarking
To evaluate the efficiency of methods 2–5 in avoiding the matrix exponential for a variety of physically motivated generator matrices, we embed the matrix exponential as the key computational step in a Bayesian inference scheme. To be precise, we generate synthetic data and perform inference by sampling kinetic parameters, θ, from the posterior with MCMC. The sampling scheme (see Sec. A of the supplementary material for details) is of little relevance to the matrix exponential, but the scheme implemented is a form of adaptative Metropolis40 and we have made the source code publicly available on our GitHub36 repository.
Benchmarking each method serves two primary objectives: first, it aims to evaluate time scaling with the total number of states for the benchmarked methods to compare them with the conventional matrix exponential.41 Since the conventional matrix exponential is dramatically more costly than other methods, even for this simple system, we only include it in our benchmark for the birth–death process. These results are presented and discussed in Sec. III B 1.
Second, to ensure that each method results in an equivalent modeling scheme, we present a comparison between the probability distributions over θ obtained through various methods in Sec. B of the supplementary material. Notably, even though the time needed to generate the samples that make up this distribution with the J-MJP (method 2) is not considerably worse than the other methods, it generates highly correlated samples, a problem whose implications we expand upon in Sec. III B 2.
1. Time scaling comparison
To compare the time needed to calculate the final distribution [thus avoiding the matrix exponential in (2)] used for each method, we measure the wall time required to sample 100 values for all kinetic parameters from across datasets (each demanding different total state space size). To obtain some statistics, we repeat this procedure 20 times. The results can be seen in Fig. 3. We focused on comparing the time scaling between methods and conducting tests under conditions where MCMC samplers had already converged to the posterior by initializing it at the simulation’s ground truth.
FIG. 3.
We recorded the wall time, tW, necessary to obtain 100 samples from the posterior in relation to the total number of states. A comparison of J-MJP, R-MJP, RK4, and Krylov (KRY) for all three examples is presented above. The birth–death process (Sec. III A 1) is presented on the top panel, the two-state birth–death (Sec. III A 2) is presented in the middle panel, and the autoregulatory gene model (Sec. III A 3) is presented in the bottom panel. In the birth–death example, we also compare to the conventional implementation of the matrix exponential (ME). When considering the overall performance, the contest narrows down between R-MJP and Krylov, with Krylov tends to exhibit a superior performance, especially as the state spaces increase in size.
Our Python code, available in our repository,36 leverages the Numba compilation tool42 and a package enabling sparse matrix manipulation in Numba-compiled code.35 The benchmark was conducted on a system equipped with an Intel(R) Core(TM) i7-7700K CPU operating at a base frequency of 4.20 GHz. The system is outfitted with 16 GB of RAM.
Figure 3 clearly illustrates the expected quadratic scaling of RK4, J-MJP, and R-MJP relative to the total number of states in each of the three examples under study. In these cases, we observe a slightly smaller exponent due to necessary overhead. Nevertheless, this overhead is anticipated to diminish in significance with an increase in state space size. The Krylov scaling is somewhat less straightforward as the method has a more complex dependence on the number of states, as discussed in Sec. II B 5 and Table I, as such needs a considerably larger number of states to actually reach a regime of quadratic scaling.
When studying the autoregulatory gene system, the exponents of the power law relating the wall time to the number of states are smaller due to the less straightforward relationship between the largest element of Aθ and the total number of states, as mentioned in Sec. III A 3. In the same figure, we also show, for the birth–death process in Sec. III A 1, how all these methods dramatically outperform regular matrix exponential implementation.
The comparison between the time scaling between methods 2 and 5 and the regular matrix exponential was not implemented in the two other systems, as the difference is even more dramatic. The competition for overall best method comes between R-MJP and Krylov, with the tendency of Krylov to perform better at larger state spaces. This difference in performance can be attributed to the fact that, in smaller state spaces, the time required for allocating memory to store the Krylov subspace (as detailed in Algorithm 3) and for exponentiating the κ × κ projection matrix, H, is significant compared to the time needed for the vector–matrix product computation. However, as the state space expands, the relative time consumed for memory allocation and matrix exponentiation diminishes, making the Krylov method increasingly more efficient. Nevertheless, despite the size of the state space, Krylov still requires arbitrary choices of the subspace size, κ, and the time step (with the latter being also required in RK4). In contrast, R-MJP (as well as T-MJP and J-MJP) has an effective time step, 1/Ωθ, endowed by the system.
Determining the state space size at which Krylov becomes more efficient than R-MJP depends on the matrix’s sparsity structure and the largest element of the matrix (related to the effective time step in each of the methods). Moreover, in Sec. III B, we focused on the time scaling of each method, which, when implemented correctly, is independent of the hardware. However, for the examples under investigation, using our code36 and hardware (described above), the critical size at which Krylov becomes more efficient is when the R-MJP and Krylov lines in Fig. 3 intercept.
2. J-MJP is inefficient
As mentioned in Sec. II A 2, the samplers generated by methods 3, 4, and 5 obtain samples from the full posterior . Consequently, the samplers from methods 3, 4, and 5 recover nearly identical probability distributions over model parameters θ.
Meanwhile, method 2 uses the set of number of virtual jumps as a latent variable. To effectively compare the performance of J-MJP (method 2) to those of other methods, it is essential to look not only at execution time but also to assess how well an equivalent number of samples describe the posterior. This becomes especially important since sampling introduces extra variables equal to the number of observations. This inclusion dramatically increases the dimensionality of the sample space. With this increased dimensionality, more samples may be required to properly characterize the joint posterior between θ and .
In Fig. 4, the posteriors obtained by J-MJP and the other three benchmarked methods are shown to give inconsistent results in both learned parameters and their uncertainty (depicted as credible intervals in 4). This comparison was made for the two-state birth–death example (Sec. III A 2) with ground truth kinetic parameters θ = {β, γ, λact, λina} = {200, 1, 2, 1}. More examples are given in Sec. B of the supplementary material.
FIG. 4.
Histogram of the posterior samples of each kinetic parameter for the two-state birth–death process. The histogram samples are compared with the simulation’s ground truth (parameter values used for generating synthetic data). In each of these histograms, we highlight the estimated expected value of the posterior (corresponding to the sample mean of each parameter) as well as the estimated 50% and 95% credible intervals (corresponding to the samples’ 25th to 75th percentiles and 2.5th to 97.5th percentiles). We notice that R-MJP, RK4, and Krylov yield coherent posteriors. Meanwhile, J-MJP, while placing the bulk of its probability density in a similar parameter region, recovers confidence intervals inconsistent with the other methods.
In order to demonstrate that the observed discrepancy is not due to a mathematical inconsistency but rather a result of insufficient sampling, Fig. 5 presents the autocorrelation of the MCMC samples, defined by
| (30) |
with being the ath parameter of the hth sample and . In simpler terms, ACF(τ) quantifies the average correlation between a sample and another with another sample τ steps away within the Markov chain. Indeed, the MCMC structure is intrinsic to J-MJP (and T-MJP), while the other three methods (R-MJP, RK4, and Krylov) calculate the vector times matrix exponential (2), both within an MCMC framework and independently.
FIG. 5.
Auto-correlation function (30) for the kinetic parameters sampled using J-MJP (on top) and R-MJP (in the bottom) within the two-state birth–death example. The results confirm that the sampler generated using J-MJP is inefficient, as the autocorrelation does not move to near 0 even after tens of thousands samples, while in R-MJP, the autocorrelation falls to zero after just a few hundred samples (as occurs with Krylov and RK4).
An ideal scenario would have ACF rapidly dropping from one to zero, signifying optimal mixing. Nonetheless, Fig. 5 reveals that while this is achieved by R-MJP (as well as RK4 and Krylov), for J-MJP, sampled kinetic parameters remain correlated, even following thousands of iterations. Consequently, at least an order of magnitude more samples are needed to characterize the posterior as proficiently with J-MJP as with RK4, Krylov, and R-MJP. This happens as a consequence of J-MJP having to sample a considerable number of latent variables.
IV. CONCLUSION
Our comparison of different methods to avoid matrix exponentiation reveals that no one method is always preferable. Figure 3 indicates that the most efficient method to generate sufficient samples of the posterior hinges on both the system and the state space: either the R-MJP (method 3) or the Krylov subspace (method 5), with the latter typically proving more efficient for significantly larger state spaces and the former seeming to have advantage for sparser matrices. However, the other methods should not be dismissed entirely for reasons we discuss below. Figure 6 shows the discussion by giving a recommendation for when to use each method.
FIG. 6.

Summary of recommendations for when to use each of the methods described in the present article based on the results reported in Sec. III B.
Although RK4 (method 4) performed significantly worse than all other methods, except the complete matrix exponential, it was the only method generalizable to cases where transition rates change in time—such as induction43—since it does not invoke a time-constant generator matrix.
Although we dismissed method 1 (T-MJP) for large state spaces due to its memory requirements, it is the only method allowing for trajectory inference and, as such, can still find important applications in some problems.29
Regarding method 2 (J-MJP), the result in Fig. 5 indicates that it requires a much larger number of samples to characterize the posterior due to many latent variables introduced, thereby reducing the mixing for the kinetic parameters in θ. The result is a sampler that mixed significantly worse than those using methods 2, 4, and 5, which all marginalize over these latent variables.
Opting for synthetic data was crucial to demonstrate, in a controlled environment, the wall time scaling across state spaces with size ranging in the hundreds to thousands of states. Our findings suggest that, for practical purposes, the choice between Krylov and R-MJP is highly contingent. As such, we recommend a benchmark of these two methods tailored to the specific dynamics under study for anyone who needs their algorithm to run as efficiently as possible.
Furthermore, although the time benchmark indicates that Krylov outperforms R-MJP in large state spaces, Krylov still demands more memory allocation, associated with the choice of subspace size, κ. In contrast, and this is absolutely critical to emphasize, R-MJP does not require any such choices. The effective time step, 1/Ωθ, is determined by the system, and the selection of k* establishes a directly computed (and optimistically presumed) upper bound on the error.
SUPPLEMENTARY MATERIAL
The supplementary material addresses two topics: (A) it provides detailed information about the MCMC strategy used to obtain the posterior samples, as shown in Figs. 4 and 5, and (B) it presents the posterior for all examples in Sec. III A in a format akin to that of Fig. 4, thereby demonstrating the consistency of methods 3–5.
ACKNOWLEDGMENTS
We would like to thank Ioannis Sgouralis, Brian Munsky, Tushar Modi, Julio Candanedo, and Lance W. Q. Xu for their interesting discussions in the development of the present article. SP acknowledges support from the NIH (Grant Nos. R01GM134426, R01GM130745, and R35GM148237). We acknowledge a grant entitled “Mechanisms of Non-equilibrium Transport Processes at Intermediate Scales” from the Department of the Army – Materiel Command.
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose.
Author Contributions
Pedro Pessoa: Conceptualization (lead); Formal analysis (lead); Investigation (equal); Software (lead); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Max Schweiger: Conceptualization (supporting); Investigation (supporting); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Steve Pressé: Project administration (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (lead).
DATA AVAILABILITY
Data sharing is not applicable to this article as no new data were created or analyzed in this study. The code for simulations is available in the GitHub repository: https://github.com/PessoaP/AvoidingMatrixExponential
REFERENCES
- 1.Ross S. M., Stochastic Processes (John Wiley & Sons, Nashville, TN, 1995). [Google Scholar]
- 2.Kampen N. V., Stochastic Processes in Physics and Chemistry (Elsevier, Amsterdam, Netherlands, 2007). [Google Scholar]
- 3.Pressé S. and Sgouralis I., Data Modeling for the Sciences (Cambridge University Press, Cambridge, 2023). [Google Scholar]
- 4.Lee J. and Pressé S., “A derivation of the master equation from path entropy maximization,” J. Chem. Phys. 137, 074103 (2012). 10.1063/1.4743955 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hung K. Y. S., Klumpe S., Eisele M. R., Elsasser S., Tian G., Sun S., Moroco J. A., Cheng T. C., Joshi T., Seibel T. et al. , “Allosteric control of Ubp6 and the proteasome via a bidirectional switch,” Nat. Commun. 13, 838–913 (2022). 10.1038/s41467-022-28186-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Fletcher A., Zhao R., and Enciso G., “Non-cooperative mechanism for bounded and ultrasensitive chromatin remodeling,” J. Theor. Biol. 534, 110946 (2022). 10.1016/j.jtbi.2021.110946 [DOI] [PubMed] [Google Scholar]
- 7.Pressé S., Ghosh K., Phillips R., and Dill K. A., “Dynamical fluctuations in biochemical reactions and cycles,” Phys. Rev. E 82, 031905 (2010). 10.1103/physreve.82.031905 [DOI] [PubMed] [Google Scholar]
- 8.Rios-Estepa R. and Lange B. M., “Experimental and mathematical approaches to modeling plant metabolic networks,” Phytochemistry 68, 2351–2374 (2007). 10.1016/j.phytochem.2007.04.021 [DOI] [PubMed] [Google Scholar]
- 9.Kilic Z., Schweiger M., Moyer C., Shepherd D., and Pressé S., “Gene expression model inference from snapshot RNA data using Bayesian non-parametrics,” Nat. Comput. Sci. 3, 174–183 (2023). 10.1038/s43588-022-00392-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Munsky B., Li G., Fox Z. R., Shepherd D. P., and Neuert G., “Distribution shapes govern the discovery of predictive models for gene regulation,” Proc. Natl. Acad. Sci. U. S. A. 115, 7533–7538 (2018). 10.1073/pnas.1804060115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tiberi S., Walsh M., Cavallaro M., Hebenstreit D., and Finkenstädt B., “Bayesian inference on stochastic gene transcription from flow cytometry data,” Bioinformatics 34, i647–i655 (2018). 10.1093/bioinformatics/bty568 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pressé S., Ghosh K., and Dill K. A., “Modeling stochastic dynamics in biochemical systems with feedback using maximum caliber,” J. Phys. Chem. B 115, 6202–6212 (2011). 10.1021/jp111112s [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gatto F., Ferreira R., and Nielsen J., “Pan-cancer analysis of the metabolic reaction network,” Metab. Eng. 57, 51–62 (2020). 10.1016/j.ymben.2019.09.006 [DOI] [PubMed] [Google Scholar]
- 14.Moler C. and Van Loan C., “Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later,” SIAM Rev. 45, 3–49 (2003). 10.1137/s00361445024180 [DOI] [Google Scholar]
- 15.Jazani S., Sgouralis I., and Pressé S., “A method for single molecule tracking using a conventional single-focus confocal setup,” J. Chem. Phys. 150, 114108 (2019). 10.1063/1.5083869 [DOI] [PubMed] [Google Scholar]
- 16.Bryan IV J. S., Sgouralis I., and Pressé S., “Diffraction-limited molecular cluster quantification with bayesian nonparametrics,” Nat. Comput. Sci. 2, 102–111 (2022). 10.1038/s43588-022-00197-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sgouralis I., Whitmore M., Lapidus L., Comstock M. J., and Pressé S., “Single molecule force spectroscopy at high data acquisition: A Bayesian nonparametric analysis,” J. Chem. Phys. 148, 123320 (2018). 10.1063/1.5008842 [DOI] [PubMed] [Google Scholar]
- 18.Sukys A., Öcal K., and Grima R., “Approximating solutions of the chemical master equation using neural networks,” iScience 25, 105010 (2022). 10.1016/j.isci.2022.105010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Saurabh A., Fazel M., Safar M., Sgouralis I., and Pressé S., “Single-photon smFRET. I: Theory and conceptual basis,” Biophys. Rep. 3, 100089 (2023). 10.1016/j.bpr.2022.100089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Saurabh A., Safar M., Fazel M., Sgouralis I., and Pressé S., “Single-photon smFRET: II. Application to continuous illumination,” Biophys. Rep. 3, 100087 (2023). 10.1016/j.bpr.2022.100087 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Safar M., Saurabh A., Sarkar B., Fazel M., Ishii K., Tahara T., Sgouralis I., and Pressé S., “Single-photon smFRET. III. Application to pulsed illumination,” Biophys. Rep. 2, 100088 (2022). 10.1016/j.bpr.2022.100088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kilic Z., Schweiger M., Moyer C., and Pressé S., “Monte Carlo samplers for efficient network inference,” PLoS Comput. Biol. 19, e1011256 (2023). 10.1371/journal.pcbi.1011256 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Munsky B. and Khammash M., “The finite state projection algorithm for the solution of the chemical master equation,” J. Chem. Phys. 124, 044104 (2006). 10.1063/1.2145882 [DOI] [PubMed] [Google Scholar]
- 24.Sidje R. and Vo H., “Solving the chemical master equation by a fast adaptive finite state projection based on the stochastic simulation algorithm,” Math. Biosci. 269, 10–16 (2015). 10.1016/j.mbs.2015.08.010 [DOI] [PubMed] [Google Scholar]
- 25.Vo H. and Sidje R., “Implementation of variable parameters in the Krylov-based finite state projection for solving the chemical master equation,” Appl. Math. Comput. 293, 334–344 (2017). 10.1016/j.amc.2016.08.013 [DOI] [Google Scholar]
- 26.Gupta A., Mikelson J., and Khammash M., “A finite state projection algorithm for the stationary solution of the chemical master equation,” J. Chem. Phys. 147, 044104 (2017). 10.1063/1.5006484 [DOI] [PubMed] [Google Scholar]
- 27.Grassmann W., “Transient solutions in Markovian queues,” Eur. J. Oper. Res. 1, 396–402 (1977). 10.1016/0377-2217(77)90049-2 [DOI] [Google Scholar]
- 28.Rao V. and Teh Y. W., “Fast MCMC sampling for Markov jump processes and extensions,” J. Mach. Learn. Res. 14, 3295–3320 (2013). [Google Scholar]
- 29.Kilic Z., Sgouralis I., and Pressé S., “Generalizing HMMs to continuous time for fast kinetics: Hidden Markov jump processes,” Biophys. J. 120, 409–423 (2021). 10.1016/j.bpj.2020.12.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Zhang J., Watson L. T., and Cao Y., “A modified uniformization method for the solution of the chemical master equation,” Comput. Math. Appl. 59, 573–584 (2010). 10.1016/j.camwa.2009.04.021 [DOI] [Google Scholar]
- 31.Carvalho C. M., Johannes M. S., Lopes H. F., and Polson N. G., “Particle learning and smoothing,” Stat. Sci. 25, 88–106 (2010). 10.1214/10-sts325 [DOI] [Google Scholar]
- 32.Butcher J., “Numerical methods for ordinary differential equations in the 20th century,” J. Comput. Appl. Math. 125, 1–29 (2000). 10.1016/s0377-0427(00)00455-6 [DOI] [Google Scholar]
- 33.Gaudreault S., Rainwater G., and Tokman M., “KIOPS: A fast adaptive Krylov subspace solver for exponential integrators,” J. Comput. Phys. 372, 236–255 (2018). 10.1016/j.jcp.2018.06.026 [DOI] [Google Scholar]
- 34.Vo H. D., Fox Z., Baetica A., and Munsky B., “Bayesian estimation for stochastic gene expression using multifidelity models,” J. Phys. Chem. B 123, 2217–2234 (2019). 10.1021/acs.jpcb.8b10946 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Pessoa P., “Sparse matrices in numba (smn),” https://github.com/PessoaP/smn, 2023.
- 36.Pessoa P., “Avoiding matrix exponential,” https://github.com/PessoaP/AvoidingMatrixExponential, 2023.
- 37.Gillespie D. T., “Exact stochastic simulation of coupled chemical reactions,” J. Phys. Chem. 81, 2340–2361 (1977). 10.1021/j100540a008 [DOI] [Google Scholar]
- 38.Gelfand A. E. and Smith A. F. M., “Sampling-based approaches to calculating marginal densities,” J. Am. Stat. Assoc. 85, 398–409 (1990). 10.2307/2289776 [DOI] [Google Scholar]
- 39.Arnoldi W. E., “The principle of minimized iterations in the solution of the matrix eigenvalue problem,” Q. Appl. Math. 9, 17–29 (1951). 10.1090/qam/42792 [DOI] [Google Scholar]
- 40.Haario H., Saksman E., and Tamminen J., “An adaptive Metropolis algorithm,” Bernoulli 7, 223–242 (2001). 10.2307/3318737 [DOI] [Google Scholar]
- 41.The Scipy Community, “The scipy.linalg.expm documentation,” https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.expm.html.
- 42.Numba Development Team, “Numba: A just-in-time compiler for numerical functions in python,” https://numba.pydata.org, 2012.
- 43.Rahman S. and Zenklusen D., “Single-molecule resolution fluorescent in situ hybridization (smFISH) in the yeast S. cerevisiae,” in Imaging Gene Expression (Springer, 2013), pp. 33–46. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
The supplementary material addresses two topics: (A) it provides detailed information about the MCMC strategy used to obtain the posterior samples, as shown in Figs. 4 and 5, and (B) it presents the posterior for all examples in Sec. III A in a format akin to that of Fig. 4, thereby demonstrating the consistency of methods 3–5.
Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study. The code for simulations is available in the GitHub repository: https://github.com/PessoaP/AvoidingMatrixExponential





