Efficient multifidelity likelihood-free Bayesian inference with adaptive computational resource allocation

Thomas P Prescott; David J Warne; Ruth E Baker

doi:10.1016/j.jcp.2023.112577

. Author manuscript; available in PMC: 2025 Aug 16.

Published in final edited form as: J Comput Phys. 2024 Jan;496:112577. doi: 10.1016/j.jcp.2023.112577

Efficient multifidelity likelihood-free Bayesian inference with adaptive computational resource allocation

Thomas P Prescott ^a,^b, David J Warne ^c, Ruth E Baker ^d,^*

PMCID: PMC7618014 EMSID: EMS207238 PMID: 40822115

Abstract

Likelihood-free Bayesian inference algorithms are popular methods for inferring the parameters of complex stochastic models with intractable likelihoods. These algorithms characteristically rely heavily on repeated model simulations. However, whenever the computational cost of simulation is even moderately expensive, the significant burden incurred by likelihood-free algorithms leaves them infeasible for many practical applications. The multifidelity approach has been introduced in the context of approximate Bayesian computation to reduce the simulation burden of likelihood-free inference without loss of accuracy, by using the information provided by simulating computationally cheap, approximate models in place of the model of interest. In this work we demonstrate that multifidelity techniques can be applied in the general likelihood-free Bayesian inference setting. Analytical results on the optimal allocation of computational resources to simulations at different levels of fidelity are derived, and subsequently implemented practically. We provide an adaptive multifidelity likelihood-free inference algorithm that learns the relationships between models at different fidelities and adapts resource allocation accordingly, and demonstrate that this algorithm produces posterior estimates with near-optimal efficiency.

1. Introduction

Across domains in engineering and science, parametrised mathematical models are often too complex to analyse directly. Instead, many outer-loop applications [1], such as model calibration, optimization, and uncertainty quantification, rely on repeated simulation to understand the relationship between model parameters and behaviour. In time-sensitive and cost-aware applications, the typical computational burden of such simulation-based methods makes them impractical. Multifidelity methods, reviewed by Peherstorfer et al. [1, 2] and Ng and Willcox [3], are a family of approaches that exploit information gathered from simulations, not only of a single model of interest, but also of additional approximate or surrogate models. In this article, the term model refers to the underlying mathematical abstraction of a system in combination with the computer code used to implement simulations. Thus, ‘model approximation’ may refer to mathematical simplifications and/or approximations in numerical methods. The fundamental challenge when implementing multifidelity techniques is the allocation of computational resources between different models, for the purposes of balancing a characteristic trade-off between maintaining accuracy and saving computational burden.

In this work, we consider a specific outer-loop application that arises in Bayesian statistics, the goal of which is to calibrate a parametrised model against observed data. Bayesian inference uses the likelihood of the observed data to update a prior distribution on the model parameters into a posterior distribution, according to Bayes’s rule. In the situation where the likelihood of the data cannot be calculated, we rely on so-called likelihood-free methods that provide estimates of the likelihood by comparing model simulations to data. For example, approximate Bayesian computation (ABC) is a widely-known likelihood-free inference technique [4, 5] where the likelihood is typically estimated as a binary value that records whether or not the distance between a simulation and the observed data falls within a given threshold. Other likelihood-free methods are also available, such as pseudo-marginal methods and Bayesian synthetic likelihoods (BSL) [6]. In this work, we develop a generalised likelihood-free framework for which ABC, pseudo-marginal and BSL can be expressed as specific cases, as described in Section 2.1.

The significant cost of likelihood-free inference has motivated several successful proposals for improving the efficiency of likelihood-free samplers, such as ABC-MCMC [7] and ABC-SMC [8, 9, 10]. These approaches aim to efficiently explore parameter space by avoiding the proposal of low-likelihood parameters, reducing the required number of expensive simulations required and improving the ABC acceptance rate. However, an ‘orthogonal’ technique for improving the efficiency of likelihood-free inference is to instead ensure that each simulation-based likelihood estimate is, on average, less computationally expensive to generate.

In previous work, Prescott and Baker [11, 12] investigated multifidelity approaches to likelihood-free Bayesian inference [13], with a specific focus on ABC [4, 5]. Suppose that there exists a low-fidelity approximation to the parametrised model of interest, and that the approximation is relatively cheap to simulate. Monte Carlo estimates of the posterior distribution, with respect to the likelihood of the original high-fidelity model, can be constructed using the simulation outputs of the low-fidelity approximation. Prescott and Baker [11] showed that using the low-fidelity approximation introduces no further bias, so long as, for any parameter proposal, there is a positive probability of simulating the high-fidelity model to check and potentially correct a low-fidelity likelihood estimate. The key to the success of the multifidelity ABC (MF-ABC) approach is to choose this positive probability to be suitably small, thereby simulating the original model as little as possible, while ensuring it is large enough that the variance of the resulting Monte Carlo estimate is suitably small. The result of the multifidelity approach is to reduce the expected cost of estimating the likelihood for each parameter proposal in any Monte Carlo sampling algorithm. In subsequent work, Prescott and Baker [12] showed that this approach integrates with sequential Monte Carlo (SMC) sampling for efficient parameter space exploration [9, 10, 14], and Warne et al. [15] demonstrated its applicability to multilevel Monte Carlo Guha and Tan [16], Jasra et al. [17], Warne et al. [18]. Thus, the synergistic effect of combining multifidelity with other Monte Carlo schemes to improve the efficiency of ABC has been demonstrated.

Multifidelity ABC can be compared with previous techniques for exploiting model approximation in ABC, such as Preconditioning ABC [19], Lazy ABC (LZ-ABC) [20], and Delayed Acceptance ABC (DA-ABC) [21, 22]. The preconditioning approach seeks to explore parameter space more efficiently, by proposing parameters for high-fidelity simulation with greater low-fidelity posterior mass. In contrast, each of MF-ABC, LZ-ABC, and DA-ABC seeks to make each parameter proposal quicker to evaluate, on average, by using the output of the low-fidelity simulation to directly decide whether to simulate the high-fidelity model. In both LZ-ABC and DA-ABC, a parameter proposal is either (a) rejected early, based on the simulated output of the low-fidelity model, or (b) sent to a high-fidelity simulation, to make a final decision on ABC acceptance or rejection. The distinctive aspect of MF-ABC is that step (a) is different; it is not necessary to reject early to avoid high-fidelity simulation. Instead the low-fidelity simulation can be used to make the accept/reject decision directly. In both DA-ABC and MF-ABC, the decision between (a) or (b) is based solely on whether the low-fidelity simulation would be accepted or rejected. In contrast, LZ-ABC allows for a much more generic decision of whether to simulate the high-fidelity model, requiring an extensive exploration of practical tuning methods.

More generally, there are multifidelity methods that exploit tractable surrogate models and apply subsequent adaptations or transformations to correct for bias in this surrogate. For example, Yan and Zhou [23, 24] adaptively tune surrogate models based on polynomial chaos or deep learning methods, and Bon et al. [25] use a population of surrogates and moment-matching transformations in a similar sense to Warne et al. [19]. While these approaches are of interest, we primarily focus on multifidelity schemes that are strictly simulation-based. However, we note that surrogate likelihood approaches can also be expressed within our theoretical framework.

In this paper, we show that the multifidelity approach can be applied to any simulation-based likelihood-free inference methodology, including but not limited to ABC. We achieve this by developing a generalised framework for likelihood-free inference, and deriving a multifidelity method to operate in this framework. A successful multifidelity likelihood-free inference algorithm requires us to determine how many simulations of the high-fidelity model to perform, based on the parameter value and the simulated output of the low-fidelity model. We provide theoretical results and practical, automated tuning methods to allocate computational resources between two models, designed to optimise the performance of multifidelity likelihood-free importance sampling.

1.1. Outline

In Section 2 we introduce a generalised framework for likelihood-free Bayesian inference for which standard approaches are shown to be a special case. In Section 3 a general multifidelity likelihood-free importance sampler is constructed based on the MF-ABC approach of Prescott and Baker [11]. This section also explores how to practically allocate computation between model fidelities, by adaptively evolving the allocation in response to learned relationships between simulations at each fidelity across parameter space. Analysis is presented in Section 4, including the proofs of the main results set out in Section 3.3, in which we determine the optimal allocation of computational resources between the two models to achieve the best possible performance of multifidelity inference. We illustrate adaptive multifidelity inference by applying the algorithm to a fundamental biochemical network motif in Section 5. We show that, using a low-fidelity Michaelis–Menten approximation together with the exact model (both simulated using the exact algorithm of [26]) our adaptive implementation of multifidelity likelihood-free inference can achieve a quantifiable speed-up in constructing posterior estimates to a specified variance and with no additional bias. Code for this example, developed in Julia 1.6.2 [27], is available at github.com/tpprescott/mf-lf. Finally, in Section 6 we discuss how greater improvements may be achieved for more challenging inference tasks.

2. Likelihood-free inference

We consider a stochastic model of the data generating process, defined by a distribution with parametrised probability density function, f (· | θ), where the parameter vector θ takes values in a parameter space Θ. For any θ ∈ Θ, the model induces a probability density, denoted f (y | θ), on observable outputs, with y taking values in an output space 𝒴. We note that the model is usually implemented in computer code to allow simulation, through which outputs y ∈ 𝒴 can be generated. We write y ~ f (· | θ) to denote simulation of the model f given parameter values θ. Taking the experimentally observed data y₀ ∈ 𝒴, we define the likelihood function to be a function of θ using the density, ℒ (θ) = f (y₀ | θ), of the observed data under this model.

Bayesian inference updates prior knowledge of the parameter values, θ ∈ Θ, which we encode in a prior distribution with density π(θ). The information provided by the experimental data, encoded in the likelihood function, ℒ (θ), is combined with the prior using Bayes’ rule to form a posterior distribution, with density

π (θ ∣ y_{0}) = \frac{ℒ (θ) π (θ)}{Z},

where Z = ∫ ℒ (θ)π(θ) dθ normalises π(· | y₀) to be a probability distribution on Θ. For a given, arbitrary, integrable function G : Θ → ℝ, we take the goal of the inference task as the production of a Monte Carlo estimate of the posterior expectation,

E (G ∣ y_{0}) = \int G (θ) π (θ ∣ y_{0}) d θ,

conditioned on the observed data.

2.1. Approximating the likelihood with simulation

In most practical settings, models tend to be sufficiently complicated that calculating ℒ(θ) = f (y₀ | θ) for θ ∈ Θ is intractable. In this case, we exploit the ability to produce independent simulations from the model, y = (y₁, …, y_K) with y_k ~ f (· | θ). In the following, we will slightly abuse notation by using the shorthand y ~ f (· | θ) to represent K independent, identically distributed draws from the parametrised distribution f (· | θ).

Given the observed data, y₀, we can define a real-valued function, referred to as a likelihood-free weighting, ω : (θ, y) ↦ ℝ, which varies over the joint space of parameter values and simulation outputs. Here, ω is a function of a parameter value, θ, and a vector, y, of stochastic simulations. For a fixed θ, we can take conditional expectations of ω(θ, y) over the probability density of simulations, f (y | θ), to define an approximate likelihood function,

L_{ω} (θ) = E (ω ∣ θ) = \int ω (θ, y) f (y ∣ θ) d y,

(1)

where ω(θ, y) is chosen such that L_ω(θ) is an appropriate approximation to the modelled likelihood function, ℒ(θ). The determination of what is appropriate as an approximation will depend on the implementation or application. For example, weights corresponding to BSL will not be appropriate if the distribution of the summary statistics are highly non-Gaussian. Similarly, within an ABC setting, it is not appropriate to choose a very large discrepancy threshold, as this will lead to L_ω(θ) ≈ 1 for any θ due to very few proposals being rejected. For standard likelihood-free methods, such as ABC, BSL and pseudo-marginal methods, the likelihood-free weighting, ω(θ, y), is a random variable (since it is a function of K stochastic simulations) and a Monte Carlo estimate of the approximate likelihood function, L_ω(θ). The explicit form of this Monte Carlo estimate is implementation specific (see Appendix A). While we focus here on the Monte Carlo setting, it is worth highlighting that ω(θ, y) may be treated as a deterministic function of θ (thus independent of any stochastic simulations) in order to implement a likelihood function, surrogate likelihood function, or some alternative loss function. For example, if ω(θ, y) = f (y₀ | θ) then Equation (1) reduces to L_ω(θ) = ℒ(θ).

The approximate likelihood function is used to define the likelihood-free approximation to the posterior,

π_{ω} (θ ∣ y_{0}) = \frac{L_{ω} (θ) π (θ)}{Z_{ω}},

with the normalisation constant Z_ω = ∫_Θ L_ω(θ)π(θ) dθ. The likelihood-free approximation to the posterior, π_ω(θ | y₀), subsequently induces a potentially biased approximation of E(G | y₀), given by

E_{π_{ω}} (G ∣ y_{0}) = \int G (θ) π_{ω} (θ ∣ y_{0}) d θ .

In this situation, the success of likelihood-free inference depends on ensuring that the likelihood-free weighting, ω(θ, y) is chosen such that the squared difference ${(E (G ∣ y_{0}) - E_{π_{ω}} (G ∣ y_{0}))}^{2}$ between the posterior expectation, E(G | y₀), and its likelihood-free approximation, $E_{π_{ω}} (G ∣ y_{0})$ , is as small as possible. Most importantly, the accuracy of the likelihood-free approximation is entirely encoded in the weighting function, and in most, but not all, cases this will be biased. For example, standard likelihood-free methods such as ABC, BSL and pseudo-marginal methods can be implemented using different choices for ω(θ, y) (see Appendix A), however, only ABC and BSL introduce bias.

2.2. Likelihood-free importance sampling

A direct approach to estimating the likelihood-free approximate posterior expectation, $E_{π_{ω}} (G ∣ y_{0})$ , is to use importance sampling. We assume that parameter proposals θ_i ~ q(·), for i = 1, …, N, can be sampled from a given importance distribution, the support of which must include the prior support, that is, q(θ) ≠ 0 if π(θ) > 0. In practice, we need only know the importance density, q(θ), up to a multiplicative constant. We also assume that we have access to the prior probability density, π(θ).

The likelihood-free importance sampling algorithm is described in Algorithm 1. This algorithm requires the specification of an importance distribution, q, and a likelihood-free weighting, ω(θ, y), with conditional expectation, L_ω(θ) = E(ω | θ). The output of Algorithm 1, Ĝ, is an estimate of the likelihood-free approximate posterior expectation, $E_{π_{ω}} (G ∣ y_{0})$ . We show (Section 4, Theorem 1), the standard result that Ĝ is a consistent estimate of $E_{π_{ω}} (G ∣ y_{0})$ , and quantify the dominant behaviour of the mean-squared error (MSE) in the limit of large sample sizes, N → ∞.

Algorithm 1 Likelihood-free importance sampling.

Require: Prior, π; importance distribution, q; likelihood-free weighting, ω; model f (· | θ); target function, G.

for i ∈ [1, 2, …, N] do

Sample θ_i ~ q(·);

Simulate y_i ~ f (· | θ_i);

Calculate weight $w_{i} = w (θ_{i}, y_{i}) = \frac{π (θ_{i})}{q (θ_{i})} ω (θ_{i}, y_{i});$

end for

Estimate expectation using weighted sum, $\hat{G} = \frac{\sum_{i = 1}^{N} w_{i} G (θ_{i})}{\sum_{j = 1}^{N} w_{j}}$ .

In obtaining the consistency result, we also determine the leading-order behaviour of the MSE of the output of Algorithm 1 in terms of sample size,

E ({(\hat{G} - E_{π_{ω}} (G ∣ y_{0}))}^{2}) = [\frac{E (W^{2} {(G (θ) - E_{π_{ω}} (G ∣ y_{0}))}^{2})}{E {(W)}^{2}}] \frac{1}{N} + O (\frac{1}{N^{2}}),

(2)

where W is the random variable for the value of the importance weight, w, as defined in Algorithm 1. For later formulations it is more useful to denote c_i as the random computational cost of the K stochastic simulations, y_i ~ f (·|θ_i), then consider a fixed computational budget, C_tot, rather than a fixed sample size. That is, we take N as the largest index i for which $\sum_{j = 1}^{i} c_{j} \leq C_{tot}$ . We can also quantify the performance of this algorithm in terms of how the MSE decreases with increasing the overall computational budget. To leading-order this is give by (Section 4, Corollary 2)

E ({(\hat{G} - E_{π_{ω}} (G ∣ y_{0}))}^{2}) = [\frac{E (C) E (W^{2} {(G (θ) - E_{π_{ω}} (G ∣ y_{0}))}^{2})}{E {(W)}^{2}}] \frac{1}{C_{tot}} + O (\frac{1}{C_{tot}^{2}}) .

(3)

We can use the leading-order coefficient of 1/C_tot in Equation (3) to quantify the performance of likelihood-free importance sampling. Importantly, this expression explicitly depends on the expected computational cost, C, of each iteration of Algorithm 1 and the variance of the weighted errors, $w (G (θ) - E_{π_{ω}} (G ∣ y_{0}))$ . In the importance sampling context, the optimal importance distribution q should seek to minimise this coefficient. This is achieved by minimisation of the numerator, $E (C) E (W^{2} {(G (θ) - E_{π_{ω}} (G ∣ y_{0}))}^{2})$ , that is, by trading off a preference for parameter values with lower computational burden against ensuring small variability in the weighted errors. However, in addition to tuning the proposal mechanism, q, Equation (3) also provides insight into how approximations might be used to directly reduce the leading-order coefficient in Equation (3), based on the identified trade-off between decreasing the expected computational burden and controlling the variance of the weighted error. We will not consider the tuning of q any further in the this work, but rather highlight that the trade-off presented here motivates the formulation and optimisation of a generalised multifidelity likelihood-free inference scheme.

3. Multifidelity inference

In Equation (3), the performance of Algorithm 1 is quantified explicitly in terms of how the Monte Carlo error between the estimate, Ĝ, and the approximated posterior mean, $E_{π_{ω}} (G ∣ y_{0})$ , decays with increasing computational budget, C_tot. It initially appears reasonable to conclude that the linear dependence of the performance on the expected iteration time, E(C), implies that if we can speed up the simulation step of Algorithm 1, then we can significantly reduce the MSE for a given computational budget.

Suppose that there exists an alternative model that we can use in Algorithm 1 in place of the original model, f (· | θ), such that the expected computation time for each iteration, E(C), is significantly reduced. There are two important issues that prevent this being a viable option for improving the efficiency of likelihood-free inference. The first problem is that we need to be able to quantify the effect of the alternative model on the ratio $E (W^{2} (G (θ) - {E_{π_{ω}} (G ∣ y_{0}))}^{2}) / E (W)^{2}$ to ensure that the overall performance of the algorithm is improved. It is not sufficient to show that the computational burden of each iteration is reduced, since it is possible that substantially more iterations are subsequently required to achieve a specified MSE.

The second problem arises from the observation that the limiting value of Ĝ, as output from Algorithm 1, is $E_{π_{ω}} (G ∣ y_{0})$ , with residual bias,

\lim_{C_{tot} \to \infty} E ({(\hat{G} - E (G ∣ y_{0}))}^{2}) = {(E_{π_{ω}} (G ∣ y_{0}) - E (G ∣ y_{0}))}^{2} \neq 0,

recalling that $E_{π_{ω}} (G ∣ y_{0})$ is the approximate posterior expectation induced by L_ω(θ) = E(ω | θ), and the approximand, E(G | y₀), is the posterior expectation induced by the likelihood, ℒ(θ) = f (y₀ | θ). We will identify this limiting residual squared bias, ${(E_{π_{ω}} (G ∣ y_{0}) - E (G ∣ y_{0}))}^{2}$ , as the fidelity of the model/likelihood-free weighting pair. We emphasise here that the fidelity depends both on the model and the likelihood-free weighting used in Algorithm 1, and is contextual to the target function, G. For a given posterior mean, E(G | y₀), a model and likelihood-free weighting pair for which the value of ${(E_{π_{ω}} (G ∣ y_{0}) - E (G ∣ y_{0}))}^{2}$ is small will be termed high-fidelity, while larger values of ${(E_{π_{ω}} (G ∣ y_{0}) - E (G ∣ y_{0}))}^{2}$ are termed low-fidelity. Thus, if we use an alternative model in place of f in Algorithm 1, the model (and likelihood-free weighting) may be too low-fidelity, in the sense of having too large a residual squared bias versus the posterior expectation of interest, E(G | y₀).

The multifidelity framework overcomes both these problems, by removing the need for a binary choice between the expensive model of interest and its cheaper alternative. Instead, we carry out likelihood-free inference using information from both models. In the sections that follow, we will only consider two levels of model fidelity. However, in Section 6, we discuss possible extensions for a truly multifidelity setting with multiple approximations as our approach need not be restricted to only two models.

3.1. Multifidelity likelihood-free importance sampling

We denote the high-fidelity model and likelihood-free weighting as f_hi and ω_hi, respectively. The likelihood under the high-fidelity model is denoted ℒ_hi(θ) = f_hi(y₀ | θ), and is assumed to be intractable. Following the notation introduced in Equation (1), the high-fidelity pair f_hi and ω_hi induce the approximate likelihood, $L_{ω_{hi}} (θ) = E (ω_{hi} ∣ θ)$ and the corresponding likelihood-free approximation to the posterior expectation, $E_{π_{ω_{hi}}} (G ∣ y_{0})$ . We further assume that simulating each y_hi ~ f_hi(· | θ) is computationally expensive. This computational expense motivates the use of an approximate, low-fidelity model and likelihood-free weighting, denoted f_lo and ω_lo, respectively, inducing the approximate likelihood, $L_{ω_{10}} (θ)) = E (ω_{10} ∣ θ)$ , and corresponding likelihood-free approximation to the posterior expectation, $E_{π_{ω_{10}}} (G ∣ y_{0})$ . We note that the low-fidelity model, f_lo, induces its own likelihood, ℒ_lo(θ) = f_lo(y₀ | θ), and assume that this remains intractable, requiring a simulation-based Bayesian approach. However, we assume that simulations of the low-fidelity model, y_lo ~ f_lo(· | θ), are significantly cheaper to produce compared to simulations of the high-fidelity model, that is E(C_lo)/E(C_hi) ≪ 1, where C_lo and C_hi denote the random computational time of K simulations from the low-fidelity and high-fidelity model, respectively.

Given the models f_lo and f_hi, we will term the joint distribution f_mf(y_lo, y_hi | θ) a multifidelity model when f_mf has marginals equal to the low- and high-fidelity densities, f_lo(y_lo | θ) and f_hi(y_hi | θ). The models may be conditionally independent, such that f_mf(y_lo, y_hi | θ) = f_lo(y_lo | θ) f_hi(y_hi | θ), in which case simulations at each model fidelity can be carried out independently given θ. Furthermore, if the simulations are conditionally independent, this means that the resulting likelihood-free weights, ω_lo(θ, y_lo) and ω_hi(θ, y_hi), are also conditionally independent. However, in the more general definition of the multifidelity model as a joint distribution, we allow for coupling between the two fidelities. Conditioned on the low-fidelity simulations, y_lo, and on parameter values, θ, we can produce a coupled simulation, y_hi, from the density f_hi(y_hi | θ, y_lo) implied by f_mf(y_lo, y_hi | θ) = f_hi(y_hi | θ, y_lo) f_lo(y_lo | θ). If appropriately constructed, coupling imposes positive correlations between the resulting likelihood-free weights, ω_lo(θ, y_lo) and ω_hi(θ, y_hi), to enable values of ω_lo to provide more information about unknown values of ω_hi, thereby acting as a variance reduction technique [28]. Implementation of such a variance reduction approach results in a construction related to a randomised multilevel Monte Carlo scheme [29, 30, 15]. It should be noted, however, that the success of our multifidelity scheme does not strictly rely on finding an appropriate coupling mechanism. This is an advantage of our approach since such coupling mechanisms can be challenging to construct.

We can calculate a multifidelity likelihood-free weighting as follows. Let M be any non-negative integer-valued random variable, with conditional probability mass function p(· | θ, y_lo), and with a positive conditional mean, µ(θ, y_lo) = E(M | θ, y_lo) > 0. Given a parameter value, θ, we define z = (y_lo, y_hi,1, y_hi,2, …, y_hi,m), with y_lo ~ f_lo(· | θ), m ~ p(· | θ, y_lo), and y_hi,i ~ f_hi(· | θ, y_lo), noting that each y_hi,i may be coupled to the low-fidelity simulation y_lo ~ f_lo(· | θ). We further define the multifidelity likelihood-free weighting function,

ω_{mf} (θ, z) = ω_{lo} (θ, y_{lo}) + \frac{1}{μ (θ, y_{lo})} \sum_{i = 1}^{m} [ω_{h i} (θ, y_{hi, i}) - ω_{lo} (θ, y_{lo})],

(4)

as the low-fidelity likelihood-free weighting, corrected by a randomly drawn number, M = m, of conditionally independent high-fidelity likelihood-free weightings. We write the density of z as ϕ(z | θ) and take expectations over z to obtain

L_{ω_{mf}} (θ) = E (ω_{mf} ∣ θ) = \int ω_{mf} (θ, z) ϕ (z ∣ θ) d z

(5)

as the multifidelity approximation to the likelihood.

Given M = m, only m replicates of y_hi,i ~ f_hi(· | θ, y_lo) need to be simulated for ω_mf(θ, z) to be evaluated. Thus, whenever m = 0, this means that no high-fidelity simulations need to be completed for ω_mf(θ, z) to be calculated, removing the high-fidelity simulation cost from that iteration. Algorithm 2 presents the adaptation of the basic importance sampling method of Algorithm 1 to incorporate the multifidelity weighting function. The simulation step, y ~ f (· | θ), in Algorithm 1 is replaced by the MF-Simulate function in Algorithm 2.

Algorithm 2 Multifidelity likelihood-free importance sampling.

Require: Prior, π; importance distribution, q; likelihood-free weightings, ω_hi and ω_lo; models f_hi(· | θ) and f_lo(· | θ); conditional probability mass function p(· | θ, y_lo) on non-negative integers with mean function µ(θ, y_lo); target estimated function, G.

for i ∈ [1, 2, … N] do

Sample θ_i ~ q(·);

Generate z_i ~ ϕ(· | θ_i) from MF-Simulate(θ_i);

For ω_mf in Equation (4), calculate the weight $w_{i} = w_{mf} (θ, z_{i}) = \frac{π (θ_{i})}{q (θ_{i})} ω_{mf} (θ_{i}, z_{i})$ ;

end for

Estimate expectation using weighted sum, ${\hat{G}}_{m f} = \frac{\sum_{i = 1}^{N} w_{i} G (θ_{i})}{\sum_{j = 1}^{N} w_{j}}$ .

function MF-Simulate(θ)

Simulate y_lo ~ f_lo(· | θ);

Generate m ~ p(· | θ, y_lo) with mean µ(θ, y_lo);

if m = 0 then

return z = (y_lo);

else

for i∈ [1, 2, …, m] do

Simulate y_hi,i ~ f_hi(θ, y_lo);

end for

return z = (y_lo, y_hi,1, …, y_hi,m);

end if

end function

3.2. Accuracy of multifidelity inference

We observe that using f_hi and ω_hi in Algorithm 1 produces an estimate of the high-fidelity approximate posterior expectation, $E_{π_{ω_{hi}}} (G ∣ y_{0})$ . We show (Section 4, Proposition 3) that the multifidelity approximate likelihood, $L_{ω_{mf}} (θ) = E (ω_{mf} (θ, z) ∣ θ)$ , is equal to the high-fidelity approximate likelihood, $L_{ω_{hi}} (θ) = E (ω_{hi} (θ, y_{hi}) | θ)$ . As a result, Algorithm 2 also produces a consistent estimate of the high-fidelity approximate posterior expectation, $E_{π_{ω_{hi}}} (G ∣ y_{0})$ .

In the limit of infinite computational budgets, the estimate produced by multifidelity importance sampling in Algorithm 2 is as accurate as the estimate produced by high-fidelity importance sampling in Algorithm 1 using f_hi and ω_hi. However, we still need to show that the performance of Algorithm 2 exceeds that of Algorithm 1 in the practical context of limited computational budgets. In Section 3.3, we introduce a method to quantify the performance of Algorithms 1 and 2 and show that the performance of multifidelity inference is strongly determined by the distribution of M, the random number of high-fidelity simulations required at each iteration.

3.3. Comparing performance

Equation (3) gives the leading-order behaviour of the MSE for Algorithm 1 as the computational budget increases. A similar result applies to the output of Algorithm 2. We compare two settings: first, using Algorithm 1 with the high-fidelity model; and second, using Algorithm 2 with the multifidelity model. In the first setting we have the model, f_hi, and likelihood-free weighting, ω_hi. Each iteration has computational cost denoted C_hi, and produces a weighted Monte Carlo sample with weights w_i as independent draws of the random variable W_hi. The output of Algorithm 1 is denoted Ĝ_hi. The MSE for Algorithm 1 has leading-order behaviour

E ({({\hat{G}}_{hi} - E_{π_{ω_{hi}}} (G ∣ y_{0}))}^{2}) = [\frac{E (C_{hi}) E (W_{hi}^{2} {(G (θ) - E_{π_{ω_{hi}}} (G ∣ y_{0}))}^{2})}{E {(W_{hi})}^{2}}] \frac{1}{C_{tot}} + O (\frac{1}{C_{tot}^{2}}),

(6)

as the total simulation budget C_tot → ∞. In the second setting, we similarly use Algorithm 2 with the multifidelity model f_mf and likelihood-free weighting, ω_mf. Each iteration has computational cost denoted C_mf, and produces a weighted Monte Carlo sample with weights w_i as independent draws of the random variable W_mf. The output of Algorithm 2 is Ĝ_mf. The MSE for Algorithm 2 has leading-order behaviour

E ({({\hat{G}}_{mf} - E_{π_{ω_{mf}}} (G ∣ y_{0}))}^{2}) = [\frac{E (C_{mf}) E (W_{mf}^{2} {(G (θ) - E_{π_{ω_{mf}}} (G ∣ y_{0}))}^{2})}{E {(W_{mf})}^{2}}] \frac{1}{C_{tot}} + O (\frac{1}{C_{tot}^{2}}),

(7)

as the total simulation budget C_tot → ∞. Thus the main task is to determine the conditions when

E ({({\hat{G}}_{mf} - E_{π_{ω_{mf}}} (G ∣ y_{0}))}^{2}) < E ({({\hat{G}}_{hi} - E_{π_{ω_{hi}}} (G ∣ y_{0}))}^{2})

for the same value of C_tot.

The main result of the paper (Section 4, Theorem 4) provides such conditions and enables the construction of performance metrics. The performance metrics 𝒥_hi and 𝒥_mf[µ], for Algorithm 1 and Algorithm 2, respectively, are given by

J_{hi} = \underset{high-fidelity cost}{\underset{︸}{E (C_{hi})}} \times \underset{high-fidelity variance}{\underset{︸}{E (Δ_{q} {(θ)}^{2} E (ω_{hi}^{2} ∣ θ))}},

(8)

\begin{array}{l} J_{mf} [μ] = \underset{multifidelity cost}{\underset{︸}{(E (C_{lo}) + E_{ρ} (μ (θ, y_{lo}) E (C_{hi} ∣ θ, y_{lo})))}} \\ \times \underset{multifidelity variance}{\underset{︸}{(E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ)) + E_{ρ} (\frac{Δ_{q} {(θ)}^{2}}{μ (θ, y_{lo})} E ({(ω_{hi} - ω_{lo})}^{2} ∣ θ, y_{lo}))),}} \end{array}

(9)

where M is a Poisson distributed random variable with mean µ(θ, y_lo), $Δ_{q} (θ) = (G (θ) - E_{π_{ω_{hi}}} (G ∣ y_{0})) π (θ) / q (θ)$ is the importance weighted error, ρ(θ, y_lo) = f_lo(y_lo | θ)q(θ) is the joint density of the low-fidelity simulator and the parameters under the importance distribution. Effectively both metrics are reformulations of the numerators of the leading order terms in Equation (6) and Equation (7), respectively.

The condition 𝒥_mf[µ] < 𝒥_hi is shown to be a necessary and sufficient condition for Algorithm 2 out-performing Algorithm 1 for the same computational budget (Theorem 4). Importantly, the performance depends explicitly on the free choice of the function µ(θ, y_lo) that determines the conditional mean of the Poisson-distributed number of high-fidelity simulations required at each iteration. We observe from the first factor in Equation (9) that, when µ is smaller, the total simulation cost is less. However, the second factor of Equation (9) implies that as µ decreases, the variability of the likelihood-free weighting can increase without bound, which can severely damage the performance. Thus, Equation (9) illustrates the characteristic multifidelity trade-off between reducing simulation burden while also controlling the increase in sample variance. In the results that follow, we continue to assume M is a Poisson distributed random variable, however, considerations for binomial and geometric distributions are given in Appendix B.

Using classical results from calculus of variations, it is possible to determine the mean function, µ^⋆, that achieves optimal performance of Algorithm 2, in the sense of minimising the functional, 𝒥_mf[µ]. This optimal function, µ^⋆, that maximises performance on average is given by

μ^{⋆} {(θ, y_{lo})}^{2} = \frac{E (C_{lo}) Δ_{q} {(θ)}^{2} E ({(ω_{hi} - ω_{lo})}^{2} ∣ θ, y_{lo})}{E (C_{hi} ∣ θ, y_{lo}) E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ))} .

(10)

This result is derived in Section 4 (Lemma 5). Note that as E (ω_hi − ω_lo)² | θ, y_lo increases then µ^⋆ increases. This means that if the expected error between the likelihood-free weightings is large, then the high-fidelity model needs to be simulated more frequently to correct for this. Conversely, when E(C_hi θ, y_lo) is larger, the greater simulation time of the high-fidelity model means that µ^⋆ should be smaller, and the requirement for the most expensive simulations is reduced. Intuitively, µ^⋆ acts to balance the trade-off between controlling simulation cost and variance identified in Equation (9) above.

Of course, it need not be the case that the optimal mean function in Equation (10) leads to 𝒥_mf[µ] < 𝒥_hi. For this to occur we require (See Corollary 6),

{(\frac{E (C_{lo})}{E (C_{hi})} \frac{E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ))}{E (Δ_{q} {(θ)}^{2} E (ω_{hi}^{2} ∣ θ))})}^{1 / 2} + E_{ρ} ({(\frac{Δ_{q} {(θ)}^{2} E ({(ω_{hi} - ω_{lo})}^{2} ∣ θ, y_{lo})}{E (Δ_{q} {(θ)}^{2} E (ω_{hi}^{2} ∣ θ))} \frac{E (C_{hi} ∣ θ, y_{lo})}{E (C_{hi})})}^{1 / 2}) < 1 .

(11)

The first term in Equation (11) justifies the key assumption that the average computational cost of the low-fidelity model is as small as possible compared to that of the high-fidelity model, E(C_lo)/E(C_hi) ≪ 1. The second term is a measure of the total detriment to the performance of Algorithm 2 incurred by the inaccuracy of ω_lo versus ω_hi as a Monte Carlo estimate of $L_{ω_{hi}}$ , as quantified by E ((ω_hi − ω_lo)² | θ, y_lo). This condition justifies two key criteria for the success of the multifidelity method: that low-fidelity simulations are significantly cheaper than high-fidelity simulations, and that the likelihood-free weightings, ω_hi and ω_lo, agree sufficiently well, on average. More detailed analysis of necessary and sufficient conditions for the existence of µ^⋆ is given in Section 4, following the proof of Corollary 6.

These key results show not only correctness of the multifidelity approach asymptotically, but also the situations in which performance improvement is expected. The details of the analysis results can been found in the proofs of Theorem 4, Lemma 5 and Corollary 6 given in Section 4. However, these analytical results are only useful insofar as the various expectations given in Equations (8) and (9) are known. Typically these expectations will be unknown a priori, and need to be estimated. In the following section, we describe how the analytical results of Section 3.3 can be used to construct a heuristic for adaptive multifidelity inference that learns a near-optimal mean function, µ(θ, y_lo), as simulations at each fidelity are completed.

3.4. Practical implementation

In Section 3.3, we derived the optimal mean function for the Poisson distribution of the number of high-fidelity simulations, M, generated in an iteration of Algorithm 2, conditioned on the parameter value, θ, and low-fidelity simulation, y_lo. In this section, we describe a practical approach to determining a near-optimal mean function for use in multifidelity likelihood-free inference. We rely on two approximations, relative to the analytically optimal mean function µ^⋆ given in Equation (10). First, we constrain the optimisation of 𝒥_mf to the space of functions µ_D that are piecewise constant in an arbitrary, given, finite partition, 𝒟, of the global space of (θ, y_lo) values. The resulting optimisation problem is therefore finite-dimensional. However, although this optimisation can be solved analytically, we can observe that its estimation, being based on the ratios of simulation-based Monte Carlo estimates, is numerically unstable. This motivates a second approximation, which is to follow a gradient-descent approach to allow the mean function to adaptively converge towards the optimum.

We constrain the space of mean functions, µ, to be piecewise constant. Consider an arbitrary, given collection 𝒟 = {D_k | k = 1, …, K} of ρ-integrable sets that partition the global space of (θ, y_lo) values. We denote a 𝒟-piecewise constant function, parameterised by the vector ν = (ν₁, …, ν_k), as

μ_{D} (θ, y_{lo}; v) = \sum_{k = 1}^{K} v_{k} I ((θ, y_{lo}) \in D_{k}) .

Substituting this function into Equation (9), we can quantify the performance of Algorithm 2, using the mean defined by µ_𝒟(θ, y_lo; ν), as the parameterised product

J_{D} (v) = (E (C_{lo}) + \sum_{k = 1}^{K} v_{k} C_{k}) (E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ)) + \sum_{k = 1}^{K} \frac{V_{k}}{v_{k}}),

(12)

where, for convenience in this derivation, we denote

\begin{array}{l} C_{k} = E_{ρ} (E (C_{hi} ∣ θ, y_{lo}) I ((θ, y_{lo}) \in D_{k})), \\ V_{k} = E_{ρ} (Δ_{q} {(θ)}^{2} E ({(ω_{hi} - ω_{lo})}^{2} ∣ θ, y_{lo}) I ((θ, y_{lo}) \in D_{k})), \end{array}

for k = 1, 2, …, K.

Just as in the case of general µ*, we can optimise the function ℐ_𝒟(ν) across positive vectors, ν, to obtain the result,

v_{k}^{⋆} = \sqrt{\frac{V_{k} E (C_{lo})}{C_{k} E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ))}}, for k = 1, 2, \dots, K .

(13)

This result is more convenient since all the terms on the right-hand side are constants. However, we still need to estimate these expectations values as they are unknown a priori. The standard approach would typically be to perform trial samples to estimate these values with Monte Carlo. This could be problematic in practice due to the rational form of $v_{k}^{⋆}$ which means that these estimates can be unstable, particularly for sets D_k ∈ 𝒟 with small volume under ρ.

We now consider a conservative approach to determining values for ν that will provide stable estimates of ν^⋆. Rather than directly targeting ν^⋆, based on ratios of highly variable Monte Carlo estimates, we can introduce a gradient-descent approach to updating the vector ν. Taking derivatives of ℐ_𝒟 with respect to log ν_k for k = 1, …, K gives the gradient

\frac{\partial J_{D}}{\partial \log v_{k}} = v_{k} C_{k} (E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ)) + \sum_{j = 1}^{K} \frac{V_{j}}{v_{j}}) - \frac{V_{k}}{v_{k}} (E (C_{lo}) + \sum_{j = 1}^{K} C_{j} v_{j}) .

Thus, if we write ν^(r) for the value of ν used in iteration r of Algorithm 2, we intend to update to ν^(r+1) in the next iteration using gradient descent, such that

\log v_{k}^{(r + 1)} = \log v_{k}^{(r)} - δ [v_{k}^{(r)} C_{k} (E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ)) + \sum_{j = 1}^{K} \frac{V_{j}}{v_{j}^{(r)}}) - \frac{V_{k}}{v_{k}^{(r)}} (E (C_{lo}) + \sum_{j = 1}^{K} c_{j} v_{j}^{(r)})] .

(14)

Note that we express this updating rule in terms of log $v_{k}^{(r)}$ to ensure that each $v_{k}^{(r)}$ is positive, since the updates to ν^(r) are multiplicative. As is typical of gradient-descent approaches, Equation (14) requires the specification of the step size hyperparameter, δ. It is straightforward to show that ν^⋆ is the unique positive stationary point of Equation (14). Furthermore, since each derivative ∂𝒥_𝒟/∂ log ν_k is quadratic in the expectations the numerical instability in estimating Equation (13) as a ratio does not occur. In relatively under-sampled regions D_k ∈ 𝒟 with small ρ-volume, the small values of c_k and V_k ensure that the convergence to the corresponding estimated optimal value, $v_{k}^{⋆}$ , is more conservative.

We now explicitly set out the Monte Carlo estimates of unknown expectations given r-iterations of the multifidelity inference scheme (Algorithm 2). Firstly, the cost related expectations are estimated using

E (C_{lo}) \approx {\hat{C}}_{lo}^{(r)} = \frac{1}{r} \sum_{i = 1}^{r} c_{lo, i},

(15)

and for k = 1, 2, …, K,

C_{k} \approx {\hat{C}}_{k}^{(r)} = \frac{1}{r} \sum_{i = 1}^{r} I ((θ_{i}, y_{lo, i}) \in D_{k}) \frac{1}{μ_{i}} \sum_{j = 1}^{m_{i}} c_{hi, i, j},

(16)

where, for iteration i, c_lo,i is the observed simulation cost of each y_lo,i given parameters θ_i drawn from q(θ), and c_{hi,i, j} is the observed simulation cost of each y_{hi,i, j} for j = 1, …, m_i with m_i as a random draw from the distribution of M having mean µ_i. The estimators for the variance related terms are more complex with

E (Δ_{q} {(θ)}^{2} E (E {(ω_{hi} ∣ θ, y_{lo})}^{2} ∣ θ)) \approx {\hat{V}}_{mf}^{(r)} = \frac{1}{r} \sum_{i = 1}^{r} {(\frac{Δ_{i}^{(r)}}{μ_{i}})}^{2} [{(\sum_{j = 1}^{m_{i}} ω_{hi, i, j})}^{2} - \sum_{j = 1}^{m_{i}} ω_{hi, i, j}^{2}],

(17)

and for k = 1, 2, …, K,

V_{k} \approx {\hat{V}}_{k}^{(r)} = \frac{1}{r} \sum_{i = 1}^{r} I_{D_{k}} (θ_{i}, y_{lo, i}) \frac{1}{μ_{i}} \sum_{j = 1}^{m_{i}} {(Δ_{i}^{(r)} (ω_{hi, i, j} - ω_{lo, i}))}^{2},

(18)

where ω_lo,i = ω_lo(θ_i, y_lo,i), ω_{hi,i, j} = ω_hi(θ_i, y_{hi,i, j}) for j = 1, …, m_i, $Δ_{i}^{(r)} = [G (θ_{i}) - {\hat{G}}_{hi}^{(r)}] π (θ_{i}) / q (θ_{i})$ with ${\hat{G}}_{hi}^{(r)}$ as the current Monte Carlo estimate of the target $E_{π_{ω_{hi}}} (G ∣ y_{0})$ .

These estimates can then be substituted into Equation (14) to produce an updating rule for $v_{k}^{(r)}$ , that is,

\log v_{k}^{(r + 1)} = \log v_{k}^{(r)} - δ [v_{k}^{(r)} c_{k}^{(r)} (V_{mf}^{(r)} + \sum_{j = 1}^{K} \frac{V_{j}^{(r)}}{v_{j}^{(r)}}) - \frac{V_{k}^{(r)}}{v_{k}^{(r)}} ({\hat{C}}_{l o}^{(r)} + \sum_{j = 1}^{K} c_{j}^{(r)} v_{j}^{(r)})],

(19)

leading to an adaptive multifidelity likelihood-free importance sampling scheme, as outlined in Algorithm 3.

In addition to the specification of the step size hyperparameter, δ, Algorithm 3 also requires a burn-in phase, N₀, to initialise the Monte Carlo estimates in Equation (15)–(18). The partition, 𝒟 = {D₁, …, D_K}, is also an input into Algorithm 3. We defer an investigation of how to choose this partition to future work. For the purposes of this paper, however, we can heuristically construct a partition, 𝒟, by fitting a decision tree. We use the burn-in phase of Algorithm 3, over iterations i ≤ N₀, and regress the values of

μ_{i}^{⋆} = | Δ_{i}^{(N_{0})} | \sqrt{\frac{\sum_{j} {(ω_{hi, i, j} - ω_{lo, i})}^{2}}{\sum_{j} c_{hi, i, j}}},

against features (θ_i, y_lo,i), using the CART algorithm [31] as implemented in DecisionTrees.jl. Note that this regression is motivated by the form of the true optimal mean function, µ^⋆, given in Equation (10). The resulting decision tree defines a partition, 𝒟 = {D₁, …, D_K}, used to define the piecewise-constant mean function µ_𝒟 (θ, y_lo; ν) over i > N₀.

Algorithm 3 Adaptive multifidelity likelihood-free importance sampling.

Require: Prior, π; importance distribution, q; likelihood-free weightings, ω_hi and ω_lo; models f_hi(· | θ) and f_lo(· | θ); partition 𝒟 = {D₁, …, D_K} of (θ, y_lo) space; adaptation rate, δ; burn-in period, N₀; target estimated function, G.

Initialise log $v_{k}^{(1)} = 0$ for k = 1, …, K.

for i ∈ [1, 2, …, N] do

Sample θ_i ~ q(·);

Generate z_i ~ ϕ(· | θ_i) from MF-SIMULATE(θ_i, ν⁽ⁱ⁾);

For ω_mf in Equation (4), calculate the weight $w_{i} = w_{mf} (θ, z_{i}) = \frac{π (θ_{i})}{q (θ_{i})} ω_{mf} (θ_{i}, z_{i})$ ;

if i > N₀ then

Generate ν⁽ⁱ⁺¹⁾ from Update-Nu(ν⁽ⁱ⁾);

else

Set ν⁽ⁱ⁺¹⁾ = ν⁽ⁱ⁾;

end if

end for

Estimate expectation using weighted sum, ${\hat{G}}_{mf} = \frac{\sum_{i = 1}^{N} w_{i} G (θ_{i})}{\sum_{j = 1}^{N} w_{j}}$ .

function MF-Simulate(θ, ν)

Simulate y_lo ~ f_lo(· | θ);

Find k such that (θ, y_lo) ∈ D_k;

Generate m ~ Poi(ν_k);

for i = 1, …, m do

Simulate y_hi,i ~ f_hi(· | θ, y_lo);

end for

return z = (y_lo, m, y_hi,1, …, y_hi,m)

end function

function Update-Nu(ν₁, …, ν_K)

Update Monte Carlo estimates defined in Equation (15)–(18);

for k = 1, …, K do

Increment log ν_k according to Equation (19);

end for

return ν = (ν₁, …, ν_k)

end function

4. Analysis of likelihood-free and multifidelity importance sampling

In this section, we detail the theoretical analysis of the general likelihood-free importance sampling scheme and the multifidelity schemes. The results in this section are referred to throughout Sections 2 and 3, however, notation in this section may be distinct to ensure the analysis is not needlessly verbose. Throughout the relation f (x) ≼ g(x) is taken to mean there exists a constant, c, such that f (x) ≤ cg(x) as x → ∞. Likewise f (x) ≺ g(x) is taken to mean there exists a constant, c, such that f (x) < cg(x) as x → ∞.

The first result in Theorem 1 establishes the consistency of the general likelihood-free importance sampling scheme (Algorithm 1) and in doing so derives the convergence rate in MSE. For notational simplicity, we define the function $Δ (θ) = G (θ) - E_{π_{ω}} (G ∣ y_{0})$ to recentre the approximate posterior mean.

Theorem 1

For the weighted sample values (θ_i, w_i) produced in each iteration of Algorithm 1, let W denote the random value of the weight w_i, and let θ denote the random value of θ_i. The mean squared error (MSE) of the estimator, Ĝ, is given to leading order by

E ({(\hat{G} - E_{π_{ω}} (G ∣ y_{0}))}^{2}) ≼ [\frac{E (W^{2} Δ^{2})}{E {(W)}^{2}}] \frac{1}{N},

as N → ∞. Thus, Ĝ is a consistent estimator of $E_{π_{ω}} (G | y_{0})$ .

Proof. The Monte Carlo estimate produced by Algorithm 1, Ĝ = R/S, is the ratio of two random variables: the weighted sum, $R = \sum_{i = 1}^{N} w (θ_{i}, y_{i}) G (θ_{i})$ , and the normalising sum, $S = \sum_{i = 1}^{N} w (θ_{i}, y_{i})$ . We write the function $Φ (r, s) = (r / s - E_{π_{ω}} (G | y_{0}))^{2}$ , and note that the MSE is the expected value of the function ${(\hat{G} - E_{π_{ω}} (G | y_{0}))}^{2} = Φ (R, S)$ . Using the delta method, we take expectations of the second-order Taylor expansion of Φ(R, S) about (µ_R, µ_S) = (E(R), E(S)), to give

\begin{array}{l} E (Φ (R, S)) = Φ (μ_{R}, μ_{S}) + \frac{1}{μ_{S}^{2}} [Var (R) + (2 E_{π_{ω}} (G | y_{0}) - \frac{4 μ_{R}}{μ_{S}}) Cov (R, S) + (\frac{3 μ_{R} - 2 E_{π_{ω}} (G | y_{0}) μ_{S}}{μ_{S}^{2}}) μ_{R} Var (S)] \\ + O (\frac{E ({((R - μ_{R}) + (S - μ_{S}))}^{3})}{μ_{S}^{3}}) . \end{array}

Taking expectations with respect to N independent draws of (θ, y) with density f (y | θ)q(θ), it is straightforward to write

\begin{array}{l} μ_{R} = E (R) = N E (W G) = N Z_{ω} E_{π_{ω}} (G | y_{0}), \\ μ_{S} = E (S) = N E (W) = N Z_{ω}, \end{array}

where we recall that E(W) = Z_ω = ∫L_ω(θ)π(θ) dθ is the normalising constant in Section 2.1. We substitute these expectations into the Taylor expansion of $E ({(\hat{G} - E_{π_{ω}} (G | y_{0}))}^{2})$ , noting that the leading-order term, Φ(µ_R, µ_S), is zero. Thus, we can write the dominant behaviour of the MSE as

\begin{array}{l} E ({(\hat{G} - E_{π_{ω}} (G | y_{0}))}^{2}) = \frac{1}{N^{2} Z_{ω}^{2}} [Var (R) - 2 E_{π_{ω}} (G | y_{0}) Cov (R, S) + E_{π_{ω}} {(G | y_{0})}^{2} Var (S)] + O (\frac{E ({((R - μ_{R}) + (S - μ_{S}))}^{3})}{N^{3}}) \\ = \frac{1}{N^{2} Z_{ω}^{2}} Var (R - E_{π_{ω}} (G | y_{0}) S) + O (\frac{E ({((R - μ_{R}) + (S - μ_{S}))}^{3})}{N^{3}}), \end{array}

as N → ∞. Substituting into this expression the definitions of R and S as summations of N independent identically distributed random variables, we have

E ({(\hat{G} - E_{π_{ω}} (G | y_{0}))}^{2}) = \frac{1}{N^{2} Z_{ω}^{2}} N Var (W Δ) + O (\frac{N}{N^{3}}),

and Equation (2) follows, on noting that E(WΔ) = 0 and that E(W) = Z_ω.

The consistency and MSE convergence rate in terms of sample size N is trivially recast in Corollary 2 in terms of a computational budget C_tot. This form is particularly important in the context of performance comparison with the multifidelilty schemes as presented later.

Corollary 2

Let the computational cost of each iteration of Algorithm 1 be denoted by the random variable C. The leading order behaviour of the MSE of Ĝ as an estimate of $E_{π_{ω}} (G | y_{0})$ is

E ({(\hat{G} - E_{π_{ω}} (G | y_{0}))}^{2}) ≼ [\frac{E (C) E (W^{2} Δ^{2})}{E {(W)}^{2}}] \frac{1}{C_{tot}},

as C_tot → ∞.

Proof. As the given computational budget increases, C_tot → ∞, the Monte Carlo sample size that can be produced with that budget increases on the order of N ~ C_tot/E(C). On substituting this expression into Equation (2), the result follows.

Given the multifidelity scheme in Algorithm 2 is also a special case of Algorithm 1, we know via Theorem 1 that the estimator is consistent with respect to $E_{π_{ω_{m f}}} (G | y_{0})$ . Proposition 3 further establishes that in the multifidelity case we also have $E_{π_{ω_{m f}}} (G | y_{0}) = E_{π_{ω_{h i}}} (G | y_{0})$ .

Proposition 3

The multifidelity approximation to the likelihood, L_mf(θ) = E(ω_mf | θ), is equal to the high-fidelity approximation to the likelihood, L_hi(θ) = E(ω_hi | θ). Therefore, the estimate Ĝ_mf produced by Algorithm 2 is a consistent estimate of the high-fidelity approximate posterior expectation, $E_{π_{ω_{hi}}} (G | y_{0})$ .

Proof. We take the expectation of ω_mf (Equation (4)) conditional on (θ, y_lo, m), to find

E (ω_{mf} | θ, y_{lo}, M = m) = (1 - \frac{m}{μ}) ω_{lo} (θ, y_{lo}) + \frac{m}{μ} E (ω_{hi} | θ, y_{lo}) .

Further taking expectations over the random integer M, which has conditional expected value µ(θ, y_lo), gives

E (ω_{mf} | θ, y_{lo}) = E (ω_{hi} ∣ θ, y_{lo}) .

Further taking expectations with respect to y_lo, it follows that $L_{ω_{m f}} (θ) = L_{ω_{h i}} (θ)$ . Therefore, the likelihood-free approximate posteriors, $π_{ω_{mf}} = π_{ω_{hi}}$ , are equal and thus Ĝ_mf is a consistent estimate of $E_{π_{ω_{mf}}} (G | y_{0}) = E_{π_{ω_{hi}}} (G | y_{0})$ , as required.

We now arrive at the most important result in this work as it determines the necessary and sufficient conditions for which the multifidelity scheme out-performs direct likelihood-free importance sampling targeting the high-fidelity weighting. For concreteness, Theorem 4 assumes that the random variable M determining the required number of high-fidelity simulations in each iteration of Algorithm 2, is Poisson distributed, conditional on the parameter value and low-fidelity simulation output. However, it is important to note that only minor alterations to the form of the performance metrics arise through other parametric assumptions on M. Binomial and Geometric distribution alternatives are explored in Appendix B. We show that, for a given multifidelity model and likelihood-free weightings, the mean function, µ, for M determines the performance of Algorithm 2 relative to Algorithm 1.

Theorem 4

Assume that the random number of high-fidelity simulations, M, required in each iteration of Algorithm 2 is Poisson distributed with conditional mean µ(θ, y_lo). Let c_hi(θ) [respectively, c_lo(θ) and c_hi(θ, y_lo)] be the expected time taken to simulate y_hi ~ f_hi(· | θ) [respectively, to simulate y_lo ~ f_lo(· | θ) and to produce the coupled high-fidelity simulation y_hi ~ f_hi(θ, y_lo)]. Further, assume that the computational cost of each iteration of Algorithm 1 and Algorithm 2 can be approximated by the dominant cost of simulation alone, neglecting the costs of the other calculations.

As C_tot → ∞, the MSE of the high-fidelity likelihood-free importance sampling estimator, ${\hat{G}}_{ω_{hi}}$ , asymptotically exceeds that of the multifidelity importance sampling estimator, ${\hat{G}}_{ω_{mf}}$ , that is,

E ({({\hat{G}}_{ω_{mf}} - E_{π_{ω_{hi}}} (G | y_{0}))}^{2}) ≺ E ({({\hat{G}}_{ω_{hi}} - E_{π_{ω_{hi}}} (G | y_{0}))}^{2}),

if and only if 𝒥_mf[µ] < 𝒥_hi, where

J_{hi} = c_{hi} V_{hi},

(20a)

J_{mf} [μ] = (c_{lo} + E_{ρ} (μ (θ, y_{lo}) c_{hi} (θ, y_{lo}))) \times (V_{mf} + E_{ρ} (Δ_{q} {(θ)}^{2} \frac{η (θ, y_{lo})}{μ (θ, y_{lo})})),

(20b)

with constants

\begin{array}{l} c_{hi} = \int c_{hi} (θ) q (θ) d θ, & V_{hi} = \int Δ_{q} {(θ)}^{2} E (ω_{hi}^{2} | θ) q (θ) d θ, \end{array}

(20c)

\begin{array}{l} c_{lo} = \int c_{lo} (θ) q (θ) d θ, & V_{mf} = \int Δ_{q} {(θ)}^{2} E (λ_{hi}^{2} ∣ θ) q (θ) d θ \end{array},

(20d)

and functions

\begin{array}{l} Δ_{q} (θ) = \frac{π (θ)}{q (θ)} (G (θ) - G_{hi}), & η (θ, y_{lo}) = E ({(ω_{hi} - ω_{lo})}^{2} ∣ θ, y_{lo}), & λ_{hi} (θ, y_{lo}) = E (ω_{hi} ∣ θ, y_{lo}), \end{array}

(20e)

and the joint density

ρ (θ, y_{lo}) = f_{lo} (y_{lo} | θ) q (θ) .

(20f)

Proof. The leading order performance of each of Algorithm 1 and Algorithm 2 is given in terms of increasing computational budget, C_tot, in Equation (6) and Equation (7), respectively. For the performance of Algorithm 2 to exceed that of Algorithm 1, we compare the leading order coefficients from Equations (6) and (7), requiring

\frac{E (C_{mf}) E (w_{mff}^{2} Δ^{2})}{E {(w_{mf})}^{2}} < \frac{E (C_{hi}) E (w_{hi}^{2} Δ^{2})}{E {(w_{hi})}^{2}} .

(21)

We note that $E (w_{mf} | θ) = π (θ) L_{ω_{mf}} (θ) / q (θ)$ and $E (w_{hi} | θ) = π (θ) L_{ω_{hi}} (θ) / q (θ)$ , as shown in Proposition 3, the denominators in Equation (21) are therefore equal. Thus,

J_{mf} = E (C_{mf}) E (w_{mf}^{2} Δ^{2}) < E (C_{hi}) E (w_{hi}^{2} Δ^{2}) = J_{hi},

is the condition for Algorithm 2 to outperform Algorithm 1.

Taking the right-hand side of this inequality first, clearly the expected simulation time is E(C_hi) = c_hi, for the constant c_hi defined in Equation (20c). Similarly, we can write

E (w_{hi}^{2} Δ^{2}) = \int {(\frac{π (θ)}{q (θ)} Δ (θ))}^{2} [\int ω_{hi} {(θ, y_{hi})}^{2} f_{hi} (y_{hi} | θ)d y_{hi}] q (θ) d θ = V_{hi},

as given in Equation (20c). Thus, 𝒥_hi = c_hiV_hi.

For the left-hand side of the performance inequality, we take each expectation in 𝒥_mf in turn. We first note that the expected iteration cost of Algorithm 2, E(C_mf), is the sum of the expected cost of a single low-fidelity simulation, and the expected cost of M high-fidelity simulations. By definition, the expected cost of a single low-fidelity simulation y_lo ~ f_lo(· | θ) across θ ~ q(·) is given by c_lo. Thus the remaining cost, E(δC_mf) = E(C_mf) − c_lo, is the expected cost of M high-fidelity simulations. Conditioning on θ, y_lo and M = m, the expected remaining cost is, by definition,

E (δ C_{mf} ∣ θ, y_{lo}, M = m) = m c_{hi} (θ, y_{lo}) .

Taking expectations over the conditional distribution M ~ Poi(µ(θ, y_lo)), we have

E (δ C_{mf} | θ, y_{lo}) = μ (θ, y_{lo}) c_{hi} (θ, y_{lo}) .

Finally, integrating this expression over the density ρ in Equation (20f) gives the first factor of Equation (9).

It remains to show that

E (w_{mf}^{2} Δ^{2}) = V_{mf} + \iint Δ_{q} {(θ)}^{2} \frac{η (θ, y_{lo})}{μ (θ, y_{lo})} ρ (θ, y_{lo}) d θ d y_{lo} .

We first condition on θ, y_lo and M = m, to write

\begin{array}{l} E (w_{m f}^{2} Δ^{2} ∣ θ, y_{lo}, m) = Δ_{q}^{2} E (ω_{m f}^{2} ∣ θ, y_{lo}, m) \\ = Δ_{q}^{2} [ω_{l o}^{2} + \frac{2}{μ} ω_{lo} E (D_{m} ∣ θ, y_{lo}) + \frac{1}{μ^{2}} E (D_{m}^{2} ∣ θ, y_{lo})], \end{array}

for the random variable $D_{m} = \sum_{i = 1}^{m} (ω_{hi, i} - ω_{lo})$ . It is straightforward to show that

\begin{array}{l} E (D_{m} | θ, y_{lo}) = m E (ω_{hi} | θ, y_{lo}) - m ω_{lo}, \\ E (D_{m}^{2} | θ, y_{lo}) = m E ({(ω_{hi} - ω_{lo})}^{2} | θ, y_{lo}) + (m^{2} - m) E {(ω_{hi} - ω_{lo} | θ, y_{lo})}^{2}, \end{array}

where we exploit the conditional independence of the high-fidelity simulations y_hi,i and y_hi, _j, for i ≠ j. On substitution of these conditional expectations, we then rearrange to write

E (w_{mf}^{2} Δ^{2} | θ, y_{lo}, m) = Δ_{q}^{2} [(1 - \frac{2 m}{μ}) ω_{lo}^{2} + \frac{2 m}{μ} ω_{lo} λ_{h i} + \frac{m}{μ^{2}} Var (ω_{hi} - ω_{lo} ∣ θ, y_{lo}) + {(\frac{m (λ_{hi} - ω_{lo})}{μ})}^{2}],

where we write the conditional expectation λ_hi(θ, y_lo) = E(ω_hi θ, y_lo). At this point we can take expectations over M and rearrange to give

\begin{array}{l} E (w_{mf}^{2} Δ^{2} | θ, y_{lo}) = Δ_{q}^{2} [2 ω_{lo} λ_{hi} - ω_{lo}^{2} + \frac{1}{μ} Var (ω_{hi} - ω_{lo} ∣ θ, y_{lo}) + \frac{(Var (M ∣ θ, y_{lo}) + μ^{2}) {(λ_{hi} - ω_{l o})}^{2}}{μ^{2}}] \\ = Δ_{q}^{2} [λ_{hi}^{2} + \frac{1}{μ} (Var (ω_{hi} - ω_{lo} ∣ θ, y_{lo}) + \frac{Var (M ∣ θ, y_{lo})}{μ} {(E (ω_{hi} - ω_{lo} ∣ θ, y_{lo}))}^{2})] . \end{array}

(22)

Here, we can use the assumption that M conditioned on θ and y_lo is Poisson distributed, noting that the statement of Theorem 4 can be adapted for other conditional distributions of M with different conditional variance functions. Under the Poisson assumption, we can substitute Var(M | θ, y_lo) = µ(θ, y_lo) to give

E (w_{mf}^{2} Δ^{2} | θ, y_{lo}) = Δ_{q}^{2} [λ_{hi}^{2} + \frac{E ({(ω_{hi} - ω_{lo})}^{2} ∣ θ, y_{lo})}{μ (θ, y_{lo})}] .

Finally, we take expectations with respect to the probability density ρ in Equation (20f), and the product in Equation (9) follows.

Given the main result from Theorem 4 on the conditions for improvements using multifidelity importance sampling, the following results (Lemma 5 and Corollary 6) demonstrate the form and existence of the mean function, µ(θ, y_lo), such that the conditions are satisfied.

Lemma 5

The functional 𝒥_mf[µ] quantifying the performance of Algorithm 2 is optimised by the function µ^⋆, where

μ^{⋆} {(θ, y_{lo})}^{2} = Δ_{q} {(θ)}^{2} [\frac{η (θ, y_{lo}) / V_{mf}}{c_{hi} (θ, y_{lo}) / c_{lo}}] .

(23)

Proof. We write the functional 𝒥_mf[µ] = 𝒞[µ]𝒱 [µ] in Equation (9) as the product of functionals

C [μ] = c_{lo} + \iint μ c_{hi} ρ d θ d y_{lo},

(24a)

V [μ] = V_{mf} + \iint Δ_{q}^{2} \frac{η}{μ} ρ d θ d y_{lo} .

(24b)

Standard ‘product rule’ results from calculus of variations allows us to write the functional derivative of 𝒥_mf with respect to µ as

\begin{array}{l} \frac{δ J_{mf}}{δ μ} & = V [μ] \frac{δ C}{δ μ} + C [μ] \frac{δ V}{δ μ} \\ = V [μ] c_{hi} ρ - C [μ] \frac{Δ_{q}^{2} η ρ}{μ^{2}} . \end{array}

Setting this functional derivative to zero, the optimal function, µ^⋆, satisfies

μ^{⋆} {(θ, y_{lo})}^{2} = \frac{C [μ^{⋆}]}{V [μ^{⋆}]} \frac{Δ_{q} {(θ)}^{2} η (θ, y_{lo})}{c_{hi} (θ, y_{lo})} .

(25)

The result in Equation (10) follows on showing that C[µ^⋆]/V[µ^⋆] = c_lo/V_mf.

On substituting Equation (25) into Equation (24) we find

\begin{array}{l} C [μ^{⋆}] = c_{lo} + \sqrt{\frac{C [μ^{⋆}]}{V [μ^{⋆}]}} \iint \sqrt{Δ_{q}^{2} η c_{hi}} ρ d θ d y_{lo,} \\ V [μ^{⋆}] = V_{mf} + \sqrt{\frac{V [μ^{⋆}]}{C [μ^{⋆}]}} \iint \sqrt{Δ_{q}^{2} η c_{hi}} ρ d θ d y_{lo} \end{array},

from which it follows that

\sqrt{\frac{V [μ^{⋆}]}{C [μ^{⋆}]}} c_{lo} = \sqrt{\frac{C [μ^{⋆}]}{V [μ^{⋆}]}} V_{mf} = \sqrt{C [μ^{⋆}] V [μ^{⋆}]} - \iint \sqrt{Δ_{q}^{2} η c_{hi}} ρ d θ d y_{lo} .

Multiplying this equation by $\sqrt{C [μ^{⋆}] V [μ^{⋆}]}$ , we have 𝒱[µ^⋆]c_lo = 𝒞 [µ^⋆]V_mf, and thus Equation (10) follows from Equation (25).

Corollary 6

There exists a mean function, µ, such that the performance of Algorithm 2 exceeds the performance of Algorithm 1, if and only if

\sqrt{\frac{c_{lo}}{c_{hi}} \frac{V_{mf}}{V_{hi}}} + \iint \sqrt{\frac{Δ_{q} {(θ)}^{2} η (θ, y_{lo})}{V_{hi}}} \sqrt{\frac{c_{hi} (θ, y_{lo})}{c_{hi}}} ρ (θ, y_{lo}) d θ d y_{lo} < 1.

(26)

Proof. On substituting Equation (23) into Equation (24), we find that the condition $J_{mf}^{⋆} = J_{mf} [μ^{⋆}] < J_{hi} = c_{hi} V_{hi}$ is equivalent to

{(\sqrt{c_{lo} V_{mf}} + \iint \sqrt{Δ_{q} {(θ)}^{2} η (θ, y_{lo}) c_{hi} (θ, y_{lo})} ρ (θ, y_{lo}) d θ d y_{lo})}^{2} < c_{hi} V_{hi} .

A simple rearrangement of this inequality gives the inequality in Equation (26).

To interpret the condition in Equation (26), we note that the first term is determined by (a) our assumption of a significant reduction in simulation burden of the low-fidelity model over the high-fidelity model, c_lo < c_hi, and (b) the ratio of the two integrals,

\frac{V_{mf}}{V_{hi}} = \frac{\int Δ_{q} {(θ)}^{2} E (E {(ω_{hi} | θ, y_{lo})}^{2} | θ) q (θ) d θ}{\int Δ_{q} {(θ)}^{2} E (ω_{hi}^{2} | θ) q (θ) d θ} .

Exploiting the law of total variance, we note that

\begin{array}{l} E (ω_{h i}^{2} | θ) = Var (ω_{hi} | θ) + L_{ω_{hi}} {(θ)}^{2}, \\ E (E {(ω_{hi} | θ, y_{lo})}^{2} ∣ θ) = Var (E (ω_{hi} | θ, y_{lo}) ∣ θ) + L_{ω_{hi}} {(θ)}^{2} \\ = E (ω_{hi}^{2} | θ) - E (Var (ω_{hi} ∣ θ, y_{lo}) ∣ θ) . \end{array}

These equalities imply that

E {(ω_{hi} | θ)}^{2} \leq E (E {(ω_{hi} | θ, y_{lo})}^{2} ∣ θ) \leq E (ω_{hi}^{2} | θ),

where the lower bound is achieved for y_hi independent of y_lo, while the upper bound would be achieved if y_hi were a deterministic function of y_lo. In particular, V_mf/V_hi ≤ 1, and so the first term of Equation (26) is small whenever the low-fidelity model provides significant computational savings versus the high-fidelity model.

The second term in Equation (26) quantifies the detriment to the performance of Algorithm 2 that arises from the inaccuracy of ω_lo as an estimate of ω_hi. The function η(θ, y_lo) = E((ω_hi − ω_lo)² | θ, y_lo) is integrated across the density ρ, weighted by the relative computational cost of the high-fidelity simulation, c_hi(θ, y_lo)/c_hi, and by the contribution of G(θ) to the variance of the estimated posterior expectation of G. We can conclude that the multifidelity approach requires that the low-fidelity model is accurate in the regions of parameter space where high-fidelity simulations are particularly expensive.

To summarise: if (a) the ratio between average low-fidelity simulation costs and high-fidelity simulation costs is suitably small, and (b) the average disagreement between likelihood-free weightings, as measured by η, is suitably small, then Equation (26) will be satisfied and thus a mean function, µ^⋆, exists such that Algorithm 2 is more efficient than Algorithm 1. The optimisation goal of the adaptive scheme proposed in Algorithm 3 optimally tunes the mean function µ(θ, y_lo) to maximise the computational benefit.

5. Example: Biochemical reaction network

The following example considers the stochastic simulation of a biochemical reaction motif. Readers unfamiliar with these techniques are referred to detailed expositions of by Schnoerr et al. [32], Warne et al. [33] and Erban and Chapman [34]. We model the conversion (over time t ≥ 0) of substrate molecules, labelled S, into molecules of a product, P. The conversion of S into P is catalysed by the presence of enzyme molecules, E, which bind with S to form a substrate-enzyme complex, labelled C. After non-dimensionalising units of time and volume, this network motif is represented by three reactions,

S + E \underset{k_{2}}{\overset{k_{1}}{⇌}} C \overset{k_{3}}{\to} P + E,

(27a)

parameterised by the vector θ = (k₁, k₂, k₃) of positive parameters, k₁, k₂, and k₃, and three propensity functions,

v_{1} (t) = k_{1} S (t) E (t),

(27b)

v_{2} (t) = k_{2} C (t),

(27c)

v_{3} (t) = k_{3} C (t),

(27d)

where the integer-valued variables S (t), E(t), C(t) and P(t) represent the molecule numbers at time t > 0. The stoichiometric matrix for this model is

h = [\begin{array}{l} - 1 & 1 & 0 \\ - 1 & 1 & 0 \\ 1 & - 1 & - 1 \\ 0 & 0 & 1 \end{array}] .

(28)

At t = 0, we assume there are no complex or product molecules, but set positive integer numbers S ₀ = 100 and E₀ = 5 of substrate and enzyme molecules, respectively. Given the fixed initial conditions, the parameters in θ are sufficient to specify the dynamics of the model in Equation (27a). The model is stochastic, and induces a distribution, which we denote f (· | θ), on the space of trajectories x : t ⟼ (S (t), E(t), C(t), P(t)) of molecule numbers in ℕ⁴ over time. Such trajectories evolve according to the discrete-state Markov process

x_{t} = x_{0} + \sum_{j = 1}^{3} P_{j} (\int_{0}^{t} v_{j} (s) d s) h_{*, j},

(29)

where 𝒫₁(·), 𝒫₃(·), and 𝒫₃(·), are inhomogeneous Poisson processes.

For the purposes of this example, the observed data (Figure 1) is given by

y_{0} = (y_{1}, \dots, y_{10}) = (1.73, 3.80, 5.95, 8.10, 11.17, 12.92, 15.50, 17.75, 20.17, 23.67),

where the n-th observation represents the hitting time for the product molecule levels exceeding 10n, that is, y_n = t such that P(t) = 10n. We set a prior π(θ) on the vector θ, equal to a product of independent uniform distributions such that k₁, k₂ ~ U(10, 100) and k₃ ~ U(0.1, 10). We seek the posterior distribution π(θ | y₀) using the likelihood, denoted ℒ(θ) = f (y₀ | θ), focusing on the posterior expectation of the function G(θ) = k₃, denoting the rate of conversion of substrate–enzyme complex to product. All code for this example is available at github.com/tpprescott/mf-lf, using stochastic simulations implemented by github.com/tpprescott/ReactionNetworks.jl. We assume that we cannot calculate the likelihood function, L(θ) = f (y₀ | θ), and therefore resort to our likelihood-free framework.

(a) Example stochastic trajectories from the high and low-fidelity enzyme kinetics models in Equations (27) and (30) for parameters θ = (k₁, k₂, k₃) = (50, 50, 1), compared with data used for inference. For one low-fidelity simulation, we generate five uncoupled simulations and five coupled simulations. (b) Ten-dimensional data summarising simulated trajectories in (a). Black represents observed data, y₀; the single low-fidelity simulation y_lo ~ f_lo(· | θ) is in blue; five uncoupled simulations y_hi ~ f_hi(· | θ) are in orange; five coupled simulations y_hi ~ f_hi(· | *θ, y*_lo) are in green (almost coincident).

5.1. Multifidelity approximate Bayesian computation

Here we formulate the weighting function to implement ABC importance sampling to compare with an equivalent ABC implementation of adaptive multifidelity importance sampling. See Appendix A for the general formulation of ABC within our framework.

5.1.1. ABC importance sampling

Given θ, the model in Equation (27) can be exactly simulated using the Gillespie stochastic simulation algorithm, to produce draws y ~ f (· | θ) from the exact model [26, 34, 35]. We will use the ABC likelihood-free weighting with threshold value ϵ = 5 on the Euclidean distance of the simulation from y₀, such that

ω (θ, y) = 1 (| | y - y_{0} | |_{2} < 5),

to define the likelihood-free approximation to the posterior, L_ABC(θ) = E(ω | θ). For simplicity of demonstration we set the importance distribution to be equal to the prior, that is, q = π. In this instance, the likelihood-free weighting in Algorithm 1 reduces to a rejection sampling approach, setting the importance distribution equal to the prior. Furthermore, configuring Algorithm 1 and Algorithm 3 to use the same proposal ensures any performance improvements are due to the multifidelity scheme rather than tuning of proposals.

5.1.2. Multifidelity ABC

The exact Gillespie stochastic simulation algorithm can incur significant computational burden. In the specific case of the network in Equation (27), if the reaction rates k₁ and k₂ are large relative to k₃, there are large numbers of binding/unbinding reactions S + E ⟷ C that occur in any simulation. In comparison, the reaction C ⟶ P + E can only fire exactly 100 times. Michaelis–Menten dynamics exploit this scale separation to approximate the enzyme kinetics network motif. We approximate the conversion of substrate into product as a single reaction step,

S \overset{k_{MM} (t)}{\to} P,

(30a)

where the time-varying rate of conversion, k_MM(t), given by

k_{MM} (t) = \frac{k_{3} \min (S (t), E_{0})}{K_{MM} + S (t)},

(30b)

K_{MM} = \frac{k_{2} + k_{3}}{k_{1}},

(30c)

induces the propensity function v_MM(t) = k_MM(t)S (t). We assume initial conditions of S (0) = S ₀ = 100 and P(0) = 0, and fix the parameter E₀ = 5. Thus, the parameter vector, θ = (k₁, k₂, k₃), again fully determines the dynamics of the low-fidelity model in Equation (30). We write y_lo ~ f_lo(· | θ) as the conditional probability density for the Gillespie simulation of the approximated model in Equation (30), where y_lo is the vector of ten simulated time points y_lo,n at which 10n product molecules have been produced.

For a biochemical reaction network consisting of R reactions, the Gillespie simulation algorithm is a deterministic transformation of R independent unit-rate Poisson processes, one for each reaction channel. We can couple the models in Equations (27) and (30) by using the same Poisson process for the single reaction in Equation (30) and for the product formation C ⟶ P + E reaction of Equation (27) [11, 36]. Using this coupling approach, we first simulate y_lo ~ f_lo(· | θ) from Equation (30). We then produce the coupled simulation y_hi ~ f_hi(· | θ, y_lo) from the model in Equation (27), using the shared Poisson process. We set the corresponding likelihood-free weightings to

\begin{array}{l} ω_{hi} (θ, y_{hi}) = I (| y_{hi} - y_{0} | < 5), \\ ω_{lo} (θ, y_{lo}) = I (| y_{lo} - y_{0} | < 5), \end{array}

noting that E(ω_hi | θ) = L_ABC(θ) is the high-fidelity ABC approximation to the likelihood. Figure 1 illustrates the effect of coupling between low-fidelity and high-fidelity models. The five coupled high-fidelity simulations are significantly less variable than the independent high-fidelity simulations, appearing almost coincident in Figure 1. This ensures a large degree of correlation between the coupled likelihood-free weightings, ω_hi and ω_lo. Thus, coupling ensures that ω_lo is a reliable proxy for ω_hi for use in multifidelity likelihood-free inference.

We implement Algorithm 3 by setting a burn-in period of N₀ = 10, 000, for which we generate m_i ~ Poi(1) high-fidelity simulations at each iteration, i ≤ N₀. Once the burn-in period is complete, we define the partition 𝒟 by learning a decision tree through a simple regression, as described in Section 3.4. For iterations i > N₀ beyond the burn-in period, we set a step size of δ = 10³ for the gradient descent update in Equation (14).

5.1.3. Results

Algorithm 1 was run four times, setting the total number of weighted samples to N = 10, 000, N = 20, 000, N = 40, 000 and N = 80, 000. Similarly, Algorithm 3 was run five times, setting N = 40, 000, N = 80, 000, N = 160, 000, N = 320, 000 and N = 640, 000. Since we do not have access to the true posterior mean, $E_{π_{ω_{hi}}} (G ∣ y_{0})$ , we use the empirical mean over all high-fidelity runs as a proxy, resulting in an empirical MSE estimate for the multifidelity scheme. Figure 2a shows how the empirical MSE in the estimate, Ĝ, varies with the total simulation cost, C_tot, shown for each of the two algorithms. The slope of each curve (on a log-log scale) is approximately − 1, corresponding to the dominant behaviour of the MSE being reciprocal with total simulation time, as observed in Equation (7). The offset in the two curves corresponds to the inequality 𝒥_mf < 𝒥_hi in the leading order coefficient, thereby demonstrating the improved performance of Algorithm 3 over Algorithm 1.

(a) Total simulation cost versus the empirical mean-squared error (MSE) of output estimate, Ĝ, for four runs of Algorithm 1 (ABC) and five runs of Algorithm 3 (MF-ABC). (b) Values of the multifidelity weight, w_mf, during the longest run of Algorithm 3, for iterations where ω_mf ≠ ω_lo, such that the low-fidelity likelihood-free weighting is corrected based on at least one high-fidelity simulation. (c) The evolution of the values of $v_{k}^{(i)}$ during the longest run of Algorithm 3, here different colored lines are used to distinguish $v_{k}^{(i)}$ between distinct partitions, D_k, however, the color itself has no specific interpretation. (d) A comparison of the adaptive $v_{1}^{(i)}$ to the evolving best estimate of the optimal $v_{1}^{⋆}$ , given by Equation (13), based on the Monte Carlo estimates in Equation (19).

The values in Figure 2b show the multifidelity weights, w_i. We show only those weights not equal to zero or one, corresponding to those iterations where ω_lo(θ_i, y_lo,i) has been corrected by at least one ω_hi(θ_i, y_{hi,i, j}) ≠ ω_lo(θ_i, y_lo,i). Clearly there is a significant amount of correction applied to the low-fidelity weights. However, as demonstrated by the improved performance statistics, Algorithm 3 has learned the required allocation of computational budget to the high-fidelity simulations that balances the trade-off between achieving reduced overall simulation times and correcting inaccuracies in the low-fidelity simulation.

Each run of Algorithm 3 includes a burn-in period of 10, 000 iterations, at the conclusion of which a partition 𝒟 is created, based on decision tree regression. In Appendix C, we show how this decision tree is used to define a piecewise-constant mean function, specifically for the partition 𝒟 used for the final run of Algorithm 3 (i.e. N = 640, 000 iterations). In Figure 2c, we show the evolution of the values of $v_{k}^{(i)}$ used in this mean function, over iterations i. Following the updating rule in Equation (19), the trajectory of $v_{k}^{(i)}$ converges exponentially towards a Monte Carlo estimate of the optimal value $v_{k}^{⋆}$ given in Equation (13). However, we can see from Figure 2c that, as more simulations are completed and the Monte Carlo estimates in Equation (19) evolve, the values of each parameter, ν_k, track updated estimates. This is illustrated in Figure 2d for ν₁, where the estimated optimum $v_{1}^{⋆}$ evolves as more simulations are completed. We note that the gradient descent update in Equation (19) at iteration i depends on all $v_{k}^{(i)}$ values. Thus, the observed convergence of $v_{1}^{(i)}$ to the evolving estimate of $v_{1}^{⋆}$ is not necessarily monotonic.

Figure 2d illustrates the motivation for the use of gradient descent rather than simply using the analytically obtained optimum. When very few simulations have been completed, then the estimates in Equation (10) are small and their ratios are numerically unstable, and often far from the true optimum. If $v_{k}^{(i)}$ values are too small in early iterations, then estimates become more numerically unstable, since fewer high-fidelity simulations are completed for small µ. Instead, using gradient descent ensures that enough high-fidelity simulations are completed for each 𝒟_k, including those with low volume under the measure ρ, to stabilise the estimates required in Equation (10) and thus stabilise the multifidelity algorithm.

5.2. Multifidelity Bayesian synthetic likelihood

Consider the same model of enzyme kinetics as in Section 5.1. As depicted in Figure 1, this model has low-fidelity (Michaelis–Menten) stochastic dynamics with distribution f_lo(· | θ), and coupled high-fidelity stochastic dynamics with distribution f_hi(· | θ, y_lo). We now redefine ω_lo and ω_hi to be Bayesian synthetic likelihoods, based on K pairs of coupled simulations,

\begin{array}{l} y_{lo, k} ~ f_{lo} (\cdot ∣ θ), \\ y_{hi, k} ~ f_{hi} (\cdot ∣ θ, y_{lo, k}), \end{array}

for k = 1, …, K. That is,

\begin{array}{l} ω_{lo} (θ, y_{lo}) = N (y_{0} : μ (y_{lo}), \sum (y_{lo})), \\ ω_{hi} (θ, y_{hi}) = N (y_{0} : μ (y_{hi}), \sum (y_{hi})), \end{array}

are the Gaussian likelihoods of the observed data, under the empirical mean and covariance of K low-fidelity and (coupled) high-fidelity simulations, respectively. Just like in the ABC example, for simplicity of demonstration we set the importance distribution to be equal to the prior, that is, q = π.

Algorithm 1 was run three times, using ω_hi(θ, y_hi) dependent on high-fidelity simulations y_hi ~ f (· | θ), alone, and setting the number of iterations to N = 2, 500, N = 5, 000 and N = 10, 000. Similarly, Algorithm 3 was run four times using the coupled multifidelity model, setting the number of iterations to N = 4, 000, N = 8, 000, N = 16, 000 and N = 32, 000, and initialising with a burn-in of size N₀ = 2, 000. The adaptive step size is set to δ = 10⁸. In both algorithms, we set the number of simulations required for each evaluation of ω_hi(θ, (y_hi,1, …, y_hi,K)) or ω_lo(θ, (y_lo,1, …, y_lo,K)) as K = 100.

Figure 3 depicts the performance of multifidelity BSL inference, where Algorithm 3 is applied with BSL likelihood-free weightings, ω_lo and ω_hi. As with MF-ABC, Figure 3a shows that the MF-BSL generates improved performance over high-fidelity BSL inference, achieving lower MSE estimates for a given computational budget. We also note in Figure 3a that the curve corresponding to MF-BSL has slope less than − 1. This is due to (a) the overhead cost of the initial burn-in period of Algorithm 3, and also (b) the conservative convergence of ν⁽ⁱ⁾ to the optimum, as shown in Figure 3c–d. Both observations imply that earlier iterations are less efficiently produced than later iterations, meaning that larger samples show greater improvements than expected from the reciprocal relationship in Equation (7).

(a) Total simulation cost versus the empirical MSE of output estimate, Ĝ, for three runs of Algorithm 1 (BSL) and four runs of Algorithm 3 (MF-BSL). (b) Values of the multifidelity weight, w_mf, during the longest run of Algorithm 3, for iterations where ω_mf ≠ ω_lo, such that the low-fidelity likelihood-free weighting is corrected based on at least one high-fidelity simulation. (c) The evolution of the values of $v_{k}^{(i)}$ during the longest run of Algorithm 3, here different coloured lines are used to distinguish $v_{k}^{(i)}$ between distinct partitions, D_k, however, the colour itself has no specific interpretation. (d) A comparison of the adaptive $v_{1}^{(i)}$ to the evolving best estimate of the optimal $v_{1}^{⋆}$ , given by Equation (13), based on the Monte Carlo estimates in Equation (19).

Comparing Figure 3b to Figure 2b, we note that there are very few negative multifidelity weightings in MF-BSL, in comparison to MF-ABC. We can conclude that the Bayesian synthetic likelihood, constructed using low-fidelity simulations, tends to underestimate the likelihood of the observed data compared to using high-fidelity simulations. We note also in this comparison that the multifidelity likelihood-free weightings are on significantly different scales.

6. Discussion

The characteristic computational burden of simulation-based, likelihood-free Bayesian inference methods is often a barrier to their successful implementation. Multifidelity simulation techniques have previously been shown to improve the efficiency of likelihood-free inference in the context of ABC. In this work, we have demonstrated that these techniques can be readily applied to a range of likelihood-free approaches. Furthermore, we have introduced a computational methodology for automating the multifidelity approach, adaptively allocating simulation resources across different fidelities in order to ensure near-optimal efficiency gains from this technique. As parameter space is explored, our methodology, given in Algorithm 3, learns the relationships between simulation accuracy and simulation costs at the different fidelities, and adapts the requirement for high-fidelity simulation accordingly.

The multifidelity approach to likelihood-free inference is one of a number of strategies for speeding up inference, which include MCMC and SMC sampling techniques [7, 8, 9] and methods for variance reduction such as multilevel estimation [29, 16, 18, 17]. A key observation in the previous work of Prescott and Baker [12] and Warne et al. [15] is that applying multifidelity techniques provides ‘orthogonal’ improvements that combine synergistically with these other established approaches to improving efficiency. Similarly, we envision that Algorithm 3 can be adapted into an SMC or multilevel algorithm with minimal difficulty, following the templates set by Prescott and Baker [12] and Warne et al. [15].

The multifidelity approach discussed in this work is a highly flexible generalisation of existing multifidelity techniques, which can be viewed as special cases of Algorithm 2. In each of MF-ABC [11, 12], LZ-ABC [20], and DA-ABC [22], it is assumed that ω_hi is an ABC likelihood-free weighting, which we relax in this work. Furthermore, LZ-ABC and DA-ABC both use ω_lo ≡ 0, so that parameters are always rejected if no high-fidelity simulation is completed. We relax this assumption to allow for any low-fidelity likelihood-free weighting. In all of MF-ABC, LZ-ABC and DA-ABC, the conditional distribution of M, given a parameter value θ and low-fidelity simulation output y_lo is Bernoulli distributed, with mean µ(θ, y_lo) ∈ (0, 1]. In this work we change this distribution to Poisson, to ease analytical results, but any conditional distribution for M can be used. These adaptations are explored further in Appendix B.

In the case of MF-ABC (as originally formulated by [11]) and DA-ABC [21, 22], the mean function, µ(θ, y_lo), depends on a single low-fidelity simulation and is assumed to be piecewise constant in the value of the indicator function 1(d(y_lo, y₀) < ϵ). Lazy ABC is more generic in its definition of µ = µ(ϕ(θ, y_lo)) to depend on the value of any decision statistic, ϕ. In this work, we consider more general piecewise constant mean functions, µ_𝒟, for heuristically derived partitions 𝒟 of (θ, y_lo)-space. We observe that (θ, y_lo) may be of very high dimension; in the BSL example in Section 5.2, having K = 100 low-fidelity simulations y_lo ∈ ℝ¹⁰ means that the input to µ is of dimension 1003. In this situation, it may be tempting to seek a mean function that only depends on θ. However, we recall that the optimal mean function, µ^⋆(θ, y_lo), derived in Lemma 5, depends on the conditional expectation E((ω_hi − ω_lo)² | θ, y_lo). Thus, by ignoring y_lo, we would ignore the information about ω_hi given by the evaluation of ω_lo(θ, y_lo). Furthermore, the high dimension of the inputs to µ^⋆ suggest that this function is not necessarily well-approximated by a decision tree. Future work may focus on methods to learn the optimal mean function directly without resorting to piecewise constant approximations [37]. The key problem is ensuring the conservatism of any alternative estimate of µ^⋆, recalling that the variance of w_mf is inversely proportional to µ.

In the example explored in Section 5, we considered the use of Algorithm 3 where ω_hi and ω_lo were first both ABC likelihood-free weightings, and then both BSL likelihood-free weightings. In principle, this method should also allow for ω_lo to be, for example, an ABC likelihood-free weighting based on a single low-fidelity simulation, and ω_hi to be a BSL likelihood-free weighting based on K > 1 high-fidelity simulations. However, the success of the multifidelity method depends explicitly on the function η(θ, y_lo) = E((ω_hi − ω_lo)² | θ, y_lo) being sufficiently small, as quantified in Corollary 6. If ω_lo and ω_hi are on different scales, as is likely when one is an ABC weighting and one a BSL weighting, then this function is not sufficiently small in general, and so the multifidelity approach fails. We note, however, that we could instead consider the scaled low-fidelity weighting, ${\tilde{ω}}_{lo} = γ ω_{lo}$ , in place of ω_lo in Algorithms 2 and 3 with no change to the target distribution. Here, γ is an additional parameter that can be tuned with µ when minimising the performance metric, 𝒥_mf; the optimal value of this parameter would need to be learned in parallel with the optimal mean function, µ. We defer this adaptation to future work.

The analysis of performance improvements for multifidelity importance sampling over high-fidelity likelihood-free importance sampling assumes the importance distribution, q(θ), is the same in both methods. As mentioned in Section 2, this importance distribution can also be tuned to improve efficiency in a similar way to the cost/variance trade-off that we optimise in the multifidelity scheme. While we have not analysed this here, we expect that tuning the proposal distribution could also provide additional improvements. While the analysis may be non-trivial, standard adaptive importance sampling could be applied to tune a proposal from fixed parametric family [38], or more generally, sequential Monte Carlo methods can be used to deal with this problem practically without assuming a parametric form for q(θ) [12]. Future work should consider the performance improvements available when the assumption of a fixed proposal is relaxed.

Finally, this work follows Prescott and Baker [11, 12] in considering only a single low-fidelity model. There is significant scope for further improvements by applying these approaches to suites of low-fidelity approximations [39]. For example, exact stochastic simulations of biochemical networks, such as that simulated in Section 5, may also be approximated by tau-leaping [33, 40], where the time discretisation parameter τ tends to be chosen to trade off computational savings against accuracy: exactly the trade-off explored in this work. Clearly, this parameter therefore has important consequences for the success of a multifidelity inference approach using such an approximation strategy. There are several natural extensions that could be applied to include multiple fidelities (such as multiple τ resolutions). As an example, suppose three levels of fidelity, low, medium and high with respective weights ω_lo, ω_med and ω_hi. In this setting we can apply the multifidelity weighting (Equation (4)) recursively to obtain

ω_{mf} (θ, z) = ω_{lo} (θ, y_{lo}) + \frac{1}{μ_{med} (θ, y_{lo})} \sum_{i = 1}^{m} [{ω_{med} (θ, y_{med, i}) + \frac{1}{μ_{hi} (θ, y_{med, i})} \sum_{j = 1}^{n_{i}} [ω_{hi} (θ, y_{hi, j}) - ω_{med} (θ, y_{med, i})]} - ω_{lo} (θ, y_{lo})],

where m is the random number of medium-fidelity simulations conditional on the parameters and the low-fidelity simulation and n_i is the random number of high-fidelity simulations conditional on the parameters and the ith mediumfidelity simulation. The mean functions of these random variables are µ_med(θ, y_lo) and µ_hi(θ, y_med,i), respectively. Alternatively, we could consider a set of J low-fidelity models $ω_{1 o}^{1}, \dots, ω_{1 o}^{J}$ , that may have different accuracy properties in different partitions of parameter space D₁, …, D_J. The multifidelity weighting can be formulated as

ω_{mf} (θ, z) = \sum_{j = 1}^{J} I (θ \in D_{j}) {ω_{lo}^{j} (θ, y_{lo}^{j}) + \frac{1}{μ^{j} (θ, y_{lo}^{j})} \sum_{i = 1}^{m} [ω_{hi} (θ, y_{hi, i}) - ω_{lo} (θ, y_{lo}^{j})]},

assuming that model $ω_{1 o}^{j}$ has the optimal accuracy characteristics in partition D_j. Of course, it may not be known a priori which low-fidelity model performs best in a given partition, and therefore we may choose to include the choice of partition as part of the adaptive tuning scheme. In future work, a full exploration of the use of multiple low-fidelity model approximations will be vital for the full potential of multifidelity likelihood-free inference to be realised. The theory and practice presented here along with clear paths for potential extension have the potential to achieve substantial performance gains in likelihood-free inference.

Supplementary Material

Appendix

EMS207238-supplement-Appendix.pdf^{(103.1KB, pdf)}

Acknowledgements

REB and TPP acknowledge funding for this work through the BBSRC/UKRI grant BB/R00816/1. TPP is supported by the Alan Turing Institute and by Wave 1 of the UKRI Strategic Priorities Fund, under the “Shocks and Resilience” theme of the EPSRC/UKRI grant EP/W006022/1. DJW thanks the Australian Mathematical Society for a Lift-off Fellowship and the Queensland University of Technology (QUT) for support through the Early Career Researcher Support Scheme. DJW acknowledges continued support from the Centre for Data Science at QUT. REB is supported by a Royal Society Wolfson Research Merit Award.

References

[1].Peherstorfer B, Willcox K, Gunzburger M. Survey of multifidelity methods in uncertainty propagation, inference, and optimization. SIAM Review. 2018;60:550–591. [Google Scholar]
[2].Peherstorfer B, Willcox K, Gunzburger M. Optimal model management for multifidelity monte carlo estimation. SIAM Journal on Scientific Computing. 2016;38:A3163–A3194. [Google Scholar]
[3].Ng LWT, Willcox KE. Multifidelity approaches for optimization under uncertainty. International Journal for Numerical Methods in Engineering. 2014;100:746–772. [Google Scholar]
[4].Sisson SA, Fan Y, Beaumont M, editors. Handbook of Approximate Bayesian Computation. CRC Press; 2020. [Google Scholar]
[5].Sunnåker M, Busetto AG, Numminen E, Corander J, Foll M, Dessimoz C. Approximate Bayesian computation. PLoS Computational Biology. 2013;9:e1002803. doi: 10.1371/journal.pcbi.1002803. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Price LF, Drovandi CC, Lee A, Nott DJ. Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics. 2018;27:1–11. [Google Scholar]
[7].Marjoram P, Molitor J, Plagnol V, Tavare S. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences. 2003;100:15324–15328. doi: 10.1073/pnas.0306899100. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Sisson SA, Fan Y, Tanaka MM. Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences. 2007;104:1760–1765. doi: 10.1073/pnas.0607208104. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MP. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface. 2009;6:187–202. doi: 10.1098/rsif.2008.0172. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Del Moral P, Doucet A, Jasra A. An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistical Computing. 2011;22:1009–1020. [Google Scholar]
[11].Prescott TP, Baker RE. Multifidelity approximate Bayesian computation. SIAM/ASA Journal on Uncertainty Quantification. 2020;8:114–138. [Google Scholar]
[12].Prescott TP, Baker RE. Multifidelity approximate Bayesian computation with sequential Monte Carlo parameter sampling. SIAM/ASA Journal on Uncertainty Quantification. 2021;9:788–817. [Google Scholar]
[13].Cranmer K, Brehmer J, Louppe G. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences. 2020;117:30055–30062. doi: 10.1073/pnas.1912789117. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Drovandi CC, Pettitt AN. Estimation of parameters for macroparasite population evolution using approximate Bayesian computation. Biometrics. 2011;67:225–233. doi: 10.1111/j.1541-0420.2010.01410.x. [DOI] [PubMed] [Google Scholar]
[15].Warne DJ, Prescott TP, Baker RE, Simpson MJ. Multifidelity multilevel monte carlo to accelerate approximate bayesian parameter inference for partially observed stochastic processes. Journal of Computational Physics. 2022;469:111543 [Google Scholar]
[16].Guha N, Tan X. Multilevel approximate Bayesian approaches for flows in highly heterogeneous porous media and their applications. Journal of Computational and Applied Mathematics. 2017;317:700–717. [Google Scholar]
[17].Jasra A, Jo S, Nott D, Shoemaker C, Tempone R. Multilevel Monte Carlo in approximate Bayesian computation. Stochastic Analysis and Applications. 2019;37:346–360. [Google Scholar]
[18].Warne DJ, Baker RE, Simpson MJ. Multilevel rejection sampling for approximate Bayesian computation. Computational Statistics & Data Analysis. 2018;124:71–86. [Google Scholar]
[19].Warne DJ, Baker RE, Simpson MJ. Rapid Bayesian inference for expensive stochastic models. Journal of Computational and Graphical Statistics. 2021;31:512–528. [Google Scholar]
[20].Prangle D. Lazy ABC. Statistics and Computing. 2016;26:171–185. [Google Scholar]
[21].Christen JA, Fox C. Markov chain Monte Carlo using an approximation. Journal of Computational and Graphical Statistics. 2005;14:795–810. [Google Scholar]
[22].Everitt RG, Rowińska PA. Delayed acceptance ABC-SMC. Journal of Computational and Graphical Statistics. 2021;30:55–66. [Google Scholar]
[23].Yan L, Zhou T. Adaptive multi-fidelity polynomial chaos approach to bayesian inference in inverse problems. Journal of Computational Physics. 2019;381:110–128. [Google Scholar]
[24].Yan L, Zhou T. An adaptive surrogate modeling based on deep neural networks for large-scale bayesian inverse problems. Communications in Computational Physics. 2020;28:2180–2205. [Google Scholar]
[25].Bon JJ, Warne DJ, Nott DJ, Drovandi C. Bayesian score calibration for approximate models. 2022:ArXiv:2211.05357 [Google Scholar]
[26].Gillespie DT. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry. 1977;81:2340–2361. [Google Scholar]
[27].Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: A fresh approach to numerical computing. SIAM Review. 2017;59:65–98. [Google Scholar]
[28].Owen AB. Monte Carlo Theory. Methods and Examples. 2013 URL: https://artowen.su.domains/mc/ [Google Scholar]
[29].Giles MB. Multilevel Monte Carlo methods. Acta Numerica. 2015;24:259–328. [Google Scholar]
[30].Rhee C-H, Glynn PW. Unbiased estimation with square root convergence for SDE models. Operations Research. 2015;63:1026–1043. [Google Scholar]
[31].Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer New York; 2009. [DOI] [Google Scholar]
[32].Schnoerr D, Sanguinetti G, Grima R. Approximation and inference methods for stochastic biochemical kinetics—a tutorial review. Journal of Physics A: Mathematical and Theoretical. 2017;50:093001 [Google Scholar]
[33].Warne DJ, Baker RE, Simpson MJ. Simulation and inference algorithms for stochastic biochemical reaction networks: From basic concepts to state-of-the-art. Journal of The Royal Society Interface. 2019;16:20180943. doi: 10.1098/rsif.2018.0943. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Erban R, Chapman SJ. Stochastic Modelling of Reaction–Diffusion Processes. Cambridge University Press; 2019. [DOI] [Google Scholar]
[35].Warne DJ, Baker RE, Simpson MJ. A practical guide to pseudo-marginal methods for computational inference in systems biology. Journal of Theoretical Biology. 2020;496:110255. doi: 10.1016/j.jtbi.2020.110255. [DOI] [PubMed] [Google Scholar]
[36].Lester C. Multi-level approximate Bayesian computation. arXiv. 2019:arXiv:1811.08866. arXiv:arXiv:1811.08866. [Google Scholar]
[37].Levine ME, Stuart AM. A framework for machine learning of model error in dynamical systems. 2021:arXiv:2107.06658. arXiv:2107.06658. [Google Scholar]
[38].Cornuet J-M, Marin J-M, Mira A, Robert CP. Adaptive multiple importance sampling. Scandinavian Journal of Statistics. 2012;39:798–812. [Google Scholar]
[39].Gorodetsky AA, Jakeman JD, Geraci G. MFNets: data efficient all-at-once learning of multifidelity surrogates as directed networks of information sources. Computational Mechanics. 2021;68:741–758. [Google Scholar]
[40].Gillespie DT. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of Chemical Physics. 2001;115:1716–1733. [Google Scholar]
[41].Andrieu C, Roberts GO. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics. 2009;37 [Google Scholar]
[42].Drovandi C, Everitt RG, Golightly A, Prangle D. Ensemble MCMC: Accelerating pseudo-marginal MCMC for state space models using the ensemble kalman filter. Bayesian Analysis. 2022;17:223–260. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

EMS207238-supplement-Appendix.pdf^{(103.1KB, pdf)}

[R1] [1].Peherstorfer B, Willcox K, Gunzburger M. Survey of multifidelity methods in uncertainty propagation, inference, and optimization. SIAM Review. 2018;60:550–591. [Google Scholar]

[R2] [2].Peherstorfer B, Willcox K, Gunzburger M. Optimal model management for multifidelity monte carlo estimation. SIAM Journal on Scientific Computing. 2016;38:A3163–A3194. [Google Scholar]

[R3] [3].Ng LWT, Willcox KE. Multifidelity approaches for optimization under uncertainty. International Journal for Numerical Methods in Engineering. 2014;100:746–772. [Google Scholar]

[R4] [4].Sisson SA, Fan Y, Beaumont M, editors. Handbook of Approximate Bayesian Computation. CRC Press; 2020. [Google Scholar]

[R5] [5].Sunnåker M, Busetto AG, Numminen E, Corander J, Foll M, Dessimoz C. Approximate Bayesian computation. PLoS Computational Biology. 2013;9:e1002803. doi: 10.1371/journal.pcbi.1002803. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Price LF, Drovandi CC, Lee A, Nott DJ. Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics. 2018;27:1–11. [Google Scholar]

[R7] [7].Marjoram P, Molitor J, Plagnol V, Tavare S. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences. 2003;100:15324–15328. doi: 10.1073/pnas.0306899100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Sisson SA, Fan Y, Tanaka MM. Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences. 2007;104:1760–1765. doi: 10.1073/pnas.0607208104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MP. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface. 2009;6:187–202. doi: 10.1098/rsif.2008.0172. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Del Moral P, Doucet A, Jasra A. An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistical Computing. 2011;22:1009–1020. [Google Scholar]

[R11] [11].Prescott TP, Baker RE. Multifidelity approximate Bayesian computation. SIAM/ASA Journal on Uncertainty Quantification. 2020;8:114–138. [Google Scholar]

[R12] [12].Prescott TP, Baker RE. Multifidelity approximate Bayesian computation with sequential Monte Carlo parameter sampling. SIAM/ASA Journal on Uncertainty Quantification. 2021;9:788–817. [Google Scholar]

[R13] [13].Cranmer K, Brehmer J, Louppe G. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences. 2020;117:30055–30062. doi: 10.1073/pnas.1912789117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Drovandi CC, Pettitt AN. Estimation of parameters for macroparasite population evolution using approximate Bayesian computation. Biometrics. 2011;67:225–233. doi: 10.1111/j.1541-0420.2010.01410.x. [DOI] [PubMed] [Google Scholar]

[R15] [15].Warne DJ, Prescott TP, Baker RE, Simpson MJ. Multifidelity multilevel monte carlo to accelerate approximate bayesian parameter inference for partially observed stochastic processes. Journal of Computational Physics. 2022;469:111543 [Google Scholar]

[R16] [16].Guha N, Tan X. Multilevel approximate Bayesian approaches for flows in highly heterogeneous porous media and their applications. Journal of Computational and Applied Mathematics. 2017;317:700–717. [Google Scholar]

[R17] [17].Jasra A, Jo S, Nott D, Shoemaker C, Tempone R. Multilevel Monte Carlo in approximate Bayesian computation. Stochastic Analysis and Applications. 2019;37:346–360. [Google Scholar]

[R18] [18].Warne DJ, Baker RE, Simpson MJ. Multilevel rejection sampling for approximate Bayesian computation. Computational Statistics & Data Analysis. 2018;124:71–86. [Google Scholar]

[R19] [19].Warne DJ, Baker RE, Simpson MJ. Rapid Bayesian inference for expensive stochastic models. Journal of Computational and Graphical Statistics. 2021;31:512–528. [Google Scholar]

[R20] [20].Prangle D. Lazy ABC. Statistics and Computing. 2016;26:171–185. [Google Scholar]

[R21] [21].Christen JA, Fox C. Markov chain Monte Carlo using an approximation. Journal of Computational and Graphical Statistics. 2005;14:795–810. [Google Scholar]

[R22] [22].Everitt RG, Rowińska PA. Delayed acceptance ABC-SMC. Journal of Computational and Graphical Statistics. 2021;30:55–66. [Google Scholar]

[R23] [23].Yan L, Zhou T. Adaptive multi-fidelity polynomial chaos approach to bayesian inference in inverse problems. Journal of Computational Physics. 2019;381:110–128. [Google Scholar]

[R24] [24].Yan L, Zhou T. An adaptive surrogate modeling based on deep neural networks for large-scale bayesian inverse problems. Communications in Computational Physics. 2020;28:2180–2205. [Google Scholar]

[R25] [25].Bon JJ, Warne DJ, Nott DJ, Drovandi C. Bayesian score calibration for approximate models. 2022:ArXiv:2211.05357 [Google Scholar]

[R26] [26].Gillespie DT. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry. 1977;81:2340–2361. [Google Scholar]

[R27] [27].Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: A fresh approach to numerical computing. SIAM Review. 2017;59:65–98. [Google Scholar]

[R28] [28].Owen AB. Monte Carlo Theory. Methods and Examples. 2013 URL: https://artowen.su.domains/mc/ [Google Scholar]

[R29] [29].Giles MB. Multilevel Monte Carlo methods. Acta Numerica. 2015;24:259–328. [Google Scholar]

[R30] [30].Rhee C-H, Glynn PW. Unbiased estimation with square root convergence for SDE models. Operations Research. 2015;63:1026–1043. [Google Scholar]

[R31] [31].Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer New York; 2009. [DOI] [Google Scholar]

[R32] [32].Schnoerr D, Sanguinetti G, Grima R. Approximation and inference methods for stochastic biochemical kinetics—a tutorial review. Journal of Physics A: Mathematical and Theoretical. 2017;50:093001 [Google Scholar]

[R33] [33].Warne DJ, Baker RE, Simpson MJ. Simulation and inference algorithms for stochastic biochemical reaction networks: From basic concepts to state-of-the-art. Journal of The Royal Society Interface. 2019;16:20180943. doi: 10.1098/rsif.2018.0943. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Erban R, Chapman SJ. Stochastic Modelling of Reaction–Diffusion Processes. Cambridge University Press; 2019. [DOI] [Google Scholar]

[R35] [35].Warne DJ, Baker RE, Simpson MJ. A practical guide to pseudo-marginal methods for computational inference in systems biology. Journal of Theoretical Biology. 2020;496:110255. doi: 10.1016/j.jtbi.2020.110255. [DOI] [PubMed] [Google Scholar]

[R36] [36].Lester C. Multi-level approximate Bayesian computation. arXiv. 2019:arXiv:1811.08866. arXiv:arXiv:1811.08866. [Google Scholar]

[R37] [37].Levine ME, Stuart AM. A framework for machine learning of model error in dynamical systems. 2021:arXiv:2107.06658. arXiv:2107.06658. [Google Scholar]

[R38] [38].Cornuet J-M, Marin J-M, Mira A, Robert CP. Adaptive multiple importance sampling. Scandinavian Journal of Statistics. 2012;39:798–812. [Google Scholar]

[R39] [39].Gorodetsky AA, Jakeman JD, Geraci G. MFNets: data efficient all-at-once learning of multifidelity surrogates as directed networks of information sources. Computational Mechanics. 2021;68:741–758. [Google Scholar]

[R40] [40].Gillespie DT. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of Chemical Physics. 2001;115:1716–1733. [Google Scholar]

[R41] [41].Andrieu C, Roberts GO. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics. 2009;37 [Google Scholar]

[R42] [42].Drovandi C, Everitt RG, Golightly A, Prangle D. Ensemble MCMC: Accelerating pseudo-marginal MCMC for state space models using the ensemble kalman filter. Bayesian Analysis. 2022;17:223–260. [Google Scholar]

PERMALINK

Efficient multifidelity likelihood-free Bayesian inference with adaptive computational resource allocation

Thomas P Prescott

David J Warne

Ruth E Baker

Abstract

1. Introduction

1.1. Outline

2. Likelihood-free inference

2.1. Approximating the likelihood with simulation

2.2. Likelihood-free importance sampling

Algorithm 1 Likelihood-free importance sampling.

3. Multifidelity inference

3.1. Multifidelity likelihood-free importance sampling

Algorithm 2 Multifidelity likelihood-free importance sampling.

3.2. Accuracy of multifidelity inference

3.3. Comparing performance

3.4. Practical implementation

Algorithm 3 Adaptive multifidelity likelihood-free importance sampling.

4. Analysis of likelihood-free and multifidelity importance sampling

Theorem 1

Corollary 2

Proposition 3

Theorem 4

Lemma 5

Corollary 6

5. Example: Biochemical reaction network

Figure 1. Effect of multifidelity coupling.

5.1. Multifidelity approximate Bayesian computation

5.1.1. ABC importance sampling

5.1.2. Multifidelity ABC

5.1.3. Results

Figure 2. Multifidelity ABC.

5.2. Multifidelity Bayesian synthetic likelihood

Figure 3. Multifidelity BSL.

6. Discussion

Supplementary Material

Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases