Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 16.
Published in final edited form as: J Comput Phys. 2024 Jan;496:112577. doi: 10.1016/j.jcp.2023.112577

Efficient multifidelity likelihood-free Bayesian inference with adaptive computational resource allocation

Thomas P Prescott a,b, David J Warne c, Ruth E Baker d,*
PMCID: PMC7618014  EMSID: EMS207238  PMID: 40822115

Abstract

Likelihood-free Bayesian inference algorithms are popular methods for inferring the parameters of complex stochastic models with intractable likelihoods. These algorithms characteristically rely heavily on repeated model simulations. However, whenever the computational cost of simulation is even moderately expensive, the significant burden incurred by likelihood-free algorithms leaves them infeasible for many practical applications. The multifidelity approach has been introduced in the context of approximate Bayesian computation to reduce the simulation burden of likelihood-free inference without loss of accuracy, by using the information provided by simulating computationally cheap, approximate models in place of the model of interest. In this work we demonstrate that multifidelity techniques can be applied in the general likelihood-free Bayesian inference setting. Analytical results on the optimal allocation of computational resources to simulations at different levels of fidelity are derived, and subsequently implemented practically. We provide an adaptive multifidelity likelihood-free inference algorithm that learns the relationships between models at different fidelities and adapts resource allocation accordingly, and demonstrate that this algorithm produces posterior estimates with near-optimal efficiency.

1. Introduction

Across domains in engineering and science, parametrised mathematical models are often too complex to analyse directly. Instead, many outer-loop applications [1], such as model calibration, optimization, and uncertainty quantification, rely on repeated simulation to understand the relationship between model parameters and behaviour. In time-sensitive and cost-aware applications, the typical computational burden of such simulation-based methods makes them impractical. Multifidelity methods, reviewed by Peherstorfer et al. [1, 2] and Ng and Willcox [3], are a family of approaches that exploit information gathered from simulations, not only of a single model of interest, but also of additional approximate or surrogate models. In this article, the term model refers to the underlying mathematical abstraction of a system in combination with the computer code used to implement simulations. Thus, ‘model approximation’ may refer to mathematical simplifications and/or approximations in numerical methods. The fundamental challenge when implementing multifidelity techniques is the allocation of computational resources between different models, for the purposes of balancing a characteristic trade-off between maintaining accuracy and saving computational burden.

In this work, we consider a specific outer-loop application that arises in Bayesian statistics, the goal of which is to calibrate a parametrised model against observed data. Bayesian inference uses the likelihood of the observed data to update a prior distribution on the model parameters into a posterior distribution, according to Bayes’s rule. In the situation where the likelihood of the data cannot be calculated, we rely on so-called likelihood-free methods that provide estimates of the likelihood by comparing model simulations to data. For example, approximate Bayesian computation (ABC) is a widely-known likelihood-free inference technique [4, 5] where the likelihood is typically estimated as a binary value that records whether or not the distance between a simulation and the observed data falls within a given threshold. Other likelihood-free methods are also available, such as pseudo-marginal methods and Bayesian synthetic likelihoods (BSL) [6]. In this work, we develop a generalised likelihood-free framework for which ABC, pseudo-marginal and BSL can be expressed as specific cases, as described in Section 2.1.

The significant cost of likelihood-free inference has motivated several successful proposals for improving the efficiency of likelihood-free samplers, such as ABC-MCMC [7] and ABC-SMC [8, 9, 10]. These approaches aim to efficiently explore parameter space by avoiding the proposal of low-likelihood parameters, reducing the required number of expensive simulations required and improving the ABC acceptance rate. However, an ‘orthogonal’ technique for improving the efficiency of likelihood-free inference is to instead ensure that each simulation-based likelihood estimate is, on average, less computationally expensive to generate.

In previous work, Prescott and Baker [11, 12] investigated multifidelity approaches to likelihood-free Bayesian inference [13], with a specific focus on ABC [4, 5]. Suppose that there exists a low-fidelity approximation to the parametrised model of interest, and that the approximation is relatively cheap to simulate. Monte Carlo estimates of the posterior distribution, with respect to the likelihood of the original high-fidelity model, can be constructed using the simulation outputs of the low-fidelity approximation. Prescott and Baker [11] showed that using the low-fidelity approximation introduces no further bias, so long as, for any parameter proposal, there is a positive probability of simulating the high-fidelity model to check and potentially correct a low-fidelity likelihood estimate. The key to the success of the multifidelity ABC (MF-ABC) approach is to choose this positive probability to be suitably small, thereby simulating the original model as little as possible, while ensuring it is large enough that the variance of the resulting Monte Carlo estimate is suitably small. The result of the multifidelity approach is to reduce the expected cost of estimating the likelihood for each parameter proposal in any Monte Carlo sampling algorithm. In subsequent work, Prescott and Baker [12] showed that this approach integrates with sequential Monte Carlo (SMC) sampling for efficient parameter space exploration [9, 10, 14], and Warne et al. [15] demonstrated its applicability to multilevel Monte Carlo Guha and Tan [16], Jasra et al. [17], Warne et al. [18]. Thus, the synergistic effect of combining multifidelity with other Monte Carlo schemes to improve the efficiency of ABC has been demonstrated.

Multifidelity ABC can be compared with previous techniques for exploiting model approximation in ABC, such as Preconditioning ABC [19], Lazy ABC (LZ-ABC) [20], and Delayed Acceptance ABC (DA-ABC) [21, 22]. The preconditioning approach seeks to explore parameter space more efficiently, by proposing parameters for high-fidelity simulation with greater low-fidelity posterior mass. In contrast, each of MF-ABC, LZ-ABC, and DA-ABC seeks to make each parameter proposal quicker to evaluate, on average, by using the output of the low-fidelity simulation to directly decide whether to simulate the high-fidelity model. In both LZ-ABC and DA-ABC, a parameter proposal is either (a) rejected early, based on the simulated output of the low-fidelity model, or (b) sent to a high-fidelity simulation, to make a final decision on ABC acceptance or rejection. The distinctive aspect of MF-ABC is that step (a) is different; it is not necessary to reject early to avoid high-fidelity simulation. Instead the low-fidelity simulation can be used to make the accept/reject decision directly. In both DA-ABC and MF-ABC, the decision between (a) or (b) is based solely on whether the low-fidelity simulation would be accepted or rejected. In contrast, LZ-ABC allows for a much more generic decision of whether to simulate the high-fidelity model, requiring an extensive exploration of practical tuning methods.

More generally, there are multifidelity methods that exploit tractable surrogate models and apply subsequent adaptations or transformations to correct for bias in this surrogate. For example, Yan and Zhou [23, 24] adaptively tune surrogate models based on polynomial chaos or deep learning methods, and Bon et al. [25] use a population of surrogates and moment-matching transformations in a similar sense to Warne et al. [19]. While these approaches are of interest, we primarily focus on multifidelity schemes that are strictly simulation-based. However, we note that surrogate likelihood approaches can also be expressed within our theoretical framework.

In this paper, we show that the multifidelity approach can be applied to any simulation-based likelihood-free inference methodology, including but not limited to ABC. We achieve this by developing a generalised framework for likelihood-free inference, and deriving a multifidelity method to operate in this framework. A successful multifidelity likelihood-free inference algorithm requires us to determine how many simulations of the high-fidelity model to perform, based on the parameter value and the simulated output of the low-fidelity model. We provide theoretical results and practical, automated tuning methods to allocate computational resources between two models, designed to optimise the performance of multifidelity likelihood-free importance sampling.

1.1. Outline

In Section 2 we introduce a generalised framework for likelihood-free Bayesian inference for which standard approaches are shown to be a special case. In Section 3 a general multifidelity likelihood-free importance sampler is constructed based on the MF-ABC approach of Prescott and Baker [11]. This section also explores how to practically allocate computation between model fidelities, by adaptively evolving the allocation in response to learned relationships between simulations at each fidelity across parameter space. Analysis is presented in Section 4, including the proofs of the main results set out in Section 3.3, in which we determine the optimal allocation of computational resources between the two models to achieve the best possible performance of multifidelity inference. We illustrate adaptive multifidelity inference by applying the algorithm to a fundamental biochemical network motif in Section 5. We show that, using a low-fidelity Michaelis–Menten approximation together with the exact model (both simulated using the exact algorithm of [26]) our adaptive implementation of multifidelity likelihood-free inference can achieve a quantifiable speed-up in constructing posterior estimates to a specified variance and with no additional bias. Code for this example, developed in Julia 1.6.2 [27], is available at github.com/tpprescott/mf-lf. Finally, in Section 6 we discuss how greater improvements may be achieved for more challenging inference tasks.

2. Likelihood-free inference

We consider a stochastic model of the data generating process, defined by a distribution with parametrised probability density function, f (· | θ), where the parameter vector θ takes values in a parameter space Θ. For any θ ∈ Θ, the model induces a probability density, denoted f (y | θ), on observable outputs, with y taking values in an output space 𝒴. We note that the model is usually implemented in computer code to allow simulation, through which outputs y ∈ 𝒴 can be generated. We write y ~ f (· | θ) to denote simulation of the model f given parameter values θ. Taking the experimentally observed data y0 ∈ 𝒴, we define the likelihood function to be a function of θ using the density, ℒ (θ) = f (y0 | θ), of the observed data under this model.

Bayesian inference updates prior knowledge of the parameter values, θ ∈ Θ, which we encode in a prior distribution with density π(θ). The information provided by the experimental data, encoded in the likelihood function, ℒ (θ), is combined with the prior using Bayes’ rule to form a posterior distribution, with density

π(θy0)=(θ)π(θ)Z,

where Z = ℒ (θ)π(θ) dθ normalises π(· | y0) to be a probability distribution on Θ. For a given, arbitrary, integrable function G : Θ → ℝ, we take the goal of the inference task as the production of a Monte Carlo estimate of the posterior expectation,

E(Gy0)=G(θ)π(θy0)dθ,

conditioned on the observed data.

2.1. Approximating the likelihood with simulation

In most practical settings, models tend to be sufficiently complicated that calculating ℒ(θ) = f (y0 | θ) for θ ∈ Θ is intractable. In this case, we exploit the ability to produce independent simulations from the model, y = (y1, …, yK) with yk ~ f (· | θ). In the following, we will slightly abuse notation by using the shorthand y ~ f (· | θ) to represent K independent, identically distributed draws from the parametrised distribution f (· | θ).

Given the observed data, y0, we can define a real-valued function, referred to as a likelihood-free weighting, ω : (θ, y) ↦ ℝ, which varies over the joint space of parameter values and simulation outputs. Here, ω is a function of a parameter value, θ, and a vector, y, of stochastic simulations. For a fixed θ, we can take conditional expectations of ω(θ, y) over the probability density of simulations, f (y | θ), to define an approximate likelihood function,

Lω(θ)=E(ωθ)=ω(θ,y)f(yθ)dy, (1)

where ω(θ, y) is chosen such that Lω(θ) is an appropriate approximation to the modelled likelihood function, ℒ(θ). The determination of what is appropriate as an approximation will depend on the implementation or application. For example, weights corresponding to BSL will not be appropriate if the distribution of the summary statistics are highly non-Gaussian. Similarly, within an ABC setting, it is not appropriate to choose a very large discrepancy threshold, as this will lead to Lω(θ) ≈ 1 for any θ due to very few proposals being rejected. For standard likelihood-free methods, such as ABC, BSL and pseudo-marginal methods, the likelihood-free weighting, ω(θ, y), is a random variable (since it is a function of K stochastic simulations) and a Monte Carlo estimate of the approximate likelihood function, Lω(θ). The explicit form of this Monte Carlo estimate is implementation specific (see Appendix A). While we focus here on the Monte Carlo setting, it is worth highlighting that ω(θ, y) may be treated as a deterministic function of θ (thus independent of any stochastic simulations) in order to implement a likelihood function, surrogate likelihood function, or some alternative loss function. For example, if ω(θ, y) = f (y0 | θ) then Equation (1) reduces to Lω(θ) = ℒ(θ).

The approximate likelihood function is used to define the likelihood-free approximation to the posterior,

πω(θy0)=Lω(θ)π(θ)Zω,

with the normalisation constant Zω = Θ Lω(θ)π(θ) dθ. The likelihood-free approximation to the posterior, πω(θ | y0), subsequently induces a potentially biased approximation of E(G | y0), given by

Eπω(Gy0)=G(θ)πω(θy0)dθ.

In this situation, the success of likelihood-free inference depends on ensuring that the likelihood-free weighting, ω(θ, y) is chosen such that the squared difference (E(Gy0)Eπω(Gy0))2 between the posterior expectation, E(G | y0), and its likelihood-free approximation, Eπω(Gy0), is as small as possible. Most importantly, the accuracy of the likelihood-free approximation is entirely encoded in the weighting function, and in most, but not all, cases this will be biased. For example, standard likelihood-free methods such as ABC, BSL and pseudo-marginal methods can be implemented using different choices for ω(θ, y) (see Appendix A), however, only ABC and BSL introduce bias.

2.2. Likelihood-free importance sampling

A direct approach to estimating the likelihood-free approximate posterior expectation, Eπω(Gy0), is to use importance sampling. We assume that parameter proposals θi ~ q(·), for i = 1, …, N, can be sampled from a given importance distribution, the support of which must include the prior support, that is, q(θ) ≠ 0 if π(θ) > 0. In practice, we need only know the importance density, q(θ), up to a multiplicative constant. We also assume that we have access to the prior probability density, π(θ).

The likelihood-free importance sampling algorithm is described in Algorithm 1. This algorithm requires the specification of an importance distribution, q, and a likelihood-free weighting, ω(θ, y), with conditional expectation, Lω(θ) = E(ω | θ). The output of Algorithm 1, Ĝ, is an estimate of the likelihood-free approximate posterior expectation, Eπω(Gy0). We show (Section 4, Theorem 1), the standard result that Ĝ is a consistent estimate of Eπω(Gy0), and quantify the dominant behaviour of the mean-squared error (MSE) in the limit of large sample sizes, N → ∞.

Algorithm 1 Likelihood-free importance sampling.

Require: Prior, π; importance distribution, q; likelihood-free weighting, ω; model f (· | θ); target function, G.

    for i ∈ [1, 2, …, N] do

        Sample θi ~ q(·);

        Simulate yi ~ f (· | θi);

        Calculate weight wi=w(θi,yi)=π(θi)q(θi)ω(θi,yi);

    end for

    Estimate expectation using weighted sum, G^=i=1NwiG(θi)j=1Nwj.

In obtaining the consistency result, we also determine the leading-order behaviour of the MSE of the output of Algorithm 1 in terms of sample size,

E((G^Eπω(Gy0))2)=[E(W2(G(θ)Eπω(Gy0))2)E(W)2]1N+O(1N2), (2)

where W is the random variable for the value of the importance weight, w, as defined in Algorithm 1. For later formulations it is more useful to denote ci as the random computational cost of the K stochastic simulations, yi ~ f (·|θi), then consider a fixed computational budget, Ctot, rather than a fixed sample size. That is, we take N as the largest index i for which j=1icjCtot. We can also quantify the performance of this algorithm in terms of how the MSE decreases with increasing the overall computational budget. To leading-order this is give by (Section 4, Corollary 2)

E((G^Eπω(Gy0))2)=[E(C)E(W2(G(θ)Eπω(Gy0))2)E(W)2]1Ctot+O(1Ctot2). (3)

We can use the leading-order coefficient of 1/Ctot in Equation (3) to quantify the performance of likelihood-free importance sampling. Importantly, this expression explicitly depends on the expected computational cost, C, of each iteration of Algorithm 1 and the variance of the weighted errors, w(G(θ)Eπω(Gy0)). In the importance sampling context, the optimal importance distribution q should seek to minimise this coefficient. This is achieved by minimisation of the numerator, E(C)E(W2(G(θ)Eπω(Gy0))2), that is, by trading off a preference for parameter values with lower computational burden against ensuring small variability in the weighted errors. However, in addition to tuning the proposal mechanism, q, Equation (3) also provides insight into how approximations might be used to directly reduce the leading-order coefficient in Equation (3), based on the identified trade-off between decreasing the expected computational burden and controlling the variance of the weighted error. We will not consider the tuning of q any further in the this work, but rather highlight that the trade-off presented here motivates the formulation and optimisation of a generalised multifidelity likelihood-free inference scheme.

3. Multifidelity inference

In Equation (3), the performance of Algorithm 1 is quantified explicitly in terms of how the Monte Carlo error between the estimate, Ĝ, and the approximated posterior mean, Eπω(Gy0), decays with increasing computational budget, Ctot. It initially appears reasonable to conclude that the linear dependence of the performance on the expected iteration time, E(C), implies that if we can speed up the simulation step of Algorithm 1, then we can significantly reduce the MSE for a given computational budget.

Suppose that there exists an alternative model that we can use in Algorithm 1 in place of the original model, f (· | θ), such that the expected computation time for each iteration, E(C), is significantly reduced. There are two important issues that prevent this being a viable option for improving the efficiency of likelihood-free inference. The first problem is that we need to be able to quantify the effect of the alternative model on the ratio E(W2(G(θ)Eπω(Gy0))2)/E(W)2 to ensure that the overall performance of the algorithm is improved. It is not sufficient to show that the computational burden of each iteration is reduced, since it is possible that substantially more iterations are subsequently required to achieve a specified MSE.

The second problem arises from the observation that the limiting value of Ĝ, as output from Algorithm 1, is Eπω(Gy0), with residual bias,

limCtotE((G^E(Gy0))2)=(Eπω(Gy0)E(Gy0))20,

recalling that Eπω(Gy0) is the approximate posterior expectation induced by Lω(θ) = E(ω | θ), and the approximand, E(G | y0), is the posterior expectation induced by the likelihood, ℒ(θ) = f (y0 | θ). We will identify this limiting residual squared bias, (Eπω(Gy0)E(Gy0))2, as the fidelity of the model/likelihood-free weighting pair. We emphasise here that the fidelity depends both on the model and the likelihood-free weighting used in Algorithm 1, and is contextual to the target function, G. For a given posterior mean, E(G | y0), a model and likelihood-free weighting pair for which the value of (Eπω(Gy0)E(Gy0))2 is small will be termed high-fidelity, while larger values of (Eπω(Gy0)E(Gy0))2 are termed low-fidelity. Thus, if we use an alternative model in place of f in Algorithm 1, the model (and likelihood-free weighting) may be too low-fidelity, in the sense of having too large a residual squared bias versus the posterior expectation of interest, E(G | y0).

The multifidelity framework overcomes both these problems, by removing the need for a binary choice between the expensive model of interest and its cheaper alternative. Instead, we carry out likelihood-free inference using information from both models. In the sections that follow, we will only consider two levels of model fidelity. However, in Section 6, we discuss possible extensions for a truly multifidelity setting with multiple approximations as our approach need not be restricted to only two models.

3.1. Multifidelity likelihood-free importance sampling

We denote the high-fidelity model and likelihood-free weighting as fhi and ωhi, respectively. The likelihood under the high-fidelity model is denoted ℒhi(θ) = fhi(y0 | θ), and is assumed to be intractable. Following the notation introduced in Equation (1), the high-fidelity pair fhi and ωhi induce the approximate likelihood, Lωhi(θ)=E(ωhiθ) and the corresponding likelihood-free approximation to the posterior expectation, Eπωhi(Gy0). We further assume that simulating each yhi ~ fhi(· | θ) is computationally expensive. This computational expense motivates the use of an approximate, low-fidelity model and likelihood-free weighting, denoted flo and ωlo, respectively, inducing the approximate likelihood, Lω10(θ))=E(ω10θ), and corresponding likelihood-free approximation to the posterior expectation, Eπω10(Gy0). We note that the low-fidelity model, flo, induces its own likelihood, ℒlo(θ) = flo(y0 | θ), and assume that this remains intractable, requiring a simulation-based Bayesian approach. However, we assume that simulations of the low-fidelity model, ylo ~ flo(· | θ), are significantly cheaper to produce compared to simulations of the high-fidelity model, that is E(Clo)/E(Chi) ≪ 1, where Clo and Chi denote the random computational time of K simulations from the low-fidelity and high-fidelity model, respectively.

Given the models flo and fhi, we will term the joint distribution fmf(ylo, yhi | θ) a multifidelity model when fmf has marginals equal to the low- and high-fidelity densities, flo(ylo | θ) and fhi(yhi | θ). The models may be conditionally independent, such that fmf(ylo, yhi | θ) = flo(ylo | θ) fhi(yhi | θ), in which case simulations at each model fidelity can be carried out independently given θ. Furthermore, if the simulations are conditionally independent, this means that the resulting likelihood-free weights, ωlo(θ, ylo) and ωhi(θ, yhi), are also conditionally independent. However, in the more general definition of the multifidelity model as a joint distribution, we allow for coupling between the two fidelities. Conditioned on the low-fidelity simulations, ylo, and on parameter values, θ, we can produce a coupled simulation, yhi, from the density fhi(yhi | θ, ylo) implied by fmf(ylo, yhi | θ) = fhi(yhi | θ, ylo) flo(ylo | θ). If appropriately constructed, coupling imposes positive correlations between the resulting likelihood-free weights, ωlo(θ, ylo) and ωhi(θ, yhi), to enable values of ωlo to provide more information about unknown values of ωhi, thereby acting as a variance reduction technique [28]. Implementation of such a variance reduction approach results in a construction related to a randomised multilevel Monte Carlo scheme [29, 30, 15]. It should be noted, however, that the success of our multifidelity scheme does not strictly rely on finding an appropriate coupling mechanism. This is an advantage of our approach since such coupling mechanisms can be challenging to construct.

We can calculate a multifidelity likelihood-free weighting as follows. Let M be any non-negative integer-valued random variable, with conditional probability mass function p(· | θ, ylo), and with a positive conditional mean, µ(θ, ylo) = E(M | θ, ylo) > 0. Given a parameter value, θ, we define z = (ylo, yhi,1, yhi,2, …, yhi,m), with ylo ~ flo(· | θ), m ~ p(· | θ, ylo), and yhi,i ~ fhi(· | θ, ylo), noting that each yhi,i may be coupled to the low-fidelity simulation ylo ~ flo(· | θ). We further define the multifidelity likelihood-free weighting function,

ωmf(θ,z)=ωlo(θ,ylo)+1μ(θ,ylo)i=1m[ωhi(θ,yhi,i)ωlo(θ,ylo)], (4)

as the low-fidelity likelihood-free weighting, corrected by a randomly drawn number, M = m, of conditionally independent high-fidelity likelihood-free weightings. We write the density of z as ϕ(z | θ) and take expectations over z to obtain

Lωmf(θ)=E(ωmfθ)=ωmf(θ,z)ϕ(zθ)dz (5)

as the multifidelity approximation to the likelihood.

Given M = m, only m replicates of yhi,i ~ fhi(· | θ, ylo) need to be simulated for ωmf(θ, z) to be evaluated. Thus, whenever m = 0, this means that no high-fidelity simulations need to be completed for ωmf(θ, z) to be calculated, removing the high-fidelity simulation cost from that iteration. Algorithm 2 presents the adaptation of the basic importance sampling method of Algorithm 1 to incorporate the multifidelity weighting function. The simulation step, y ~ f (· | θ), in Algorithm 1 is replaced by the MF-Simulate function in Algorithm 2.

Algorithm 2 Multifidelity likelihood-free importance sampling.

Require: Prior, π; importance distribution, q; likelihood-free weightings, ωhi and ωlo; models fhi(· | θ) and flo(· | θ); conditional probability mass function p(· | θ, ylo) on non-negative integers with mean function µ(θ, ylo); target estimated function, G.

    for i ∈ [1, 2, … N] do

        Sample θi ~ q(·);

        Generate zi ~ ϕ(· | θi) from MF-Simulate(θi);

        For ωmf in Equation (4), calculate the weight wi=wmf(θ,zi)=π(θi)q(θi)ωmf(θi,zi);

    end for

    Estimate expectation using weighted sum, G^mf=i=1NwiG(θi)j=1Nwj.

    function MF-Simulate(θ)

        Simulate ylo ~ flo(· | θ);

        Generate m ~ p(· | θ, ylo) with mean µ(θ, ylo);

        if m = 0 then

            return z = (ylo);

        else

            for i∈ [1, 2, …, m] do

                Simulate yhi,i ~ fhi(θ, ylo);

            end for

            return z = (ylo, yhi,1, …, yhi,m);

        end if

    end function

3.2. Accuracy of multifidelity inference

We observe that using fhi and ωhi in Algorithm 1 produces an estimate of the high-fidelity approximate posterior expectation, Eπωhi(Gy0). We show (Section 4, Proposition 3) that the multifidelity approximate likelihood, Lωmf(θ)=E(ωmf(θ,z)θ), is equal to the high-fidelity approximate likelihood, Lωhi(θ)=E(ωhi(θ,yhi)|θ). As a result, Algorithm 2 also produces a consistent estimate of the high-fidelity approximate posterior expectation, Eπωhi(Gy0).

In the limit of infinite computational budgets, the estimate produced by multifidelity importance sampling in Algorithm 2 is as accurate as the estimate produced by high-fidelity importance sampling in Algorithm 1 using fhi and ωhi. However, we still need to show that the performance of Algorithm 2 exceeds that of Algorithm 1 in the practical context of limited computational budgets. In Section 3.3, we introduce a method to quantify the performance of Algorithms 1 and 2 and show that the performance of multifidelity inference is strongly determined by the distribution of M, the random number of high-fidelity simulations required at each iteration.

3.3. Comparing performance

Equation (3) gives the leading-order behaviour of the MSE for Algorithm 1 as the computational budget increases. A similar result applies to the output of Algorithm 2. We compare two settings: first, using Algorithm 1 with the high-fidelity model; and second, using Algorithm 2 with the multifidelity model. In the first setting we have the model, fhi, and likelihood-free weighting, ωhi. Each iteration has computational cost denoted Chi, and produces a weighted Monte Carlo sample with weights wi as independent draws of the random variable Whi. The output of Algorithm 1 is denoted Ĝhi. The MSE for Algorithm 1 has leading-order behaviour

E((G^hiEπωhi(Gy0))2)=[E(Chi)E(Whi2(G(θ)Eπωhi(Gy0))2)E(Whi)2]1Ctot+O(1Ctot2), (6)

as the total simulation budget Ctot → ∞. In the second setting, we similarly use Algorithm 2 with the multifidelity model fmf and likelihood-free weighting, ωmf. Each iteration has computational cost denoted Cmf, and produces a weighted Monte Carlo sample with weights wi as independent draws of the random variable Wmf. The output of Algorithm 2 is Ĝmf. The MSE for Algorithm 2 has leading-order behaviour

E((G^mfEπωmf(Gy0))2)=[E(Cmf)E(Wmf2(G(θ)Eπωmf(Gy0))2)E(Wmf)2]1Ctot+O(1Ctot2), (7)

as the total simulation budget Ctot → ∞. Thus the main task is to determine the conditions when

E((G^mfEπωmf(Gy0))2)<E((G^hiEπωhi(Gy0))2)

for the same value of Ctot.

The main result of the paper (Section 4, Theorem 4) provides such conditions and enables the construction of performance metrics. The performance metrics 𝒥hi and 𝒥mf[µ], for Algorithm 1 and Algorithm 2, respectively, are given by

Jhi=E(Chi)high-fidelitycost×E(Δq(θ)2E(ωhi2θ))high-fidelityvariance, (8)
Jmf[μ]=(E(Clo)+Eρ(μ(θ,ylo)E(Chiθ,ylo)))multifidelitycost×(E(Δq(θ)2E(E(ωhiθ,ylo)2θ))+Eρ(Δq(θ)2μ(θ,ylo)E((ωhiωlo)2θ,ylo))),multifidelityvariance (9)

where M is a Poisson distributed random variable with mean µ(θ, ylo), Δq(θ)=(G(θ)Eπωhi(Gy0))π(θ)/q(θ) is the importance weighted error, ρ(θ, ylo) = flo(ylo | θ)q(θ) is the joint density of the low-fidelity simulator and the parameters under the importance distribution. Effectively both metrics are reformulations of the numerators of the leading order terms in Equation (6) and Equation (7), respectively.

The condition 𝒥mf[µ] < 𝒥hi is shown to be a necessary and sufficient condition for Algorithm 2 out-performing Algorithm 1 for the same computational budget (Theorem 4). Importantly, the performance depends explicitly on the free choice of the function µ(θ, ylo) that determines the conditional mean of the Poisson-distributed number of high-fidelity simulations required at each iteration. We observe from the first factor in Equation (9) that, when µ is smaller, the total simulation cost is less. However, the second factor of Equation (9) implies that as µ decreases, the variability of the likelihood-free weighting can increase without bound, which can severely damage the performance. Thus, Equation (9) illustrates the characteristic multifidelity trade-off between reducing simulation burden while also controlling the increase in sample variance. In the results that follow, we continue to assume M is a Poisson distributed random variable, however, considerations for binomial and geometric distributions are given in Appendix B.

Using classical results from calculus of variations, it is possible to determine the mean function, µ, that achieves optimal performance of Algorithm 2, in the sense of minimising the functional, 𝒥mf[µ]. This optimal function, µ, that maximises performance on average is given by

μ(θ,ylo)2=E(Clo)Δq(θ)2E((ωhiωlo)2θ,ylo)E(Chiθ,ylo)E(Δq(θ)2E(E(ωhiθ,ylo)2θ)). (10)

This result is derived in Section 4 (Lemma 5). Note that as E (ωhiωlo)2 | θ, ylo increases then µ increases. This means that if the expected error between the likelihood-free weightings is large, then the high-fidelity model needs to be simulated more frequently to correct for this. Conversely, when E(Chi θ, ylo) is larger, the greater simulation time of the high-fidelity model means that µ should be smaller, and the requirement for the most expensive simulations is reduced. Intuitively, µ acts to balance the trade-off between controlling simulation cost and variance identified in Equation (9) above.

Of course, it need not be the case that the optimal mean function in Equation (10) leads to 𝒥mf[µ] < 𝒥hi. For this to occur we require (See Corollary 6),

(E(Clo)E(Chi)E(Δq(θ)2E(E(ωhiθ,ylo)2θ))E(Δq(θ)2E(ωhi2θ)))1/2+Eρ((Δq(θ)2E((ωhiωlo)2θ,ylo)E(Δq(θ)2E(ωhi2θ))E(Chiθ,ylo)E(Chi))1/2)<1. (11)

The first term in Equation (11) justifies the key assumption that the average computational cost of the low-fidelity model is as small as possible compared to that of the high-fidelity model, E(Clo)/E(Chi) ≪ 1. The second term is a measure of the total detriment to the performance of Algorithm 2 incurred by the inaccuracy of ωlo versus ωhi as a Monte Carlo estimate of Lωhi, as quantified by E ((ωhiωlo)2 | θ, ylo). This condition justifies two key criteria for the success of the multifidelity method: that low-fidelity simulations are significantly cheaper than high-fidelity simulations, and that the likelihood-free weightings, ωhi and ωlo, agree sufficiently well, on average. More detailed analysis of necessary and sufficient conditions for the existence of µ is given in Section 4, following the proof of Corollary 6.

These key results show not only correctness of the multifidelity approach asymptotically, but also the situations in which performance improvement is expected. The details of the analysis results can been found in the proofs of Theorem 4, Lemma 5 and Corollary 6 given in Section 4. However, these analytical results are only useful insofar as the various expectations given in Equations (8) and (9) are known. Typically these expectations will be unknown a priori, and need to be estimated. In the following section, we describe how the analytical results of Section 3.3 can be used to construct a heuristic for adaptive multifidelity inference that learns a near-optimal mean function, µ(θ, ylo), as simulations at each fidelity are completed.

3.4. Practical implementation

In Section 3.3, we derived the optimal mean function for the Poisson distribution of the number of high-fidelity simulations, M, generated in an iteration of Algorithm 2, conditioned on the parameter value, θ, and low-fidelity simulation, ylo. In this section, we describe a practical approach to determining a near-optimal mean function for use in multifidelity likelihood-free inference. We rely on two approximations, relative to the analytically optimal mean function µ given in Equation (10). First, we constrain the optimisation of 𝒥mf to the space of functions µD that are piecewise constant in an arbitrary, given, finite partition, 𝒟, of the global space of (θ, ylo) values. The resulting optimisation problem is therefore finite-dimensional. However, although this optimisation can be solved analytically, we can observe that its estimation, being based on the ratios of simulation-based Monte Carlo estimates, is numerically unstable. This motivates a second approximation, which is to follow a gradient-descent approach to allow the mean function to adaptively converge towards the optimum.

We constrain the space of mean functions, µ, to be piecewise constant. Consider an arbitrary, given collection 𝒟 = {Dk | k = 1, …, K} of ρ-integrable sets that partition the global space of (θ, ylo) values. We denote a 𝒟-piecewise constant function, parameterised by the vector ν = (ν1, …, νk), as

μD(θ,ylo;v)=k=1KvkI((θ,ylo)Dk).

Substituting this function into Equation (9), we can quantify the performance of Algorithm 2, using the mean defined by µ𝒟(θ, ylo; ν), as the parameterised product

JD(v)=(E(Clo)+k=1KvkCk)(E(Δq(θ)2E(E(ωhiθ,ylo)2θ))+k=1KVkvk), (12)

where, for convenience in this derivation, we denote

Ck=Eρ(E(Chiθ,ylo)I((θ,ylo)Dk)),Vk=Eρ(Δq(θ)2E((ωhiωlo)2θ,ylo)I((θ,ylo)Dk)),

for k = 1, 2, …, K.

Just as in the case of general µ*, we can optimise the function ℐ𝒟(ν) across positive vectors, ν, to obtain the result,

vk=VkE(Clo)CkE(Δq(θ)2E(E(ωhiθ,ylo)2θ)),fork=1,2,,K. (13)

This result is more convenient since all the terms on the right-hand side are constants. However, we still need to estimate these expectations values as they are unknown a priori. The standard approach would typically be to perform trial samples to estimate these values with Monte Carlo. This could be problematic in practice due to the rational form of vk which means that these estimates can be unstable, particularly for sets Dk ∈ 𝒟 with small volume under ρ.

We now consider a conservative approach to determining values for ν that will provide stable estimates of ν. Rather than directly targeting ν, based on ratios of highly variable Monte Carlo estimates, we can introduce a gradient-descent approach to updating the vector ν. Taking derivatives of ℐ𝒟 with respect to log νk for k = 1, …, K gives the gradient

JDlogvk=vkCk(E(Δq(θ)2E(E(ωhiθ,ylo)2θ))+j=1KVjvj)Vkvk(E(Clo)+j=1KCjvj).

Thus, if we write ν(r) for the value of ν used in iteration r of Algorithm 2, we intend to update to ν(r+1) in the next iteration using gradient descent, such that

logvk(r+1)=logvk(r)δ[vk(r)Ck(E(Δq(θ)2E(E(ωhiθ,ylo)2θ))+j=1KVjvj(r))Vkvk(r)(E(Clo)+j=1Kcjvj(r))]. (14)

Note that we express this updating rule in terms of log vk(r) to ensure that each vk(r) is positive, since the updates to ν(r) are multiplicative. As is typical of gradient-descent approaches, Equation (14) requires the specification of the step size hyperparameter, δ. It is straightforward to show that ν is the unique positive stationary point of Equation (14). Furthermore, since each derivative ∂𝒥𝒟/∂ log νk is quadratic in the expectations the numerical instability in estimating Equation (13) as a ratio does not occur. In relatively under-sampled regions Dk ∈ 𝒟 with small ρ-volume, the small values of ck and Vk ensure that the convergence to the corresponding estimated optimal value, vk, is more conservative.

We now explicitly set out the Monte Carlo estimates of unknown expectations given r-iterations of the multifidelity inference scheme (Algorithm 2). Firstly, the cost related expectations are estimated using

E(Clo)C^lo(r)=1ri=1rclo,i, (15)

and for k = 1, 2, …, K,

CkC^k(r)=1ri=1rI((θi,ylo,i)Dk)1μij=1michi,i,j, (16)

where, for iteration i, clo,i is the observed simulation cost of each ylo,i given parameters θi drawn from q(θ), and chi,i, j is the observed simulation cost of each yhi,i, j for j = 1, …, mi with mi as a random draw from the distribution of M having mean µi. The estimators for the variance related terms are more complex with

E(Δq(θ)2E(E(ωhiθ,ylo)2θ))V^mf(r)=1ri=1r(Δi(r)μi)2[(j=1miωhi,i,j)2j=1miωhi,i,j2], (17)

and for k = 1, 2, …, K,

VkV^k(r)=1ri=1rIDk(θi,ylo,i)1μij=1mi(Δi(r)(ωhi,i,jωlo,i))2, (18)

where ωlo,i = ωlo(θi, ylo,i), ωhi,i, j = ωhi(θi, yhi,i, j) for j = 1, …, mi, Δi(r)=[G(θi)G^hi(r)]π(θi)/q(θi) with G^hi(r) as the current Monte Carlo estimate of the target Eπωhi(Gy0).

These estimates can then be substituted into Equation (14) to produce an updating rule for vk(r), that is,

logvk(r+1)=logvk(r)δ[vk(r)ck(r)(Vmf(r)+j=1KVj(r)vj(r))Vk(r)vk(r)(C^lo(r)+j=1Kcj(r)vj(r))], (19)

leading to an adaptive multifidelity likelihood-free importance sampling scheme, as outlined in Algorithm 3.

In addition to the specification of the step size hyperparameter, δ, Algorithm 3 also requires a burn-in phase, N0, to initialise the Monte Carlo estimates in Equation (15)(18). The partition, 𝒟 = {D1, …, DK}, is also an input into Algorithm 3. We defer an investigation of how to choose this partition to future work. For the purposes of this paper, however, we can heuristically construct a partition, 𝒟, by fitting a decision tree. We use the burn-in phase of Algorithm 3, over iterations iN0, and regress the values of

μi=|Δi(N0)|j(ωhi,i,jωlo,i)2jchi,i,j,

against features (θi, ylo,i), using the CART algorithm [31] as implemented in DecisionTrees.jl. Note that this regression is motivated by the form of the true optimal mean function, µ, given in Equation (10). The resulting decision tree defines a partition, 𝒟 = {D1, …, DK}, used to define the piecewise-constant mean function µ𝒟 (θ, ylo; ν) over i > N0.

Algorithm 3 Adaptive multifidelity likelihood-free importance sampling.

Require: Prior, π; importance distribution, q; likelihood-free weightings, ωhi and ωlo; models fhi(· | θ) and flo(· | θ); partition 𝒟 = {D1, …, DK} of (θ, ylo) space; adaptation rate, δ; burn-in period, N0; target estimated function, G.

Initialise log vk(1)=0 for k = 1, …, K.

for i ∈ [1, 2, …, N] do

      Sample θi ~ q(·);

      Generate zi ~ ϕ(· | θi) from MF-SIMULATEi, ν(i));

      For ωmf in Equation (4), calculate the weight wi=wmf(θ,zi)=π(θi)q(θi)ωmf(θi,zi);

      if i > N0 then

            Generate ν(i+1) from Update-Nu(ν(i));

      else

            Set ν(i+1) = ν(i);

      end if

end for

Estimate expectation using weighted sum, G^mf=i=1NwiG(θi)j=1Nwj.

function MF-Simulate(θ, ν)

      Simulate ylo ~ flo(· | θ);

      Find k such that (θ, ylo) ∈ Dk;

      Generate m ~ Poi(νk);

      for i = 1, …, m do

            Simulate yhi,i ~ fhi(· | θ, ylo);

      end for

      return z = (ylo, m, yhi,1, …, yhi,m)

end function

function Update-Nu(ν1, …, νK)

      Update Monte Carlo estimates defined in Equation (15)(18);

      for k = 1, …, K do

            Increment log νk according to Equation (19);

      end for

      return ν = (ν1, …, νk)

end function

4. Analysis of likelihood-free and multifidelity importance sampling

In this section, we detail the theoretical analysis of the general likelihood-free importance sampling scheme and the multifidelity schemes. The results in this section are referred to throughout Sections 2 and 3, however, notation in this section may be distinct to ensure the analysis is not needlessly verbose. Throughout the relation f (x) ≼ g(x) is taken to mean there exists a constant, c, such that f (x) ≤ cg(x) as x → ∞. Likewise f (x) ≺ g(x) is taken to mean there exists a constant, c, such that f (x) < cg(x) as x → ∞.

The first result in Theorem 1 establishes the consistency of the general likelihood-free importance sampling scheme (Algorithm 1) and in doing so derives the convergence rate in MSE. For notational simplicity, we define the function Δ(θ)=G(θ)Eπω(Gy0) to recentre the approximate posterior mean.

Theorem 1

For the weighted sample values (θi, wi) produced in each iteration of Algorithm 1, let W denote the random value of the weight wi, and let θ denote the random value of θi. The mean squared error (MSE) of the estimator, Ĝ, is given to leading order by

E((G^Eπω(Gy0))2)[E(W2Δ2)E(W)2]1N,

as N → ∞. Thus, Ĝ is a consistent estimator of Eπω(G|y0).

Proof. The Monte Carlo estimate produced by Algorithm 1, Ĝ = R/S, is the ratio of two random variables: the weighted sum, R=i=1Nw(θi,yi)G(θi), and the normalising sum, S=i=1Nw(θi,yi). We write the function Φ(r,s)=(r/sEπω(G|y0))2, and note that the MSE is the expected value of the function (G^Eπω(G|y0))2=Φ(R,S). Using the delta method, we take expectations of the second-order Taylor expansion of Φ(R, S) about (µR, µS) = (E(R), E(S)), to give

E(Φ(R,S))=Φ(μR,μS)+1μS2[Var(R)+(2Eπω(G|y0)4μRμS)Cov(R,S)+(3μR2Eπω(G|y0)μSμS2)μRVar(S)]+O(E(((RμR)+(SμS))3)μS3).

Taking expectations with respect to N independent draws of (θ, y) with density f (y | θ)q(θ), it is straightforward to write

μR=E(R)=NE(WG)=NZωEπω(G|y0),μS=E(S)=NE(W)=NZω,

where we recall that E(W) = Zω = Lω(θ)π(θ) is the normalising constant in Section 2.1. We substitute these expectations into the Taylor expansion of E((G^Eπω(G|y0))2), noting that the leading-order term, Φ(µR, µS), is zero. Thus, we can write the dominant behaviour of the MSE as

E((G^Eπω(G|y0))2)=1N2Zω2[Var(R)2Eπω(G|y0)Cov(R,S)+Eπω(G|y0)2Var(S)]+O(E(((RμR)+(SμS))3)N3)=1N2Zω2Var(REπω(G|y0)S)+O(E(((RμR)+(SμS))3)N3),

as N. Substituting into this expression the definitions of R and S as summations of N independent identically distributed random variables, we have

E((G^Eπω(G|y0))2)=1N2Zω2NVar(WΔ)+O(NN3),

and Equation (2) follows, on noting that E(WΔ) = 0 and that E(W) = Zω.

The consistency and MSE convergence rate in terms of sample size N is trivially recast in Corollary 2 in terms of a computational budget Ctot. This form is particularly important in the context of performance comparison with the multifidelilty schemes as presented later.

Corollary 2

Let the computational cost of each iteration of Algorithm 1 be denoted by the random variable C. The leading order behaviour of the MSE of Ĝ as an estimate of Eπω(G|y0) is

E((G^Eπω(G|y0))2)[E(C)E(W2Δ2)E(W)2]1Ctot,

as Ctot.

Proof. As the given computational budget increases, Ctot → ∞, the Monte Carlo sample size that can be produced with that budget increases on the order of N ~ Ctot/E(C). On substituting this expression into Equation (2), the result follows.

Given the multifidelity scheme in Algorithm 2 is also a special case of Algorithm 1, we know via Theorem 1 that the estimator is consistent with respect to Eπωmf(G|y0). Proposition 3 further establishes that in the multifidelity case we also have Eπωmf(G|y0)=Eπωhi(G|y0).

Proposition 3

The multifidelity approximation to the likelihood, Lmf(θ) = E(ωmf | θ), is equal to the high-fidelity approximation to the likelihood, Lhi(θ) = E(ωhi | θ). Therefore, the estimate Ĝmf produced by Algorithm 2 is a consistent estimate of the high-fidelity approximate posterior expectation, Eπωhi(G|y0).

Proof. We take the expectation of ωmf (Equation (4)) conditional on (θ, ylo, m), to find

E(ωmf|θ,ylo,M=m)=(1mμ)ωlo(θ,ylo)+mμE(ωhi|θ,ylo).

Further taking expectations over the random integer M, which has conditional expected value µ(θ, ylo), gives

E(ωmf|θ,ylo)=E(ωhiθ,ylo).

Further taking expectations with respect to ylo, it follows that Lωmf(θ)=Lωhi(θ). Therefore, the likelihood-free approximate posteriors, πωmf=πωhi, are equal and thus Ĝmf is a consistent estimate of Eπωmf(G|y0)=Eπωhi(G|y0), as required.

We now arrive at the most important result in this work as it determines the necessary and sufficient conditions for which the multifidelity scheme out-performs direct likelihood-free importance sampling targeting the high-fidelity weighting. For concreteness, Theorem 4 assumes that the random variable M determining the required number of high-fidelity simulations in each iteration of Algorithm 2, is Poisson distributed, conditional on the parameter value and low-fidelity simulation output. However, it is important to note that only minor alterations to the form of the performance metrics arise through other parametric assumptions on M. Binomial and Geometric distribution alternatives are explored in Appendix B. We show that, for a given multifidelity model and likelihood-free weightings, the mean function, µ, for M determines the performance of Algorithm 2 relative to Algorithm 1.

Theorem 4

Assume that the random number of high-fidelity simulations, M, required in each iteration of Algorithm 2 is Poisson distributed with conditional mean µ(θ, ylo). Let chi(θ) [respectively, clo(θ) and chi(θ, ylo)] be the expected time taken to simulate yhi ~ fhi(· | θ) [respectively, to simulate ylo ~ flo(· | θ) and to produce the coupled high-fidelity simulation yhi ~ fhi(θ, ylo)]. Further, assume that the computational cost of each iteration of Algorithm 1 and Algorithm 2 can be approximated by the dominant cost of simulation alone, neglecting the costs of the other calculations.

As Ctot → ∞, the MSE of the high-fidelity likelihood-free importance sampling estimator, G^ωhi, asymptotically exceeds that of the multifidelity importance sampling estimator, G^ωmf, that is,

E((G^ωmfEπωhi(G|y0))2)E((G^ωhiEπωhi(G|y0))2),

if and only if 𝒥mf[µ] < 𝒥hi, where

Jhi=chiVhi, (20a)
Jmf[μ]=(clo+Eρ(μ(θ,ylo)chi(θ,ylo)))×(Vmf+Eρ(Δq(θ)2η(θ,ylo)μ(θ,ylo))), (20b)

with constants

chi=chi(θ)q(θ)dθ,Vhi=Δq(θ)2E(ωhi2|θ)q(θ)dθ, (20c)
clo=clo(θ)q(θ)dθ,Vmf=Δq(θ)2E(λhi2θ)q(θ)dθ, (20d)

and functions

Δq(θ)=π(θ)q(θ)(G(θ)Ghi),η(θ,ylo)=E((ωhiωlo)2θ,ylo),λhi(θ,ylo)=E(ωhiθ,ylo), (20e)

and the joint density

ρ(θ,ylo)=flo(ylo|θ)q(θ). (20f)

Proof. The leading order performance of each of Algorithm 1 and Algorithm 2 is given in terms of increasing computational budget, Ctot, in Equation (6) and Equation (7), respectively. For the performance of Algorithm 2 to exceed that of Algorithm 1, we compare the leading order coefficients from Equations (6) and (7), requiring

E(Cmf)E(wmff2Δ2)E(wmf)2<E(Chi)E(whi2Δ2)E(whi)2. (21)

We note that E(wmf|θ)=π(θ)Lωmf(θ)/q(θ) and E(whi|θ)=π(θ)Lωhi(θ)/q(θ), as shown in Proposition 3, the denominators in Equation (21) are therefore equal. Thus,

Jmf=E(Cmf)E(wmf2Δ2)<E(Chi)E(whi2Δ2)=Jhi,

is the condition for Algorithm 2 to outperform Algorithm 1.

Taking the right-hand side of this inequality first, clearly the expected simulation time is E(Chi) = chi, for the constant chi defined in Equation (20c). Similarly, we can write

E(whi2Δ2)=(π(θ)q(θ)Δ(θ))2[ωhi(θ,yhi)2fhi(yhi|θ)dyhi]q(θ)dθ=Vhi,

as given in Equation (20c). Thus, 𝒥hi = chiVhi.

For the left-hand side of the performance inequality, we take each expectation in 𝒥mf in turn. We first note that the expected iteration cost of Algorithm 2, E(Cmf), is the sum of the expected cost of a single low-fidelity simulation, and the expected cost of M high-fidelity simulations. By definition, the expected cost of a single low-fidelity simulation ylo ~ flo(· | θ) across θ ~ q(·) is given by clo. Thus the remaining cost, E(δCmf) = E(Cmf) − clo, is the expected cost of M high-fidelity simulations. Conditioning on θ, ylo and M = m, the expected remaining cost is, by definition,

E(δCmfθ,ylo,M=m)=mchi(θ,ylo).

Taking expectations over the conditional distribution M ~ Poi(µ(θ, ylo)), we have

E(δCmf|θ,ylo)=μ(θ,ylo)chi(θ,ylo).

Finally, integrating this expression over the density ρ in Equation (20f) gives the first factor of Equation (9).

It remains to show that

E(wmf2Δ2)=Vmf+Δq(θ)2η(θ,ylo)μ(θ,ylo)ρ(θ,ylo)dθdylo.

We first condition on θ, ylo and M = m, to write

E(wmf2Δ2θ,ylo,m)=Δq2E(ωmf2θ,ylo,m)=Δq2[ωlo2+2μωloE(Dmθ,ylo)+1μ2E(Dm2θ,ylo)],

for the random variable Dm=i=1m(ωhi,iωlo). It is straightforward to show that

E(Dm|θ,ylo)=mE(ωhi|θ,ylo)mωlo,E(Dm2|θ,ylo)=mE((ωhiωlo)2|θ,ylo)+(m2m)E(ωhiωlo|θ,ylo)2,

where we exploit the conditional independence of the high-fidelity simulations yhi,i and yhi, j, for ij. On substitution of these conditional expectations, we then rearrange to write

E(wmf2Δ2|θ,ylo,m)=Δq2[(12mμ)ωlo2+2mμωloλhi+mμ2Var(ωhiωloθ,ylo)+(m(λhiωlo)μ)2],

where we write the conditional expectation λhi(θ, ylo) = E(ωhi θ, ylo). At this point we can take expectations over M and rearrange to give

E(wmf2Δ2|θ,ylo)=Δq2[2ωloλhiωlo2+1μVar(ωhiωloθ,ylo)+(Var(Mθ,ylo)+μ2)(λhiωlo)2μ2]=Δq2[λhi2+1μ(Var(ωhiωloθ,ylo)+Var(Mθ,ylo)μ(E(ωhiωloθ,ylo))2)]. (22)

Here, we can use the assumption that M conditioned on θ and ylo is Poisson distributed, noting that the statement of Theorem 4 can be adapted for other conditional distributions of M with different conditional variance functions. Under the Poisson assumption, we can substitute Var(M | θ, ylo) = µ(θ, ylo) to give

E(wmf2Δ2|θ,ylo)=Δq2[λhi2+E((ωhiωlo)2θ,ylo)μ(θ,ylo)].

Finally, we take expectations with respect to the probability density ρ in Equation (20f), and the product in Equation (9) follows.

Given the main result from Theorem 4 on the conditions for improvements using multifidelity importance sampling, the following results (Lemma 5 and Corollary 6) demonstrate the form and existence of the mean function, µ(θ, ylo), such that the conditions are satisfied.

Lemma 5

The functional 𝒥mf[µ] quantifying the performance of Algorithm 2 is optimised by the function µ, where

μ(θ,ylo)2=Δq(θ)2[η(θ,ylo)/Vmfchi(θ,ylo)/clo]. (23)

Proof. We write the functional 𝒥mf[µ] = 𝒞[µ]𝒱 [µ] in Equation (9) as the product of functionals

C[μ]=clo+μchiρdθdylo, (24a)
V[μ]=Vmf+Δq2ημρdθdylo. (24b)

Standard ‘product rule’ results from calculus of variations allows us to write the functional derivative of 𝒥mf with respect to µ as

δJmfδμ=V[μ]δCδμ+C[μ]δVδμ=V[μ]chiρC[μ]Δq2ηρμ2.

Setting this functional derivative to zero, the optimal function, µ, satisfies

μ(θ,ylo)2=C[μ]V[μ]Δq(θ)2η(θ,ylo)chi(θ,ylo). (25)

The result in Equation (10) follows on showing that C[µ]/V[µ] = clo/Vmf.

On substituting Equation (25) into Equation (24) we find

C[μ]=clo+C[μ]V[μ]Δq2ηchiρdθdylo,V[μ]=Vmf+V[μ]C[μ]Δq2ηchiρdθdylo,

from which it follows that

V[μ]C[μ]clo=C[μ]V[μ]Vmf=C[μ]V[μ]Δq2ηchiρdθdylo.

Multiplying this equation by C[μ]V[μ], we have 𝒱[µ]clo = 𝒞 [µ]Vmf, and thus Equation (10) follows from Equation (25).

Corollary 6

There exists a mean function, µ, such that the performance of Algorithm 2 exceeds the performance of Algorithm 1, if and only if

clochiVmfVhi+Δq(θ)2η(θ,ylo)Vhichi(θ,ylo)chiρ(θ,ylo)dθdylo<1. (26)

Proof. On substituting Equation (23) into Equation (24), we find that the condition Jmf=Jmf[μ]<Jhi=chiVhi is equivalent to

(cloVmf+Δq(θ)2η(θ,ylo)chi(θ,ylo)ρ(θ,ylo)dθdylo)2<chiVhi.

A simple rearrangement of this inequality gives the inequality in Equation (26).

To interpret the condition in Equation (26), we note that the first term is determined by (a) our assumption of a significant reduction in simulation burden of the low-fidelity model over the high-fidelity model, clo < chi, and (b) the ratio of the two integrals,

VmfVhi=Δq(θ)2E(E(ωhi|θ,ylo)2|θ)q(θ)dθΔq(θ)2E(ωhi2|θ)q(θ)dθ.

Exploiting the law of total variance, we note that

E(ωhi2|θ)=Var(ωhi|θ)+Lωhi(θ)2,E(E(ωhi|θ,ylo)2θ)=Var(E(ωhi|θ,ylo)θ)+Lωhi(θ)2=E(ωhi2|θ)E(Var(ωhiθ,ylo)θ).

These equalities imply that

E(ωhi|θ)2E(E(ωhi|θ,ylo)2θ)E(ωhi2|θ),

where the lower bound is achieved for yhi independent of ylo, while the upper bound would be achieved if yhi were a deterministic function of ylo. In particular, Vmf/Vhi ≤ 1, and so the first term of Equation (26) is small whenever the low-fidelity model provides significant computational savings versus the high-fidelity model.

The second term in Equation (26) quantifies the detriment to the performance of Algorithm 2 that arises from the inaccuracy of ωlo as an estimate of ωhi. The function η(θ, ylo) = E((ωhiωlo)2 | θ, ylo) is integrated across the density ρ, weighted by the relative computational cost of the high-fidelity simulation, chi(θ, ylo)/chi, and by the contribution of G(θ) to the variance of the estimated posterior expectation of G. We can conclude that the multifidelity approach requires that the low-fidelity model is accurate in the regions of parameter space where high-fidelity simulations are particularly expensive.

To summarise: if (a) the ratio between average low-fidelity simulation costs and high-fidelity simulation costs is suitably small, and (b) the average disagreement between likelihood-free weightings, as measured by η, is suitably small, then Equation (26) will be satisfied and thus a mean function, µ, exists such that Algorithm 2 is more efficient than Algorithm 1. The optimisation goal of the adaptive scheme proposed in Algorithm 3 optimally tunes the mean function µ(θ, ylo) to maximise the computational benefit.

5. Example: Biochemical reaction network

The following example considers the stochastic simulation of a biochemical reaction motif. Readers unfamiliar with these techniques are referred to detailed expositions of by Schnoerr et al. [32], Warne et al. [33] and Erban and Chapman [34]. We model the conversion (over time t ≥ 0) of substrate molecules, labelled S, into molecules of a product, P. The conversion of S into P is catalysed by the presence of enzyme molecules, E, which bind with S to form a substrate-enzyme complex, labelled C. After non-dimensionalising units of time and volume, this network motif is represented by three reactions,

S+Ek1k2Ck3P+E, (27a)

parameterised by the vector θ = (k1, k2, k3) of positive parameters, k1, k2, and k3, and three propensity functions,

v1(t)=k1S(t)E(t), (27b)
v2(t)=k2C(t), (27c)
v3(t)=k3C(t), (27d)

where the integer-valued variables S (t), E(t), C(t) and P(t) represent the molecule numbers at time t > 0. The stoichiometric matrix for this model is

h=[110110111001]. (28)

At t = 0, we assume there are no complex or product molecules, but set positive integer numbers S 0 = 100 and E0 = 5 of substrate and enzyme molecules, respectively. Given the fixed initial conditions, the parameters in θ are sufficient to specify the dynamics of the model in Equation (27a). The model is stochastic, and induces a distribution, which we denote f (· | θ), on the space of trajectories x : t ⟼ (S (t), E(t), C(t), P(t)) of molecule numbers in ℕ4 over time. Such trajectories evolve according to the discrete-state Markov process

xt=x0+j=13Pj(0tvj(s)ds)h,j, (29)

where 𝒫1(·), 𝒫3(·), and 𝒫3(·), are inhomogeneous Poisson processes.

For the purposes of this example, the observed data (Figure 1) is given by

y0=(y1,,y10)=(1.73,3.80,5.95,8.10,11.17,12.92,15.50,17.75,20.17,23.67),

where the n-th observation represents the hitting time for the product molecule levels exceeding 10n, that is, yn = t such that P(t) = 10n. We set a prior π(θ) on the vector θ, equal to a product of independent uniform distributions such that k1, k2 ~ U(10, 100) and k3 ~ U(0.1, 10). We seek the posterior distribution π(θ | y0) using the likelihood, denoted ℒ(θ) = f (y0 | θ), focusing on the posterior expectation of the function G(θ) = k3, denoting the rate of conversion of substrate–enzyme complex to product. All code for this example is available at github.com/tpprescott/mf-lf, using stochastic simulations implemented by github.com/tpprescott/ReactionNetworks.jl. We assume that we cannot calculate the likelihood function, L(θ) = f (y0 | θ), and therefore resort to our likelihood-free framework.

Figure 1. Effect of multifidelity coupling.

Figure 1

(a) Example stochastic trajectories from the high and low-fidelity enzyme kinetics models in Equations (27) and (30) for parameters θ = (k1, k2, k3) = (50, 50, 1), compared with data used for inference. For one low-fidelity simulation, we generate five uncoupled simulations and five coupled simulations. (b) Ten-dimensional data summarising simulated trajectories in (a). Black represents observed data, y0; the single low-fidelity simulation ylo ~ flo(· | θ) is in blue; five uncoupled simulations yhi ~ fhi(· | θ) are in orange; five coupled simulations yhi ~ fhi(· | θ, ylo) are in green (almost coincident).

5.1. Multifidelity approximate Bayesian computation

Here we formulate the weighting function to implement ABC importance sampling to compare with an equivalent ABC implementation of adaptive multifidelity importance sampling. See Appendix A for the general formulation of ABC within our framework.

5.1.1. ABC importance sampling

Given θ, the model in Equation (27) can be exactly simulated using the Gillespie stochastic simulation algorithm, to produce draws y ~ f (· | θ) from the exact model [26, 34, 35]. We will use the ABC likelihood-free weighting with threshold value ϵ = 5 on the Euclidean distance of the simulation from y0, such that

ω(θ,y)=1(||yy0||2<5),

to define the likelihood-free approximation to the posterior, LABC(θ) = E(ω | θ). For simplicity of demonstration we set the importance distribution to be equal to the prior, that is, q = π. In this instance, the likelihood-free weighting in Algorithm 1 reduces to a rejection sampling approach, setting the importance distribution equal to the prior. Furthermore, configuring Algorithm 1 and Algorithm 3 to use the same proposal ensures any performance improvements are due to the multifidelity scheme rather than tuning of proposals.

5.1.2. Multifidelity ABC

The exact Gillespie stochastic simulation algorithm can incur significant computational burden. In the specific case of the network in Equation (27), if the reaction rates k1 and k2 are large relative to k3, there are large numbers of binding/unbinding reactions S + E ⟷ C that occur in any simulation. In comparison, the reaction C ⟶ P + E can only fire exactly 100 times. Michaelis–Menten dynamics exploit this scale separation to approximate the enzyme kinetics network motif. We approximate the conversion of substrate into product as a single reaction step,

SkMM(t)P, (30a)

where the time-varying rate of conversion, kMM(t), given by

kMM(t)=k3min(S(t),E0)KMM+S(t), (30b)
KMM=k2+k3k1, (30c)

induces the propensity function vMM(t) = kMM(t)S (t). We assume initial conditions of S (0) = S 0 = 100 and P(0) = 0, and fix the parameter E0 = 5. Thus, the parameter vector, θ = (k1, k2, k3), again fully determines the dynamics of the low-fidelity model in Equation (30). We write ylo ~ flo(· | θ) as the conditional probability density for the Gillespie simulation of the approximated model in Equation (30), where ylo is the vector of ten simulated time points ylo,n at which 10n product molecules have been produced.

For a biochemical reaction network consisting of R reactions, the Gillespie simulation algorithm is a deterministic transformation of R independent unit-rate Poisson processes, one for each reaction channel. We can couple the models in Equations (27) and (30) by using the same Poisson process for the single reaction in Equation (30) and for the product formation C ⟶ P + E reaction of Equation (27) [11, 36]. Using this coupling approach, we first simulate ylo ~ flo(· | θ) from Equation (30). We then produce the coupled simulation yhi ~ fhi(· | θ, ylo) from the model in Equation (27), using the shared Poisson process. We set the corresponding likelihood-free weightings to

ωhi(θ,yhi)=I(|yhiy0|<5),ωlo(θ,ylo)=I(|yloy0|<5),

noting that E(ωhi | θ) = LABC(θ) is the high-fidelity ABC approximation to the likelihood. Figure 1 illustrates the effect of coupling between low-fidelity and high-fidelity models. The five coupled high-fidelity simulations are significantly less variable than the independent high-fidelity simulations, appearing almost coincident in Figure 1. This ensures a large degree of correlation between the coupled likelihood-free weightings, ωhi and ωlo. Thus, coupling ensures that ωlo is a reliable proxy for ωhi for use in multifidelity likelihood-free inference.

We implement Algorithm 3 by setting a burn-in period of N0 = 10, 000, for which we generate mi ~ Poi(1) high-fidelity simulations at each iteration, i ≤ N0. Once the burn-in period is complete, we define the partition 𝒟 by learning a decision tree through a simple regression, as described in Section 3.4. For iterations i > N0 beyond the burn-in period, we set a step size of δ = 103 for the gradient descent update in Equation (14).

5.1.3. Results

Algorithm 1 was run four times, setting the total number of weighted samples to N = 10, 000, N = 20, 000, N = 40, 000 and N = 80, 000. Similarly, Algorithm 3 was run five times, setting N = 40, 000, N = 80, 000, N = 160, 000, N = 320, 000 and N = 640, 000. Since we do not have access to the true posterior mean, Eπωhi(Gy0), we use the empirical mean over all high-fidelity runs as a proxy, resulting in an empirical MSE estimate for the multifidelity scheme. Figure 2a shows how the empirical MSE in the estimate, Ĝ, varies with the total simulation cost, Ctot, shown for each of the two algorithms. The slope of each curve (on a log-log scale) is approximately − 1, corresponding to the dominant behaviour of the MSE being reciprocal with total simulation time, as observed in Equation (7). The offset in the two curves corresponds to the inequality 𝒥mf < 𝒥hi in the leading order coefficient, thereby demonstrating the improved performance of Algorithm 3 over Algorithm 1.

Figure 2. Multifidelity ABC.

Figure 2

(a) Total simulation cost versus the empirical mean-squared error (MSE) of output estimate, Ĝ, for four runs of Algorithm 1 (ABC) and five runs of Algorithm 3 (MF-ABC). (b) Values of the multifidelity weight, wmf, during the longest run of Algorithm 3, for iterations where ωmfωlo, such that the low-fidelity likelihood-free weighting is corrected based on at least one high-fidelity simulation. (c) The evolution of the values of vk(i) during the longest run of Algorithm 3, here different colored lines are used to distinguish vk(i) between distinct partitions, Dk, however, the color itself has no specific interpretation. (d) A comparison of the adaptive v1(i) to the evolving best estimate of the optimal v1, given by Equation (13), based on the Monte Carlo estimates in Equation (19).

The values in Figure 2b show the multifidelity weights, wi. We show only those weights not equal to zero or one, corresponding to those iterations where ωlo(θi, ylo,i) has been corrected by at least one ωhi(θi, yhi,i, j) ≠ ωlo(θi, ylo,i). Clearly there is a significant amount of correction applied to the low-fidelity weights. However, as demonstrated by the improved performance statistics, Algorithm 3 has learned the required allocation of computational budget to the high-fidelity simulations that balances the trade-off between achieving reduced overall simulation times and correcting inaccuracies in the low-fidelity simulation.

Each run of Algorithm 3 includes a burn-in period of 10, 000 iterations, at the conclusion of which a partition 𝒟 is created, based on decision tree regression. In Appendix C, we show how this decision tree is used to define a piecewise-constant mean function, specifically for the partition 𝒟 used for the final run of Algorithm 3 (i.e. N = 640, 000 iterations). In Figure 2c, we show the evolution of the values of vk(i) used in this mean function, over iterations i. Following the updating rule in Equation (19), the trajectory of vk(i) converges exponentially towards a Monte Carlo estimate of the optimal value vk given in Equation (13). However, we can see from Figure 2c that, as more simulations are completed and the Monte Carlo estimates in Equation (19) evolve, the values of each parameter, νk, track updated estimates. This is illustrated in Figure 2d for ν1, where the estimated optimum v1 evolves as more simulations are completed. We note that the gradient descent update in Equation (19) at iteration i depends on all vk(i) values. Thus, the observed convergence of v1(i) to the evolving estimate of v1 is not necessarily monotonic.

Figure 2d illustrates the motivation for the use of gradient descent rather than simply using the analytically obtained optimum. When very few simulations have been completed, then the estimates in Equation (10) are small and their ratios are numerically unstable, and often far from the true optimum. If vk(i) values are too small in early iterations, then estimates become more numerically unstable, since fewer high-fidelity simulations are completed for small µ. Instead, using gradient descent ensures that enough high-fidelity simulations are completed for each 𝒟k, including those with low volume under the measure ρ, to stabilise the estimates required in Equation (10) and thus stabilise the multifidelity algorithm.

5.2. Multifidelity Bayesian synthetic likelihood

Consider the same model of enzyme kinetics as in Section 5.1. As depicted in Figure 1, this model has low-fidelity (Michaelis–Menten) stochastic dynamics with distribution flo(· | θ), and coupled high-fidelity stochastic dynamics with distribution fhi(· | θ, ylo). We now redefine ωlo and ωhi to be Bayesian synthetic likelihoods, based on K pairs of coupled simulations,

ylo,k~flo(θ),yhi,k~fhi(θ,ylo,k),

for k = 1, …, K. That is,

ωlo(θ,ylo)=N(y0:μ(ylo),(ylo)),ωhi(θ,yhi)=N(y0:μ(yhi),(yhi)),

are the Gaussian likelihoods of the observed data, under the empirical mean and covariance of K low-fidelity and (coupled) high-fidelity simulations, respectively. Just like in the ABC example, for simplicity of demonstration we set the importance distribution to be equal to the prior, that is, q = π.

Algorithm 1 was run three times, using ωhi(θ, yhi) dependent on high-fidelity simulations yhi ~ f (· | θ), alone, and setting the number of iterations to N = 2, 500, N = 5, 000 and N = 10, 000. Similarly, Algorithm 3 was run four times using the coupled multifidelity model, setting the number of iterations to N = 4, 000, N = 8, 000, N = 16, 000 and N = 32, 000, and initialising with a burn-in of size N0 = 2, 000. The adaptive step size is set to δ = 108. In both algorithms, we set the number of simulations required for each evaluation of ωhi(θ, (yhi,1, …, yhi,K)) or ωlo(θ, (ylo,1, …, ylo,K)) as K = 100.

Figure 3 depicts the performance of multifidelity BSL inference, where Algorithm 3 is applied with BSL likelihood-free weightings, ωlo and ωhi. As with MF-ABC, Figure 3a shows that the MF-BSL generates improved performance over high-fidelity BSL inference, achieving lower MSE estimates for a given computational budget. We also note in Figure 3a that the curve corresponding to MF-BSL has slope less than − 1. This is due to (a) the overhead cost of the initial burn-in period of Algorithm 3, and also (b) the conservative convergence of ν(i) to the optimum, as shown in Figure 3c–d. Both observations imply that earlier iterations are less efficiently produced than later iterations, meaning that larger samples show greater improvements than expected from the reciprocal relationship in Equation (7).

Figure 3. Multifidelity BSL.

Figure 3

(a) Total simulation cost versus the empirical MSE of output estimate, Ĝ, for three runs of Algorithm 1 (BSL) and four runs of Algorithm 3 (MF-BSL). (b) Values of the multifidelity weight, wmf, during the longest run of Algorithm 3, for iterations where ωmfωlo, such that the low-fidelity likelihood-free weighting is corrected based on at least one high-fidelity simulation. (c) The evolution of the values of vk(i) during the longest run of Algorithm 3, here different coloured lines are used to distinguish vk(i) between distinct partitions, Dk, however, the colour itself has no specific interpretation. (d) A comparison of the adaptive v1(i) to the evolving best estimate of the optimal v1, given by Equation (13), based on the Monte Carlo estimates in Equation (19).

Comparing Figure 3b to Figure 2b, we note that there are very few negative multifidelity weightings in MF-BSL, in comparison to MF-ABC. We can conclude that the Bayesian synthetic likelihood, constructed using low-fidelity simulations, tends to underestimate the likelihood of the observed data compared to using high-fidelity simulations. We note also in this comparison that the multifidelity likelihood-free weightings are on significantly different scales.

6. Discussion

The characteristic computational burden of simulation-based, likelihood-free Bayesian inference methods is often a barrier to their successful implementation. Multifidelity simulation techniques have previously been shown to improve the efficiency of likelihood-free inference in the context of ABC. In this work, we have demonstrated that these techniques can be readily applied to a range of likelihood-free approaches. Furthermore, we have introduced a computational methodology for automating the multifidelity approach, adaptively allocating simulation resources across different fidelities in order to ensure near-optimal efficiency gains from this technique. As parameter space is explored, our methodology, given in Algorithm 3, learns the relationships between simulation accuracy and simulation costs at the different fidelities, and adapts the requirement for high-fidelity simulation accordingly.

The multifidelity approach to likelihood-free inference is one of a number of strategies for speeding up inference, which include MCMC and SMC sampling techniques [7, 8, 9] and methods for variance reduction such as multilevel estimation [29, 16, 18, 17]. A key observation in the previous work of Prescott and Baker [12] and Warne et al. [15] is that applying multifidelity techniques provides ‘orthogonal’ improvements that combine synergistically with these other established approaches to improving efficiency. Similarly, we envision that Algorithm 3 can be adapted into an SMC or multilevel algorithm with minimal difficulty, following the templates set by Prescott and Baker [12] and Warne et al. [15].

The multifidelity approach discussed in this work is a highly flexible generalisation of existing multifidelity techniques, which can be viewed as special cases of Algorithm 2. In each of MF-ABC [11, 12], LZ-ABC [20], and DA-ABC [22], it is assumed that ωhi is an ABC likelihood-free weighting, which we relax in this work. Furthermore, LZ-ABC and DA-ABC both use ωlo ≡ 0, so that parameters are always rejected if no high-fidelity simulation is completed. We relax this assumption to allow for any low-fidelity likelihood-free weighting. In all of MF-ABC, LZ-ABC and DA-ABC, the conditional distribution of M, given a parameter value θ and low-fidelity simulation output ylo is Bernoulli distributed, with mean µ(θ, ylo) ∈ (0, 1]. In this work we change this distribution to Poisson, to ease analytical results, but any conditional distribution for M can be used. These adaptations are explored further in Appendix B.

In the case of MF-ABC (as originally formulated by [11]) and DA-ABC [21, 22], the mean function, µ(θ, ylo), depends on a single low-fidelity simulation and is assumed to be piecewise constant in the value of the indicator function 1(d(ylo, y0) < ϵ). Lazy ABC is more generic in its definition of µ = µ(ϕ(θ, ylo)) to depend on the value of any decision statistic, ϕ. In this work, we consider more general piecewise constant mean functions, µ𝒟, for heuristically derived partitions 𝒟 of (θ, ylo)-space. We observe that (θ, ylo) may be of very high dimension; in the BSL example in Section 5.2, having K = 100 low-fidelity simulations ylo ∈ ℝ10 means that the input to µ is of dimension 1003. In this situation, it may be tempting to seek a mean function that only depends on θ. However, we recall that the optimal mean function, µ(θ, ylo), derived in Lemma 5, depends on the conditional expectation E((ωhiωlo)2 | θ, ylo). Thus, by ignoring ylo, we would ignore the information about ωhi given by the evaluation of ωlo(θ, ylo). Furthermore, the high dimension of the inputs to µ suggest that this function is not necessarily well-approximated by a decision tree. Future work may focus on methods to learn the optimal mean function directly without resorting to piecewise constant approximations [37]. The key problem is ensuring the conservatism of any alternative estimate of µ, recalling that the variance of wmf is inversely proportional to µ.

In the example explored in Section 5, we considered the use of Algorithm 3 where ωhi and ωlo were first both ABC likelihood-free weightings, and then both BSL likelihood-free weightings. In principle, this method should also allow for ωlo to be, for example, an ABC likelihood-free weighting based on a single low-fidelity simulation, and ωhi to be a BSL likelihood-free weighting based on K > 1 high-fidelity simulations. However, the success of the multifidelity method depends explicitly on the function η(θ, ylo) = E((ωhiωlo)2 | θ, ylo) being sufficiently small, as quantified in Corollary 6. If ωlo and ωhi are on different scales, as is likely when one is an ABC weighting and one a BSL weighting, then this function is not sufficiently small in general, and so the multifidelity approach fails. We note, however, that we could instead consider the scaled low-fidelity weighting, ω˜lo=γωlo, in place of ωlo in Algorithms 2 and 3 with no change to the target distribution. Here, γ is an additional parameter that can be tuned with µ when minimising the performance metric, 𝒥mf; the optimal value of this parameter would need to be learned in parallel with the optimal mean function, µ. We defer this adaptation to future work.

The analysis of performance improvements for multifidelity importance sampling over high-fidelity likelihood-free importance sampling assumes the importance distribution, q(θ), is the same in both methods. As mentioned in Section 2, this importance distribution can also be tuned to improve efficiency in a similar way to the cost/variance trade-off that we optimise in the multifidelity scheme. While we have not analysed this here, we expect that tuning the proposal distribution could also provide additional improvements. While the analysis may be non-trivial, standard adaptive importance sampling could be applied to tune a proposal from fixed parametric family [38], or more generally, sequential Monte Carlo methods can be used to deal with this problem practically without assuming a parametric form for q(θ) [12]. Future work should consider the performance improvements available when the assumption of a fixed proposal is relaxed.

Finally, this work follows Prescott and Baker [11, 12] in considering only a single low-fidelity model. There is significant scope for further improvements by applying these approaches to suites of low-fidelity approximations [39]. For example, exact stochastic simulations of biochemical networks, such as that simulated in Section 5, may also be approximated by tau-leaping [33, 40], where the time discretisation parameter τ tends to be chosen to trade off computational savings against accuracy: exactly the trade-off explored in this work. Clearly, this parameter therefore has important consequences for the success of a multifidelity inference approach using such an approximation strategy. There are several natural extensions that could be applied to include multiple fidelities (such as multiple τ resolutions). As an example, suppose three levels of fidelity, low, medium and high with respective weights ωlo, ωmed and ωhi. In this setting we can apply the multifidelity weighting (Equation (4)) recursively to obtain

ωmf(θ,z)=ωlo(θ,ylo)+1μmed(θ,ylo)i=1m[{ωmed(θ,ymed,i)+1μhi(θ,ymed,i)j=1ni[ωhi(θ,yhi,j)ωmed(θ,ymed,i)]}ωlo(θ,ylo)],

where m is the random number of medium-fidelity simulations conditional on the parameters and the low-fidelity simulation and ni is the random number of high-fidelity simulations conditional on the parameters and the ith mediumfidelity simulation. The mean functions of these random variables are µmed(θ, ylo) and µhi(θ, ymed,i), respectively. Alternatively, we could consider a set of J low-fidelity models ω1o1,,ω1oJ, that may have different accuracy properties in different partitions of parameter space D1, …, DJ. The multifidelity weighting can be formulated as

ωmf(θ,z)=j=1JI(θDj){ωloj(θ,yloj)+1μj(θ,yloj)i=1m[ωhi(θ,yhi,i)ωlo(θ,yloj)]},

assuming that model ω1oj has the optimal accuracy characteristics in partition Dj. Of course, it may not be known a priori which low-fidelity model performs best in a given partition, and therefore we may choose to include the choice of partition as part of the adaptive tuning scheme. In future work, a full exploration of the use of multiple low-fidelity model approximations will be vital for the full potential of multifidelity likelihood-free inference to be realised. The theory and practice presented here along with clear paths for potential extension have the potential to achieve substantial performance gains in likelihood-free inference.

Supplementary Material

Appendix

Acknowledgements

REB and TPP acknowledge funding for this work through the BBSRC/UKRI grant BB/R00816/1. TPP is supported by the Alan Turing Institute and by Wave 1 of the UKRI Strategic Priorities Fund, under the “Shocks and Resilience” theme of the EPSRC/UKRI grant EP/W006022/1. DJW thanks the Australian Mathematical Society for a Lift-off Fellowship and the Queensland University of Technology (QUT) for support through the Early Career Researcher Support Scheme. DJW acknowledges continued support from the Centre for Data Science at QUT. REB is supported by a Royal Society Wolfson Research Merit Award.

References

  • [1].Peherstorfer B, Willcox K, Gunzburger M. Survey of multifidelity methods in uncertainty propagation, inference, and optimization. SIAM Review. 2018;60:550–591. [Google Scholar]
  • [2].Peherstorfer B, Willcox K, Gunzburger M. Optimal model management for multifidelity monte carlo estimation. SIAM Journal on Scientific Computing. 2016;38:A3163–A3194. [Google Scholar]
  • [3].Ng LWT, Willcox KE. Multifidelity approaches for optimization under uncertainty. International Journal for Numerical Methods in Engineering. 2014;100:746–772. [Google Scholar]
  • [4].Sisson SA, Fan Y, Beaumont M, editors. Handbook of Approximate Bayesian Computation. CRC Press; 2020. [Google Scholar]
  • [5].Sunnåker M, Busetto AG, Numminen E, Corander J, Foll M, Dessimoz C. Approximate Bayesian computation. PLoS Computational Biology. 2013;9:e1002803. doi: 10.1371/journal.pcbi.1002803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Price LF, Drovandi CC, Lee A, Nott DJ. Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics. 2018;27:1–11. [Google Scholar]
  • [7].Marjoram P, Molitor J, Plagnol V, Tavare S. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences. 2003;100:15324–15328. doi: 10.1073/pnas.0306899100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Sisson SA, Fan Y, Tanaka MM. Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences. 2007;104:1760–1765. doi: 10.1073/pnas.0607208104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MP. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface. 2009;6:187–202. doi: 10.1098/rsif.2008.0172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Del Moral P, Doucet A, Jasra A. An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistical Computing. 2011;22:1009–1020. [Google Scholar]
  • [11].Prescott TP, Baker RE. Multifidelity approximate Bayesian computation. SIAM/ASA Journal on Uncertainty Quantification. 2020;8:114–138. [Google Scholar]
  • [12].Prescott TP, Baker RE. Multifidelity approximate Bayesian computation with sequential Monte Carlo parameter sampling. SIAM/ASA Journal on Uncertainty Quantification. 2021;9:788–817. [Google Scholar]
  • [13].Cranmer K, Brehmer J, Louppe G. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences. 2020;117:30055–30062. doi: 10.1073/pnas.1912789117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Drovandi CC, Pettitt AN. Estimation of parameters for macroparasite population evolution using approximate Bayesian computation. Biometrics. 2011;67:225–233. doi: 10.1111/j.1541-0420.2010.01410.x. [DOI] [PubMed] [Google Scholar]
  • [15].Warne DJ, Prescott TP, Baker RE, Simpson MJ. Multifidelity multilevel monte carlo to accelerate approximate bayesian parameter inference for partially observed stochastic processes. Journal of Computational Physics. 2022;469:111543 [Google Scholar]
  • [16].Guha N, Tan X. Multilevel approximate Bayesian approaches for flows in highly heterogeneous porous media and their applications. Journal of Computational and Applied Mathematics. 2017;317:700–717. [Google Scholar]
  • [17].Jasra A, Jo S, Nott D, Shoemaker C, Tempone R. Multilevel Monte Carlo in approximate Bayesian computation. Stochastic Analysis and Applications. 2019;37:346–360. [Google Scholar]
  • [18].Warne DJ, Baker RE, Simpson MJ. Multilevel rejection sampling for approximate Bayesian computation. Computational Statistics & Data Analysis. 2018;124:71–86. [Google Scholar]
  • [19].Warne DJ, Baker RE, Simpson MJ. Rapid Bayesian inference for expensive stochastic models. Journal of Computational and Graphical Statistics. 2021;31:512–528. [Google Scholar]
  • [20].Prangle D. Lazy ABC. Statistics and Computing. 2016;26:171–185. [Google Scholar]
  • [21].Christen JA, Fox C. Markov chain Monte Carlo using an approximation. Journal of Computational and Graphical Statistics. 2005;14:795–810. [Google Scholar]
  • [22].Everitt RG, Rowińska PA. Delayed acceptance ABC-SMC. Journal of Computational and Graphical Statistics. 2021;30:55–66. [Google Scholar]
  • [23].Yan L, Zhou T. Adaptive multi-fidelity polynomial chaos approach to bayesian inference in inverse problems. Journal of Computational Physics. 2019;381:110–128. [Google Scholar]
  • [24].Yan L, Zhou T. An adaptive surrogate modeling based on deep neural networks for large-scale bayesian inverse problems. Communications in Computational Physics. 2020;28:2180–2205. [Google Scholar]
  • [25].Bon JJ, Warne DJ, Nott DJ, Drovandi C. Bayesian score calibration for approximate models. 2022:ArXiv:2211.05357 [Google Scholar]
  • [26].Gillespie DT. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry. 1977;81:2340–2361. [Google Scholar]
  • [27].Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: A fresh approach to numerical computing. SIAM Review. 2017;59:65–98. [Google Scholar]
  • [28].Owen AB. Monte Carlo Theory. Methods and Examples. 2013 URL: https://artowen.su.domains/mc/ [Google Scholar]
  • [29].Giles MB. Multilevel Monte Carlo methods. Acta Numerica. 2015;24:259–328. [Google Scholar]
  • [30].Rhee C-H, Glynn PW. Unbiased estimation with square root convergence for SDE models. Operations Research. 2015;63:1026–1043. [Google Scholar]
  • [31].Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer New York; 2009. [DOI] [Google Scholar]
  • [32].Schnoerr D, Sanguinetti G, Grima R. Approximation and inference methods for stochastic biochemical kinetics—a tutorial review. Journal of Physics A: Mathematical and Theoretical. 2017;50:093001 [Google Scholar]
  • [33].Warne DJ, Baker RE, Simpson MJ. Simulation and inference algorithms for stochastic biochemical reaction networks: From basic concepts to state-of-the-art. Journal of The Royal Society Interface. 2019;16:20180943. doi: 10.1098/rsif.2018.0943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Erban R, Chapman SJ. Stochastic Modelling of Reaction–Diffusion Processes. Cambridge University Press; 2019. [DOI] [Google Scholar]
  • [35].Warne DJ, Baker RE, Simpson MJ. A practical guide to pseudo-marginal methods for computational inference in systems biology. Journal of Theoretical Biology. 2020;496:110255. doi: 10.1016/j.jtbi.2020.110255. [DOI] [PubMed] [Google Scholar]
  • [36].Lester C. Multi-level approximate Bayesian computation. arXiv. 2019:arXiv:1811.08866. arXiv:arXiv:1811.08866. [Google Scholar]
  • [37].Levine ME, Stuart AM. A framework for machine learning of model error in dynamical systems. 2021:arXiv:2107.06658. arXiv:2107.06658. [Google Scholar]
  • [38].Cornuet J-M, Marin J-M, Mira A, Robert CP. Adaptive multiple importance sampling. Scandinavian Journal of Statistics. 2012;39:798–812. [Google Scholar]
  • [39].Gorodetsky AA, Jakeman JD, Geraci G. MFNets: data efficient all-at-once learning of multifidelity surrogates as directed networks of information sources. Computational Mechanics. 2021;68:741–758. [Google Scholar]
  • [40].Gillespie DT. Approximate accelerated stochastic simulation of chemically reacting systems. The Journal of Chemical Physics. 2001;115:1716–1733. [Google Scholar]
  • [41].Andrieu C, Roberts GO. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics. 2009;37 [Google Scholar]
  • [42].Drovandi C, Everitt RG, Golightly A, Prangle D. Ensemble MCMC: Accelerating pseudo-marginal MCMC for state space models using the ensemble kalman filter. Bayesian Analysis. 2022;17:223–260. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

RESOURCES