Abstract
Variational Message Passing (VMP) provides an automatable and efficient algorithmic framework for approximating Bayesian inference in factorized probabilistic models that consist of conjugate exponential family distributions. The automation of Bayesian inference tasks is very important since many data processing problems can be formulated as inference tasks on a generative probabilistic model. However, accurate generative models may also contain deterministic and possibly nonlinear variable mappings and non-conjugate factor pairs that complicate the automatic execution of the VMP algorithm. In this paper, we show that executing VMP in complex models relies on the ability to compute the expectations of the statistics of hidden variables. We extend the applicability of VMP by approximating the required expectation quantities in appropriate cases by importance sampling and Laplace approximation. As a result, the proposed Extended VMP (EVMP) approach supports automated efficient inference for a very wide range of probabilistic model specifications. We implemented EVMP in the Julia language in the probabilistic programming package ForneyLab.jl and show by a number of examples that EVMP renders an almost universal inference engine for factorized probabilistic models.
Keywords: Bayesian inference, variational inference, factor graphs, variational message passing, probabilistic programming
1. Introduction
Probabilistic Programming Languages (PPL) and packages [1] have gained strong popularity over recent years since they support fast algorithm development through automating Bayesian inference in probabilistic models. Many of these PPLs [2,3,4,5] are based on numerical approximation methods, which leads to inexact inference results, even if the model comprises conjugate factor pairs and exact inference is achievable. Moreover, although a majority of popular PPLs scale well to processing large data sets due to their stochastic inference settings [6], they tend to execute very slowly for certain types of structured dynamic models, such as state space models. Alternatively, some PPLs that execute inference by message passing in a factor graph [7,8] provide efficient inference performance by exploiting factorization and conjugacy between exponential family-based distribution pairs in the model. In particular the Variational Message Passing (VMP) [9,10] algorithm has gained a good reputation, as it supports efficient inference for conjugate factor pairs in factorized probabilistic models. Unfortunately, non-conjugate factor pairs complicate the automated estimation of posterior distributions, due to intractability of the normalization constants. Likewise, non-linear deterministic relations between model variables often create non-conjugate pairings and thus obstruct the message-passing-based inference mechanism.
This paper proposes an Extended VMP (EVMP) algorithm to support automated efficient inference on a wide class of models that contain both non-conjugate relations between factor pairs and deterministic, possibly non-linear factor nodes. In our solution proposal, the regular VMP algorithm constructs the functional forms of the messages. These functional forms contain expectations of functions of hidden variables. In the case that these expectation quantities cannot be evaluated to a closed-form expression, we estimate them by Importance Sampling (IS) [11], which is a well-known Monte Carlo method that approximates intractable posteriors by a set of weighted samples and estimates expectations over this sample set. We also make use of Laplace approximation ([12], Section 4.4) with support by automatic differentiation tools (autodiff) [13] in appropriate cases to approximate posteriors by normal distributions, which allows us to calculate the expectations over the approximating normal distribution. Our proposal leads to an efficient, automatable message-passing framework that removes most model specification limitations.
In Section 2, we start with a review of factor graphs and the VMP algorithm. Next, we specify the proposed Extended VMP algorithm in Section 3. In order to keep the paper readable, both for the advanced researcher and someone who just needs the results, we defer detailed discussions and derivations of the key equations in EVMP to Appendix A and Appendix B. We implemented EVMP in the Julia package ForneyLab.jl [8,14]. In Section 4 we present several comparative experiments of EVMP in ForneyLab vs. Turing.jl, which is an alternative state-of-the-art Julia-based PPL that focuses on Monte Carlo methods for inference. We show that EVMP transforms ForneyLab into an almost universally applicable inference engine, while retaining computational efficiency, due to its library of closed-form message passing rules. An extensive comparison to related work is presented in Section 5.
2. Problem Statement
Variational Message Passing on Forney-Style Factor Graphs
We assume a probabilistic model with a given set of observations and a set of latent variables . Bayesian inference in this model relates to evaluating the following posterior:
which relies on evaluating the model evidence . Unfortunately, due to the computational complexity of evaluating the integral for the evidence, exact Bayesian inference is achievable only for a limited set of probabilistic models. Alternatively, inference can be executed by minimization of a variational objective called free energy.
| (1) |
where stands for the expectation with respect to and denotes a Kullback–Leibler divergence. (In this paper, we denote the expected value of function f with respect to distribution q both by and .) The KL divergence is greater or equal to zero for any distribution and if and only if . As a result, minimizing the free energy with respect to q leads both to an approximate posterior
and an upper bound on the negative log-evidence. In practice, minimization of is often greatly alleviated by assuming a mean-field constraint, i.e., a fully factorized posterior .
Variational inference on factorized models with the mean-field assumption for q leads to an automatable algorithm called Variational Message Passing (VMP) [9,10]. VMP can be visualized by representing the model as a graph and interpreting the VMP update equations as messages.
In this paper, we favor a Forney-style Factor Graph (FFG) representation to visualize the factorization properties of probabilistic models and inference by message passing [15]. FFGs are undirected graph representations of factorized probabilistic models in which the conditional distributions are represented by nodes and the variables are associated with edges that connect the nodes. Besides visualizing the factorization properties of probabilistic models, FFGs also provide a formal framework for message passing-based inference in probabilistic models. In the FFGs that we discuss here, we distinguish three types of factors (nodes): soft factors, deterministic factors and equality factors. Throughout the paper, a soft factor represents an Exponential Family (EF) distribution (see (6) for definition) such as a Gaussian, Bernoulli, Gamma or Categorical distribution. Deterministic factors hold deterministic mappings of variables; in particular, we will use the relation , where is a deterministic function. Lastly, equality factors are used to circumvent the constraint that an edge (representing a variable z) can only be connected to maximally two factors. In an FFG representation, this problem is resolved by adding variable copies and and constraining the beliefs over these copy variables through an equality factor . As an example, the FFG for one time step for a hierarchical state-space model is visualized in Figure 1. In this graph, the factors , , are encoded by EF distributions. The factor , for a given function , represents a non-linear deterministic relation. An interesting property of FFGs is the hierarchical composition: we can create new “higher level” nodes by enclosing a set of connected nodes in a box and integrating out the internal variables in the box. For instance, the composite node can be created through the following:
Figure 1.
An FFG representation of one time step of a state space model. In FFGs, factors represent (conditional) distributions. Here, , and are soft factors that each represent an exponential family distribution. On the other hand, represents a deterministic factor, where is a deterministic function. It is possible to compose factors and consider them as a single unit. In this example, , visualized by a dashed box, stands for the composition of and . It is a notational convention to visualize observed values () by a small black node.
For a more detailed introduction to FFGs, we refer to [15,16].
Aside from visualization, FFGs also serve to formalize message-passing-based inference in probabilistic models, and VMP on FFGs realizes coordinate-descent optimization of the free energy functional (1). Coordinate-descent optimization of the free energy refers to iterative updates of the variational factors one at a time while keeping the other factors fixed [12,17]. To illustrate, let us optimize with respect to for the system (see Figure 2). First, we decompose the free energy as follows:
where holds terms that are not a function of variable . The term can be re-arranged as follows:
| (2) |
so it follows that is minimized when is set proportional to the denominator in (2) [12,17]. In addition, notice that the terms with the local factors and are uncoupled, which paves the way for a message-passing interpretation of coordinate-descent variational inference. As a result, provided that and are not deterministic factors, the VMP algorithm proceeds by repeating the following four steps until convergence [10]:
Choose a variable from the set .
- Compute the incoming messages.
(3) - Update the posterior.
(4) - Update the local free energy (for performance tracking), i.e., update all terms in that are affected by the update (4):
(5)
Figure 2.
FFG representation for edge with connected nodes and .
As we see in (3), messages flow on edges in both directions. It is common parlance to call one of the messages the forward message (denoted by ) and the other the backward message ().
In this paper, the central problem is how to execute the VMP update Equation (3) through (5) for a wide range of specifications for the factors and . In the next section, we specify the proposed Extended VMP (EVMP) solution. A more detailed derivation of the key equations of EVMP is provided in Appendix B.
3. Specification of EVMP Algorithm
Variational Message Passing is a fast, efficient and deterministic approximate inference algorithm. However, the applicability of VMP heavily relies on connected factors being conjugate pairs (see Appendix A). In contrast, Monte Carlo methods (see [18] for message-passing interpretation) are applicable to a wider range of models with non-conjugate factor pairs. Unfortunately, in comparison to VMP, Monte Carlo methods are considerably slower since they rely on stochastic simulations. As we elaborate in Section 5, the recent efforts to combine the best of Monte Carlo methods and variational inference predominantly focus on noisy gradient estimation of the free energy through Monte Carlo sampling and do not take the full advantage of deterministic message passing steps in inference.
In this section, we specify the EVMP algorithm, which combines the efficiency of VMP with the flexibility of the Laplace approximation and the universality of Monte Carlo methods. In the proposed EVMP algorithm, VMP constructs the functional forms of the messages while importance sampling and Laplace approximations are used to estimate the required expectations of statistical quantities if they are not available in closed form. We first specify the range of probability distribution types for factors, messages and posteriors. These different types are used to identify the specific calculation rules for updating the messages and posteriors in (3) and (4). We refer the interested reader to Appendix B for detailed derivations.
3.1. Distribution Types
We consider the following representation types for probability distributions in factors , where z holds a variable.
-
(1)The standard Exponential Family (EF) of distributions, i.e., the following:
where is the base measure, is the sufficient statistics vector, is the natural parameters vector and is the log-partition function.(6) -
(2)Distributions that are of the following exponential form:
where is a deterministic function. The key characteristic here is that is not recognized as a sufficient statistics vector for any of the standard EF distributions. We call this distribution type a Non-Standard Exponential Family (NEF) distribution. As we show in Section 3.6, this distribution type arises only in backward message calculations.(7) -
(3)A List of Weighted Samples (LWS), i.e., the following:
(8) -
(4)Deterministic relations are represented by delta distributions, i.e., the following:
(9) Technically, the equality factor also specifies a deterministic relation between variables.
3.2. Factor Types
Factor types are represented by EF and delta distributions.
In a VMP setting, as discussed in this and previous papers on VMP, conjugate soft factors from the exponential family enjoy some computational advantages. As an extension to VMP, the EVMP algorithm inherits the same computational advantages for conjugate factor pairs. In order to automate and generalize the inference to custom non-conjugate soft factors, we compose a generic soft factor by a delta distribution (to describe a non-linear deterministic function) and a standard EF distribution. This decomposition relieves us from manually deriving VMP messages for each different soft factor specification. For a given composite node (delta + standard EF), the EVMP algorithm uses the predefined VMP messages for the standard EF component to compute messages around the composite node. As we will see, this formulation yields an almost generic inference procedure.
3.3. Message Types
Forward messages carry either an EF or an LWS distribution. Backward messages carry either an EF or an NEF distribution. This is an arbitrary choice in the sense that we only make this assignment to indicate that in the EVMP algorithm, two colliding messages in posterior calculations are not both of the LWS type nor both of the NEF type.
3.4. Posterior Types
The posteriors are represented by either the EF or LWS representations.
To summarize the terminology so far, we defined four distribution types: Standard EF (EF), Non-Standard EF (NEF), List of Weighted Samples (LWS) and delta distributions. The end user of our algorithm can design a model by using EF and delta distributions. Under the hood, messages may carry EF, NEF or LWS distributions to render the inference. As the output, the end user is provided with either the EF or LWS posteriors. Next, we discuss how posteriors, messages and free energies are computed in the EVMP algorithm. The different types can be used to identify which computational recipe applies. As an aside, Julia’s support for multiple dispatch in functions [14] makes this a very elegant mechanism that requires almost no if–then rules.
3.5. Computation of Posteriors
Here, we discuss how EVMP updates the posteriors in (4). In an FFG, computation of the posterior is realized by a multiplication of colliding forward and backward messages, respectively and , followed by normalization. We distinguish four types of updates.
-
(1)In the case that the colliding forward and backward messages both carry EF distributions with the same sufficient statistics , then computing the posterior simplifies to a summation of natural parameters:
In this case, the posterior will also be represented by the EF distribution type. This case corresponds to classical VMP with conjugate factor pairs.
-
(2)
The forward message again carries a standard EF distribution. The backward message carries either an NEF distribution or a non-conjugate EF distribution.
-
(a)If the forward message is Gaussian, i.e., , we use a Laplace approximation to compute the posterior:
(10) -
(b)Otherwise ( is not a Gaussian), we use Importance Sampling (IS) to compute the posterior:
(11)
-
(a)
-
(3)The forward message carries an LWS distribution, i.e., the following:
and the backward message carries either an EF or NEF distribution. In that case, the posterior computation refers to updating the weights in (see Appendix E):(12)
3.6. Computation of Messages
Here, we discuss how EVMP compute the messages (3). We specify different message calculation rules depending of the type of the factor.
-
(1)If factor is a soft factor of the form (see Figure 3a)
then the outgoing VMP message to is the following EF-distributed message:(13) If rather (or ) than is the output variable of , i.e., if the following is true:
then the outgoing message to is either an EF or an NEF distribution of the following form:(14) In this last expression, we chose to assign a backward arrow to since it is customary to align the message direction with the direction of the factor, which in this case points to .
Note that the message calculation rule for requires the computation of expectation , and for we need to compute expectations and . In the update rules to be shown below, we will see these expectations of statistics of z appear over and again. In Section 3.8 we detail how we calculate these expectations and in Appendix A, we further discuss the origins of these expectations.
-
(2)In the case that is a deterministic factor (see Figure 3b):
then the forward message from to x is of LWS type and is calculated as follows:(15) (16) For the computation of the backward message toward , we distinguish two cases:
-
(a)If all forward incoming messages from the variables are Gaussian, we first use a Laplace approximation to obtain a Gaussian joint posterior ; see Appendix B.1.2 and Appendix B.2.2 for details. Then, we evaluate the posteriors for individual random variables, e.g., . Finally, we send the following Gaussian backward message:
(17) -
(b)Otherwise (the incoming messages from the variables are not all Gaussian), we use Monte Carlo and send a message to as a NEF distribution:
(18) Note that if is a single input deterministic node, i.e., , then the backward message simplifies to (Appendix B.1.1).
-
(a)
-
(3)The third factor type that leads to a special message computation rule is the equality node; see Figure 3c. The outgoing message from an equality node
is computed by following the sum–product rule:(19)
Figure 3.
Different factor types for outgoing message computation rules.
3.7. Computation of Free Energy
Here, we discuss how EVMP computes the FE update from (5). Note that the FE can be decomposed into a subtraction of energy and entropy terms:
| (20) |
These energy and entropy terms can be evaluated because contains only factors that are defined in the generative model and is also accessible as a result of variational inference. Thus, we evaluate the FE by evaluating the energy and entropy terms separately.
For an EF-encoded soft factor
the energy over the factor evaluates to
The entropy terms only need to be evaluated for variables z that are not associated with output edges of deterministic nodes. In that case, we calculate the entropy of as follows:
- If is a represented by a standard EF distribution, i.e.,
then
3.8. Expectations of Statistics
In many of the above computations for messages, posteriors and free energies, we need to compute certain expectations of statistics of z, e.g., the computation of the forward message in (13) requires evaluation of . Here, we discuss how EVMP evaluates these expectations. Let us denote a statistic of random variable z by and assume we are interested in the expected value . The calculation rule depends on the type of :
-
(1)We have two cases when is coded as an EF distribution, i.e.,
-
(a)If , i.e., the statistic matches with elements of the sufficient statistics vector , then is available in closed form as the gradient of the log-partition function (this is worked out in Appendix A.1.1, see (A14) and (A15)):
-
(b)Otherwise (), then we evaluate
where .
-
(a)
-
(2)In case is represented by a LWS, i.e., the following:
then, we evaluate the following:
3.9. Pseudo-Code for the EVMP Algorithm
Section 3.1, Section 3.2, Section 3.3, Section 3.4, Section 3.5, Section 3.6, Section 3.7 and Section 3.8 provide a recipe for almost universal evaluation of variational inference in factor graphs. We use classical VMP with closed-form solutions when possible, and resort to Laplace or IS approximations when needed. We now summarize the EVMP algorithm by a pseudo-code fragment in Algorithm 1. We use the following notation: is the set of factor nodes (vertices), where , , stand for the subsets of soft factor nodes, deterministic nodes and equality nodes, respectively. E is the set of edges that connect the nodes. represents the entire factor graph. is the set of hidden variables, where are the variables at the output edges of deterministic nodes. are also associated with edges in E, but in contrast to , are not output edges of deterministic nodes.
For structured factorizations, the overall structure remains the same, but messages and posteriors are calculated for sub-graphs instead of single random variables.
An example to illustrate the calculation of messages and posteriors in the EVMP algorithm is provided in Appendix F.
| Algorithm 1 Extended VMP (Mean-field assumption) |
|
4. Experiments
We illustrate EVMP-based inference on three different applications (code for experiments can be found at https://github.com/biaslab/ExtendedVMP (accessed on 25 June 2021)). For each application, we show the favorable features of EVMP together with its shortcomings in comparison to Turing [5], which is a general purpose Julia probabilistic programming package.
4.1. Filtering with the Hierarchical Gaussian Filter
The Hierarchical Gaussian Filter (HGF) [19,20] is a popular generative model in the neuroscience community. The HGF consists of a Gaussian random walk model, where the variance of the Gaussian is a nonlinear function of the state of the next higher layer, that in turn evolves according to a Gaussian random walk, an so on. Due to the nonlinear link between the layers, classical VMP rules do not have a closed-form solution. While in principle, variational updates through Laplace approximation can be manually derived for the HGF model [19], automatically generated EVMP update rules alleviate the need for cumbersome and error-prone manual derivations.
The 2-layer HGF model is defined as
| (21a) |
| (21b) |
| (21c) |
| (21d) |
For this experiment, we generated data points by the following process. First, we generated noisy hidden states using . Next, we generated observations following model (21a–d) with . The generated data set is visualized in (the lower subgraph of) Figure 4.
Figure 4.
Above: Hidden states and their estimates (ribbon is one variance). The estimates of ForneyLab’s Extended VMP are designated by blue while the estimates of Turing’s ADVI are marked by red. Below: Observed synthetic data.
Next, we filtered the data set by a second HGF, also given by (21a–d) with priors , and parameters . We used EVMP to track the hidden states and . All inference steps including the message passing schedule for filtering in the HGF are detailed in [19]. For each time step, EVMP was run for 10 iterations at each filtering step.
For comparison, we implemented a similar filtering procedure by Automatic Differentiation Variational Inference (ADVI) [21], executed by Julia’s Turing.jl [5] package. At each time step t, the priors over and are set to Gaussian distributions, the mean and variance parameters of which are determined by sampling from the variational posteriors at . The only difference between the ForneyLab and Turing implementations, in terms of posterior distribution factorization, is that in Turing’s ADVI, we posit a fully factorized posterior distribution. This assumption decreases the number of parameters to be estimated via automatic differentiation and speeds up the inference procedure. On the other hand, pre-defined message passing rules in ForneyLab enable us to retain the dependency structure between and at time step t in exchange for almost no run-time loss. To be more precise, at time step t, we run inference on the following model: where and are the posterior approximations from the previous time step. In ForneyLab, we run the inference with variational distribution with . We plot estimations for in Figure 4. In ADVI, the variational distribution is . Once inference has completed, Turing allows drawing samples from the variational distribution. We then calculate the mean and variance of these samples to fit Gaussian distributions on and .
The estimated tracks of are visualized in Figure 4. For both EVMP and ADVI, the estimated hidden states largely coincide. However, we observe that both methods capture the periodic character of the true hidden states with a delay. We believe that there are two plausible explanations for the delayed estimations: (1) in the model specification, we assume that the data generative process is not known fully. The variables are originally generated from a sinusoidal function of discrete time steps. However, in the model specification, we do not use this information; (2) in the model specification, we define a random walk over hidden variables that posits the mean of as . Elaborating the latter factor, the random walk avoids a hidden variable to change drastically, compared to while forces to explain the volatility in the process. Reconciling the beliefs from and , both Extended VMP and ADVI estimate with a delay.
In Turing’s ADVI procedure, we used 10 samples per iteration for gradient estimation and set the maximum number of iterations to 4000 per time step to be able to capture this periodic behavior. The overall inference is completed in roughly 1.5 min (this and furtherexperiments were carried out on a machine with the following specs: Julia 1.5.0, Turing v0.15.9, AMD Ryzen 7 3700X 3.6 GHz 8-Core CPU, 16 GB DDR4-3200 MHz RAM.). ForneyLab’s EVMP procedure, on the other hand, is able to perform inference in under 7 s on this time series; see Table 1. The speed of ForneyLab stems from the hybrid inference nature of EVMP. EVMP resorts to gradient-based optimization only to infer and the sampling procedure is required only to estimate statistics related to to be used in the update steps of . In contrast, ADVI requires sampling and employs noisy gradients in the estimation of all the components of the variational distribution. This experiment validates EVMP as a fast automated variational inference solution for filtering in hierarchical dynamic models.
Table 1.
Run-time comparison of EVMP (in ForneyLab.jl) vs. ADVI (in Turing.jl) for the hierarchical Gaussian filter model.
| Algorithm | Run Time (s) |
|---|---|
| EVMP (ForneyLab) | |
| ADVI (Turing) |
4.2. Parameter Estimation for a Linear Dynamical System
In this experiment, we focused attention on a system identification task in a Linear Dynamical System (LDS) [22,23]. An LDS is generally defined as
| (22a) |
| (22b) |
where are observations and are hidden states.
In this experiment, we are interested in inferring the transition matrix A together with the hidden states from a set of observations. Manually derived closed-form solutions for the system identification task are available both in maximum likelihood estimation [24] and a variational Bayesian approximation [25] contexts. Nevertheless, the goal in this and other papers on probabilistic programming packages is to automatically infer posteriors over the hidden states and parameters without resorting to manual derivations. In principle, EVMP supports to infer the hidden states, A, B, Q and R concurrently. Of course, depending on specific circumstances such as system identifiability and the richness of the observed data, the performance may vary.
In order to execute our experiment, we first extend (22a,b) with a prior on A as follows:
| (23a) |
| (23b) |
| (23c) |
| (23d) |
In (23a–d), a holds the vectorized representation of the transition matrix A. Note that (23b) can be written as follows:
and through this manipulation we identify as the deterministic factor in (15). As a result, ForneyLab’s EVMP works out-of-the-box for inference of the transition matrix in (23a–d).
We first generated a data set of number of samples by running model (22a,b) with parameters , , and .
Next, we presented the data set to a second LDS model and aimed to infer posteriors over hidden states and transition matrix A. The prior on a was set to and all other parameters were set to the same values as in the data generation process.
We compared the performance of ForneyLab’s EVMP with Turing’s ADVI and NUTS (No U-Turn Sampler, a Hamiltonian Monte Carlo sampling-based inference method) [26] engines, see Figure 5. Both EVMP, ADVI and NUTS successfully converged to almost coinciding estimates of the transition matrix (no notable difference when visualized). We also show free energy tracks for EVMP and ADVI in Figure 5. In this experiment, Turing’s ADVI outperformed ForneyLab’s EVMP in terms of total execution time and the free energy minimization. As a mitigating factor in this analysis, the pre-compilation of the message passing schedule in ForneyLab takes about 13 s, while actual execution of the generated inference algorithm is on par with Turing’s ADVI. Execution time details are shown in Table 2.
Figure 5.
Free energy tracks for EVMP on the LDS transition matrix identification task. Left: (a) Mean estimate EVMP for the transition matrix A after 50 iterations, (b) mean estimate after 300 iterations, (c) true transition matrix A that was used to generate the synthetic data. Right: Free energy tracks by ForneyLab’s EVMP and Turing’s ADVI procedures.
Table 2.
Run-time results for transition matrix estimation in the LDS model.
| Algorithm | Free Energy | Total Time (s) |
|---|---|---|
| EVMP (ForneyLab) | 135.837 | |
| ADVI (Turing) | 90.285 | |
| NUTS (Turing) | - |
4.3. EVMP for a Switching State Space Model
In this experiment, we went beyond models that only contain continuously valued variables and inquired the capabilities of EVMP on a Switching State Space Model (SSSM) [27], which consists of both continuous and discrete hidden variables. The assumption of constant model parameters in the LDS of Section 4.2 does not account for the regime changes that occur in many dynamical systems of interest. The SSSM does allow for modeling parameter switches, and in this experiment we used the following model:
| (24a) |
| (24b) |
| (24c) |
| (24d) |
In this system, are observations, is a continuously valued hidden state and is a one-hot coded three-dimensional selection variable, i.e., and . The parameters of the system are the state variances and concentration parameters . The elements of are all 1, except the element which is set to 100 to disfavor frequent regime switches, e.g., .
We generated data points from a random walk process (24c) and (24d) with process noise variance parameter . From time step to , we set and consequently . From time step to , we set and between to we set . The generated time series is shown in Figure 6.
Figure 6.
Performance results for automated inference in SSSM. Top: generated data set. Bottom 4 subgraphs: posterior for regime selection variable by MF-EVMP, SMF-EVMP, HMC and NUTS procedures respectively. In the Turing simulations (HMC and NUTS), the number of particles in the Particle Gibbs sampler was set to 50. In the NUTS sampler, the adaptation step size is 1000 and the target accept ratio is 0.65. The HMC sampler was tried with varying step sizes, including , , , , and and leapfrog step numbers 10, 20, and 30. The best results are shown.
The main difficulty in state inference for the SSSM stems from the coupling between x and z. This is because the variational message passing rules around the node are not pre-defined in ForneyLab, although technically they can be worked out to closed-form expressions [27]. If EVMP were not available either, then a ForneyLab end user would be expected to manually derive closed-form update rules and implement these rules in an additional ForneyLab node. This type of manually assisted inference by end user calculations is what we try to avoid with EVMP and with probabilistic programming packages in general. EVMP enables the user to compensate for the lack of stored message-passing rules by introducing an auxiliary variable s in the model with a deterministic relation between s and z:
| (25a) |
| (25b) |
| (25c) |
| (25d) |
After we extend model specification (24a–d) by (25a–d), then ForneyLab can run EVMP-based inference out of the box. Note that there is no need for manual inference calculations, but rather a simple manipulation of the generative model that makes the system suited for automated inference.
We tested the performance of two different constraints on the posterior distribution: (1) a mean-field assumption, i.e., ; (2) a structured mean-field assumption, i.e., , see Figure 6. We observe that the structured factorization, being a less stringent constraint on q, yields a slightly better performance than the mean-field factorization, particularly in estimating the length of the first regime.
We also compared the performance of ForneyLab’s EVMP method to Turing’s inference method. As opposed to the previous two experiments, we could not use solely ADVI, nor Hamiltonian Monte Carlo (HMC, [28,29]) and NUTS samplers in this experiment since these procedures do not allow inference for discrete random variables. Turing does provide the option to use a Particle Gibbs (PG) sampler [30,31] for the estimation of the discrete random variables () in conjunction with the estimation of the continuous random variables (, A) by HMC and NUTS. The performance results for NUTS-PG and HMC-PG are shown in Figure 6. The performance of the NUTS-PG and HMC-PG samplers in estimating the correct regimes is far below the EVMP results, although the HCM-PG sampler correctly identified the third regime. The run-time scores are shown in Table 3.
Table 3.
Experimental results for switching state space model.
| Algorithm | Free Energy | Total Time (s) |
|---|---|---|
| EVMP (Mean-field) | 283.991 | |
| EVMP (Structured) | 273.596 | |
| HMC-PG (Turing) | - | |
| NUTS-PG (Turing) | - |
5. Related Work
Hybrid Monte Carlo variational inference techniques have been studied prior to our work. However, mainstream research predominantly consists of variational methods within Monte Carlo techniques as opposed to our Monte Carlo methods within a variational inference approach.
For instance, ref. [32] casts variational distributions as proposal distributions in a Markov-Chain Monte Carlo (MCMC) procedure. Similarly, ref. [33] employs variational methods to design adaptive proposal distributions for Importance Sampling (IS). In [34], gradient estimates of a variational objective are used to tune the parameters of the proposal distributions for MCMC. On the other hand, Monte-Carlo Co-Ordinate Ascent Variational Inference (MC-CAVI), proposed in [35], differs from the aforementioned methods in that it uses MCMC in the calculation of expectations required within the fixed-point iterations of Coordinate Ascent Variational Inference (CAVI).
In this paper, we follow a similar approach as [35], but we use IS to estimate the expectation quantities required in VMP. Both MCMC and IS have their own merits. IS smoothly interfaces with the message passing interpretation of Bayesian inference, which further leads to automated design of proposal distributions. We use Laplace approximation for Gaussian posteriors for variables with Gaussian priors. In the context of dynamical systems, this approach notably overlaps with Gaussian filtering techniques ([36], Section 6) that is often achieved by Assumed Density Filtering ([37], Section 8.4).
As we show in Appendix E, in the approach that we propose, it is also possible to run automated Bootstrap particle filtering [36,38] rather than Gaussian filtering methods. As show in [18], particle filtering can also be framed as message passing on a factor graph. The connection between the particle filtering and variational optimization was introduced in [39]. Their formalism is based on an extension of Particle Belief Propagation [40] to Tree-Reweighted Belief Propagation [41], while ours revolves around VMP. Similar to our approach, Particle Variational Inference (PVI) [42] aims at optimizing a variational objective by successive IS approximations to true posterior distributions. While PVI applies well to inference for discrete random variables, our EVMP proposal applies to both continuous and discrete random variables.
Variational inference in the context of deterministic building blocks in probabilistic models was studied in [43]. Wheras [43] allows non-linearities to be placed only after Gaussian nodes, the proposed EVMP method generalizes this concept to EF distributed factors.
Non-conjugate Variational Message Passing (NC-VMP) [44] addresses the non-conjugate factor issue in VMP. Assuming that the posterior distribution is an EF distribution, NC-VMP projects the messages to the distribution space of the posterior by equating their sufficient statistics. Thus NC-VMP tunes the natural parameters of the messages in such a way that they converge to the stationary points of the KL divergence between the approximated and true posteriors. Ref. [44] also reports that the algorithm necessitates damping for convergence in practice. In response, ref. [45] presents Conjugate-Computation Variational Inference (CVI) as a universal inference method that is based on stochastic optimization techniques. As opposed to alternative stochastic variational inference techniques, such as Black-Box Variational Inference [46] and Automatic Differentiation Variational Inference [21], CVI exploits the conjugacy structure of the probabilistic models, which leads to faster convergence. In CVI, non-conjugate factors are incorporated into coordinate ascent steps of mean-field variational inference (with ELBO objective) through a stochastic optimization procedure to form compact posterior approximations with standard probability distributions. In our EVMP approach, the Laplace approximation entails a similarly nested optimization procedure to form compact approximations with Gaussian distributions. Nevertheless, our particle approximations to the true posteriors obviate the need for additional gradient-based optimizations to estimate the parameters of the posteriors.
Finally, the original VMP paper [9] itself briefly mentions sampling methods to overcome the issues with non-conjugate priors. However, they do not extend this idea to deterministic nodes and rather present it as a fallback method whenever soft factors are tied to non-conjugate soft factor priors. Inspired by their vision of approximating the expectation quantities by sampling techniques, we introduce here a fully automated, very broadly applicable extended VMP procedure.
6. Discussion
In this paper, we present a method for almost universal variational inference on factorized probabilistic models. The core of our method is the locality feature of VMP: the messages at a soft factor are functions of expectations related to arguments of the factor. We employ IS to estimate these expectations or directly approximate posteriors by Laplace approximation if a Gaussian posterior is reasonable. We also extended the Julia package ForneyLab with the proposed EVMP method. In contrast to many alternative PPLs that are solely based on Monte Carlo methods, ForneyLab allows end users to take full advantage of closed-form message passing rules while resorting to small-scale numerical approximations only when needed. We showed that ForneyLab provides an efficient automated variational Bayesian inference tool that in some instances may be preferable to the state-of-the-art Turing package, especially for tasks that include filtering in dynamical models or discrete variables in state space models.
While the experiments support the notion that EVMP is a promising method for inference in non-linear and non-conjugate models, we have not tested our method yet in high-dimensional problems. It is well-known that importance sampling is not efficient in high dimensions [47]. Therefore, we anticipate that for high-dimensional inference tasks with continuous random variables, Hamiltonian Monte Carlo-based methods could outperform EVMP both in terms of run-time and quality of the estimates. Nevertheless, it should be possible to alleviate the deficiencies of EVMP in high dimensions by replacing IS and Laplace approximations by HMC samplers. In essence, HMC is an MCMC method and ref. [35] shows the efficiency of MCMC methods in estimation of the expectations that are required in variational inference. Yet, in lower dimensions, we favor IS and Laplace approximations both because of their promising performance scores in the experiments and also because EVMP relieves users of choosing hyperparameters for the best performance. Recall that in the SSSM experiments in Section 4.3, we tested HMC with various hyperparameters to attain the best performance, and yet EVMP was more successful in detecting the hidden regimes. Moreover, in contrast to EVMP, plain HMC is not applicable to estimate discrete variables and needs to be combined with other samplers to run inference on the models with discrete and continuous variables.
In Appendix C, we introduce a variational free energy estimation method that resorts to approximations only if the closed-form expressions of the information-theoretic measures are not available. This differs from alternative automated variational inference techniques, such as Automatic Differentiation Variational Inference (ADVI), which estimates the entire free energy over Monte Carlo summation. Moreover, like HMC, the applicability of ADVI is also limited to continuous variables.
In EVMP, proposal distributions for importance sampling are automatically set to forward messages. Although it is a practical solution with an elegant interpretation in a message passing context, forward messages do not carry information regarding observations. Therefore, we may not acquire useful samples from forward messages if the observations lead to peaky backward messages. In future work, we aim to investigate the effects of alternative proposal distribution design methods.
One major drawback of our ForneyLab implementation is that ForneyLab does not allow loops during the inference procedure. We rarely encounter this problem with soft factors since the mean field assumption breaks the loops by imposing additional factorizations in variational distributions. However, this may not be the case with deterministic nodes. This is because the input and the output variables of deterministic nodes are tied to each other through a deterministic mapping even after the mean field assumption. For example, consider the following mixture model specification: with , with and . Although it is a valid model specification with properly defined message passing rules, the EVMP algorithm is precluded due to the loop: the variable z is connected to two deterministic nodes ( and ) the outputs of which are connected to the same node . Belief propagation (BP) [48,49] faces with a similar problem on loopy graphs. Nonetheless, it has been proven that iteratively running BP on loopy graphs often yields satisfactory approximations though the convergence is not guaranteed ([12], Section 8.4.7), ([37], Section 22.2). Therefore, it is worth investigating the performance of EVMP executed in a loopy setting.
There are similarities between EVMP and Expectation Propagation (EP) [50,51] in the sense that both methods estimate the moment parameters of posteriors. In contrast to EP, which approximates belief propagation (BP) [48,49] messages, EVMP approximates VMP messages, which is applicable to a broader range of model specifications. In future work, we aim to investigate and exploit this relation.
7. Conclusions
We developed a hybrid message passing-based approach to variational Bayesian inference that supports deterministic and non-conjugate model segments. The proposed Extended VMP (EVMP) method defaults to analytical updates for conjugate factor pairs and uses a local Laplace approximation or importance sampling when numerical methods are needed. EVMP was implemented in Julia’s ForneyLab package (see Appendix D) and a set of simulations shows very competitive inference performance on various inference tasks, particularly for state and parameter tracking in state-space models.
Acknowledgments
The authors want to extend gratitude to our fellow researchers at BIASlab for interesting discussions on the topics in this paper.
Abbreviations
The following abbreviations are used in this manuscript:
| VMP | Variational Message Passing |
| EVMP | Extended Variational Message Passing |
| BP | Belief propagation |
| EP | Expectation propagation |
| FFG | Forney-style Factor Graph |
| EF | Exponential family |
| NEF | Non-standard exponential family |
| LWS | List of Weighted Samples |
| IS | Importance sampling |
| MCMC | Markov Chain Monte Carlo |
| HMC | Hamiltonian Monte Carlo |
| ADVI | Automatic Differentiation Variational Inference |
| PG | Particle Gibbs |
Appendix A. On the Applicability of VMP
In this section, we show that the applicability of the VMP algorithm relies on connected factors being conjugate pairs in exponential family of distributions. Non-conjugate connected factors lead to intractable posteriors and messages. Nevertheless, we show that for a given soft factor, the corresponding VMP messages are locally expressed in terms of some expectation quantities. If these expectations are not available in closed form, then we can estimate them to approximate the VMP messages around the non-conjugate factor pairs.
Let us focus on Figure 2. We postulate the following assumptions:
-
is an element of the exponential family (EF) of distributions, i.e.,
(A1) In this equation, is a base measure, is a vector of natural (or canonical) parameters, are the sufficient statistics, and is the log-partition function, i.e., . It is always possible to write the log-partition function as a function of natural parameters , such that . Throughout the paper, we sometimes prefer the natural parameter parameterization of the log partition.
-
We differentiate a few cases for :
-
is also an element of the EF, given by the following:
and is a conditionally conjugate pair with for . This (conditional conjugacy) property implies that, given , we can modify in such a way that its sufficient statistics will match the sufficient statistics of . Technically, this means we can rewrite as follows:(A2) (A3) The crucial element of this rewrite is that both and are written as exponential functions of the same sufficient statistics function . This case leads to the regular VMP update equations, see Appendix A.1.
Our Extended VMP does not need this assumption and derives approximate VMP update rules for the following extensions.
is an element of the EF, but not amenable to the modification given in (A3), i.e., it cannot be written as an exponential function of sufficient statistics . Therefore, is not a conjugate pair with for .
-
is a composition of a deterministic node with an EF node, see Figure A1. In particular, in this case can be decomposed as follows:
(A4a)
where is a deterministic, possibly nonlinear transformation and is an element of the EF:(A4b) (A5) We assume that the conjugate prior to for random variable x has sufficient statistics vector , and hence (A5) can be modified as follows:
where refers to the terms that does not include x.(A6)
-
Figure A1.
Deterministic conditional distributions often complicate VMP. An example deterministic conditional is visualized for . and together form the composite node .
Appendix A.1. VMP with Conjugate Soft Factor Pairs
The original VMP algorithm arises as an efficient inference procedure in models that solely consist of conjugate factor pairs. This is because conjugate factor pairs yield analytically tractable messages and posterior calculations. Next, we shortly review the effect of conjugate factor pairs on VMP updates.
Appendix A.1.1. Messages and Posteriors
The VMP message from the factor to can easily be evaluated by applying (3) to (A1):
| (A7) |
Since is conjugate to , its functional form can be modified as (A3) and by applying (3) to (A3), we find the VMP message from the factor to :
| (A8) |
Given that the messages and have the same sufficient statistics, the posterior update step reduces to summation of the messages’ natural parameters:
| (A9) |
For the sake of brevity, the distribution subscripts in the expectation notations are dropped. Note that we evaluate the posterior up to a normalization constant. Nevertheless, the log-normalizer function is readily available for EF distributions having sufficient statistics vector . As a consequence, the posterior evaluates to the following:
| (A10) |
Having showed that the conjugate factor pairs lead to a closed-form expression for posterior , we now investigate which expectation quantities related to are required in the outgoing VMP messages from the factors and to, say and :
| (A11a) |
| (A11b) |
In practice, the message is explicitly calculated by isolating the terms with in a sufficient statistics vector as it is done for in (A8). Similarly, is explicitly calculated, analogous to message in (A7). Here, we follow a rather different approach to explicitly show the expectations related to in the message calculations. Substituting and with (A1) and (A3) in (A11a,b), and keeping in mind that the mean-field assumption allows separation of the expectation quantities with distinct random variables, the messages evaluate to the following:
| (A12a) |
| (A12b) |
Notice that both messages require the expectation of the sufficient statistic vector . Fortunately, in EF distributions, is available in closed-form as the gradient of the log-normalizer ([52], Proposition 3.1):
| (A13) |
For the sake of completeness, we now show that this equality holds. Recall that . Then,
| (A14) |
Evaluating this gradient at , we reach the following:
| (A15) |
Appendix A.1.2. Free Energy
As in the message and the posterior calculations, conjugacy eases the free energy calculation. We investigate it for (5), the free energy terms that include . is decomposed as follows:
| (A16) |
Substituting , and with (A1), (A3) and (A10) in the above expression:
| (A17) |
The expectation terms related to in are and . The former expectation is available in closed form, (A13). Thus is analytically tractable for those distributions that possess closed-form solution for .
In short, conjugate factor pairs facilitate the VMP procedure by allowing closed-form expressions for updates of messages (3), posteriors (4) and FE (5). Moreover, although exceptions exist, similar to the normalization of the posterior (A10), the messages and can be effortlessly normalized if the required expectations are known. Therefore, we can directly parameterize them with standard probability distributions and draw samples from them. This property of EF distributions plays a pivotal role in our automation of the importance sampling procedure.
Appendix A.2. VMP with Non-Conjugate Soft Factor Pairs
Suppose that the soft factors and are no longer conjugate pairs, i.e., given in (A2) can be written in the following form:
| (A18) |
where crucially . Notice that is the natural parameters after the modification of (A2) to isolate the terms with in the sufficient statistics vector. Therefore, the messages and differ in sufficient statistics:
| (A19a) |
| (A19b) |
In this case, the normalization constant calculation in the posterior update step (5) is not straightforward anymore; and worse, it is often intractable. The term intractable refers to integrals that are not available in closed-form for continuous variables. For discrete variables, it refers to summations that are not achievable in a feasible amount of time. The lack of the normalization constant, , hinders the calculation of the expectations with terms that appear in out-going VMP messages from and to variables and , respectively, e.g., and in
| (A20a) |
| (A20b) |
As a result, non-conjugacies obstruct the VMP procedure by hampering closed-form expectation calculations. Bear in mind that even though VMP procedure is obstructed due to intractable expectations, the messages are distinctly fixed for soft factors as functions of certain expectation quantities that are supposed to be calculated over their arguments. We use this property in our Extended VMP method.
Appendix A.3. VMP with Composite Nodes
In this subsection, we shed light on the issues with composite nodes that are constructed by composition of EF distribution soft factors and deterministic conditionals, i.e., the following:
| (A21) |
where is a deterministic conditional distribution. Composite nodes enable us to build almost arbitrary factor nodes. For example, a mixture likelihood distribution with z a one-hot coded selection variable, can be constructed by composing an EF soft factor, a Gaussian, with a deterministic factor as , where and . However, composite nodes impose new challenges on inference procedures.
Now, let us try to calculate . The forward VMP message is given in (A19a). Suppose that the conjugate prior to EF soft factor for x has the sufficient statistics vector (see (A6)). Then, the VMP message from to is as follows:
| (A22) |
Note that the above message reduces to VMP message from to x followed by Belief Propagation (BP) [48,49]. The resulting backward message has the sufficient statistics vector . If , this case reduces to ordinary VMP as discussed in Appendix A.1; otherwise this case is a special case of Appendix A.2 and is not available in closed-form. Hence, the outgoing messages from the factor nodes and :
| (A23a) |
| (A23b) |
are intractable. The last line in the above derivations follows from the transformation of variables [53], i.e., , and expose the automatable nature of Variational Message Passing: the VMP message requires expectation quantities that are related to arguments of the soft factor is tied to, which is in this case . Therefore, once the VMP message passing rule is defined for the factor as a function of its arguments, we can instantiate the messages by providing the required expectation quantities. For example, the required expectation quantities related to argument x are contained in the sufficient statistics vector .
Appendix B. Derivation of Extended VMP
Here, we show the details of our solution approach that is based on importance sampling (IS) and Laplace approximation. First, we address the issues with deterministic mappings of random variables. The resulting technique emerges as a remedy for non-conjugate soft factor pairs problem as well.
Appendix B.1. Deterministic Mappings with Single Inputs
We first address the issues with single-input deterministic mappings and generalize our solution to multiple inputs later on. Consider the sub-graph given in Figure A1, where the deterministic conditional is defined as . As derived in (A23a,b), we need the expectations and to calculate VMP messages towards edges and , respectively. Suppose that is an element in the sufficient statistic vectors and . Then we need to be able to calculate and . Let us start with evaluating first:
The second equality in the above expression is due to the transformation of variables, i.e., [53].
Substituting with (4) in the above integral yields the following:
| (A24) |
where , as given in (A22). Recall from Appendix A.3 that the normalizer, , is often hard to calculate analytically.
We use importance sampling [11,36] to approximate the integral in (A24):
| (A25) |
where for are drawn from the proposal distribution , i.e., . for are particles and their corresponding weights are denoted by .
The design of a good proposal distribution has a critical role in IS. First, it is supposed to be an easy-to-sample distribution. Secondly, its support is required to be no smaller than the support of [36]. Lastly, the proposal distribution is desired to be a good representation of to attain a fast convergence [47]. In our automated design, constitutes the proposal distribution. Our choice is not optimal in a sense that information regarding the evidence is most often carried out by the backward message and it is not incorporated in our proposal design. However, satisfies the first two conditions since the messages are parameterized with standard distributions (easy-to-sample) and it has nonzero probability everywhere that the posterior has, too. Substituting with in (A25) yields the following:
| (A26) |
where for and denotes our estimator for .
Let us summarize the procedure in (A26) to define our first set of rules related to the deterministic nodes. (A26) consists of samples that are drawn from and transformed through deterministic mapping . We cast this process as the forward message calculation. Once the samples are transformed, i.e., , the weights are determined over . We interpret this process as the collision of the forward and the backward messages ; hence, we relate it to the posterior calculation. Setting to , our interpretation of message collision becomes obvious since results in a Monte Carlo estimate for . As a result, we introduce our first set of rules related to deterministic nodes and the posterior approximation at the output edge of the deterministic node:
| (A27a) |
| (A27b) |
| (A27c) |
Here, we introduce the term list of weighted samples (LWS) to refer to the distributions that are represented by a set of samples and corresponding weights. Above, and are represented by LWS distributions.
Now, we turn our attention to calculation of and the expectation quantity . For this task we have two different strategies: if is a Gaussian message, we approximate by Laplace approximation which is also automatable thanks to automatic differentiation and otherwise we follow the IS procedure introduced above. Let us go over them starting from the latter.
Appendix B.1.1. Non-Gaussian Case
This time we are supposed to evaluate so that the VMP messages toward can be computed. Notice that the procedure is exactly same with (A25), except that the expectation quantity of interest, , does not involve the deterministic mapping , this time. Therefore, by using as the proposal distribution, we can estimate as the following:
| (A28) |
This gives us the second set of rules related to deterministic mappings. An element of this new set of rules is that the backward message is directly passed in probability distribution function (pdf) form:
| (A29) |
Recall from Appendix A.1 that the messages used to carry standard EF distributions. Now, we make an exception and introduce , which is no longer associated with any of the standard EF distributions. Nonetheless, takes an exponential form since is an EF distribution (see (A22)). Therefore, we call a non-standard exponential family (NEF) distribution. Having defined the backward message, let us evaluate the posterior . Similar to , substituting with in (A28) gives us a Monte Carlo estimate of :
| (A30a) |
| (A30b) |
Appendix B.1.2. Gaussian Case
In FFGs, the models are often constructed in such a way that the most prevailing message types will be Gaussians. This is because Gaussian messages facilitate inference by allowing many inference related operations to be executed in closed form, such as summation, conditioning, scaling and shifting by constants, etc. In order to retain the computational advantages of Gaussian distribution, we take it as an implicit hint that the posterior distribution is Gaussian-like, if is a Gaussian message. Then, we use Laplace approximation ([12], Section 4.4) to approximate with :
| (A31a) |
| (A31b) |
where denotes the gradient of f with respect to and refers to the Hessian of f with respect to evaluated at . Note that the gradient and the Hessian respectively reduce to the first and the second derivatives if is scalar. Laplace approximation is a mode-seeking algorithm. We use automatic differentiation (autodiff) [13] to evaluate the gradient and employ it in a gradient-ascent algorithm to seek the mode (we supply the implementation details in Appendix D). Once the mode is reached, we evaluate the Hessian at the mode to fit the variance term for our Gaussian approximation.
The assumption we make here that implies a Gaussian-like paves the way of automating many well known inference procedures achieved through Laplace approximation, such as Bayesian logistic regression ([37], Section 8.4), Laplace-Gaussian filtering and smoothing in state space models [54], Poisson Linear Dynamical Systems [55], etc. However, our assumption would not be appropriate for all configurations. For example, Gaussian prior on rate parameter of a Poisson distribution would result in an ambiguous posterior since the domain of the rate is the positive real numbers while the Gaussian approximated posterior has a support on the entire real axis. A better model specification could be achieved by mapping a Gaussian distributed random variable to the rate parameter through an inverse-link function, exp in this example. Likewise, a multi-modal backward message with a support on real numbers often yields a multi-modal posterior which can be better captured with particle methods. (In Appendix E, we show that it is possible to run particle filtering through Gaussian factor nodes in our technique.)
In summary, our method resorts to Laplace approximation to approximate with a Gaussian distribution whenever is Gaussian. Therefore, the user of our method must keep in mind the consequences of prior choices and build her model accordingly.
The overall procedure for single input deterministic functions is depicted in Figure A2. In the next subsection, we extend this procedure to multiple input deterministic mappings.
Figure A2.
Messages around a deterministic node are visualized together with posterior approximations. In EVMP algorithm, forward messages from single input deterministic nodes are approximated by LWS representations. Backward messages, on the other hand, take non-standard exponential family distribution forms.
Appendix B.2. Deterministic Mappings with Multiple Inputs
Consider the deterministic node, , given in Figure A3 where the inputs to the deterministic function are and the output is x.
Figure A3.
A deterministic node with inputs and output x.
Before starting the discussion on the backward messages, let us define the forward message . Analogous to the single input case, we define with an LWS as the following:
| (A32) |
Once the message is calculated as a set of equally weighted samples, we scale the weights according to the importance score of their corresponding samples to represent :
| (A33a) |
| (A33b) |
Now, let us define the backward messages propagated by the deterministic node. The exact backward message towards one of the input variables, say , is the following:
| (A34) |
Unfortunately, the above integral is often intractable. Even if all the variables are discrete and integral is replaced by summation, it becomes intractable in practice as the number of variables increases. Here, we address this issue with two different approximation strategies. As it is in the above subsection, type of the approximation depends on the incoming messages to the deterministic node from the input edges: if the messages are all Gaussian, we approximate the joint posterior distribution of by a Gaussian distribution. Then, we calculate the backward messages over the approximated joint posterior and incoming messages. Otherwise, we use Monte Carlo summation. Let us start with the latter case.
Appendix B.2.1. Monte Carlo Approximation to the Backward Message
Monte Carlo approximation to the the integral in (A34) is
| (A35) |
where for .
Once the message is approximately calculated and propagated as an NEF distribution, is also approximated either by IS or Laplace, depending on the message type as it is discussed in Appendix B.1.
Appendix B.2.2. Gaussian Approximation to the Backward Message
The above procedure yields two consecutive approximation processes in the calculation of . Considering that we assumed implies that is Gaussian-like, we can avoid the approximation in (A35) if all the incoming messages are Gaussian. We achieve this by approximating the joint posterior with Laplace, followed by a marginalization to evaluate and .
More precisely, consider the incoming messages for . Note that these messages carry posterior beliefs on , which can be represented with a joint belief constituted by concatenation of : :
| (A36) |
Now, we approximate by a Gaussian distribution with a La- place approximation:
| (A37a) |
| (A37b) |
By marginalizing out , we find :
| (A38) |
Recall that . This yields the following backward message
| (A39) |
where and . Note that we intentionally parameterize the Gaussian backward message with a precision-weighted mean and precision . The canonical parameterization (weighted-mean and precision) brings computational advantages, especially in state space models, by avoiding certain matrix inversions [15]. The approach that we introduced in this section resembles Expectation Propagation (EP) [50,51] in the sense that we first find the posterior, , and then the backward message is evaluated by dividing the posterior to the incoming message. As it is stated in [51], Laplace Propagation [56] proposes an iterative Laplace approximation approach to mitigate intractable integral issues that sometimes emerge in EP.
So far, we have discussed how to extend VMP to those models with deterministic conditional distributions. To summarize, the resulting technique approximates the forward messages in deterministic nodes by LWS. Backward messages, on the other hand, are either directly propagated in NEF form or approximated with Gaussian distributions. We also showed posterior approximations related to these message types. In the next subsection, we attack the problem regarding the non-conjugate soft factor pairs.
Appendix B.3. Non-Conjugate Soft Factor Pairs
Next, we address the problems defined in Appendix A.2. Consider the generic edge depicted in Figure 2. Suppose that the messages and differ in sufficient statistics and hence the normalization constant is analytically intractable. Recall that the very much same problem emerges in Appendix B.1 while calculating . Therefore, the approximation rules defined in Appendix B.1 applies to non-conjugate factor pairs, as well. For the sake of comprehensiveness, the rules are summarized below.
If is a Gaussian message, apply Laplace to approximate with a Gaussian distribution as in (A31a,b).
Otherwise, use IS as in (A30a,b).
Appendix C. Free Energy Approximation
Recall from Section 2 that variational inference transforms a difficult inference task to an easier optimization problem of a variational bound called the free energy . Considering the fact that VMP converges to a stationary point by updating one posterior factor at a time, we anticipate that our approximations approach near the local optima.
As it is shown in Appendix A.1.2, the free energy is amenable to analytical calculations for those models that are solely comprised of conjugate factor pairs. The models that we address here do not allow the free energy to be calculated analytically. This is because analytically intractable expectation quantities, which complicates VMP in practice, also appear in the free energy calculation. Therefore, we provide an approximate free energy to the user so that they can track the convergence of the inference and also make a model comparison [9,57].
We introduce our free energy approximation approach over the sub-graph given in Figure A1, where is a standard EF distribution (A1) and is a composite node, i.e., . Recall that is modified as follows:
This sub-graph is a part of a larger FFG. First, we decompose the free energy as the following:
| (A40) |
where stands for the free energy terms that are not subject to variables . Explicitly writing the average energy terms, we have the following:
| (A41a) |
| (A41b) |
The above derivations closely follow the derivations in (A23a,b). Note that the expectation terms regarding in (A41b) are substituted by the expectations related to x, which are contained in the sufficient statistics vector . This quantities are exactly same with the ones required to calculate VMP messages towards , and we used IS to estimate them in (A26). Therefore, for the estimation of is readily available.
Next, we investigate the terms related to in . Recall that for , we have two approximation methods: (1) a Gaussian approximation to with Laplace, (2) an LWS approximation. is approximated with a Gaussian when is a Gaussian and this is the case if the factor node is a Gaussian distribution. In this case, ( for a multivariate Gaussian), ( for a d-dimensional multivariate Gaussian), and , are available in closed-form. Similarly, the entropy term is available in closed-form for a Gaussian . This completes the calculation of the expectation terms with in .
In the case that is approximated with LWS as in (A30a,b), we approximate and with IS as in (A28). Therefore, the approximations for the average energy terms are straightforward. For LWS approximated , the main difficulty in the estimation of stems from the entropy calculation. This is because does not persist in functional form. The entropy approximation for weighted sample approximated distributions is often carried out by probability density estimates on weighted samples [58]. Fortunately, in our case, we do not need to fit a density estimate on LWS since the messages and afford the information regarding the density . Let us derive an estimator for the entropy :
| (A42) |
We estimate the first term with the following Monte Carlo summation:
| (A43) |
The term with the log in is constant since is integrated out inside the log. Therefore simplifies further:
| (A44) |
Recall from (A28) that the samples are drawn from the message . Therefore, the Monte Carlo estimate of is as follows:
| (A45) |
This completes the estimation of the terms with in .
Appendix D. Implementation Details in ForneyLab
Our extensions to VMP are readily available in ForneyLab [8], which is a Julia package for message passing based probabilistic programming. In this section, we provide the reader with some of the core implementation details and automation process of the method. First, the number of particles that commutes through deterministic nodes is set to 1000 by default. The user can change the number of samples during model specification. Similarly, the posterior of the variables that are connected to non-conjugate soft factor pairs are approximated by 1000 samples. For Laplace approximations, gradients are automatically calculated by automatic differentiation tools of Julia language. We use the ForwardDiff package [59] since it is a mature, universal automatic differentiation tool that aligns well with the needs of our approach.
Appendix E. Bonus: Bootstrap Particle Filtering
Having implemented importance sampling to get around the complications in VMP, we now show how our technique inherently supports bootstrap particle filtering in state space models [36,38].
Recall that we automate Laplace approximation to retain the computational convenience of Gaussian filtering and smoothing. Although this choice sounds reasonable for those cases, we believe that the distributions over hidden states possess unimodal behavior, so it would not be sufficient to capture multi-modal distributions [36]. Similarly, due to non-linearities in the model specification and/or non-Gaussian process noise, Gaussian distribution might not be a plausible representation of the hidden states. In these cases, Sequential Monte Carlo methods [60] could be appealing because they flexibly recover asymmetric and mixture distributions.
In VMP setting, our method employs samples and their corresponding weights to deploy VMP messages which are parameterized by exponential family distributions. Alternatively, in Belief Propagation (BP) setting, a soft factor collects samples to instantiate the conditional distributions, and then draw samples from these conditionals. This process is depicted in Figure A4 with two samples for the sake of ease of visualization.
Figure A4.
Bootstrap Particle Filtering employs the state transition distributions, as proposal distributions, which can easily be supported in our framework by defining BP rules at soft factors for incoming messages that are LWS. The rule is straightforward to implement: weights stay unchanged; for each incoming sample, instantiate a new conditional distribution and draw a sample from it. The weight update is automatically carried out at equality node by , which is followed by a normalization: .
Having implemented the BP rule at a soft factor for incoming LWS messages, we have to show how posteriors are approximated through updating weights. Suppose that is a message that carries LWS, and is parameterized either by an EF or an NEF. Then, we define the posterior update rule as follows:
| (A46a) |
| (A46b) |
In Bootstrap particle filtering, these rules update the weights at equality nodes, automatically. A major drawback of sequential importance sampling methods is that the further samples commute over time steps, the more they lose their ability to recover the underlying process, and many of the weights approach to zero. This phenomenon is known as the degeneracy problem and can be alleviated by resampling [36,60]. In our automated setting, at each weight update step, we measure the effectiveness of the existing samples by , as it is shown in [36]. Then, we resample if [36]. A user can effortlessly execute a particle filtering procedure in our method by putting an LWS prior on the first hidden state of a sequential model and running BP inference on the model (For demonstration purposes, we implemented BP rules at Gaussian node for LWS messages. The user can implement the very same rules for other soft factors according to their needs. Visit https://github.com/biaslab/ForneyLab.jl/blob/master/demo/bootstrap_particle_filter.ipynb (accessed on 25 June 2021) for a toy example.).
Appendix F. Illustrative Example
Consider the following model visualized in Figure A5: , , , , with observation y. In [9], VMP messages are provided as an example for a normal node parameterized by mean and precision. Here, we augment their example by a deterministic node to illustrate how Extended VMP operates to approximate the posteriors for x, w and z:
Figure A5.

The model , , , is visualized together with the messages. The EVMP algorithm approximates the backward VMP message towards x by estimating with Monte Carlo summation. Once this VMP message is approximated, the update for is available in closed form. The backward message toward w and z requires and . These expectations can be computed analytically since they are the sufficient statistics of . However, this time, the forward and the backward messages differ in sufficient statistics, which impedes the analytical calculations for and . We approximate them by IS and Laplace, respectively.
Initiate , by Normal distributions and by an LWS.
-
Repeat until convergence the following three steps:
-
-
Choose w for updating.
-
-Calculate VMP message by (14). In this case,
where is a Gamma distribution with shape and rate . -
-Calculate by the following (16):
-
-
Update by Section 3.5 rule (3).
-
-
-
-
Choose z for updating.
-
-Calculate by (18), which is a NEF distribution:
-
-
The forward message is simply the prior:
-
-
Update by Section 3.5 rule (2)(a).
-
-
-
-
Choose x for updating.
-
-Calculate VMP message by (14). In this case,
-
-The forward message is the prior:
-
-Update by Section 3.5 rule (1), i.e., the following:
-
-
The expectation quantities , , that appear in the message calculations are computed according to Section 3.8. Therefore, while is estimated via a Monte Carlo summation, and are available in closed form.
Author Contributions
Conceptualization, S.A.; methodology, S.A. and I.B.; software, I.B. and S.A.; validation, I.B. and S.A.; writing—original draft preparation, S.A. and I.B.; writing—review and editing, B.d.V. and S.A.; supervision, B.d.V.; funding acquisition, B.d.V. All authors have read and agreed to the published version of the manuscript.
Funding
This work is partly funded by GN Advanced Science.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.van de Meent J.W., Paige B., Yang H., Wood F. An Introduction to Probabilistic Programming. arXiv. 20181809.10756 [Google Scholar]
- 2.Carpenter B., Gelman A., Hoffman M.D., Lee D., Goodrich B., Betancourt M., Brubaker M., Guo J., Li P., Riddell A. Stan: A Probabilistic Programming Language. J. Stat. Softw. 2017;76:1–32. doi: 10.18637/jss.v076.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dillon J.V., Langmore I., Tran D., Brevdo E., Vasudevan S., Moore D., Patton B., Alemi A., Hoffman M., Saurous R.A. TensorFlow Distributions. arXiv. 20171711.10604 [Google Scholar]
- 4.Bingham E., Chen J.P., Jankowiak M., Obermeyer F., Pradhan N., Karaletsos T., Singh R., Szerlip P., Horsfall P., Goodman N.D. Pyro: Deep Universal Probabilistic Programming. J. Mach. Learn. Res. 2019;20:1–6. [Google Scholar]
- 5.Ge H., Xu K., Ghahramani Z. International Conference on Artificial Intelligence and Statistics. PMLR; 2018. Turing: A Language for Flexible Probabilistic Inference; pp. 1682–1690. [Google Scholar]
- 6.Titsias M., Lázaro-Gredilla M. International Conference on Machine Learning. PMLR; 2014. Doubly stochastic variational Bayes for non-conjugate inference; pp. 1971–1979. [Google Scholar]
- 7.Minka T., Winn J., Guiver J., Zaykov Y., Fabian D., Bronskill J. Infer.NET 0.3. [(accessed on 25 June 2021)];2018 Available online: https://dotnet.github.io/infer/
- 8.Cox M., van de Laar T., de Vries B. A factor graph approach to automated design of Bayesian signal processing algorithms. Int. J. Approx. Reason. 2019;104:185–204. doi: 10.1016/j.ijar.2018.11.002. [DOI] [Google Scholar]
- 9.Winn J., Bishop C.M. Variational message passing. J. Mach. Learn. Res. 2005;6:661–694. [Google Scholar]
- 10.Dauwels J. On Variational Message Passing on Factor Graphs; Proceedings of the IEEE International Symposium on Information Theory; Nice, France. 24–29 June 2007; pp. 2546–2550. [Google Scholar]
- 11.Tokdar S.T., Kass R.E. Importance sampling: A review. Wiley Interdiscip. Rev. Comput. Stat. 2010;2:54–60. doi: 10.1002/wics.56. [DOI] [Google Scholar]
- 12.Bishop C.M. Pattern Recognition and Machine Learning. Springer; Berlin/Heidelberg, Germany: 2006. [Google Scholar]
- 13.Baydin A.G., Pearlmutter B.A., Radul A.A., Siskind J.M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 2017;18:5595–5637. [Google Scholar]
- 14.Bezanson J., Karpinski S., Shah V.B., Edelman A. Julia: A fast dynamic language for technical computing. arXiv. 20121209.5145 [Google Scholar]
- 15.Loeliger H.A., Dauwels J., Hu J., Korl S., Ping L., Kschischang F.R. The factor graph approach to model-based signal processing. Proc. IEEE. 2007;95:1295–1322. doi: 10.1109/JPROC.2007.896497. [DOI] [Google Scholar]
- 16.Loeliger H.A. An introduction to factor graphs. IEEE Signal Process. Mag. 2004;21:28–41. doi: 10.1109/MSP.2004.1267047. [DOI] [Google Scholar]
- 17.Blei D.M., Kucukelbir A., McAuliffe J.D. Variational Inference: A Review for Statisticians. J. Am. Stat. Assoc. 2017;112:859–877. doi: 10.1080/01621459.2017.1285773. [DOI] [Google Scholar]
- 18.Dauwels J., Korl S., Loeliger H.A. Particle methods as message passing; Proceedings of the IEEE International Symposium on Information Theory; Seattle, WA, USA. 9–14 July 2006; pp. 2052–2056. [Google Scholar]
- 19.Şenöz I., De Vries B. Online variational message passing in the hierarchical Gaussian filter; Proceedings of the 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP); Aalborg, Denmark. 17–20 September 2018; pp. 1–6. [Google Scholar]
- 20.Mathys C.D., Lomakina E.I., Daunizeau J., Iglesias S., Brodersen K.H., Friston K.J., Stephan K.E. Uncertainty in perception and the Hierarchical Gaussian Filter. Front. Hum. Neurosci. 2014;8:825. doi: 10.3389/fnhum.2014.00825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kucukelbir A., Tran D., Ranganath R., Gelman A., Blei D.M. Automatic differentiation variational inference. J. Mach. Learn. Res. 2017;18:430–474. [Google Scholar]
- 22.Kalman R.E. A new approach to linear filtering and prediction problems. J. Basic Eng. 1960;82:35–45. doi: 10.1115/1.3662552. [DOI] [Google Scholar]
- 23.Barber D. Bayesian Reasoning and Machine Learning. Cambridge University Press; Cambridge, UK: 2012. [Google Scholar]
- 24.Ghahramani Z., Hinton G.E. Parameter Estimation for Linear Dynamical Systems. Department of Computer Science, University of Toronto; Toronto, ON, Canada: 1996. Technical Report CRG-TR-92-2. [Google Scholar]
- 25.Beal M.J. Ph.D. Thesis. UCL (University College London); London, UK: 2003. Variational Algorithms for Approximate Bayesian Inference. [Google Scholar]
- 26.Hoffman M.D., Gelman A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 2014;15:1593–1623. [Google Scholar]
- 27.Ghahramani Z., Hinton G.E. Variational learning for switching state-space models. Neural Comput. 2000;12:831–864. doi: 10.1162/089976600300015619. [DOI] [PubMed] [Google Scholar]
- 28.Neal R.M. MCMC using Hamiltonian dynamics. Handb. Markov Chain Monte Carlo. 2011;2:113–162. [Google Scholar]
- 29.Betancourt M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv. 20171701.02434 [Google Scholar]
- 30.Wood F., Meent J.W., Mansinghka V. Artificial Intelligence and Statistics. PMLR; 2014. A new approach to probabilistic programming inference; pp. 1024–1032. [Google Scholar]
- 31.Andrieu C., Doucet A., Holenstein R. Particle markov chain monte carlo methods. J. R. Stat. Soc. Ser. B. 2010;72:269–342. doi: 10.1111/j.1467-9868.2009.00736.x. [DOI] [Google Scholar]
- 32.De Freitas N., Højen-Sørensen P., Jordan M.I., Russell S. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.; Burlington, MA, USA: 2001. Variational MCMC; pp. 120–127. [Google Scholar]
- 33.Wexler Y., Geiger D. Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence. AUAI Press; Arlington, VA, USA: 2007. Importance sampling via variational optimization; pp. 426–433. [Google Scholar]
- 34.Salimans T., Kingma D., Welling M. International Conference on Machine Learning. PMLR; 2015. Markov chain monte carlo and variational inference: Bridging the gap; pp. 1218–1226. [Google Scholar]
- 35.Ye L., Beskos A., De Iorio M., Hao J. Statistics and Computing. Springer; Berlin/Heidelberg, Germany: 2020. Monte Carlo co-ordinate ascent variational inference; pp. 1–19. [Google Scholar]
- 36.Särkkä S. Bayesian Filtering and Smoothing. Volume 3 Cambridge University Press; Cambridge, UK: 2013. [Google Scholar]
- 37.Murphy K.P. Machine Learning: A Probabilistic Perspective. MIT Press; Cambridge, MA, USA: 2012. [Google Scholar]
- 38.Gordon N.J., Salmond D.J., Smith A.F. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proc. Radar Signal Process. 1993;140:107–113. doi: 10.1049/ip-f-2.1993.0015. [DOI] [Google Scholar]
- 39.Frank A., Smyth P., Ihler A. Particle-based variational inference for continuous systems. Adv. Neural Inf. Process. Syst. 2009;22:826–834. [Google Scholar]
- 40.Ihler A., McAllester D. Artificial Intelligence and Statistics. PMLR; 2009. Particle belief propagation; pp. 256–263. [Google Scholar]
- 41.Wainwright M.J., Jaakkola T.S., Willsky A.S. A new class of upper bounds on the log partition function. IEEE Trans. Inf. Theory. 2005;51:2313–2335. doi: 10.1109/TIT.2005.850091. [DOI] [Google Scholar]
- 42.Saeedi A., Kulkarni T.D., Mansinghka V.K., Gershman S.J. Variational particle approximations. J. Mach. Learn. Res. 2017;18:2328–2356. [Google Scholar]
- 43.Raiko T., Valpola H., Harva M., Karhunen J. Building Blocks for Variational Bayesian Learning of Latent Variable Models. J. Mach. Learn. Res. 2007;8:155–201. [Google Scholar]
- 44.Knowles D.A., Minka T. Non-conjugate variational message passing for multinomial and binary regression. Adv. Neural Inf. Process. Syst. 2011;24:1701–1709. [Google Scholar]
- 45.Khan M., Lin W. Artificial Intelligence and Statistics. PMLR; 2017. Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models; pp. 878–887. [Google Scholar]
- 46.Ranganath R., Gerrish S., Blei D. Artificial Intelligence and Statistics. PMLR; 2014. Black box variational inference; pp. 814–822. [Google Scholar]
- 47.Mackay D.J.C. Learning in Graphical Models. Springer; Berlin/Heidelberg, Germany: 1998. Introduction to monte carlo methods; pp. 175–204. [Google Scholar]
- 48.Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann; Burlington, MA, USA: 1988. [Google Scholar]
- 49.MacKay D.J. Information Theory, Inference and Learning Algorithms. Cambridge University Press; Cambridge, UK: 2003. [Google Scholar]
- 50.Minka T.P. Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.; San Francisco, CA, USA: 2001. Expectation Propagation for approximate Bayesian inference; pp. 362–369. [Google Scholar]
- 51.Vehtari A., Gelman A., Sivula T., Jylänki P., Tran D., Sahai S., Blomstedt P., Cunningham J.P., Schiminovich D., Robert C.P. Expectation Propagation as a Way of Life: A Framework for Bayesian Inference on Partitioned Data. J. Mach. Learn. Res. 2020;21:1–53. [Google Scholar]
- 52.Wainwright M.J., Jordan M.I. Graphical Models, Exponential Families, and Variational Inference. Found. Trends Mach. Learn. 2008;1:1–305. doi: 10.1561/2200000001. [DOI] [Google Scholar]
- 53.Gelman A., Carlin J.B., Stern H.S., Dunson D.B., Vehtari A., Rubin D.B. Bayesian Data Analysis. CRC Press; Boca Raton, FL, USA: 2013. [Google Scholar]
- 54.Koyama S., Castellanos Pérez-Bolde L., Shalizi C.R., Kass R.E. Approximate methods for state-space models. J. Am. Stat. Assoc. 2010;105:170–180. doi: 10.1198/jasa.2009.tm08326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Macke J.H., Buesing L., Cunningham J.P., Yu B.M., Shenoy K.V., Sahani M. Empirical models of spiking in neural populations. Adv. Neural Inf. Process. Syst. 2011;24:1350–1358. [Google Scholar]
- 56.Smola A.J., Vishwanathan S., Eskin E. Proceedings of the 16th International Conference on Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2004. Laplace propagation; pp. 441–448. [Google Scholar]
- 57.Acerbi L. Variational bayesian monte carlo. arXiv. 20181810.05558 [Google Scholar]
- 58.Ajgl J., Šimandl M. Differential entropy estimation by particles. IFAC Proc. Vol. 2011;44:11991–11996. doi: 10.3182/20110828-6-IT-1002.01404. [DOI] [Google Scholar]
- 59.Revels J., Lubin M., Papamarkou T. Forward-Mode Automatic Differentiation in Julia. arXiv. 20161607.07892 [Google Scholar]
- 60.Doucet A., de Freitas N., Gordon N. Sequential Monte Carlo Methods in Practice. Springer; Berlin/Heidelberg, Germany: 2001. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable.










