Comment

Philip T Reiss; Jeff Goldsmith

doi:10.1080/01621459.2016.1270049

. Author manuscript; available in PMC: 2018 May 3.

Published in final edited form as: J Am Stat Assoc. 2017 May 3;112(517):161–164. doi: 10.1080/01621459.2016.1270049

Comment

Philip T Reiss ^a,^b, Jeff Goldsmith ^c

PMCID: PMC5823540 NIHMSID: NIHMS907430 PMID: 29479124

1. Introduction

This very stimulating article reminded us of the following remark in a review article by Wand and co-authors (Ruppert, Wand, and Carroll 2009):

Interplay with Computer Science is one of the most exciting recent developments in semiparametric regression. We anticipate this to be an increasingly fruitful area of research.

Bringing message passing to bear on semiparametric regression, as Wand has done here, is very much in the spirit of such interplay. The notion of message passing is ubiquitous in some areas of computer science, such as distributed computing and object-oriented programming. More specifically, within the field of artificial intelligence, the influential problem-solving model of Hewitt (1977) is based upon message passing, while Pearl (1982, 1988) proposed the passing of messages among neighboring nodes as a way to update beliefs efficiently in large Bayesian networks.

Against this backdrop it is unsurprising that, whereas variational message passing (VMP) is formulated quite differently in the three papers that Wand cites (Winn and Bishop 2005;Minka 2005; Minka and Winn 2008), in each case the authors find it natural to portray the algorithm as passing messages among the nodes of a network. But for readers like us with a mainstream statistics background, a message passing scheme such as that described by Wand comes across, at least at first, as uncomfortably mysterious (see Wainwright and Jordan 2008, p. 36).

We begin by posing three questions that arose for us as we read Wand’s article. The first crystallizes our unease with the very notion of message passing, and addressing this key question will pave the way toward answering the other two.

As presented by Wand, following Minka (2005), VMP works by iteratively updating two types of messages: messages m_{θ_i→f_j} (θ_i) from variables (stochastic nodes) to factors, and messages m_{f_j→θ_i} (θ_i) from factors to variables. What, exactly, is the statistical meaning of these messages?
How is the VMP algorithm related to the traditional approach to mean field variational Bayes (MFVB)?
Wand’s message updates are given in (W7)–(W9) (here and below, to avoid confusion with our own equation numbers, we use (Wx) to denote Wand’s equation (x)). How do these reduce to natural parameter updates, as presented from Section 3.2 onward?

In the following sections, we attempt to answer these questions and thereby, we hope, to shed some light on VMP.

2. A Closer Look at Messages

To address Question 1, we consider first the variable-to-factor messages and then the factor-to-variable messages.

Variable-to-factor messages

Recall that the form of the messages in Wand’s presentation of VMP flows from the factor graph representation. In an article popularizing this representation, Kschischang, Frey, and Loeliger (2001) developed a generic sum–product algorithm in which messages are passed back and forth between factors and variables, as in Wand’s presentation of VMP. Bishop (2006, p. 408) noted that one can eliminate the variable-to-factor messages in the sum–product algorithm, and reformulate it with only factor-to-variable messages. We find it helpful to reformulate VMP in a similar way.

Let us first recall Wand’s generic algorithm. In Section 3.2, following Minka (2005), he presents an iteration loop for VMP that could be stated as (1) choose a factor; (2) update messages from neighboring stochastic nodes to that factor; (3) update messages from that factor to neighboring stochastic nodes. While the schedule for updating factors may be flexible in some applications (Winn and Bishop 2005, sec. 3.5), for our purposes we can assume the factors are updated serially in a fixed order. Thus, a single iteration of the VMP algorithm might be written as a loop over j, with each step comprising two subloops:

Loop A. For j = 1, …, N:

For each i′ ∈ S_j, perform the update
$m_{θ_{i'} \to f_{j}} (θ_{i'}) \leftarrow \propto \prod_{j' \neq j : i' \in S_{j'}} m_{f_{j'} \to θ_{i'}} (θ_{i'}) .$ (1)
This is just (W7), but with S_j′ (defined in (W5)) replacing the equivalent “neighbors (j′).”
For each i ∈ S_j:
1. Define the density in (W9), which is proportional to
  $\prod_{i' \in S_{j} \ {i}} m_{f_{j} \to θ_{i'}} (θ_{i'}) m_{θ_{i'} \to f_{j}} (θ_{i'}) .$ (2)
2. Update the factor-to-variable message m_{f_j→θ_i} (θ_i) by (W8), which we repeat for convenience, again using S_j in place of “neighbors(j)”:
  $m_{f_{j} \to θ_{i}} (θ_{i}) \leftarrow \propto exp [E_{f_{j} \to θ_{i}} {log f_{j} (θ_{S_{j}})}],$ (3)
  where the expectation is with respect to the density in (2).

The messages on the right-hand side of (1) emanate from factors other than f_j, and thus are not updated within the current step of the loop over j. Therefore, the density (2) is unchanged if we substitute the right-hand side of (1) for m_{θ_i′→f_j} (θ_i′) in (2). Doing so renders the first subloop redundant, so that a single iteration of VMP can be rewritten in the following mathematically equivalent form.

Loop B. For j = 1, …, N:

For each i ∈ S_j,

Define the density proportional to
$\prod_{i' \in S_{j} \ {i}} [m_{f_{j} \to θ_{i'}} (θ_{i'}) \prod_{j' \neq j : i' \in S_{j'}} m_{f_{j'} \to θ_{i'}} (θ_{i'})] = \prod_{i' \in S_{j} \ {i}} \prod_{j' : i' \in S_{j'}} m_{f_{j'} \to θ_{i'}} (θ_{i'}) .$ (4)
Update the factor-to-variable message m_{f_j→θ_i} (θ_i) using (3), with the expectation taken with respect to the density in (4).

If the jth factor depends on more than two of the θ_i (i.e., |S_j| > 2) then Loop A may save some computation by performing the multiplication (1) just once, whereas Loop B must do the same multiplication, in (4), |S_j| − 1 times. We suspect the savings would typically be small, since |S_j| ≤ 2 for most j and the multiplication reduces to the summing natural parameters (see Section 4). At any rate, the conceptual simplicity that Loop B achieves by doing away with variable-to-factor messages will facilitate our development in Sections 3 and 4.

Factor-to-variable messages

As Wand notes, in MFVB we seek component densities $q_{1}^{*} (θ_{1}), \dots, q_{N}^{*} (θ_{N})$ (here and below, unlike Wand, we include subscripts for these densities) such that $q^{*} (θ) = \prod_{i = 1}^{M} q_{i}^{*} (θ_{i})$ minimizes the Kullback–Leibler divergence $KL [q ‖ p (\cdot | D)] = \int q (θ) log [\frac{q (θ)}{p (θ | D)}] d θ$ over all product densities $q (θ) = \prod_{i = 1}^{M} q_{i} (θ_{i})$ . By (W10), in the VMP implementation of MFVB, we have

q_{i}^{*} (θ_{i}) \propto \prod_{j : i \in S_{j}} m_{f_{j} \to θ_{i}} (θ_{i})

(5)

upon convergence. Alternatively, one can view $q_{1}^{*} (θ_{1}), \dots, q_{M}^{*} (θ_{M})$ as quantities that are being updated throughout the iterative algorithm (Minka 2005 does this). We can then view the factor-to-variable messages as (proportional to) iteratively updated subcomponent densities, where component $q_{i}^{*} (θ_{i})$ is divided into subcomponents for each factor f_j of which θ_i is an argument.

To summarize: we understand the variable-to-factor messages as a bookkeeping device with no independent statistical meaning, such that VMP can be formulated without them; and we interpret the factor-to-variable messages as factor-specific subcomponents of each component density as in (5). One could, then, jettison the message passing metaphor altogether, and replace the notation m_{f_j→θ_i} (θ_i) by $q_{i}^{f_{j}} (θ_{i})$ or $q_{i}^{j} (θ_{i})$ .

3. Relating VMP to Traditional MFVB

We can now answer Question 2 posed in the Introduction. As explained in, for example, Ormerod and Wand (2010) and Goldsmith, Wand, and Crainiceanu (2011), the traditional MFVB algorithm initializes $q_{1}^{*} (θ_{1}), \dots, q_{M}^{*} (θ_{M})$ and then updates these iteratively via coordinate descent steps

q_{i}^{*} (θ_{i}) \leftarrow \propto exp [\int log {p (θ, D)} \prod_{i' \neq i} {q_{i'}^{*} (θ_{i'}) d θ_{i'}}]

(6)

for i = 1, …, M.

By (5), (4) reduces to $\prod_{i' \in S_{j} \ {i}} q_{i'}^{*} (θ_{i'})$ , and thus the VMP update (3) can be rewritten as

m_{f_{j} \to θ_{i}} (θ_{i}) \leftarrow \propto exp [\int log {f_{j} (θ_{S_{j}})} \prod_{i' \in S_{j} \ {i}} {q_{i'}^{*} (θ_{i'}) d θ_{i'}}] .

(7)

This is a factor-specific analogue of the usual MFVB update (6): the ith component $q_{i}^{*} (θ_{i})$ is replaced by its jth-factor-specific subcomponent m_{f_j→θ_i} (θ_i), the joint density p(θ, D) by its jth factor f_j(θ_{S_j}), and the product of $q_{i'}^{*} (θ_{i'}) d θ_{i'}$ over all i′ ≠ i by one restricted to those in S_j (this last is not a real difference, since taking the product over all i′ ≠ i in (7) would be equivalent).

The traditional MFVB algorithm cycles over all i (the variables) to update the component densities. VMP, on the other hand, cycles over j (the factors), and within each factor, cycles over i ∈ S_j to update subcomponent densities.

4. Reduction to Natural Parameter Updates

We now turn to Question 3. Throughout his Sections 3 and 4, Wand exploits conjugacy (as defined for factor graphs in Section 3.2.2) to simplify a number of special cases of the VMP algorithm, and in particular, to reduce updates (W7)–(W9) for the messages to updates for the natural parameters of the messages. The following is our attempt to make explicit what Wand’s treatment assumes implicitly.

In the exponential family case, it is natural to write the factor-to-variable messages in the form

m_{f_{j}} \to θ_{i} (θ_{i}) \propto exp [T_{i} {(θ_{i})}^{T} η_{f_{j} \to θ_{i}}]

(8)

for a natural parameter η_{f_j→θ_i}. But since these messages are not really defined—only their updates are, by (3)—there is a hint of vagueness in the definition of η_{f_j→θ_i}. It seems to us that the key to avoiding such ambiguity is to start by defining the jth factor density as an exponential family density respect to θ_i, for each i ∈ S_j: namely

f_{j} (θ_{S_{j}}) \propto exp [T_{i} {(θ_{i})}^{T} η_{f_{j} \ θ_{i}}],

(9)

where η_{f_j\θ_i} does not depend on θ_i. For instance, consider the factor f_j(σ², a) = p(σ²|a) in Wand’s linear regression example. In Section S.2.1 of the supplement, the logarithm of this factor is written in the two forms:

log p (σ^{2} | a) = T {(σ^{2})}^{T} η_{p (σ^{2} | a) \ σ^{2}} = T {(a)}^{T} η_{p (σ^{2} | a) \ a}

where T(·) denotes the inverse chi-squared sufficient statistic vector,

η_{p (σ^{2} | a) \ σ^{2}} = [\begin{matrix} - 3 / 2 \\ - 1 / (2 a) \end{matrix}], and η_{p (σ^{2} | a) \ a} = [\begin{matrix} - 1 / 2 \\ - 1 / (2 σ^{2}) \end{matrix}] .

(10)

Inserting (9) into (the log of) (3) yields

log m_{f_{j} \to θ_{i}} (θ_{i}) \leftarrow E_{f_{j} \to θ_{i}} [log f_{j} (θ_{S_{j}})] + constant = T_{i} {(θ_{i})}^{T} E_{f_{j} \to θ_{i}} (η_{f_{j} \ θ_{i}}) + constant .

This update equation serves as the justification both for writing factor-to-variable messages in the form (8) and for updating their natural parameters by

η_{f_{j} \to θ_{i}} \leftarrow E_{f_{j} \to θ_{i}} (η_{f_{j} \ θ_{i}}) .

(11)

The density in Loop A is proportional to (2), which can be expressed as

\prod_{i' \in S_{j} \ {i}} exp [T_{i'} {(θ_{i'})}^{T} (η_{f_{j} \to θ_{i'}} + η_{θ_{i'} \to f_{j}})] = \prod_{i' \in S_{j} \ {i}} exp [T_{i'} {(θ_{i'})}^{T} η_{f_{j} \leftrightarrow θ_{i'}}],

(12)

with η_{f_j↔θ_i′} generalizing the notation that first appears in (W22). Eliminating variable-to-factor messages as in Loop B above, (12) becomes

\prod_{i' \in S_{j} \ {i}} exp [T_{i'} {(θ_{i'})}^{T} \sum_{j' : i' \in S_{j'}} η_{f_{j'} \to θ_{i'}}]

(13)

(cf. (4)). If, as suggested in Section 3 above, $q_{i'}^{*} (θ_{i'})$ is defined using (5) throughout the iterations rather than just at convergence, then (13) can be rewritten as

\prod_{i' \in S_{j} \ {i}} exp [T_{i'} {(θ_{i'})}^{T} η_{q_{i'}^{*} (θ_{i'})}] .

This obviates the need for the notation η_{f_j↔θ_i′}, which is equivalent to $η_{q_{i'}^{*} (θ_{i'})}$ .

We can thus summarize the natural-parameter-updating version of VMP as follows. First, we must obtain the right side of (11), a function of ${η_{q_{i'}^{*} (θ_{i'})} : i' \in S_{j} \ {i}}$ , for each i, j. Then, after initializing all the natural parameters η_{f_j→θ_i}, each iteration of the algorithm is a modified, more concrete version of Loop B:

Loop C. For j = 1, …, N:

For each i ∈ S_j,

Compute $η_{q_{i'}^{*} (θ_{i'})} = \sum_{j' : i' \in S_{j'}} η_{f_{j'} \to θ_{i'}}$ for each i′ ∈ S_j\{i};
Plug these into the right side of (11) to update η_{f_j→θ_i}.

5. Bayesian Linear Regression Revisited

The remarkably simple Loop C can be visualized using a less snazzy alternative to a factor graph: an N × M table of natural parameters for the factor-to-variable messages, with a row for each of the variables θ₁, …, θ_N and a column for each of the factors f₁, …, f_M. Table 1 illustrates this for the linear regression example of Wand’s Section 3 (here, like Wand, we dispense with the subscripts in $q_{i}^{*}$ ).

Table 1.

Factor-to-variable message natural parameters for the linear regression example of Wand’s Section 3.

	p(β)	p(y\|β, σ²)	p(σ²\|a)	p(a)
β	η_p(β)→β	η_{p(y\|β,σ²)→β}			η_q*(β)
σ²		η_{p(y\|β,σ²)→σ²}	η_{p(σ²\|a)→σ²}		η_q*(σ²)
a			η_{p(σ²\|a)→a}	η_p(a)→a	η_q*(a)

Open in a new tab

In this case, only the second and third columns require updates (see (W21)), so each iteration consists of updating these two columns in turn. Consider updating the third column, that is, the natural parameters for the messages from p(σ²|a) to σ² and a. By (11), these updates are expectations of (10) with respect to densities proportional to (13); formulas for these expectations are as given in (W24), using information in Wand’s Table S.1. Given these update formulas, in Loop C we simply add the natural parameters in the third row of Table 1 and insert the result (η_q*
(a), or η_{p(σ²|a)↔a} in Wand’s notation) into the formula for η_{p(σ²|a)→σ²} to update this natural parameter; and then, similarly, we add up the natural parameters in the second row of Table 1 and insert the result into the formula for updating η_{p(σ²|a)→a}.

6. Summary and Conclusion

In this comment, we have sought to build on the pedagogical component of Wand’s achievement: that is, rendering VMP intelligible to statisticians who, like us and unlike Wand, have not spent years painstakingly reformulating and extending the methodology. Our main proposal, Loop B, is a modified VMP algorithm that is equivalent to the original, mathematically if not computationally, but does away with the variable-to-factor messages. This modification offers a reduced notational load, a clearer connection to the traditional implementation of MFVB, and a streamlined account of the reduction from message updates to natural parameter updates.

Of course, while explaining VMP to statisticians is an important contribution in itself, Wand has gone a great deal further. He has masterfully laid the groundwork for VMP-based semiparametric regression, which should be a huge step forward for flexible modeling with large datasets. Good on him (as they say Down Under) for this major contribution. We look forward to further advances in this area by Wand and coworkers, and by others who will draw inspiration from this landmark article.

Acknowledgments

Funding

Reiss’s research was supported by award R01 MH095836 from the National Institute of Mental Health, and Goldsmith’s research was supported in part by awards R01HL123407 from the National Heart, Lung, and Blood Institute and R21EB018917 from the National Institute of Biomedical Imaging and Bioengineering.

References

Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006. [Google Scholar]
Goldsmith J, Wand MP, Crainiceanu C. Functional Regression via Variational Bayes. Electronic Journal of Statistics. 2011;5:572–602. doi: 10.1214/11-ejs619. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hewitt C. Viewing Control Structures as Patterns of Passing Messages. Artificial Intelligence. 1977;8:323–364. [Google Scholar]
Kschischang FR, Frey BJ, Loeliger H-A. Factor Graphs and the Sum–Product Algorithm. IEEE Transactions on Information Theory. 2001;47:498–519. [Google Scholar]
Minka T. Divergence Measures and Message Passing. Technical Report MSR-TR-2005-173, Microsoft Research 2005 [Google Scholar]
Minka T, Winn J. Gates: A Graphical Notation for Mixture Models. Technical Report MSR-TR-2008-185, Microsoft Research 2008 [Google Scholar]
Ormerod JT, Wand MP. Explaining Variational Approximations. The American Statistician. 2010;64:140–153. [Google Scholar]
Pearl J. Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach. AAAI-82 Proceedings. 1982:133–136. [Google Scholar]
Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann; 1988. [Google Scholar]
Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression During 2003–2007. Electronic Journal of Statistics. 2009;3:1193–1256. doi: 10.1214/09-EJS525. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wainwright MJ, Jordan MI. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning. 2008;1:1–305. [Google Scholar]
Winn J, Bishop CM. Variational Message Passing. Journal of Machine Learning Research. 2005;6:661–694. [Google Scholar]

[R1] Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006. [Google Scholar]

[R2] Goldsmith J, Wand MP, Crainiceanu C. Functional Regression via Variational Bayes. Electronic Journal of Statistics. 2011;5:572–602. doi: 10.1214/11-ejs619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Hewitt C. Viewing Control Structures as Patterns of Passing Messages. Artificial Intelligence. 1977;8:323–364. [Google Scholar]

[R4] Kschischang FR, Frey BJ, Loeliger H-A. Factor Graphs and the Sum–Product Algorithm. IEEE Transactions on Information Theory. 2001;47:498–519. [Google Scholar]

[R5] Minka T. Divergence Measures and Message Passing. Technical Report MSR-TR-2005-173, Microsoft Research 2005 [Google Scholar]

[R6] Minka T, Winn J. Gates: A Graphical Notation for Mixture Models. Technical Report MSR-TR-2008-185, Microsoft Research 2008 [Google Scholar]

[R7] Ormerod JT, Wand MP. Explaining Variational Approximations. The American Statistician. 2010;64:140–153. [Google Scholar]

[R8] Pearl J. Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach. AAAI-82 Proceedings. 1982:133–136. [Google Scholar]

[R9] Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann; 1988. [Google Scholar]

[R10] Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression During 2003–2007. Electronic Journal of Statistics. 2009;3:1193–1256. doi: 10.1214/09-EJS525. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Wainwright MJ, Jordan MI. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning. 2008;1:1–305. [Google Scholar]

[R12] Winn J, Bishop CM. Variational Message Passing. Journal of Machine Learning Research. 2005;6:661–694. [Google Scholar]

PERMALINK

Comment

Philip T Reiss

Jeff Goldsmith

1. Introduction

2. A Closer Look at Messages

Variable-to-factor messages

Factor-to-variable messages

3. Relating VMP to Traditional MFVB

4. Reduction to Natural Parameter Updates

5. Bayesian Linear Regression Revisited

Table 1.

6. Summary and Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Comment

Philip T Reiss

Jeff Goldsmith

1. Introduction

2. A Closer Look at Messages

Variable-to-factor messages

Factor-to-variable messages

3. Relating VMP to Traditional MFVB

4. Reduction to Natural Parameter Updates

5. Bayesian Linear Regression Revisited

Table 1.

6. Summary and Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases