1. Introduction
This very stimulating article reminded us of the following remark in a review article by Wand and co-authors (Ruppert, Wand, and Carroll 2009):
Interplay with Computer Science is one of the most exciting recent developments in semiparametric regression. We anticipate this to be an increasingly fruitful area of research.
Bringing message passing to bear on semiparametric regression, as Wand has done here, is very much in the spirit of such interplay. The notion of message passing is ubiquitous in some areas of computer science, such as distributed computing and object-oriented programming. More specifically, within the field of artificial intelligence, the influential problem-solving model of Hewitt (1977) is based upon message passing, while Pearl (1982, 1988) proposed the passing of messages among neighboring nodes as a way to update beliefs efficiently in large Bayesian networks.
Against this backdrop it is unsurprising that, whereas variational message passing (VMP) is formulated quite differently in the three papers that Wand cites (Winn and Bishop 2005;Minka 2005; Minka and Winn 2008), in each case the authors find it natural to portray the algorithm as passing messages among the nodes of a network. But for readers like us with a mainstream statistics background, a message passing scheme such as that described by Wand comes across, at least at first, as uncomfortably mysterious (see Wainwright and Jordan 2008, p. 36).
We begin by posing three questions that arose for us as we read Wand’s article. The first crystallizes our unease with the very notion of message passing, and addressing this key question will pave the way toward answering the other two.
As presented by Wand, following Minka (2005), VMP works by iteratively updating two types of messages: messages mθi→fj (θi) from variables (stochastic nodes) to factors, and messages mfj→θi (θi) from factors to variables. What, exactly, is the statistical meaning of these messages?
How is the VMP algorithm related to the traditional approach to mean field variational Bayes (MFVB)?
Wand’s message updates are given in (W7)–(W9) (here and below, to avoid confusion with our own equation numbers, we use (Wx) to denote Wand’s equation (x)). How do these reduce to natural parameter updates, as presented from Section 3.2 onward?
In the following sections, we attempt to answer these questions and thereby, we hope, to shed some light on VMP.
2. A Closer Look at Messages
To address Question 1, we consider first the variable-to-factor messages and then the factor-to-variable messages.
Variable-to-factor messages
Recall that the form of the messages in Wand’s presentation of VMP flows from the factor graph representation. In an article popularizing this representation, Kschischang, Frey, and Loeliger (2001) developed a generic sum–product algorithm in which messages are passed back and forth between factors and variables, as in Wand’s presentation of VMP. Bishop (2006, p. 408) noted that one can eliminate the variable-to-factor messages in the sum–product algorithm, and reformulate it with only factor-to-variable messages. We find it helpful to reformulate VMP in a similar way.
Let us first recall Wand’s generic algorithm. In Section 3.2, following Minka (2005), he presents an iteration loop for VMP that could be stated as (1) choose a factor; (2) update messages from neighboring stochastic nodes to that factor; (3) update messages from that factor to neighboring stochastic nodes. While the schedule for updating factors may be flexible in some applications (Winn and Bishop 2005, sec. 3.5), for our purposes we can assume the factors are updated serially in a fixed order. Thus, a single iteration of the VMP algorithm might be written as a loop over j, with each step comprising two subloops:
Loop A. For j = 1, …, N:
- For each i′ ∈ Sj, perform the update
This is just (W7), but with Sj′ (defined in (W5)) replacing the equivalent “neighbors (j′).”(1) - For each i ∈ Sj:
- Define the density in (W9), which is proportional to
(2) - Update the factor-to-variable message mfj→θi (θi) by (W8), which we repeat for convenience, again using Sj in place of “neighbors(j)”:
where the expectation is with respect to the density in (2).(3)
The messages on the right-hand side of (1) emanate from factors other than fj, and thus are not updated within the current step of the loop over j. Therefore, the density (2) is unchanged if we substitute the right-hand side of (1) for mθi′→fj (θi′) in (2). Doing so renders the first subloop redundant, so that a single iteration of VMP can be rewritten in the following mathematically equivalent form.
Loop B. For j = 1, …, N:
For each i ∈ Sj,
- Define the density proportional to
(4) Update the factor-to-variable message mfj→θi (θi) using (3), with the expectation taken with respect to the density in (4).
If the jth factor depends on more than two of the θi (i.e., |Sj| > 2) then Loop A may save some computation by performing the multiplication (1) just once, whereas Loop B must do the same multiplication, in (4), |Sj| − 1 times. We suspect the savings would typically be small, since |Sj| ≤ 2 for most j and the multiplication reduces to the summing natural parameters (see Section 4). At any rate, the conceptual simplicity that Loop B achieves by doing away with variable-to-factor messages will facilitate our development in Sections 3 and 4.
Factor-to-variable messages
As Wand notes, in MFVB we seek component densities (here and below, unlike Wand, we include subscripts for these densities) such that minimizes the Kullback–Leibler divergence over all product densities . By (W10), in the VMP implementation of MFVB, we have
(5) |
upon convergence. Alternatively, one can view as quantities that are being updated throughout the iterative algorithm (Minka 2005 does this). We can then view the factor-to-variable messages as (proportional to) iteratively updated subcomponent densities, where component is divided into subcomponents for each factor fj of which θi is an argument.
To summarize: we understand the variable-to-factor messages as a bookkeeping device with no independent statistical meaning, such that VMP can be formulated without them; and we interpret the factor-to-variable messages as factor-specific subcomponents of each component density as in (5). One could, then, jettison the message passing metaphor altogether, and replace the notation mfj→θi (θi) by or .
3. Relating VMP to Traditional MFVB
We can now answer Question 2 posed in the Introduction. As explained in, for example, Ormerod and Wand (2010) and Goldsmith, Wand, and Crainiceanu (2011), the traditional MFVB algorithm initializes and then updates these iteratively via coordinate descent steps
(6) |
for i = 1, …, M.
By (5), (4) reduces to , and thus the VMP update (3) can be rewritten as
(7) |
This is a factor-specific analogue of the usual MFVB update (6): the ith component is replaced by its jth-factor-specific subcomponent mfj→θi (θi), the joint density p(θ, D) by its jth factor fj(θSj), and the product of over all i′ ≠ i by one restricted to those in Sj (this last is not a real difference, since taking the product over all i′ ≠ i in (7) would be equivalent).
The traditional MFVB algorithm cycles over all i (the variables) to update the component densities. VMP, on the other hand, cycles over j (the factors), and within each factor, cycles over i ∈ Sj to update subcomponent densities.
4. Reduction to Natural Parameter Updates
We now turn to Question 3. Throughout his Sections 3 and 4, Wand exploits conjugacy (as defined for factor graphs in Section 3.2.2) to simplify a number of special cases of the VMP algorithm, and in particular, to reduce updates (W7)–(W9) for the messages to updates for the natural parameters of the messages. The following is our attempt to make explicit what Wand’s treatment assumes implicitly.
In the exponential family case, it is natural to write the factor-to-variable messages in the form
(8) |
for a natural parameter ηfj→θi. But since these messages are not really defined—only their updates are, by (3)—there is a hint of vagueness in the definition of ηfj→θi. It seems to us that the key to avoiding such ambiguity is to start by defining the jth factor density as an exponential family density respect to θi, for each i ∈ Sj: namely
(9) |
where ηfj\θi does not depend on θi. For instance, consider the factor fj(σ2, a) = p(σ2|a) in Wand’s linear regression example. In Section S.2.1 of the supplement, the logarithm of this factor is written in the two forms:
where T(·) denotes the inverse chi-squared sufficient statistic vector,
(10) |
Inserting (9) into (the log of) (3) yields
This update equation serves as the justification both for writing factor-to-variable messages in the form (8) and for updating their natural parameters by
(11) |
The density in Loop A is proportional to (2), which can be expressed as
(12) |
with ηfj↔θi′ generalizing the notation that first appears in (W22). Eliminating variable-to-factor messages as in Loop B above, (12) becomes
(13) |
(cf. (4)). If, as suggested in Section 3 above, is defined using (5) throughout the iterations rather than just at convergence, then (13) can be rewritten as
This obviates the need for the notation ηfj↔θi′, which is equivalent to .
We can thus summarize the natural-parameter-updating version of VMP as follows. First, we must obtain the right side of (11), a function of , for each i, j. Then, after initializing all the natural parameters ηfj→θi, each iteration of the algorithm is a modified, more concrete version of Loop B:
Loop C. For j = 1, …, N:
For each i ∈ Sj,
Compute for each i′ ∈ Sj\{i};
Plug these into the right side of (11) to update ηfj→θi.
5. Bayesian Linear Regression Revisited
The remarkably simple Loop C can be visualized using a less snazzy alternative to a factor graph: an N × M table of natural parameters for the factor-to-variable messages, with a row for each of the variables θ1, …, θN and a column for each of the factors f1, …, fM. Table 1 illustrates this for the linear regression example of Wand’s Section 3 (here, like Wand, we dispense with the subscripts in ).
Table 1.
p(β) | p(y|β, σ2) | p(σ2|a) | p(a) | ||
---|---|---|---|---|---|
β | ηp(β)→β | ηp(y|β,σ2)→β | ηq*(β) | ||
σ2 | ηp(y|β,σ2)→σ2 | ηp(σ2|a)→σ2 | ηq*(σ2) | ||
a | ηp(σ2|a)→a | ηp(a)→a | ηq*(a) |
In this case, only the second and third columns require updates (see (W21)), so each iteration consists of updating these two columns in turn. Consider updating the third column, that is, the natural parameters for the messages from p(σ2|a) to σ2 and a. By (11), these updates are expectations of (10) with respect to densities proportional to (13); formulas for these expectations are as given in (W24), using information in Wand’s Table S.1. Given these update formulas, in Loop C we simply add the natural parameters in the third row of Table 1 and insert the result (ηq* (a), or ηp(σ2|a)↔a in Wand’s notation) into the formula for ηp(σ2|a)→σ2 to update this natural parameter; and then, similarly, we add up the natural parameters in the second row of Table 1 and insert the result into the formula for updating ηp(σ2|a)→a.
6. Summary and Conclusion
In this comment, we have sought to build on the pedagogical component of Wand’s achievement: that is, rendering VMP intelligible to statisticians who, like us and unlike Wand, have not spent years painstakingly reformulating and extending the methodology. Our main proposal, Loop B, is a modified VMP algorithm that is equivalent to the original, mathematically if not computationally, but does away with the variable-to-factor messages. This modification offers a reduced notational load, a clearer connection to the traditional implementation of MFVB, and a streamlined account of the reduction from message updates to natural parameter updates.
Of course, while explaining VMP to statisticians is an important contribution in itself, Wand has gone a great deal further. He has masterfully laid the groundwork for VMP-based semiparametric regression, which should be a huge step forward for flexible modeling with large datasets. Good on him (as they say Down Under) for this major contribution. We look forward to further advances in this area by Wand and coworkers, and by others who will draw inspiration from this landmark article.
Acknowledgments
Funding
Reiss’s research was supported by award R01 MH095836 from the National Institute of Mental Health, and Goldsmith’s research was supported in part by awards R01HL123407 from the National Heart, Lung, and Blood Institute and R21EB018917 from the National Institute of Biomedical Imaging and Bioengineering.
References
- Bishop CM. Pattern Recognition and Machine Learning. New York: Springer; 2006. [Google Scholar]
- Goldsmith J, Wand MP, Crainiceanu C. Functional Regression via Variational Bayes. Electronic Journal of Statistics. 2011;5:572–602. doi: 10.1214/11-ejs619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hewitt C. Viewing Control Structures as Patterns of Passing Messages. Artificial Intelligence. 1977;8:323–364. [Google Scholar]
- Kschischang FR, Frey BJ, Loeliger H-A. Factor Graphs and the Sum–Product Algorithm. IEEE Transactions on Information Theory. 2001;47:498–519. [Google Scholar]
- Minka T. Divergence Measures and Message Passing. Technical Report MSR-TR-2005-173, Microsoft Research 2005 [Google Scholar]
- Minka T, Winn J. Gates: A Graphical Notation for Mixture Models. Technical Report MSR-TR-2008-185, Microsoft Research 2008 [Google Scholar]
- Ormerod JT, Wand MP. Explaining Variational Approximations. The American Statistician. 2010;64:140–153. [Google Scholar]
- Pearl J. Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach. AAAI-82 Proceedings. 1982:133–136. [Google Scholar]
- Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann; 1988. [Google Scholar]
- Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression During 2003–2007. Electronic Journal of Statistics. 2009;3:1193–1256. doi: 10.1214/09-EJS525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wainwright MJ, Jordan MI. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning. 2008;1:1–305. [Google Scholar]
- Winn J, Bishop CM. Variational Message Passing. Journal of Machine Learning Research. 2005;6:661–694. [Google Scholar]