Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Dec 19.
Published in final edited form as: Adv Neural Inf Process Syst. 2019 Dec;32:7834–7844.

Stein Variational Gradient Descent with Matrix-Valued Kernels

Dilin Wang 1,*, Ziyang Tang 1,*, Chandrajit Bajaj 1, Qiang Liu 1
PMCID: PMC6923147  NIHMSID: NIHMS1060429  PMID: 31857781

Abstract

Stein variational gradient descent (SVGD) is a particle-based inference algorithm that leverages gradient information for efficient approximate inference. In this work, we enhance SVGD by leveraging preconditioning matrices, such as the Hessian and Fisher information matrix, to incorporate geometric information into SVGD updates. We achieve this by presenting a generalization of SVGD that replaces the scalar-valued kernels in vanilla SVGD with more general matrix-valued kernels. This yields a significant extension of SVGD, and more importantly, allows us to flexibly incorporate various preconditioning matrices to accelerate the exploration in the probability landscape. Empirical results show that our method outperforms vanilla SVGD and a variety of baseline approaches over a range of real-world Bayesian inference tasks.

1. Introduction

Approximate inference of intractable distributions is a central task in probabilistic learning and statistics. An efficient approximation inference algorithm must perform both efficient optimization to explore the high probability regions of the distributions of interest, and reliable uncertainty quantification for evaluating the variation of the given distributions. Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a deterministic sampling algorithm that achieves both desiderata by optimizing the samples using a procedure similar to gradient-based optimization, while achieving reliable uncertainty estimation using an interacting repulsive mechanism. SVGD has been shown to provide a fast and flexible alternative to traditional methods such as Markov chain Monte Carlo (MCMC) (e.g., Neal et al., 2011; Hoffman & Gelman, 2014) and parametric variational inference (VI) (e.g., Wainwright et al., 2008; Blei et al., 2017) in various challenging applications (e.g., Pu et al., 2017; Wang & Liu, 2016; Kim et al., 2018; Haarnoja et al., 2017).

On the other hand, standard SVGD only uses the first order gradient information, and can not leverage the advantage of the second order methods, such as Newton’s method and natural gradient, to achieve better performance on challenging problems with complex loss landscapes or domains. Unfortunately, due to the special form of SVGD, it is not straightforward to derive second order extensions of SVGD by simply extending similar ideas from optimization. While this problem has been recently considered (e.g., Detommaso et al., 2018; Liu & Zhu, 2018; Chen et al., 2019), the presented solutions either require heuristic approximations (Detommaso et al., 2018), or lead to complex algorithmic procedures that are difficult to implement in practice (Liu & Zhu, 2018).

Our solution to this problem is through a key generalization of SVGD that replaces the original scalar-valued positive definite kernels in SVGD with a class of more general matrix-valued positive definite kernels. Our generalization includes all previous variants of SVGD (e.g., Wang et al., 2018; Han & Liu, 2018) as special cases. More significantly, it allows us to easily incorporate various structured preconditioning matrices into SVGD updates, including both Hessian and Fisher information matrices, as part of the generalized matrix-valued positive definite kernels. We develop theoretical results that shed insight on optimal design of the matrix-valued kernels, and also propose simple and fast practical procedures. We empirically evaluate both Newton and Fisher based extensions of SVGD on various practical benchmarks, including Bayesian neural regression and sentence classification, on which our methods show significant improvement over vanilla SVGD and other baseline approaches.

Notation and Preliminary

For notation, we use bold lower-case letters (e.g., x) for vectors in d, and bold upper-case letters (e.g., Q) for matrices. A symmetric function k: d × d → ℝ is called a positive definite kernel if ijcik(xi,xj)cj0 for any {ci} and {xi}d. Every positive definite kernel k(x, x) is associated with a reproducing kernel Hilbert space (RKHS) Hk, which consists of the closure of functions of form

f(x)=icik(x,xi),{ci},{xi}d, (1)

for which the inner product and norm are defined by f,gHk=ijcisjk(xi,xj), fHk2=ijcicjk(xi,xj), where we assume g(x)=isik(x,xi). Denote by Hkd:=Hk××Hk the vector-valued RKHS consisting of d vector-valued functions ϕ=[ϕ1,,ϕd] with each ϕlHk. See e.g., Berlinet & Thomas-Agnan (2011) for more rigorous treatment. For notation convenience, we do not distinguish distributions on d and and their density functions.

2. Stein Variational Gradient Descent (SVGD)

We introduce the basic derivation of Stein variational gradient descent (SVGD), which provides a foundation for our new generalization. See Liu & Wang (2016, 2018); Liu (2017) for more details.

Let p(x) be a positive and continuously differentiable probability density function on d. Our goal is to find a set of points (a.k.a. particles) {xi}i=1nd to approximate p, such that the empirical distribution q(x)=iδ(xxi)/n of the particles weakly converges to p when n is large. Here δ() denotes the Dirac delta function.

SVGD achieves this by starting from a set of initial particles, and iteratively updating them with a deterministic transformation of form

xixi+ϵϕk*(xi),i=1,,n,ϕk*=argmaxϕBk{ddϵKL(q[ϵϕ]p)|ϵ=0}, (2)

where ϵ is a small step size, ϕk*:dd is an optimal transform function chosen to maximize the decreasing rate of the KL divergence between the distribution of particles and the target p, and q[ϵϕ] denotes the distribution of the updated particles x=x+ϵϕ(x) as xq, and Bk is the unit ball of RKHS Hkd:=Hk××Hk associated with a positive definite kernel k(x, x), that is,

Bk={ϕHkd:ϕHkd1}. (3)

Liu & Wang (2016) showed that the objective in (2) can be expressed as a linear functional of Φ,

ddϵKL(q[ϵϕ]p)|ϵ=0=Ex~q[Pϕ(x)],Pϕ(x)=xlogp(x)ϕ(x)+xϕ(x), (4)

where P is a differential operator called Stein operator; here we formally view P and the derivative operator xasd column vectors, hence Pϕ and xϕ are viewed as inner products, e.g., xϕ=l=1dxlϕl, with xl and ϕl being the l-th coordinate of vector x and ϕ, respectively. With (4), it is shown in Liu & Wang (2016) that the solution of (2) is

ϕk*()Ex~q[Pk(x,)]=Ex~q[xlogp(x)k(x,)+xk(x,)]. (5)

Such ϕk* provides the best update direction for the particles within RKHS Hkd. By taking q to be the empirical measure of the particles, i.e., q(x)=i=1nδ(xxi)/n and repeatedly applying this update on the particles, we obtain the SVGD algorithm using equations (2) and (5).

3. SVGD with Matrix-valued Kernels

Our goal is to extend SVGD to allow efficient incorporation of precondition information for better optimization. We achieve this by providing a generalization of SVGD that leverages more general matrix-valued kernels, to flexibly incorporate preconditioning information.

The key idea is to observe that the standard SVGD searches for the optimal ϕ in RKHS Hkd=Hk××Hk, a product of d copies of RKHS of scalar-valued functions, which does not allow us to encode potential correlations between different coordinates of ϕ. This limitation can be addressed by replacing Hkd with a more general RKHS of vector-valued functions (called vector-valued RKHS), which uses more flexible matrix-valued positive definite kernels to specify rich correlation structures between different coordinates. In this section, we first introduce the background of vector-valued RKHS with matrix-valued kernels in Section 3.1, and then propose and discuss our generalization of SVGD using matrix-valued kernels in Section 3.2-3.3.

3.1. Vector-Valued RKHS with Matrix-Valued Kernels

We now introduce the background of matrix-valued positive definite kernels, which provides a most general framework for specifying vector-valued RKHS. We focus on the intuition and key ideas in our introduction, and refer the readers to Alvarez et al. (2012); Carmeli et al. (2006) for mathematical treatment.

Recall that a standard real-valued RKHS Hk consists of the closure of the linear span of its kernel function k(,x) as shown in (1). Vector-valued RKHS can be defined in a similar way, but consist of the linear span of a matrix-valued kernel function:

f(x)=iK(x,xi)ci, (6)

for any {ci}d, where K:d×dd×d is now a matrix-valued kernel function, and ci are vector-valued weights. Similar to the scalar case, we can define an inner product structure f,gHK=ijciK(xi,xj)sj, where we assume g=iK(x,xi)si and hence a norm fHk2=ijciK(xi,xj)cj. In order to make the inner product and norm well defined, the matrix-value kernel K is required to be symmetric in that K(x,x)=K(x,x), and positive definite in that ijciK(xi,xj)cj0, {xi}d for any {ci}d.

Mathematically, one can show that the closure of the set of functions in (6), equipped with the inner product defined above, defines a RKHS that we denote by HK. It is “reproducing” because it has the following reproducing property that generalizes the version for scalar-valued RKHS: for any fHK and any cd, we have

f(x)c=f(),K(,x)cHK, (7)

where it is necessary to introduce c because the result of the inner product of two functions must be a scalar. A simple example of matrix kernel is K(x,x)=k(x,x)I, where I is the d × d identity matrix. It is related RKHS is HK=Hk××Hk=Hkd, as used in the original SVGD.

3.2. SVGD with Matrix-Valued Kernels

It is now natural to leverage matrix-valued kernels to obtain a generalization of SVGD (see Algorithm 1). The idea is simple: we now optimize ϕ in the unit ball of a general vector-valued RKHS HK with a matrix valued kernel K(x,x).

ϕK*=argmaxϕHK{Ex~q[Pϕ(x)],s.t.ϕHK1}. (8)

This yields a simple closed form solution similar to (5).

Theorem 1.

Let K(x,x) be a matrix-valued positive definite kernel that is continuously differentiable on xandx, the optimal ϕ* in (8) is

ϕK*()Ex~q[K(,x)P]=Ex~q[K(,x)xlogp(x)+K(,x)x], (9)
Algorithm 1.

Stein Variational Gradient Descent with Matrix-valued Kernels (Matrix SVGD)

Input: A (possibly unnormalized) differentiable density function p(x) in d. A matrix-valued positive definite kernel K(x,x). Step size ϵ.
Goal: Find a set of particles {xi}i=1n to represent the distribution p.
Initialize a set of particles {xi}i=1n, e.g., by drawing from some simple distribution.
repeat
xixi+ϵnj=1n[K(xi,xj)xjlogp(xj)+K(xi,xj)xj],
 where K(,x)x is formally defined as the product of matrix K(,x) and vector x. The l-th element of K(,x)xis(K(,x)x)l=m=1dxmKl,m(,x), see also (10).
until Convergence

where the Stein operator P and derivative operatorx are again formally viewed as d-valued column vectors, and K(,x)P and K(,x)x are interpreted by the matrix multiplication rule. Therefore, K(,x)P is a d-valued column vector, whose l-th element is defined by

(K(,x)P)l=m=1d(Kl,m(,x)xmlogp(x)+xmKl,m(,x)), (10)

where Kl,m(x,x) denotes the (l,m)-element of matrix K(x, mathvariant='bold-italic'x) and xm the m-th element of x.

Similar to the case of standard SVGD, recursively applying the optimal transform ϕK* on the particles yields a general SVGD algorithm shown in Algorithm 1, which we call matrix SVGD.

Parallel to vanilla SVGD, the gradient of matrix SVGD in (9) consists of two parts that account for optimization and diversity, respectively: the first part is a weighted average of gradient xlogp(x) multiplied by a matrix-value kernel K(,x) the other part consists of the gradient of the matrix-valued kernel K, which, like standard SVGD, serves as a repulsive force to keep the particles away from each other to reflect the uncertainty captured in distribution p.

Matrix SVGD includes various previous variants of SVGD as special cases. The vanilla SVGD corresponds to the case when K(x,x)=k(x,x)I, with I as the d × d identity matrix; the gradient-free SVGD of Han & Liu (2018) can be treated as the case when K(x,x)=k(x,x)w(x)w(x)I, where w(x) is an importance weight function; the graphical SVGD of Wang et al. (2018); Zhuo et al. (2018) corresponds to a diagonal matrix-valued kernel: K(x,x)=diag[{kl(x,x)}l=1d], where each kl(x,x) is a “local” scalar-valued kernel function related to the l-th coordinate xl of vector x.

3.3. Matrix-Valued Kernels and Change of Variables

It is well known that preconditioned gradient descent can be interpreted as applying standard gradient descent on a reparameterization of the variables. For example, let y=Q1/2x, where Q is a positive definite matrix, then log p(x)=logp(Q1/2y). Applying gradient descent on y and transform it back to the updates on x yields a preconditioned gradient update xx+ϵQ1xlogp(x).

We now extend this idea to SVGD, for which matrix-valued kernels show up naturally as a consequence of change of variables. This justifies the use of matrix-valued kernels and provides guidance on the practical choice of matrix-valued kernels. We start with a basic result of how matrix-valued kernels change under change of variables (see Paulsen & Raghupathi (2016)).

Lemma 2.

Assume H0 is an RKHS with a matrix kernel K0:d×dd×d. Let H be the set of functions formed by

ϕ(x)=M(x)ϕ0(t(x)),ϕ0H0,

where M:dd×d is a fixed matrix-valued function and we assume M (x) is an invertible matrix for all x, and t:dd is a fixed continuously differentiable one-to-one transform on d.

For ϕ,ϕH, we can identity an unique ϕ0,ϕ0H0 such that ϕ(x)=M(x)ϕ0(t(x)) and ϕ(x)=M(x)ϕ0(t(x)). Define the inner product on H via ϕ,ϕH=ϕ0,ϕ0H0, then H is also a vector-valued RKHS, whose matrix-valued kernel is

K(x,x)=M(x)K0(t(x),t(x))M(x).

We now present a key result, which characterizes the change of kernels when we apply invertible variable transforms on the SVGD trajectory.

Theorem 3.

i) Let p and q be two distributions and p0, q0 the distribution of x0=t(x) when x is drawn from p, q, respectively, where t is a continuous differentiable one-to-one map on d. Assume p is a continuous differentiable density with Stein operator P, and P0 the Stein operator of p0. We have

Ex~q0[P0ϕ0(x)]=Ex~q[Pϕ(x)],withϕ(x):=t(x)1ϕ0(t(x)), (11)

wheret is the Jacobian matrix of t.

ii) Therefore, in the asymptotics of infinitesimal step size (ϵ0+), running SVGD with kernel K0 on p0 is equivalent to running SVGD on p with kernel

K(x,x)=t(x)1K0(t(x),t(x))t(x),

in the sense that the trajectory of these two SVGD can be mapped to each other by the one-to-one map t (and its inverse).

3.4. Practical Choice of Matrix-Valued Kernels

Theorem 3 suggests a conceptual procedure for constructing proper matrix kernels to incorporate desirable preconditioning information: one can construct a one-to-one map t so that the distribution p0 of x0=t(x) is an easy-to-sample distribution with a simpler kernel, which can be a standard scalar-valued kernel or a simple diagonal kernel. Practical choices of t often involve rotating x with either Hessian matrix or Fisher information, allowing us to incorporating these information into SVGD. In the sequel, we first illustrate this idea for simple Gaussian cases and then discuss practical approaches for non-Gaussian cases.

Constant Preconditioning Matrices

Consider the simple case when p is multivariate Gaussian, e.g., logp(x)=12xQx+const, where Q is a positive definite matrix. In this case, the distribution p0 of the transformed variable t(x)=Q1/2x is the standard Gaussian distribution that can be better approximated with a simpler kernel K0(x,x), which can be chosen to be the standard RBF kernel suggested in Liu & Wang (2016), the graphical kernel suggested in Wang et al. (2018), or the linear kernels suggested in Liu & Wang (2018). Theorem 3 then suggests to use a kernel of form

KQ(x,x):=Q1/2K0(Q1/2x,Q1/2x)Q1/2, (12)

in which Q is applied on both the input x and the output side. As an example, taking K0 to be the scalar-valued Gaussian RBF kernel gives

KQ(x,x)=Q1exp(12hxxQ2), (13)

where xxQ2:=(xx)Q(xx) and h is a bandwidth parameter. Define K0,Q(x,x):=K0(Q1/2x,Q1/2x). One can show that the SVGD direction of the kernel in (12) equals

ϕKQ*()=Q1Ex~q[logp(x)K0,Q(,x)+K0,Q(,x)x]=Q1ϕK0,Q*(), (14)

which is a linear transform of the SVGD direction of kernel K0,Q(x,x) with matrix Q−1.

In practice, when p is non-Gaussian, we can construct Q by taking averaging over the particles. For example, denote by H(x)=x2logp(x) the negative Hessian matrix at x, we can construct Q by

Q=i=1nH(xi)/n, (15)

where {xi}i=1n are the particles from the previous iteration. We may replace H with the Fisher information matrix to obtain a natural gradient like variant of SVGD.

Point-wise Preconditioning

A constant preconditioning matrix can not reflect different curvature or geometric information at different points. A simple heuristic to address this limitation is to replace the constant matrix Q with a point-wise matrix function Q(x); this motivates a kernel of form

K(x,x)=Q1/2(x)K0(Q1/2(x)x,Q1/2(x)x)Q1/2(x).

Unfortunately, this choice may yield expensive computation and difficult implementation in practice, because the SVGD update involves taking the gradient of the kernel K(x,x), which would need to differentiate through matrix valued function Q(x). When Q(x) equals the Hessian matrix, for example, it involves taking the third order derivative of log p(x), yielding an unnatural algorithm.

Mixture Preconditioning

We instead propose a more practical approach to achieve point-wise preconditioning with a much simpler algorithm. The idea is to use a weighted combination of several constant preconditioning matrices. This involves leveraging a set of anchor points {zl}l=1md, each of which is associated with a preconditioning matrix Ql=Q(zl) (e.g., their Hessian or Fisher information matrices). In practice, the anchor points {zl}l=1m can be conveniently set to be the same as the particles {xi}i=1n. We then construct a kernel by

K(x,x)=l=1mKQl(x,x)wl(x)wl(x), (16)

where KQl(x,x) is defined in (12) or (13), and wl(x) is a positive scalar-valued function that decides the contribution of kernel KQl at point x. Here wl(x) should be viewed as a mixture probability, and hence should satisfy l=1mwl(x)=1 for all x. In our empirical studies, we take wl(x) as the Gaussian mixture probability from the anchor points:

wl(x)=N(x;zl,Ql1)l=1mN(x;zl,Ql1),N(x;zl,Ql1):=1Zlexp(12xzlQl2), (17)

where Zl=(2π)d/2det(Ql)1/2. In this way, each point x is mostly influenced by the anchor point closest to it, allowing us to apply different preconditioning for different points. Importantly, the SVGD update direction related to the kernel in (16) has a simple and easy-to-implement form:

ϕK*()=l=1mwl()Ex~q[(wl(x)KQl(,x))P]=l=1mwl()ϕwlKQl*(), (18)

which is a weighted sum of a number of SVGD directions with constant preconditioning matrices (but with an asymmetric kernel wl(x)KQl(,x)).

A Remark on Stein Variational Newton (SVN)

Detommaso et al. (2018) provided a Newton-like variation of SVGD. It is motivated by an intractable functional Newton framework, and arrives a practical algorithm using a series of approximation steps. The update of SVN is

xixi+ϵH˜i1ϕk*(xi),i=1,,n, (19)

where ϕk*() is the standard SVGD gradient, and H˜i is a Hessian like matrix associated with particle xi, defined by

H˜i=Ex~q[H(x)k(x,xi)2+(xik(x,xi))2],

where H(x)=x2logp(x) and w2:=ww. Due to the approximation introduced in the derivation of SVN, it does not correspond to a standard functional gradient flow like SVGD (unless H˜i=Q for all i, in which case it reduces to using a constant preconditioning matrix on SVGD like (14)). SVN can be heuristically viewed as a “hard” variant of (18), which assigns each particle with its own preconditioning matrix with probability one, but the mathematical form do not match precisely. On the other hand, it is useful to note that the set of fixed points of SVN update (19) is the identical to that of the standard SVGD update with ϕk*(), once all H˜i are positive definite matrices. This is because at the fixed points of (19), we have H˜i1ϕk*(xi)=0fori=1,,n, which is equivalent to ϕk*(xi)=0, i when all the H˜i, i are positive definite. Therefore, SVN can be justified as an alternative fixed point iteration method to achieve the same set of fixed points as the standard SVGD.

4. Experiments

We demonstrate the effectiveness of our matrix SVGD on various practical tasks. We start with a toy example and then proceed to more challenging tasks that involve logistic regression, neural networks and recurrent neural networks. For our method, we take the preconditioning matrices to be either Hessian or Fisher information matrices, depending on the application. For large scale Fisher matrices in (recurrent) neural networks, we leverage the Kronecker-factored (KFAC) approximation by Martens & Grosse (2015); Martens et al. (2018) to enable efficient computation. We use RBF kernel for vanilla SVGD. The kernel K0(x,x) in our matrix SVGD (see (12) and (13)) is also taken to be Gaussian RBF. Following Liu & Wang (2016), we choose the bandwidth of the Gaussian RBF kernels using the standard median trick and use Adagrad (Duchi et al., 2011) for stepsize. Our code is available at https://github.com/dilinwang820/matrix_ssvgd.

The algorithms we test are summarized here:

Vanilla SVGD, using the code by Liu & Wang (2016);

Matrix-SVGD (average), using the constant preconditioning matrix kernel in (13), with Q to be either the average of the Hessian matrices or Fisher matrices of the particles (e.g., (15));

Matrix-SVGD (mixture), using the mixture preconditioning matrix kernel in (16), where we pick the anchor points to be particles themselves, that is, {zl}l=1m={xi}i=1n;

Stein variational Newton (SVN), based on the implementation of Detommaso et al. (2018);

Preconditioned Stochastic Langevin Dynamics (pSGLD), which is a variant of SGLD (Li et al., 2016), using a diagonal approximation of Fisher information as the preconditioned matrix.

4.1. Two-Dimensional Toy Examples

Settings

We start with illustrating our method using a Gaussian mixture toy model (Figure 1), with exact Hessian matrices for preconditioning. For fair comparison, we search the best learning rate for all algorithms exhaustively. We use 50 particles for all the cases. We use the same initialization for all methods with the same random seeds.

Figure 1:

Figure 1:

Figure (a)-(e) show the particles obtained by various methods at the 30-th iteration. Figure (f) plots the log MMD (Gretton et al., 2012) vs. training iteration starting from the 10-th iteration. We use the standard RBF kernel for evaluating MMD.

Results

Figure 1 show the results for 2D toy examples. Appendix B shows more visualization and results on more examples. We can see that methods with Hessian information generally converge faster than vanilla SVGD, and Matrix-SVGD (mixture) yields the best performance.

4.2. Bayesian Logistic Regression

Settings

We consider the Bayesian logistic regression model for binary classification. Let D = {(xj,yj)}j=1N be a dataset with feature vector xj and binary label yj{0,1}. The distribution of interest is

p(θ|D)p(D|θ)p(θ)withp(D|θ)=j=1N[yjσ(θxj)+(1yj)σ(θxj)],

where σ(z) := 1/(1 + exp(−z)), and p0(θ) is the prior distribution, which we set to be standard normal N(θ;0,I). The goal is to approximate the posterior distribution p(θ|D) with a set of particles {θi}i=1n, and then use it to predict the class labels for testing data points. We compare our methods with preconditioned stochastic gradient Langevin dynamics (pSGLD) (Li et al., 2016). Because pSGLD is a sequential algorithm, for fair comparison, we obtain the samples of pSGLD by running n parallel chains of pSGLD for estimation. The preconditioning matrix in both pSGLD and matrix SVGD is taken to be the Fisher information matrix.

We consider the binary Covtype2 dataset with 581, 012 data points and 54 features. We partition the data into 70% for training, 10% for validation and 20% for testing. We use Adagrad optimizer with a mini-batch size of 256. We choose the best learning rate from [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0] for each method on the validation set. For all the experiments and algorithms, we use n = 20 particles. Results are average over 20 random trials.

Results

Figure 2 (a) and (b) show the test accuracy and test log-likelihood of different algorithms. We can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture) converge much faster than both vanilla SVGD and pSGLD, reaching an accuracy of 0.75 in less than 500 iterations.

Figure 2:

Figure 2:

(a)-(b) Results of Bayesian Logistic regression on the Covtype dataset. (c)-(d) Average test RMSE and log-likelihood vs. training batches on the Protein dataset for Bayesian neural regression.

4.3. Neural Network Regression

Settings

We apply our matrix SVGD on Bayesian neural network regression on UCI datasets. For all experiments, we use a two-layer neural network with 50 hidden units with ReLU activation functions. We assign isotropic Gaussian priors N(0, αI) to the neural network weights. We set α = 0.5. All datasets3 are randomly partitioned into 90% for training and 10% for testing. All results are averaged over 20 random trials, except for Protein and Year, on which 5 random trials are performed. We use n = 10 particles for all methods. We use Adam optimizer with a mini-batch size of 100; for large dataset such as Year, we set the mini-batch size to be 1000. We use the Fisher information matrix with Kronecker-factored (KFAC) approximation for preconditioning.

Results

Table 1 shows the performance in terms of the test RMSE and the test log-likelihood. We can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture), which use second-order information, achieve better performance than vanilla SVGD. Matrix-SVGD (mixture) yields the best performance for both test RMSE and test log-likelihood in most cases. Figure 2 (c)-(d) show that both variants of Matrix-SVGD converge much faster than vanilla SVGD and pSGLD on the Protein dataset.

Table 1:

Average test RMSE and log-likelihood in test data for UCI regression benchmarks.

Test RMSE Test Log-Likelihood

Dataset pSGLD Vanilla SVGD Matrix-SVGD (average) Matrix-SVGD (mixture) pSGLD Vanilla SVGD Matrix-SVGD (average) Matrix-SVGD (mixture)
Boston 2.699±0.155 2.785±0.169 2.898±0.184 2.717±0.166 −2.847±0.182 −2.706±0.158 −2.669±0.141 −2.861±0.207
Concrete 5.053±0.124 5.027±0.116 4.869±0.124 4.721±0.111 −3.206±0.056 −3.064±0.034 −3.150±0.054 −3.207±0.071
Energy 0.985±0.024 0.889±0.024 0.795±0.025 0.868±0.025 −1.395±0.029 −1.315±0.020 −1.135±0.026 −1.249±0.036
Kin8nm 0.091±0.001 0.093±0.001 0.092±0.001 0.090±0.001 0.973±0.010 0.964±0.012 0.956±0.011 0.975±0.011
Naval 0.002±0.000 0.004±0.000 0.001±0.000 0.000±0.000 4.535±0.093 4.312±0.087 5.383±0.081 5.639±0.048
Combined 4.042±0.034 4.088±0.033 4.056±0.033 4.029±0.033 −2.821±0.009 −2.832±0.009 −2.824±0.009 −2.817±0.009
Wine 0.641±0.009 0.645±0.009 0.637±0.008 0.637±0.009 −0.984±0.016 −0.997±0.019 −0.980±0.016 −0.988±0.018
Protein 4.300±0.018 4.186±0.017 3.997±0.018 3.852±0.014 −2.874±0.004 −2.846±0.003 −2.796±0.004 −2.755±0.003
Year 8.630±0.007 8.686±0.010 8.637±0.005 8.594±0.009 −3.568±0.002 −3.577±0.002 −3.569±0.001 −3.561±0.002

4.4. Sentence Classification With Recurrent Neural Networks (RNN)

Settings

We consider the sentence classification task on four datasets: MR (Pang & Lee, 2005), CR (Hu & Liu, 2004), SUBJ (Pang & Lee, 2004), and MPQA (Wiebe et al., 2005). We use a recurrent neural network (RNN) based model, p(y|x)=softmax(wyhRNN(x,v)), where x is the input sentence, y is a discrete-valued label of the sentence, and wy is a weight coefficient related to label class y. And hRNN(x,v) is an RNN function with parameter v using a one-layer bidirectional GRU model (Cho et al., 2014) with 50 hidden units. We apply matrix SVGD to infer the posterior of w={wy:y}, while updating the RNN weights v using typical stochastic gradient descent. In all experiments, we use the pre-processed text data provided in Gan et al. (2016). For all the datasets, we conduct 10-fold cross-validation for evaluation. We use n = 10 particles for all the methods. For training, we use a mini-batch size of 50 and run all the algorithms for 20 epochs. We use the Fisher information matrix for preconditioning.

Results

Table 2 shows the results of testing classification errors. We can see that Matrix-SVGD (mixture) generally performs the best among all algorithms.

Table 2:

Sentence classification errors measured with four benchmarks.

Method MR CR SUBJ MPQA
SGLD 20.52 18.65 7.66 11.24
pSGLD 19.75 17.50 6.99 10.80

Vanilla SVGD 19.73 18.07 6.67 10.58
Matrix-SVGD (average) 19.22 17.29 6.76 10.79
Matrix-SVGD (mixture) 19.09 17.13 6.59 10.71

5. Conclusion

We present a generalization of SVGD by leveraging general matrix-valued positive definite kernels, which allows us to flexibly incorporate various preconditioning matrices, including Hessian and Fisher information matrices, to improve exploration in the probability landscape. We test our practical algorithms on various practical tasks and demonstrate its efficiency compared to various existing methods.

Supplementary Material

appendix- supplement

Acknowledgement

This work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421. We would like to acknowledge Google Cloud and Amazon Web Services (AWS) for their support.

A. Proof

Proof of Theorem 1

Let el be the column vector with 1 in lth coordinate and 0 elsewhere. By the RKHS reproducing property (7) we have

Ex~q[Pϕ(x)]=Ex~q[xlogp(x)ϕ(x)+xϕ(x)]=Ex~q[ϕ(x)xlogp(x)+l=1dxlϕ(x)el]=Ex~q[ϕ(),K(,x)xlogp(x)HK+l=1dxlϕ(),K(,x)elHK]=ϕ(),Ex~q[K(,x)xlogp(x)+l=1dxK(,x)el]HK=ϕ(),Ex~q[K(,x)xlogp(x)+K(,x)x]HK=ϕ(),Ex~q[K(,x)P]HK,

The optimization in (8) is hence

maxϕHKϕ(),Ex~q[K(,x)P]HK,s.t.ϕHK1,

whose solution is ϕ*()Ex~q[K(,x)P]

Proof of Lemma 2

This is a basic result of RKHS, which can be found in classical textbooks such as Paulsen & Raghupathi (2016). The key idea is to show that K(x,x) satisfies the reproducing property for H. Recall the reproducing property of H0:

ϕ0(x)c=ϕ0,K0(,x)cH0,cd.

Taking ϕ(x)=M(x)ϕ0(t(x)), we have

ϕ(x)c=ϕ0,K0(,t(x))M(x)cH0=ϕ,M()K0(t(),t(x))M(x)cH=ϕ,K(,x)cH,

where the second step follows ϕ,ϕH=ϕ0,ϕ0H0withϕ0()=K0(,t(x))M(x)Tc.

Proof of Theorem 3

Proof.

Note that KL divergence is invariant under invertible variable transforms, that is,

KL(q[ϵϕ]p)=KL(q[ϵϕ]0p0). (20)

where p0 denotes the distribution of x0 = t(x) when x ~ p, and q[ϵϕ]0 denotes the distribution of x0=t(x)whenx~q[ϵϕ]. Recall that q[ϵϕ] is defined as the distribution of x=x+ϵϕ(x) when x~q.

Denote by t−1 the inverse map of t, that is, t1(t(x))=x. We can see that x0~q[ϵϕ]0 can be obtained by

x0=t(x)//x~q[ϵϕ]=t(x+ϵϕ(x))//x~q=t(t1(x0)+ϵϕ(t1(x0)))//x0~q0=x0+ϵt(t1(x0))ϕ(t1(x0))+O(ϵ2)=x0+ϵϕ0(x0)+O(ϵ2), (21)

where we used the definition that ϕ(x)=t(x)1ϕ0(t(x)) in (11), and O() is the big-O notation.

From Theorem 3.1 of Liu & Wang (2016), we have

ddϵKL(q[ϵϕ]p)|ϵ=0=Eq[Pϕ].

Using Equation (21) and derivation similar to Theorem 3.1 of Liu & Wang (2016), we can show

ddϵKL(q[ϵϕ]0p)|ϵ=0=Eq0[P0ϕ0].

Combining these with (20) proves (11).

Following Lemma 2, when ϕ0 is in H0 with kernel K0(x,x),ϕ is in H with kernel K(x,x). Therefore, maximizing Eq[Pϕ] in H is equivalent to Eq0[P0ϕ0] in H0. This suggests the trajectory of SVGD on p0 with K0 and that on p with K are equivalent. □

B. Toy Examples

Figure 3 and Figure 4 show results of different algorithms on three 2D toy distributions: Star, Double banana and Sine. Detailed information of these distributions and more results are shown in Section B.1-B.3.

We can see from Figure 34 that both variants of matrix SVGD consistently outperform SVN and vanilla SVGD. We also find that Matrix SVGD(mixture) tends to outperform Matrix SVGD (average), which is expected since Matrix SVGD (average) uses a constant preconditioning matrix for all the particles, and can not capture different curvatures at different locations. Matrix SVGD (mixture) yields the best performance in general.

Figure 3:

Figure 3:

The particles obtained by various methods at the 30/100/30-th iteration on three toy 2D distributions.

Figure 4:

Figure 4:

The MMD vs. training iteration of different algorithms on the three toy distributions.

B.1. Sine

The density function of the “Sine” distribution is defined by

p(x1,x2)exp((x2+sin(αx1))22σ1x12+x222σ2),

where we choose σ = 1, σ1 = 0:003, σ2 = 1.

Figure 5:

Figure 5:

The particles obtained by various methods on the toy Sine distribution.

B.2. Double Banana

We use the “double banana” distribution constructed in Detommaso et al. (2018), whose probability density function is

p(x)exp(x222σ1(yF(x))22σ2),

where x=[x1,x2]2 and F(x)=log((1x1)2+100(x2x12)2) and y=log(30), σ1=1.0, σ2=0.09.

Figure 6:

Figure 6:

The particles obtained by various methods on the double banana distribution.

B.3. Star

We construct the “star” distribution with a Gaussian mixture model, whose density function is

p(x)=1Ki=1KN(x;μi,Σi),

with x2μ1=[0;1.5], Σ1=diag([1;1100]), and the other means and covariance matrices are defined by rotating their previous mean and covariance matrix. To be precise,

μi+1=Uμi,Σi+1=UΣiU,U=[cos(θ)sin(θ)sin(θ)cos(θ)],

with angle θ=2πK. We set the number of component K to be 5.

Figure 7:

Figure 7:

The particles obtained by various methods on the star-shaped distribution.

Footnotes

References

  1. Alvarez MA, Rosasco L, Lawrence ND, et al. Kernels for vector-valued functions: A review. Foundations and TrendsOR in Machine Learning, 4(3):195–266, 2012. [Google Scholar]
  2. Berlinet A and Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. [Google Scholar]
  3. Blei DM, Kucukelbir A, and McAuliffe JD Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. [Google Scholar]
  4. Carmeli C, De Vito E, and Toigo A. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4(04):377–408, 2006. [Google Scholar]
  5. Chen WY, Barp A, Briol F-X, Gorham J, Girolami M, Mackey L, Oates C, et al. Stein point markov chain monte carlo. International Conference on Machine Learning (ICML), 2019. [Google Scholar]
  6. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [Google Scholar]
  7. Detommaso G, Cui T, Marzouk Y, Spantini A, and Scheichl R. A Stein variational Newton method. In Advances in Neural Information Processing Systems, pp. 9187–9197, 2018. [Google Scholar]
  8. Duchi J, Hazan E, and Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(July):2121–2159, 2011. [Google Scholar]
  9. Gan Z, Li C, Chen C, Pu Y, Su Q, and Carin L. Scalable Bayesian learning of recurrent neural networks for language modeling. ACL, 2016. [Google Scholar]
  10. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A. A kernel two-sample test. Journal of Machine Learning Research, 13(March):723–773, 2012. [Google Scholar]
  11. Haarnoja T, Tang H, Abbeel P, and Levine S. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017. [Google Scholar]
  12. Han J and Liu Q. Stein variational gradient descent without gradient. In International Conference on Machine Learning, 2018. [Google Scholar]
  13. Hoffman MD and Gelman A. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014. [Google Scholar]
  14. Hu M and Liu B. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. ACM, 2004. [Google Scholar]
  15. Kim T, Yoon J, Dia O, Kim S, Bengio Y, and Ahn S. Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836, 2018. [Google Scholar]
  16. Li C, Chen C, Carlson DE, and Carin L. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI, volume 2, pp. 4, 2016. [Google Scholar]
  17. Liu C and Zhu J. Riemannian Stein variational gradient descent for Bayesian inference. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
  18. Liu Q. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pp. 3115–3123, 2017. [PMC free article] [PubMed] [Google Scholar]
  19. Liu Q and Wang D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016. [PMC free article] [PubMed] [Google Scholar]
  20. Liu Q and Wang D. Stein variational gradient descent as moment matching. In Advances in Neural Information Processing Systems, pp. 8867–8876, 2018. [PMC free article] [PubMed] [Google Scholar]
  21. Martens J and Grosse R. Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417, 2015. [Google Scholar]
  22. Martens J, Ba J, and Johnson M. Kronecker-factored curvature approximations for recurrent neural networks. In ICLR, 2018. [Google Scholar]
  23. Neal RM et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2 (11):2, 2011. [Google Scholar]
  24. Pang B and Lee L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271 Association for Computational Linguistics, 2004. [Google Scholar]
  25. Pang B and Lee L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 115–124. Association for Computational Linguistics, 2005. [Google Scholar]
  26. Paulsen VI and Raghupathi M. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152 Cambridge University Press, 2016. [Google Scholar]
  27. Pu Y, Gan Z, Henao R, Li C, Han S, and Carin L. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems, pp. 4236–4245, 2017. [Google Scholar]
  28. Wainwright MJ, Jordan MI, et al. Graphical models, exponential families, and variational inference. Foundations and Trends OR in Machine Learning, 1(1–2):1–305, 2008. [Google Scholar]
  29. Wang D and Liu Q. Learning to draw samples: With application to amortized MLE for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016. [Google Scholar]
  30. Wang D, Zeng Z, and Liu Q. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pp. 5206–5214, 2018. [Google Scholar]
  31. Wiebe J, Wilson T, and Cardie C. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2–3):165–210, 2005. [Google Scholar]
  32. Zhuo J, Liu C, Shi J, Zhu J, Chen N, and Zhang B. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pp. 6013–6022, 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

appendix- supplement

RESOURCES