Stein Variational Gradient Descent with Matrix-Valued Kernels

Dilin Wang; Ziyang Tang; Chandrajit Bajaj; Qiang Liu

. Author manuscript; available in PMC: 2019 Dec 19.

Published in final edited form as: Adv Neural Inf Process Syst. 2019 Dec;32:7834–7844.

Stein Variational Gradient Descent with Matrix-Valued Kernels

Dilin Wang ^1,^*, Ziyang Tang ^1,^*, Chandrajit Bajaj ¹, Qiang Liu ¹

PMCID: PMC6923147 NIHMSID: NIHMS1060429 PMID: 31857781

Abstract

Stein variational gradient descent (SVGD) is a particle-based inference algorithm that leverages gradient information for efficient approximate inference. In this work, we enhance SVGD by leveraging preconditioning matrices, such as the Hessian and Fisher information matrix, to incorporate geometric information into SVGD updates. We achieve this by presenting a generalization of SVGD that replaces the scalar-valued kernels in vanilla SVGD with more general matrix-valued kernels. This yields a significant extension of SVGD, and more importantly, allows us to flexibly incorporate various preconditioning matrices to accelerate the exploration in the probability landscape. Empirical results show that our method outperforms vanilla SVGD and a variety of baseline approaches over a range of real-world Bayesian inference tasks.

1. Introduction

Approximate inference of intractable distributions is a central task in probabilistic learning and statistics. An efficient approximation inference algorithm must perform both efficient optimization to explore the high probability regions of the distributions of interest, and reliable uncertainty quantification for evaluating the variation of the given distributions. Stein variational gradient descent (SVGD) (Liu & Wang, 2016) is a deterministic sampling algorithm that achieves both desiderata by optimizing the samples using a procedure similar to gradient-based optimization, while achieving reliable uncertainty estimation using an interacting repulsive mechanism. SVGD has been shown to provide a fast and flexible alternative to traditional methods such as Markov chain Monte Carlo (MCMC) (e.g., Neal et al., 2011; Hoffman & Gelman, 2014) and parametric variational inference (VI) (e.g., Wainwright et al., 2008; Blei et al., 2017) in various challenging applications (e.g., Pu et al., 2017; Wang & Liu, 2016; Kim et al., 2018; Haarnoja et al., 2017).

On the other hand, standard SVGD only uses the first order gradient information, and can not leverage the advantage of the second order methods, such as Newton’s method and natural gradient, to achieve better performance on challenging problems with complex loss landscapes or domains. Unfortunately, due to the special form of SVGD, it is not straightforward to derive second order extensions of SVGD by simply extending similar ideas from optimization. While this problem has been recently considered (e.g., Detommaso et al., 2018; Liu & Zhu, 2018; Chen et al., 2019), the presented solutions either require heuristic approximations (Detommaso et al., 2018), or lead to complex algorithmic procedures that are difficult to implement in practice (Liu & Zhu, 2018).

Our solution to this problem is through a key generalization of SVGD that replaces the original scalar-valued positive definite kernels in SVGD with a class of more general matrix-valued positive definite kernels. Our generalization includes all previous variants of SVGD (e.g., Wang et al., 2018; Han & Liu, 2018) as special cases. More significantly, it allows us to easily incorporate various structured preconditioning matrices into SVGD updates, including both Hessian and Fisher information matrices, as part of the generalized matrix-valued positive definite kernels. We develop theoretical results that shed insight on optimal design of the matrix-valued kernels, and also propose simple and fast practical procedures. We empirically evaluate both Newton and Fisher based extensions of SVGD on various practical benchmarks, including Bayesian neural regression and sentence classification, on which our methods show significant improvement over vanilla SVGD and other baseline approaches.

Notation and Preliminary

For notation, we use bold lower-case letters (e.g., x) for vectors in $ℝ^{d}$ , and bold upper-case letters (e.g., Q) for matrices. A symmetric function k: $ℝ^{d}$ × $ℝ^{d}$ → ℝ is called a positive definite kernel if $\sum_{i j} c_{i} k (x_{i}, x_{j}) c_{j} \geq 0$ for any ${c_{i}} \subset ℝ$ and ${x_{i}} \subset ℝ^{d}$ . Every positive definite kernel k(x, x^′) is associated with a reproducing kernel Hilbert space (RKHS) $H_{k}$ , which consists of the closure of functions of form

f (x) = \sum_{i} c_{i} k (x, x_{i}), \forall {c_{i}} \subset ℝ, {x_{i}} \subset ℝ^{d},

(1)

for which the inner product and norm are defined by ${〈 f, g 〉}_{H_{k}} = \sum_{i j} c_{i} s_{j} k (x_{i}, x_{j})$ , $‖ f ‖_{H_{k}}^{2} = \sum_{i j} c_{i} c_{j} k (x_{i}, x_{j})$ , where we assume $g (x) = \sum_{i} s_{i} k (x, x_{i})$ . Denote by $H_{k}^{d} : = H_{k} \times \dots \times H_{k}$ the vector-valued RKHS consisting of $ℝ^{d}$ vector-valued functions $ϕ = {[ϕ^{1}, \dots, ϕ^{d}]}^{⊤}$ with each $ϕ^{l} \in H_{k}$ . See e.g., Berlinet & Thomas-Agnan (2011) for more rigorous treatment. For notation convenience, we do not distinguish distributions on $ℝ^{d}$ and and their density functions.

2. Stein Variational Gradient Descent (SVGD)

We introduce the basic derivation of Stein variational gradient descent (SVGD), which provides a foundation for our new generalization. See Liu & Wang (2016, 2018); Liu (2017) for more details.

Let p(x) be a positive and continuously differentiable probability density function on $ℝ^{d}$ . Our goal is to find a set of points (a.k.a. particles) ${x_{i}}_{i = 1}^{n} \subset ℝ^{d}$ to approximate p, such that the empirical distribution $q (x) = \sum_{i} δ (x - x_{i}) / n$ of the particles weakly converges to p when n is large. Here $δ (\cdot)$ denotes the Dirac delta function.

SVGD achieves this by starting from a set of initial particles, and iteratively updating them with a deterministic transformation of form

x_{i} \leftarrow x_{i} + ϵ ϕ_{k}^{*} (x_{i}), \forall i = 1, \dots, n, ϕ_{k}^{*} = \underset{ϕ \in B_{k}}{\arg \max} {- {\frac{d}{d ϵ} KL (q_{[ϵ ϕ]} ‖ p) |}_{ϵ = 0}},

(2)

where $ϵ$ is a small step size, $ϕ_{k}^{*} : ℝ^{d} \to ℝ^{d}$ is an optimal transform function chosen to maximize the decreasing rate of the KL divergence between the distribution of particles and the target p, and $q_{[ϵ ϕ]}$ denotes the distribution of the updated particles $x^{'} = x + ϵ ϕ (x)$ as x ∼ q, and $B_{k}$ is the unit ball of RKHS $H_{k}^{d} : = H_{k} \times \dots \times H_{k}$ associated with a positive definite kernel k(x, x^′), that is,

B_{k} = {ϕ \in H_{k}^{d} : ‖ ϕ ‖_{H_{k}^{d}} \leq 1} .

(3)

Liu & Wang (2016) showed that the objective in (2) can be expressed as a linear functional of Φ,

- {\frac{d}{d ϵ} KL (q_{[ϵ ϕ]} ‖ p) |}_{ϵ = 0} = E_{x ~ q} [P^{⊤} ϕ (x)], P^{⊤} ϕ (x) = \nabla_{x} \log p {(x)}^{⊤} ϕ (x) + \nabla_{x}^{⊤} ϕ (x),

(4)

where $P$ is a differential operator called Stein operator; here we formally view $P$ and the derivative operator $\nabla_{x} as ℝ^{d}$ column vectors, hence $P^{⊤} ϕ$ and $\nabla_{x}^{⊤} ϕ$ are viewed as inner products, e.g., $\nabla_{x}^{⊤} ϕ = \sum_{l = 1}^{d} \nabla_{x^{l}} ϕ^{l}$ , with $x^{l}$ and $ϕ^{l}$ being the $l$ -th coordinate of vector x and ϕ, respectively. With (4), it is shown in Liu & Wang (2016) that the solution of (2) is

ϕ_{k}^{*} (\cdot) \propto E_{x ~ q} [P k (x, \cdot)] = E_{x ~ q} [\nabla_{x} \log p (x) k (x, \cdot) + \nabla_{x} k (x, \cdot)] .

(5)

Such $ϕ_{k}^{*}$ provides the best update direction for the particles within RKHS $H_{k}^{d}$ . By taking q to be the empirical measure of the particles, i.e., $q (x) = \sum_{i = 1}^{n} δ (x - x_{i}) / n$ and repeatedly applying this update on the particles, we obtain the SVGD algorithm using equations (2) and (5).

3. SVGD with Matrix-valued Kernels

Our goal is to extend SVGD to allow efficient incorporation of precondition information for better optimization. We achieve this by providing a generalization of SVGD that leverages more general matrix-valued kernels, to flexibly incorporate preconditioning information.

The key idea is to observe that the standard SVGD searches for the optimal ϕ in RKHS $H_{k}^{d} = H_{k} \times \dots \times H_{k}$ , a product of d copies of RKHS of scalar-valued functions, which does not allow us to encode potential correlations between different coordinates of $ϕ$ . This limitation can be addressed by replacing $H_{k}^{d}$ with a more general RKHS of vector-valued functions (called vector-valued RKHS), which uses more flexible matrix-valued positive definite kernels to specify rich correlation structures between different coordinates. In this section, we first introduce the background of vector-valued RKHS with matrix-valued kernels in Section 3.1, and then propose and discuss our generalization of SVGD using matrix-valued kernels in Section 3.2-3.3.

3.1. Vector-Valued RKHS with Matrix-Valued Kernels

We now introduce the background of matrix-valued positive definite kernels, which provides a most general framework for specifying vector-valued RKHS. We focus on the intuition and key ideas in our introduction, and refer the readers to Alvarez et al. (2012); Carmeli et al. (2006) for mathematical treatment.

Recall that a standard real-valued RKHS $H_{k}$ consists of the closure of the linear span of its kernel function $k (\cdot, x)$ as shown in (1). Vector-valued RKHS can be defined in a similar way, but consist of the linear span of a matrix-valued kernel function:

f (x) = \sum_{i} K (x, x_{i}) c_{i},

(6)

for any ${c_{i}} \subset ℝ^{d}$ , where $K : ℝ^{d} \times ℝ^{d} \to ℝ^{d \times d}$ is now a matrix-valued kernel function, and ci are vector-valued weights. Similar to the scalar case, we can define an inner product structure ${〈 f, g 〉}_{H_{K}} = \sum_{i j} c_{i}^{⊤} K (x_{i}, x_{j}) s_{j}$ , where we assume $g = \sum_{i} K (x, x_{i}) s_{i}$ and hence a norm $‖ f ‖_{H_{k}}^{2} = \sum_{i j} c_{i}^{⊤} K (x_{i}, x_{j}) c_{j}$ . In order to make the inner product and norm well defined, the matrix-value kernel K is required to be symmetric in that $K (x, x^{'}) = K {(x^{'}, x)}^{⊤}$ , and positive definite in that $\sum_{i j} c_{i}^{⊤} K (x_{i}, x_{j}) c_{j} \geq 0$ , ${x_{i}} \subset ℝ^{d}$ for any ${c_{i}} \subset ℝ^{d}$ .

Mathematically, one can show that the closure of the set of functions in (6), equipped with the inner product defined above, defines a RKHS that we denote by $H_{K}$ . It is “reproducing” because it has the following reproducing property that generalizes the version for scalar-valued RKHS: for any $f \in H_{K}$ and any $c \in ℝ^{d}$ , we have

f {(x)}^{⊤} c = {〈 f (\cdot), K (\cdot, x) c 〉}_{H_{K}},

(7)

where it is necessary to introduce c because the result of the inner product of two functions must be a scalar. A simple example of matrix kernel is $K (x, x^{'}) = k (x, x^{'}) I$ , where I is the d × d identity matrix. It is related RKHS is $H_{K} = H_{k} \times \dots \times H_{k} = H_{k}^{d}$ , as used in the original SVGD.

3.2. SVGD with Matrix-Valued Kernels

It is now natural to leverage matrix-valued kernels to obtain a generalization of SVGD (see Algorithm 1). The idea is simple: we now optimize $ϕ$ in the unit ball of a general vector-valued RKHS $H_{K}$ with a matrix valued kernel $K (x, x^{'})$ .

ϕ_{K}^{*} = \underset{ϕ \in H_{K}}{\arg \max} {E_{x ~ q} [P^{⊤} ϕ (x)], s.t. ‖ ϕ ‖_{H_{K}} \leq 1} .

(8)

This yields a simple closed form solution similar to (5).

Theorem 1.

Let $K (x, x^{'})$ be a matrix-valued positive definite kernel that is continuously differentiable on $x a n d x^{'}$ , the optimal $ϕ^{*}$ in (8) is

ϕ_{K}^{*} (\cdot) \propto E_{x ~ q} [K (\cdot, x) P] = E_{x ~ q} [K (\cdot, x) \nabla_{x} \log p (x) + K (\cdot, x) \nabla_{x}],

(9)

Algorithm 1.

Stein Variational Gradient Descent with Matrix-valued Kernels (Matrix SVGD)

Input: A (possibly unnormalized) differentiable density function p(x) in

ℝ^{d}

. A matrix-valued positive definite kernel

K (x, x^{'})

. Step size

ϵ

Goal: Find a set of particles

{x_{i}}_{i = 1}^{n}

to represent the distribution p.

Initialize a set of particles

{x_{i}}_{i = 1}^{n}

, e.g., by drawing from some simple distribution.

repeat

x_{i} \leftarrow x_{i} + \frac{ϵ}{n} \sum_{j = 1}^{n} [K (x_{i}, x_{j}) \nabla_{x_{j}} \log p (x_{j}) + K (x_{i}, x_{j}) \nabla_{x_{j}}],

where

K (\cdot, x) \nabla_{x}

is formally defined as the product of matrix

K (\cdot, x)

and vector

\nabla_{x}

. The

l

-th element of

K (\cdot, x) \nabla_{x} is {(K (\cdot, x) \nabla_{x})}_{l} = \sum_{m = 1}^{d} \nabla_{x^{m}} K_{l, m} (\cdot, x)

, see also (10).

until Convergence

Open in a new tab

where the Stein operator $P$ and derivative operator ∇x are again formally viewed as $ℝ^{d}$ -valued column vectors, and $K (\cdot, x) P$ and $K (\cdot, x) \nabla_{x}$ are interpreted by the matrix multiplication rule. Therefore, $K (\cdot, x) P$ is a $ℝ^{d}$ -valued column vector, whose $l$ -th element is defined by

{(K (\cdot, x) P)}_{l} = \sum_{m = 1}^{d} (K_{l, m} (\cdot, x) \nabla_{x^{m}} \log p (x) + \nabla_{x^{m}} K_{l, m} (\cdot, x)),

(10)

where $K_{l, m} (x, x^{'})$ denotes the $(l, m)$ -element of matrix $K (x, {mathvariant='bold-italic'x}^{'})$ and x^m the m-th element of x.

Similar to the case of standard SVGD, recursively applying the optimal transform $ϕ_{K}^{*}$ on the particles yields a general SVGD algorithm shown in Algorithm 1, which we call matrix SVGD.

Parallel to vanilla SVGD, the gradient of matrix SVGD in (9) consists of two parts that account for optimization and diversity, respectively: the first part is a weighted average of gradient $\nabla_{x} \log p (x)$ multiplied by a matrix-value kernel $K (\cdot, x)$ the other part consists of the gradient of the matrix-valued kernel K, which, like standard SVGD, serves as a repulsive force to keep the particles away from each other to reflect the uncertainty captured in distribution p.

Matrix SVGD includes various previous variants of SVGD as special cases. The vanilla SVGD corresponds to the case when $K (x, x^{'}) = k (x, x^{'}) I$ , with I as the d × d identity matrix; the gradient-free SVGD of Han & Liu (2018) can be treated as the case when $K (x, x^{'}) = k (x, x^{'}) w (x) w (x^{'}) I$ , where w(x) is an importance weight function; the graphical SVGD of Wang et al. (2018); Zhuo et al. (2018) corresponds to a diagonal matrix-valued kernel: $K (x, x^{'}) = diag [{k_{l} (x, x^{'})}_{l = 1}^{d}]$ , where each $k_{l} (x, x^{'})$ is a “local” scalar-valued kernel function related to the $l$ -th coordinate $x^{l}$ of vector $x$ .

3.3. Matrix-Valued Kernels and Change of Variables

It is well known that preconditioned gradient descent can be interpreted as applying standard gradient descent on a reparameterization of the variables. For example, let $y = Q^{1 / 2} x$ , where Q is a positive definite matrix, then log $p (x) = \log p (Q^{- 1 / 2} y)$ . Applying gradient descent on y and transform it back to the updates on x yields a preconditioned gradient update $x \leftarrow x + ϵ Q^{- 1} \nabla_{x} \log p (x)$ .

We now extend this idea to SVGD, for which matrix-valued kernels show up naturally as a consequence of change of variables. This justifies the use of matrix-valued kernels and provides guidance on the practical choice of matrix-valued kernels. We start with a basic result of how matrix-valued kernels change under change of variables (see Paulsen & Raghupathi (2016)).

Lemma 2.

Assume $H_{0}$ is an RKHS with a matrix kernel $K_{0} : ℝ^{d} \times ℝ^{d} \to ℝ^{d \times d}$ . Let $H$ be the set of functions formed by

ϕ (x) = M (x) ϕ_{0} (t (x)), \forall ϕ_{0} \in H_{0},

where $M : ℝ^{d} \to ℝ^{d \times d}$ is a fixed matrix-valued function and we assume M (x) is an invertible matrix for all x, and $t : ℝ^{d} \to ℝ^{d}$ is a fixed continuously differentiable one-to-one transform on $ℝ^{d}$ .

For $\forall ϕ, ϕ^{'} \in H$ , we can identity an unique $ϕ_{0}, ϕ_{0}^{'} \in H_{0}$ such that $ϕ (x) = M (x) ϕ_{0} (t (x))$ and $ϕ^{'} (x) = M (x) ϕ_{0}^{'} (t (x))$ . Define the inner product on $H$ via ${〈 ϕ, ϕ^{'} 〉}_{H} = {〈 ϕ_{0}, ϕ_{0}^{'} 〉}_{H_{0}}$ , then $H$ is also a vector-valued RKHS, whose matrix-valued kernel is

K (x, x^{'}) = M (x) K_{0} (t (x), t (x^{'})) M {(x^{'})}^{⊤} .

We now present a key result, which characterizes the change of kernels when we apply invertible variable transforms on the SVGD trajectory.

Theorem 3.

i) Let p and q be two distributions and p₀, q₀ the distribution of $x_{0} = t (x)$ when x is drawn from p, q, respectively, where t is a continuous differentiable one-to-one map on $ℝ^{d}$ . Assume p is a continuous differentiable density with Stein operator $P$ , and $P_{0}$ the Stein operator of p₀. We have

E_{x ~ q_{0}} [P_{0}^{⊤} ϕ_{0} (x)] = E_{x ~ q} [P^{⊤} ϕ (x)], w i t h ϕ (x) : = \nabla t {(x)}^{- 1} ϕ_{0} (t (x)),

(11)

where ∇t is the Jacobian matrix of t.

ii) Therefore, in the asymptotics of infinitesimal step size $(ϵ \to 0^{+})$ , running SVGD with kernel $K_{0}$ on p₀ is equivalent to running SVGD on p with kernel

K (x, x^{'}) = \nabla t {(x)}^{- 1} K_{0} (t (x), t (x^{'})) \nabla t {(x^{'})}^{- ⊤},

in the sense that the trajectory of these two SVGD can be mapped to each other by the one-to-one map t (and its inverse).

3.4. Practical Choice of Matrix-Valued Kernels

Theorem 3 suggests a conceptual procedure for constructing proper matrix kernels to incorporate desirable preconditioning information: one can construct a one-to-one map t so that the distribution $p_{0}$ of $x_{0} = t (x)$ is an easy-to-sample distribution with a simpler kernel, which can be a standard scalar-valued kernel or a simple diagonal kernel. Practical choices of t often involve rotating x with either Hessian matrix or Fisher information, allowing us to incorporating these information into SVGD. In the sequel, we first illustrate this idea for simple Gaussian cases and then discuss practical approaches for non-Gaussian cases.

Constant Preconditioning Matrices

Consider the simple case when p is multivariate Gaussian, e.g., $\log p (x) = - \frac{1}{2} x^{⊤} Q x + c o n s t$ , where Q is a positive definite matrix. In this case, the distribution p₀ of the transformed variable $t (x) = Q^{1 / 2} x$ is the standard Gaussian distribution that can be better approximated with a simpler kernel $K_{0} (x, x^{'})$ , which can be chosen to be the standard RBF kernel suggested in Liu & Wang (2016), the graphical kernel suggested in Wang et al. (2018), or the linear kernels suggested in Liu & Wang (2018). Theorem 3 then suggests to use a kernel of form

K_{Q} (x, x^{'}) : = Q^{- 1 / 2} K_{0} (Q^{1 / 2} x, Q^{1 / 2} x^{'}) Q^{- 1 / 2},

(12)

in which Q is applied on both the input x and the output side. As an example, taking $K_{0}$ to be the scalar-valued Gaussian RBF kernel gives

K_{Q} (x, x^{'}) = Q^{- 1} \exp (- \frac{1}{2 h} {‖ x - x^{'} ‖}_{Q}^{2}),

(13)

where ${‖ x - x^{'} ‖}_{Q}^{2} : = {(x - x^{'})}^{⊤} Q (x - x^{'})$ and h is a bandwidth parameter. Define $K_{0, Q} (x, x^{'}) : = K_{0} (Q^{1 / 2} x, Q^{1 / 2} x^{'})$ . One can show that the SVGD direction of the kernel in (12) equals

ϕ_{K_{Q}}^{*} (\cdot) = Q^{- 1} E_{x ~ q} [\nabla \log p (x) K_{0, Q} (\cdot, x) + K_{0, Q} (\cdot, x) \nabla_{x}] = Q^{- 1} ϕ_{K_{0, Q}}^{*} (\cdot),

(14)

which is a linear transform of the SVGD direction of kernel $K_{0, Q} (x, x^{'})$ with matrix Q⁻¹.

In practice, when p is non-Gaussian, we can construct Q by taking averaging over the particles. For example, denote by $H (x) = - \nabla_{x}^{2} \log p (x)$ the negative Hessian matrix at x, we can construct Q by

Q = \sum_{i = 1}^{n} H (x_{i}) / n,

(15)

where ${x_{i}}_{i = 1}^{n}$ are the particles from the previous iteration. We may replace H with the Fisher information matrix to obtain a natural gradient like variant of SVGD.

Point-wise Preconditioning

A constant preconditioning matrix can not reflect different curvature or geometric information at different points. A simple heuristic to address this limitation is to replace the constant matrix Q with a point-wise matrix function Q(x); this motivates a kernel of form

K (x, x^{'}) = Q^{- 1 / 2} (x) K_{0} (Q^{1 / 2} (x) x, Q^{1 / 2} (x^{'}) x^{'}) Q^{- 1 / 2} (x^{'}) .

Unfortunately, this choice may yield expensive computation and difficult implementation in practice, because the SVGD update involves taking the gradient of the kernel $K (x, x^{'})$ , which would need to differentiate through matrix valued function Q(x). When Q(x) equals the Hessian matrix, for example, it involves taking the third order derivative of log p(x), yielding an unnatural algorithm.

Mixture Preconditioning

We instead propose a more practical approach to achieve point-wise preconditioning with a much simpler algorithm. The idea is to use a weighted combination of several constant preconditioning matrices. This involves leveraging a set of anchor points ${z_{l}}_{l = 1}^{m} \subset ℝ^{d}$ , each of which is associated with a preconditioning matrix $Q_{l} = Q (z_{l})$ (e.g., their Hessian or Fisher information matrices). In practice, the anchor points ${z_{l}}_{l = 1}^{m}$ can be conveniently set to be the same as the particles ${x_{i}}_{i = 1}^{n}$ . We then construct a kernel by

K (x, x^{'}) = \sum_{l = 1}^{m} K_{Q_{l}} (x, x^{'}) w_{l} (x) w_{l} (x^{'}),

(16)

where $K_{Q_{l}} (x, x^{'})$ is defined in (12) or (13), and $w_{l} (x)$ is a positive scalar-valued function that decides the contribution of kernel $K_{Q_{l}}$ at point x. Here $w_{l} (x)$ should be viewed as a mixture probability, and hence should satisfy $\sum_{l = 1}^{m} w_{l} (x) = 1$ for all x. In our empirical studies, we take $w_{l} (x)$ as the Gaussian mixture probability from the anchor points:

w_{l} (x) = \frac{N (x; z_{l}, Q_{l}^{- 1})}{\sum_{l^{'} = 1}^{m} N (x; z_{l^{'}}, Q_{l^{'}}^{- 1})}, N (x; z_{l}, Q_{l}^{- 1}) : = \frac{1}{Z_{l}} \exp (- \frac{1}{2} {‖ x - z_{l} ‖}_{Q_{l}}^{2}),

(17)

where $Z_{l} = {(2 π)}^{d / 2} \det {(Q_{l})}^{- 1 / 2}$ . In this way, each point x is mostly influenced by the anchor point closest to it, allowing us to apply different preconditioning for different points. Importantly, the SVGD update direction related to the kernel in (16) has a simple and easy-to-implement form:

ϕ_{K}^{*} (\cdot) = \sum_{l = 1}^{m} w_{l} (\cdot) E_{x ~ q} [(w_{l} (x) K_{Q_{l}} (\cdot, x)) P] = \sum_{l = 1}^{m} w_{l} (\cdot) ϕ_{w_{l} K_{Q_{l}}}^{*} (\cdot),

(18)

which is a weighted sum of a number of SVGD directions with constant preconditioning matrices (but with an asymmetric kernel $w_{l} (x) K_{Q_{l}} (\cdot, x))$ .

A Remark on Stein Variational Newton (SVN)

Detommaso et al. (2018) provided a Newton-like variation of SVGD. It is motivated by an intractable functional Newton framework, and arrives a practical algorithm using a series of approximation steps. The update of SVN is

x_{i} \leftarrow x_{i} + ϵ {\tilde{H}}_{i}^{- 1} ϕ_{k}^{*} (x_{i}), \forall i = 1, \dots, n,

(19)

where $ϕ_{k}^{*} (\cdot)$ is the standard SVGD gradient, and ${\tilde{H}}_{i}$ is a Hessian like matrix associated with particle x_i, defined by

{\tilde{H}}_{i} = E_{x ~ q} [H (x) k {(x, x_{i})}^{2} + {(\nabla_{x_{i}} k (x, x_{i}))}^{\otimes 2}],

where $H (x) = - \nabla_{x}^{2} \log p (x)$ and $w^{\otimes 2} : = w w^{⊤}$ . Due to the approximation introduced in the derivation of SVN, it does not correspond to a standard functional gradient flow like SVGD (unless ${\tilde{H}}_{i} = Q$ for all i, in which case it reduces to using a constant preconditioning matrix on SVGD like (14)). SVN can be heuristically viewed as a “hard” variant of (18), which assigns each particle with its own preconditioning matrix with probability one, but the mathematical form do not match precisely. On the other hand, it is useful to note that the set of fixed points of SVN update (19) is the identical to that of the standard SVGD update with $ϕ_{k}^{*} (\cdot)$ , once all ${\tilde{H}}_{i}$ are positive definite matrices. This is because at the fixed points of (19), we have ${\tilde{H}}_{i}^{- 1} ϕ_{k}^{*} (x_{i}) = 0 for \forall i = 1, \dots, n$ , which is equivalent to $ϕ_{k}^{*} (x_{i}) = 0$ , $\forall i$ when all the ${\tilde{H}}_{i}$ , $\forall i$ are positive definite. Therefore, SVN can be justified as an alternative fixed point iteration method to achieve the same set of fixed points as the standard SVGD.

4. Experiments

We demonstrate the effectiveness of our matrix SVGD on various practical tasks. We start with a toy example and then proceed to more challenging tasks that involve logistic regression, neural networks and recurrent neural networks. For our method, we take the preconditioning matrices to be either Hessian or Fisher information matrices, depending on the application. For large scale Fisher matrices in (recurrent) neural networks, we leverage the Kronecker-factored (KFAC) approximation by Martens & Grosse (2015); Martens et al. (2018) to enable efficient computation. We use RBF kernel for vanilla SVGD. The kernel $K_{0} (x, x^{'})$ in our matrix SVGD (see (12) and (13)) is also taken to be Gaussian RBF. Following Liu & Wang (2016), we choose the bandwidth of the Gaussian RBF kernels using the standard median trick and use Adagrad (Duchi et al., 2011) for stepsize. Our code is available at https://github.com/dilinwang820/matrix_ssvgd.

The algorithms we test are summarized here:

Vanilla SVGD, using the code by Liu & Wang (2016);

Matrix-SVGD (average), using the constant preconditioning matrix kernel in (13), with Q to be either the average of the Hessian matrices or Fisher matrices of the particles (e.g., (15));

Matrix-SVGD (mixture), using the mixture preconditioning matrix kernel in (16), where we pick the anchor points to be particles themselves, that is, ${z_{l}}_{l = 1}^{m} = {x_{i}}_{i = 1}^{n}$ ;

Stein variational Newton (SVN), based on the implementation of Detommaso et al. (2018);

Preconditioned Stochastic Langevin Dynamics (pSGLD), which is a variant of SGLD (Li et al., 2016), using a diagonal approximation of Fisher information as the preconditioned matrix.

4.1. Two-Dimensional Toy Examples

Settings

We start with illustrating our method using a Gaussian mixture toy model (Figure 1), with exact Hessian matrices for preconditioning. For fair comparison, we search the best learning rate for all algorithms exhaustively. We use 50 particles for all the cases. We use the same initialization for all methods with the same random seeds.

Results

Figure 1 show the results for 2D toy examples. Appendix B shows more visualization and results on more examples. We can see that methods with Hessian information generally converge faster than vanilla SVGD, and Matrix-SVGD (mixture) yields the best performance.

4.2. Bayesian Logistic Regression

Settings

We consider the Bayesian logistic regression model for binary classification. Let D = ${(x_{j}, y_{j})}_{j = 1}^{N}$ be a dataset with feature vector x_j and binary label $y_{j} \in {0, 1}$ . The distribution of interest is

p (θ | D) \propto p (D | θ) p (θ) with p (D | θ) = \prod_{j = 1}^{N} [y_{j} σ (θ^{⊤} x_{j}) + (1 - y_{j}) σ (- θ^{⊤} x_{j})],

where σ(z) := 1/(1 + exp(−z)), and $p_{0} (θ)$ is the prior distribution, which we set to be standard normal $N (θ; 0, I)$ . The goal is to approximate the posterior distribution $p (θ | D)$ with a set of particles ${θ_{i}}_{i = 1}^{n}$ , and then use it to predict the class labels for testing data points. We compare our methods with preconditioned stochastic gradient Langevin dynamics (pSGLD) (Li et al., 2016). Because pSGLD is a sequential algorithm, for fair comparison, we obtain the samples of pSGLD by running n parallel chains of pSGLD for estimation. The preconditioning matrix in both pSGLD and matrix SVGD is taken to be the Fisher information matrix.

We consider the binary Covtype² dataset with 581, 012 data points and 54 features. We partition the data into 70% for training, 10% for validation and 20% for testing. We use Adagrad optimizer with a mini-batch size of 256. We choose the best learning rate from [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0] for each method on the validation set. For all the experiments and algorithms, we use n = 20 particles. Results are average over 20 random trials.

Results

Figure 2 (a) and (b) show the test accuracy and test log-likelihood of different algorithms. We can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture) converge much faster than both vanilla SVGD and pSGLD, reaching an accuracy of 0.75 in less than 500 iterations.

4.3. Neural Network Regression

Settings

We apply our matrix SVGD on Bayesian neural network regression on UCI datasets. For all experiments, we use a two-layer neural network with 50 hidden units with ReLU activation functions. We assign isotropic Gaussian priors $N (0, α I)$ to the neural network weights. We set α = 0.5. All datasets³ are randomly partitioned into 90% for training and 10% for testing. All results are averaged over 20 random trials, except for Protein and Year, on which 5 random trials are performed. We use n = 10 particles for all methods. We use Adam optimizer with a mini-batch size of 100; for large dataset such as Year, we set the mini-batch size to be 1000. We use the Fisher information matrix with Kronecker-factored (KFAC) approximation for preconditioning.

Results

Table 1 shows the performance in terms of the test RMSE and the test log-likelihood. We can see that both Matrix-SVGD (average) and Matrix-SVGD (mixture), which use second-order information, achieve better performance than vanilla SVGD. Matrix-SVGD (mixture) yields the best performance for both test RMSE and test log-likelihood in most cases. Figure 2 (c)-(d) show that both variants of Matrix-SVGD converge much faster than vanilla SVGD and pSGLD on the Protein dataset.

Table 1:

Average test RMSE and log-likelihood in test data for UCI regression benchmarks.

	Test RMSE				Test Log-Likelihood

Dataset	pSGLD	Vanilla SVGD	Matrix-SVGD (average)	Matrix-SVGD (mixture)	pSGLD	Vanilla SVGD	Matrix-SVGD (average)	Matrix-SVGD (mixture)
Boston	2.699±0.155	2.785±0.169	2.898±0.184	2.717±0.166	−2.847±0.182	−2.706±0.158	−2.669±0.141	−2.861±0.207
Concrete	5.053±0.124	5.027±0.116	4.869±0.124	4.721±0.111	−3.206±0.056	−3.064±0.034	−3.150±0.054	−3.207±0.071
Energy	0.985±0.024	0.889±0.024	0.795±0.025	0.868±0.025	−1.395±0.029	−1.315±0.020	−1.135±0.026	−1.249±0.036
Kin8nm	0.091±0.001	0.093±0.001	0.092±0.001	0.090±0.001	0.973±0.010	0.964±0.012	0.956±0.011	0.975±0.011
Naval	0.002±0.000	0.004±0.000	0.001±0.000	0.000±0.000	4.535±0.093	4.312±0.087	5.383±0.081	5.639±0.048
Combined	4.042±0.034	4.088±0.033	4.056±0.033	4.029±0.033	−2.821±0.009	−2.832±0.009	−2.824±0.009	−2.817±0.009
Wine	0.641±0.009	0.645±0.009	0.637±0.008	0.637±0.009	−0.984±0.016	−0.997±0.019	−0.980±0.016	−0.988±0.018
Protein	4.300±0.018	4.186±0.017	3.997±0.018	3.852±0.014	−2.874±0.004	−2.846±0.003	−2.796±0.004	−2.755±0.003
Year	8.630±0.007	8.686±0.010	8.637±0.005	8.594±0.009	−3.568±0.002	−3.577±0.002	−3.569±0.001	−3.561±0.002

Open in a new tab

4.4. Sentence Classification With Recurrent Neural Networks (RNN)

Settings

We consider the sentence classification task on four datasets: MR (Pang & Lee, 2005), CR (Hu & Liu, 2004), SUBJ (Pang & Lee, 2004), and MPQA (Wiebe et al., 2005). We use a recurrent neural network (RNN) based model, $p (y | x) = softmax (w_{y}^{⊤} h_{R N N} (x, v))$ , where x is the input sentence, y is a discrete-valued label of the sentence, and w_y is a weight coefficient related to label class y. And $h_{R N N} (x, v)$ is an RNN function with parameter v using a one-layer bidirectional GRU model (Cho et al., 2014) with 50 hidden units. We apply matrix SVGD to infer the posterior of $w = {w_{y} : \forall y}$ , while updating the RNN weights v using typical stochastic gradient descent. In all experiments, we use the pre-processed text data provided in Gan et al. (2016). For all the datasets, we conduct 10-fold cross-validation for evaluation. We use n = 10 particles for all the methods. For training, we use a mini-batch size of 50 and run all the algorithms for 20 epochs. We use the Fisher information matrix for preconditioning.

Results

Table 2 shows the results of testing classification errors. We can see that Matrix-SVGD (mixture) generally performs the best among all algorithms.

Table 2:

Sentence classification errors measured with four benchmarks.

Method	MR	CR	SUBJ	MPQA
SGLD	20.52	18.65	7.66	11.24
pSGLD	19.75	17.50	6.99	10.80

Vanilla SVGD	19.73	18.07	6.67	10.58
Matrix-SVGD (average)	19.22	17.29	6.76	10.79
Matrix-SVGD (mixture)	19.09	17.13	6.59	10.71

Open in a new tab

5. Conclusion

We present a generalization of SVGD by leveraging general matrix-valued positive definite kernels, which allows us to flexibly incorporate various preconditioning matrices, including Hessian and Fisher information matrices, to improve exploration in the probability landscape. We test our practical algorithms on various practical tasks and demonstrate its efficiency compared to various existing methods.

Supplementary Material

appendix- supplement

NIHMS1060429-supplement-appendix-_supplement.pdf^{(21.3MB, pdf)}

Acknowledgement

This work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421. We would like to acknowledge Google Cloud and Amazon Web Services (AWS) for their support.

A. Proof

Proof of Theorem 1

Let $e_{l}$ be the column vector with 1 in $l^{t h}$ coordinate and 0 elsewhere. By the RKHS reproducing property (7) we have

E_{x ~ q} [P^{⊤} ϕ (x)] = E_{x ~ q} [\nabla_{x} \log p {(x)}^{⊤} ϕ (x) + \nabla_{x}^{⊤} ϕ (x)] = E_{x ~ q} [ϕ {(x)}^{⊤} \nabla_{x} \log p (x) + \sum_{l = 1}^{d} \nabla_{x^{l}} ϕ {(x)}^{⊤} e_{l}] = E_{x ~ q} [{〈 ϕ (\cdot), K (\cdot, x) \nabla_{x} \log p (x) 〉}_{H_{K}} + \sum_{l = 1}^{d} \nabla_{x^{l}} {〈 ϕ (\cdot), K (\cdot, x) e_{l} 〉}_{H_{K}}] = {〈 ϕ (\cdot), E_{x ~ q} [K (\cdot, x) \nabla_{x} \log p (x) + \sum_{l = 1}^{d} \nabla_{x^{'}} K (\cdot, x) e_{l}] 〉}_{H_{K}} = {〈 ϕ (\cdot), E_{x ~ q} [K (\cdot, x) \nabla_{x} \log p (x) + K (\cdot, x) \nabla_{x}] 〉}_{H_{K}} = {〈 ϕ (\cdot), E_{x ~ q} [K (\cdot, x) P] 〉}_{H_{K}},

The optimization in (8) is hence

\max_{ϕ \in H_{K}} {〈 ϕ (\cdot), E_{x ~ q} [K (\cdot, x) P] 〉}_{H_{K}}, s.t. ‖ ϕ ‖_{H_{K}} \leq 1,

whose solution is $ϕ^{*} (\cdot) \propto E_{x ~ q} [K (\cdot, x) P]$

Proof of Lemma 2

This is a basic result of RKHS, which can be found in classical textbooks such as Paulsen & Raghupathi (2016). The key idea is to show that $K (x, x^{'})$ satisfies the reproducing property for $H$ . Recall the reproducing property of $H_{0}$ :

ϕ_{0} {(x)}^{⊤} c = {〈 ϕ_{0}, K_{0} (\cdot, x) c 〉}_{H_{0}}, \forall c \in ℝ^{d} .

Taking $ϕ (x) = M (x) ϕ_{0} (t (x))$ , we have

ϕ {(x)}^{⊤} c = {〈 ϕ_{0}, K_{0} (\cdot, t (x)) M {(x)}^{⊤} c 〉}_{H_{0}} = {〈 ϕ, M (\cdot) K_{0} (t (\cdot), t (x)) M {(x)}^{⊤} c 〉}_{H} = {〈 ϕ, K (\cdot, x) c 〉}_{H},

where the second step follows ${〈 ϕ, ϕ^{'} 〉}_{H} = {〈 ϕ_{0}, ϕ_{0}^{'} 〉}_{H_{0}} with ϕ_{0}^{'} (\cdot) = K_{0} (\cdot, t (x)) M {(x)}^{T} c$ .

Proof of Theorem 3

Proof.

Note that KL divergence is invariant under invertible variable transforms, that is,

KL (q_{[ϵ ϕ]} ‖ p) = KL (q_{[ϵ ϕ] 0} ‖ p_{0}) .

(20)

where p₀ denotes the distribution of x₀ = t(x) when x ~ p, and $q^{[ϵ ϕ] 0}$ denotes the distribution of $x_{0}^{'} = t (x^{'}) when x^{'} ~ q_{[ϵ ϕ]}$ . Recall that $q_{[ϵ ϕ]}$ is defined as the distribution of $x^{'} = x + ϵ ϕ (x)$ when $x ~ q$ .

Denote by t⁻¹ the inverse map of t, that is, $t^{- 1} (t (x)) = x$ . We can see that $x_{0}^{'} ~ q_{[ϵ ϕ] 0}$ can be obtained by

x_{0}^{'} = t (x^{'}) / / x^{'} ~ q_{[ϵ ϕ]} = t (x + ϵ ϕ (x)) / / x ~ q = t (t^{- 1} (x_{0}) + ϵ ϕ (t^{- 1} (x_{0}))) / / x_{0} ~ q_{0} = x_{0} + ϵ \nabla t (t^{- 1} (x_{0})) ϕ (t^{- 1} (x_{0})) + O (ϵ^{2}) = x_{0} + ϵ ϕ_{0} (x_{0}) + O (ϵ^{2}),

(21)

where we used the definition that $ϕ (x) = \nabla t {(x)}^{- 1} ϕ_{0} (t (x))$ in (11), and $O (\cdot)$ is the big-O notation.

From Theorem 3.1 of Liu & Wang (2016), we have

{\frac{d}{d ϵ} KL (q_{[ϵ ϕ]} ‖ p) |}_{ϵ = 0} = - E_{q} [P^{⊤} ϕ] .

Using Equation (21) and derivation similar to Theorem 3.1 of Liu & Wang (2016), we can show

{\frac{d}{d ϵ} KL (q_{[ϵ ϕ] 0} ‖ p) |}_{ϵ = 0} = - E_{q 0} [P_{0}^{⊤} ϕ_{0}] .

Combining these with (20) proves (11).

Following Lemma 2, when ϕ₀ is in $H_{0}$ with kernel $K_{0} (x, x^{'}), ϕ$ is in $H$ with kernel $K (x, x^{'})$ . Therefore, maximizing $E_{q} [P^{⊤} ϕ]$ in $H$ is equivalent to $E_{q_{0}} [P_{0}^{⊤} ϕ_{0}]$ in $H_{0}$ . This suggests the trajectory of SVGD on p₀ with K₀ and that on p with K are equivalent. □

B. Toy Examples

Figure 3 and Figure 4 show results of different algorithms on three 2D toy distributions: Star, Double banana and Sine. Detailed information of these distributions and more results are shown in Section B.1-B.3.

We can see from Figure 3–4 that both variants of matrix SVGD consistently outperform SVN and vanilla SVGD. We also find that Matrix SVGD(mixture) tends to outperform Matrix SVGD (average), which is expected since Matrix SVGD (average) uses a constant preconditioning matrix for all the particles, and can not capture different curvatures at different locations. Matrix SVGD (mixture) yields the best performance in general.

Figure 4: — The MMD vs. training iteration of different algorithms on the three toy distributions.

B.1. Sine

The density function of the “Sine” distribution is defined by

p (x_{1}, x_{2}) \propto \exp (\frac{- {(x_{2} + \sin (α x_{1}))}^{2}}{2 σ_{1}} - \frac{x_{1}^{2} + x_{2}^{2}}{2 σ_{2}}),

where we choose σ = 1, σ₁ = 0:003, σ₂ = 1.

Figure 5: — The particles obtained by various methods on the toy Sine distribution.

B.2. Double Banana

We use the “double banana” distribution constructed in Detommaso et al. (2018), whose probability density function is

p (x) \propto \exp (- \frac{‖ x ‖_{2}^{2}}{2 σ_{1}} - \frac{{(y - F (x))}^{2}}{2 σ_{2}}),

where $x = [x_{1}, x_{2}] \in ℝ^{2}$ and $F (x) = \log ({(1 - x_{1})}^{2} + 100 {(x_{2} - x_{1}^{2})}^{2})$ and $y = \log (30)$ , $σ_{1} = 1.0$ , $σ_{2} = 0.09$ .

Figure 6: — The particles obtained by various methods on the double banana distribution.

B.3. Star

We construct the “star” distribution with a Gaussian mixture model, whose density function is

p (x) = \frac{1}{K} \sum_{i = 1}^{K} N (x; μ_{i}, Σ_{i}),

with $x \in ℝ^{2} μ_{1} = [0; 1.5]$ , $Σ_{1} = diag ([1; \frac{1}{100}])$ , and the other means and covariance matrices are defined by rotating their previous mean and covariance matrix. To be precise,

μ_{i + 1} = U μ_{i}, Σ_{i + 1} = U Σ_{i} U^{⊤}, U = [\begin{matrix} \cos (θ) & \sin (θ) \\ - \sin (θ) & \cos (θ) \end{matrix}],

with angle $θ = \frac{2 π}{K}$ . We set the number of component K to be 5.

Figure 7: — The particles obtained by various methods on the star-shaped distribution.

Footnotes

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

https://archive.ics.uci.edu/ml/datasets.php

References

Alvarez MA, Rosasco L, Lawrence ND, et al. Kernels for vector-valued functions: A review. Foundations and TrendsOR in Machine Learning, 4(3):195–266, 2012. [Google Scholar]
Berlinet A and Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. [Google Scholar]
Blei DM, Kucukelbir A, and McAuliffe JD Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. [Google Scholar]
Carmeli C, De Vito E, and Toigo A. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4(04):377–408, 2006. [Google Scholar]
Chen WY, Barp A, Briol F-X, Gorham J, Girolami M, Mackey L, Oates C, et al. Stein point markov chain monte carlo. International Conference on Machine Learning (ICML), 2019. [Google Scholar]
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [Google Scholar]
Detommaso G, Cui T, Marzouk Y, Spantini A, and Scheichl R. A Stein variational Newton method. In Advances in Neural Information Processing Systems, pp. 9187–9197, 2018. [Google Scholar]
Duchi J, Hazan E, and Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(July):2121–2159, 2011. [Google Scholar]
Gan Z, Li C, Chen C, Pu Y, Su Q, and Carin L. Scalable Bayesian learning of recurrent neural networks for language modeling. ACL, 2016. [Google Scholar]
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A. A kernel two-sample test. Journal of Machine Learning Research, 13(March):723–773, 2012. [Google Scholar]
Haarnoja T, Tang H, Abbeel P, and Levine S. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017. [Google Scholar]
Han J and Liu Q. Stein variational gradient descent without gradient. In International Conference on Machine Learning, 2018. [Google Scholar]
Hoffman MD and Gelman A. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014. [Google Scholar]
Hu M and Liu B. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. ACM, 2004. [Google Scholar]
Kim T, Yoon J, Dia O, Kim S, Bengio Y, and Ahn S. Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836, 2018. [Google Scholar]
Li C, Chen C, Carlson DE, and Carin L. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI, volume 2, pp. 4, 2016. [Google Scholar]
Liu C and Zhu J. Riemannian Stein variational gradient descent for Bayesian inference. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
Liu Q. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pp. 3115–3123, 2017. [PMC free article] [PubMed] [Google Scholar]
Liu Q and Wang D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016. [PMC free article] [PubMed] [Google Scholar]
Liu Q and Wang D. Stein variational gradient descent as moment matching. In Advances in Neural Information Processing Systems, pp. 8867–8876, 2018. [PMC free article] [PubMed] [Google Scholar]
Martens J and Grosse R. Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417, 2015. [Google Scholar]
Martens J, Ba J, and Johnson M. Kronecker-factored curvature approximations for recurrent neural networks. In ICLR, 2018. [Google Scholar]
Neal RM et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2 (11):2, 2011. [Google Scholar]
Pang B and Lee L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271 Association for Computational Linguistics, 2004. [Google Scholar]
Pang B and Lee L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 115–124. Association for Computational Linguistics, 2005. [Google Scholar]
Paulsen VI and Raghupathi M. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152 Cambridge University Press, 2016. [Google Scholar]
Pu Y, Gan Z, Henao R, Li C, Han S, and Carin L. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems, pp. 4236–4245, 2017. [Google Scholar]
Wainwright MJ, Jordan MI, et al. Graphical models, exponential families, and variational inference. Foundations and Trends OR in Machine Learning, 1(1–2):1–305, 2008. [Google Scholar]
Wang D and Liu Q. Learning to draw samples: With application to amortized MLE for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016. [Google Scholar]
Wang D, Zeng Z, and Liu Q. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pp. 5206–5214, 2018. [Google Scholar]
Wiebe J, Wilson T, and Cardie C. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2–3):165–210, 2005. [Google Scholar]
Zhuo J, Liu C, Shi J, Zhu J, Chen N, and Zhang B. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pp. 6013–6022, 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

appendix- supplement

NIHMS1060429-supplement-appendix-_supplement.pdf^{(21.3MB, pdf)}

[R1] Alvarez MA, Rosasco L, Lawrence ND, et al. Kernels for vector-valued functions: A review. Foundations and TrendsOR in Machine Learning, 4(3):195–266, 2012. [Google Scholar]

[R2] Berlinet A and Thomas-Agnan C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. [Google Scholar]

[R3] Blei DM, Kucukelbir A, and McAuliffe JD Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. [Google Scholar]

[R4] Carmeli C, De Vito E, and Toigo A. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4(04):377–408, 2006. [Google Scholar]

[R5] Chen WY, Barp A, Briol F-X, Gorham J, Girolami M, Mackey L, Oates C, et al. Stein point markov chain monte carlo. International Conference on Machine Learning (ICML), 2019. [Google Scholar]

[R6] Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [Google Scholar]

[R7] Detommaso G, Cui T, Marzouk Y, Spantini A, and Scheichl R. A Stein variational Newton method. In Advances in Neural Information Processing Systems, pp. 9187–9197, 2018. [Google Scholar]

[R8] Duchi J, Hazan E, and Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(July):2121–2159, 2011. [Google Scholar]

[R9] Gan Z, Li C, Chen C, Pu Y, Su Q, and Carin L. Scalable Bayesian learning of recurrent neural networks for language modeling. ACL, 2016. [Google Scholar]

[R10] Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, and Smola A. A kernel two-sample test. Journal of Machine Learning Research, 13(March):723–773, 2012. [Google Scholar]

[R11] Haarnoja T, Tang H, Abbeel P, and Levine S. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017. [Google Scholar]

[R12] Han J and Liu Q. Stein variational gradient descent without gradient. In International Conference on Machine Learning, 2018. [Google Scholar]

[R13] Hoffman MD and Gelman A. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014. [Google Scholar]

[R14] Hu M and Liu B. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. ACM, 2004. [Google Scholar]

[R15] Kim T, Yoon J, Dia O, Kim S, Bengio Y, and Ahn S. Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836, 2018. [Google Scholar]

[R16] Li C, Chen C, Carlson DE, and Carin L. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI, volume 2, pp. 4, 2016. [Google Scholar]

[R17] Liu C and Zhu J. Riemannian Stein variational gradient descent for Bayesian inference. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]

[R18] Liu Q. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pp. 3115–3123, 2017. [PMC free article] [PubMed] [Google Scholar]

[R19] Liu Q and Wang D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016. [PMC free article] [PubMed] [Google Scholar]

[R20] Liu Q and Wang D. Stein variational gradient descent as moment matching. In Advances in Neural Information Processing Systems, pp. 8867–8876, 2018. [PMC free article] [PubMed] [Google Scholar]

[R21] Martens J and Grosse R. Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417, 2015. [Google Scholar]

[R22] Martens J, Ba J, and Johnson M. Kronecker-factored curvature approximations for recurrent neural networks. In ICLR, 2018. [Google Scholar]

[R23] Neal RM et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2 (11):2, 2011. [Google Scholar]

[R24] Pang B and Lee L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, pp. 271 Association for Computational Linguistics, 2004. [Google Scholar]

[R25] Pang B and Lee L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 115–124. Association for Computational Linguistics, 2005. [Google Scholar]

[R26] Paulsen VI and Raghupathi M. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152 Cambridge University Press, 2016. [Google Scholar]

[R27] Pu Y, Gan Z, Henao R, Li C, Han S, and Carin L. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems, pp. 4236–4245, 2017. [Google Scholar]

[R28] Wainwright MJ, Jordan MI, et al. Graphical models, exponential families, and variational inference. Foundations and Trends OR in Machine Learning, 1(1–2):1–305, 2008. [Google Scholar]

[R29] Wang D and Liu Q. Learning to draw samples: With application to amortized MLE for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016. [Google Scholar]

[R30] Wang D, Zeng Z, and Liu Q. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pp. 5206–5214, 2018. [Google Scholar]

[R31] Wiebe J, Wilson T, and Cardie C. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2–3):165–210, 2005. [Google Scholar]

[R32] Zhuo J, Liu C, Shi J, Zhu J, Chen N, and Zhang B. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pp. 6013–6022, 2018. [Google Scholar]

PERMALINK

Stein Variational Gradient Descent with Matrix-Valued Kernels

Dilin Wang

Ziyang Tang

Chandrajit Bajaj

Qiang Liu

Abstract

1. Introduction

Notation and Preliminary

2. Stein Variational Gradient Descent (SVGD)

3. SVGD with Matrix-valued Kernels

3.1. Vector-Valued RKHS with Matrix-Valued Kernels

3.2. SVGD with Matrix-Valued Kernels

Theorem 1.

Algorithm 1.

3.3. Matrix-Valued Kernels and Change of Variables

Lemma 2.

Theorem 3.

3.4. Practical Choice of Matrix-Valued Kernels

Constant Preconditioning Matrices

Point-wise Preconditioning

Mixture Preconditioning

A Remark on Stein Variational Newton (SVN)

4. Experiments

4.1. Two-Dimensional Toy Examples

Settings

Figure 1:

Results

4.2. Bayesian Logistic Regression

Settings

Results

Figure 2:

4.3. Neural Network Regression

Settings

Results

Table 1:

4.4. Sentence Classification With Recurrent Neural Networks (RNN)

Settings

Results

Table 2:

5. Conclusion

Supplementary Material

Acknowledgement

A. Proof

Proof of Theorem 1

Proof of Lemma 2

Proof of Theorem 3

Proof.

B. Toy Examples

Figure 3:

Figure 4:

B.1. Sine

Figure 5:

B.2. Double Banana

Figure 6:

B.3. Star

Figure 7:

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases