Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

Qifan Song; Yan Sun; Mao Ye; Faming Liang

doi:10.1093/biomet/asaa029

. Author manuscript; available in PMC: 2021 Jul 23.

Published in final edited form as: Biometrika. 2020 Jul 13;107(4):997–1004. doi: 10.1093/biomet/asaa029

Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

Qifan Song ¹, Yan Sun ¹, Mao Ye ¹, Faming Liang ¹

PMCID: PMC8302213 NIHMSID: NIHMS1645699 PMID: 34305153

Summary

Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC algorithm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

Keywords: Dimension Jumping, Missing Data, Stochastic Gradient Langevin Dynamics, Subsampling

1. Introduction

After six decades of continual development, MCMC has proven to be a powerful and typically unique computational tool for analyzing data of complex structures. However, for large datasets, its computational cost can be prohibitive as it requires all of the data to be processed at each iteration. To tackle this difficulty, a variety of scalable algorithms have been proposed in the recent literature. According to the strategies they employed, these algorithms can be grouped into a few categories, including stochastic gradient MCMC algorithms (Welling & Teh, 2011; Ding et al., 2014; Ahn et al., 2012; Chen et al., 2014; Betancourt, 2015; Ma et al., 2015; Nemeth & Fearnhead, 2019), split-and-merge algorithms (Scott et al., 2016; Srivastava et al., 2018; Xue & Liang, 2019), mini-batch Metropolis-Hastings algorithms (Chen et al., 2016; Korattikara et al., 2014; Bardenet et al., 2014; Maclaurin & Adams, 2014; Bardenet et al., 2017), nonreversibleMarkov process-based algorithms(Bierkens et al., 2019; Bouchard Coté et al., 2018), and some discrete sampling algorithms based on the multi-armed bandit (Chen & Ghahramani, 2016).

Although scalable algorithms have been developed for both continuous and discrete sampling problems, they are hard to be applied to dimension jumping problems. Dimension jumping is characterized by variable selection where the number of parameters changes from iteration to iteration in MCMC simulations. Under their current settings, the stochastic gradient MCMC and nonreversible Markov process-based algorithms are only applicable to problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. For the split-and-merge algorithms, it is unclear how to aggregate samples of different dimensions drawn from the posterior distributions based on different subset data. The multi-armed bandit algorithms are only applicable to problems with a small discrete domain and can be extremely inefficient for high-dimensional variable selection problems. The mini-batch Metropolis-Hastings algorithms are in principle applicable to dimension jumping problems. However, they are generally difficult to use. For example, the algorithms by Chen et al. (2016), Korattikara et al. (2014), and Bardenet et al. (2014) perform approximate acceptance tests using subset data. The amount of data consumed for each test varies significantly from one iteration to another, which compromise their scalability. The algorithms by Maclaurin & Adams (2014) and Bardenet et al. (2017) perform exact tests but require a lower bound on the parameter distribution across its domain. Unfortunately, the lower bound is usually difficult to obtain.

This paper proposes an extended stochastic gradient Langevin dynamics algorithm which, by introducing appropriate latent variables, extends the stochastic gradient Langevin dynamics algorithm to more general large-scale Bayesian computing problems such as variable selection and missing data. The extended stochastic gradient Langevin dynamics algorithm is highly scalable and much more efficient than traditional MCMC algorithms. Compared to the mini-batch Metropolis-Hastings algorithms, the proposed algorithm is much easier to use, which involves only a fixed amount of data at each iteration and does not require any lower bound on the parameter distribution.

2. A Brief Review of Stochastic Gradient Langevin Dynamics

Let X_N = (X₁,X₂, …, X_N) denote a set of N independent and identically distributed samples drawn from the distribution f(x|θ), where N is the sample size and θ is the parameter. Let $p (X_{N} ∣ θ) = \prod_{i = 1}^{N} f (X_{i} ∣ θ)$ denote the likelihood function, let π(θ) denote the prior distribution of θ, and let log π(θ|X_N) = logp(X_N|θ) + log π(θ) denote the log-posterior density function. If θ has a fixed dimension and log π(θ|X_N) is differentiable with respect to θ, then the stochastic gradient Langevin dynamics algorithm (Welling & Teh, 2011) can be applied to simulate from the posterior, which iterates by

θ_{t + 1} = θ_{t} + \frac{ϵ_{t + 1}}{2} {\hat{\nabla}}_{θ} log π (θ_{t} ∣ X_{N}) + \sqrt{(ϵ_{t + 1} τ) η_{t + 1}}, η_{t + 1} ~ N (0, I_{d}),

(1)

where d is the dimension of θ, I_d is an d × d-identity matrix, ϵ_t+1 is the step size (also known as the learning rate), τ is the temperature, and ${\hat{\nabla}}_{θ} log π (θ_{t} ∣ X_{N})$ denotes an estimate of ∇_θ log π(θ_t|X_N) based on a mini-batch of samples. The learning rate can be decreasing or kept as a constant. For the former, the convergence of the algorithm was studied in Teh et al. (2016). For the latter, the convergence of the algorithm was studied in Sato & Nakagawa (2014) and Dalalyan & Karagulyan (2017). Refer to Nemeth & Fearnhead (2019) for more discussions on the theory, implementation and variants of this algorithm.

3. An Extended Stochastic Gradient Langevin Dynamics Algorithm

To extend the applications of the stochastic gradient Langevin dynamics algorithm to varying-dimensional problems such as variable selection and missing data, we first establish an identity for evaluating ∇_θ log π(θ|X_N) in presence of latent variables. As illustrated below, the latent variables can be the model indicator in the variable selection problems or missing values in the missing data problems.

Lemma 1.

For any latent variable ϑ,

\nabla_{θ} log π (θ ∣ X_{N}) = \int \nabla_{θ} log π (θ ∣ ϑ, X_{N}) π (ϑ ∣ θ, X_{N}) d ϑ,

(2)

where π(ϑ | θ, X_N) and π(θ | ϑ, X_N) denote the conditional distribution of ϑ and θ, respectively.

Lemma 1 provides us a Monte Carlo estimator for ∇_θ log π(θ | X_N) by averaging over the samples drawn from the conditional distribution π(ϑ|θ, X_N). The identity (2) is similar to Fisher’s identity. The latter has been used in evaluating the gradient of the log-likelihood function in presence of latent variables, see e.g. Cappé et al. (2005). When N is large, the computation can be accelerated by subsampling. Let X_n denote a subsample, where n denotes the subsample size. Without loss of generality, we assume that N is a multiple of n, i.e., N/n is an integer. Let X_n,N = {X_n, … , X_n} denote a duplicated dataset with the subsample, whose total sample size is also N. Following from (2), we have

\nabla_{θ} log π (θ ∣ X_{n, N}) = \int \nabla_{θ} log π (θ ∣ ϑ, X_{n, N}) π (ϑ ∣ θ, X_{n, N}) d ϑ .

(3)

3.1. Bayesian Variable Selection

As an illustrative example, we consider the problem of variable selection for linear regression

Y = z^{T} β + ε,

(4)

where ε is a zero-mean Gaussian random error with variance σ², $β \in ℝ^{p}$ is the vector of regression coefficients, and z = (z₁,z₂, … , z_p) is the vector of explanatory variables. Let $γ_{S} = (γ_{S}^{1}, \dots, γ_{S}^{p})$ be a binary vector indicating the variables included in model S, and let β_S be the vector of regression coefficients associated with the model S. From the perspective of Bayesian statistics, we are interested in estimating the posterior probability π(γ_S|X_N)·for each model $S \in S$ and the posterior mean π(ρ) = ∫ ρ(β)π(β|X_N) for some integrable function ρ(·), where $S$ comprises 2^p models. Both quantities can be estimated using the reversible jump Metropolis-Hastings algorithm (Green, 1995) by sampling from the posterior distribution π(γ_S, β_S|X_N). However, when N is large, the algorithm can be extremely slow due to repeated scans of the full dataset in simulations.

As aforementioned, the existing stochastic gradient MCMC algorithms cannot be directly applied to simulate of π(γ_S, β_S|X_N) due to the dimension jumping issue involved in model transition. To address this issue, we introduce an auxiliary variable θ = (θ¹,θ², … , θ^p), which links γ_S and β_S through

β_{S} = θ * γ_{S} = (θ^{1} γ_{S}^{1}, θ^{2} γ_{S}^{2}, \dots, θ^{p} γ_{S}^{p}),

(5)

where * denotes elementwise multiplication. Let $θ_{[S]} = {θ^{i} : γ_{S}^{i} = 1, i = 1, 2, \dots, p}$ and $θ_{[- S]} = {θ^{i} : γ_{S}^{i} = 0, i = 1, 2, \dots, p}$ be subvectors of θ corresponding to the nonzero and zero elements of γ_S, respectively. Note that β_S is sparse with all elements in θ_[−S] being zero, while θ can be dense. Based on the relation (5), we suggest to simulate from π(θ|X_N) using the stochastic gradient Langevin dynamic algorithm, for which the gradient ∇_θ log π(θ|X_N) can be evaluated using Lemma 1 by treating γ_S as the latent variable. Let π(θ) denote the prior of θ. To simplify the computation of ∇_θ log π(θ | γ_S, X_N), we further assume the a priori independence that π(θ|γ_S) = π(θ_[S]|γ_S)π(θ_[−S]|γ_S). Then it is easy to derive

\nabla_{θ} log π (θ ∣ γ_{S}, X_{N}) = {\begin{array}{l} \nabla_{θ_{[S]}} log p (X_{N} ∣ θ_{[S]}, γ_{S}) + \nabla_{θ_{[S]}} π (θ_{[S]} ∣ γ_{S}), & for component θ_{[S]}, \\ \nabla_{θ_{[- S]}} log π (θ_{[- S]} ∣ γ_{S}), & for component θ_{[- S]}, \end{array}

which can be used in evaluating ∇log π(θ|X_N) by Lemma 1. If a mini-batch of data is used, the gradient can be evaluated based on (3). This leads to an extended stochastic gradient langevin dynamics algorithm.

Algorithm 1.

[Extended Stochastic Gradient Langevin Dynamics for Bayesian Variable Selection]

(Subsampling) Draw a subsample of size n (with or without replacement) from the full dataset X_N at random, and denote the subsample by $X_{n}^{(t)}$ , where t indexes the iteration.
(Simulating models) Simulate models $γ_{S_{1}, n}^{(t)}, \dots, γ_{S_{m}, n}^{(t)}$ from the conditional posterior $π (γ_{S} ∣ θ^{(t)}, X_{n, N}^{(t)})$ by running a short Markov chain, where $X_{n, N}^{(t)} = {X_{n}^{(t)}, \dots, X_{n}^{(t)}}$ and θ^(t) is the sample of θ at iteration t.
(Updating θ) Update θ^(t) by setting $θ^{(t + 1)} = θ^{(t)} + {(2 m)}^{- 1} ϵ_{t + 1} \sum_{k = 1}^{m} \nabla_{θ} log π (θ^{(t)} ∣ γ_{S_{k}, n}^{(t)}, X_{n, N}^{(t)}) + \sqrt{(ϵ_{t + 1} τ) η_{t + 1}}$ , where ϵ_t+1 is the learning rate, η_t+1 ~ N(0, I_p), τ is the temperature, and p is the dimension of θ.

Open in a new tab

Theorem 1 justifies the validity of this algorithm with the proof given in the Appendix.

Theorem 1.

Assume that the conditions (A.1)-(A.3) (given in Appendix) hold, m, p, n are increasing with N such that N ≥ n ≻ p, m ≻ p^1/2, and a constant learning rate ϵ ≺ 1/N is used. Then, as N → ∞,

W₂(π_t,π_*) → 0 as t → ∞, where π_t denotes the distribution of θ^(t), π_* = π(θ|X_N), and W₂(·,·) denotes the second order Wasserstein distance between two distributions.
If ρ(θ) is α-Lipschitz for some constant α > 0, then $\sum_{t = 1}^{T} ρ (θ^{(t)}) / T \overset{p}{\to} π_{*} (ρ)$ as T → ∞, where $\overset{p}{\to}$ denotes convergence in probability and π_*(ρ) = ∫_Θ ρ(θ)π(θ|X_N)dθ.
If (A.4) further holds, $\sum_{t = 1}^{T} \sum_{i = 1}^{m} I (γ_{S_{i}, n}^{(t)} = γ_{S}) / (m T) - π (γ_{S} ∣ X_{N}) \overset{p}{\to} 0$ as T → ∞

Part (i) establishes the weak convergence of θ_t; that is, if the total sample size N and the iteration number t are sufficiently large, and the subsample size n and the number of models m simulated at each iteration are reasonably large, then π(θ_t|X_N) will converge to the true posterior π(θ|X_N) in 2-Wasserstein distance. Refer to Gibbs & Su (2002) for discussions on the relation between Wasserstein distance and other probability metrics. Parts (ii) & (iii) address our general interests on how to estimate the posterior mean and posterior probability, respectively, based the samples simulated by Algorithm 1. For parts (i), (ii) and (iii), the explicit convergence rates are given in equations (3), (5) and (10), respectively.

For the choice of m ≻ p^1/2, p can be approximately treated as the maximum size of the models under consideration, which is of the same order as the true model. Therefore, m can be pretty small under the model sparsity assumption. Theorem 1 is established with a constant learning rate. In practice, one may use a decaying learning rate, see e.g. Teh et al. (2016), where it is suggested to set ϵ_t = O(1/t^κ) for some 0 < κ ≤ 1. For the decaying learning rate, Teh et al. (2016) recommended some weighted averaging estimators for π_*(ρ). Theorem 2 shows that the unweighted averaging estimators used above still work if the learning rate slowly decays at a rate of ϵ_t = O(1/t^κ) for 0 < κ < 1. However, if κ = 1, the weighted averaging estimators are still needed. The proof of Theorem 2 is given in the supplementary material.

Theorem 2.

Assume the conditions of Theorem 1 hold. If a decaying learning rate ϵ_t = O(1/t^κ) is used for some 0 < κ < 1, then parts (i), (ii) and (iii) of Theorem 1 are still valid.

3.2. Missing Data

Missing data are ubiquitous over all fields from science to technology. However, under the big data scenario, how to conduct Bayesian analysis in presence of missing data is still unclear. The existing data-augmentation algorithm (Tanner & Wong, 1987) is full data based and thus can be extremely slow. In this context, we let X_N denote the incomplete data and let θ denote the model parameters. If we treat the missing values as latent variables, then Lemma 1 can be used for evaluating the gradient ∇_θ log π(θ|X_N). However, Algorithm 1 cannot be directly applied to missing data problems, since the imputation of the missing data might depend on the subsample only. To address this issue, we propose Algorithm S1 (given in the Supplementarymaterial), where the missing values ϑ are imputed from π(ϑ|θ, X_n) at each iteration. Theorem 1 and Theorem 2 are still applicable to this algorithm.

4. An Illustrative Example

This section illustrates the performance of Algorithm 1 using a simulated example. More numerical examples are presented in the supplementary material. Ten synthetic datasets were generated from the model (4) with N = 50,000, p = 2001, σ² = 1, β₁ = ⋯ = β₅ = 1, β₆ = β₇ = β₈ = −1, and β₀ = β₉ = ⋯ = β_p = 0, where σ² is assumed to be known, and the explanatory variables are normally distributed with a mutual correlation coefficient of 0.5. A hierarchical prior was assumed for the model and parameters with the detail given in the supplementary material. For each dataset, Algorithm 1 was run for 5000 iterations with n = 200, m = 10, and the learning rate ϵ_t ≡ 10⁻⁶, where the first 2000 iterations were discarded for the burn-in process and the samples generated from the remaining iterations were used for inference. At each iteration, the reversible jump Metropolis-Hastings algorithm (Green, 1995) was used for simulating the models $γ_{S_{i}, n}^{(t)}$ , i = 1, 2, …, m with the detail given in the supplementary material.

Table 1 summarizes the performance of the algorithm, where the false selection rate (FSR), negative selection rate (NSR), mean squared errors for false predictors (MSE₀) and mean squared errors for true predictors (MSE₁) are defined in the supplementary material. The variables were selected according to the median posterior probability rule (Barbieri & Berger, 2004), which selects only the variables with the marginal inclusion probability greater than 0.5. The Bayesian estimates of parameters were obtained by averaging over a set of thinned (by a factor of 10) posterior samples. For comparison, some existing algorithms were applied to this example with the results given in Table 1 and the implementation details given in the supplementary material. The comparison show that the proposed algorithm has much alleviated the pain of Bayesian methods in big data analysis.

Table 1.

Bayesian variable selection with the extended stochastic gradient Langevin dynamics (eSGLD), reversible jump Metropolis-Hastings (RJMH), split-and-merge (SaM) and Bayesian Lasso (B-Lasso) algorithms, where FSR, NSR, MSE₁ and MSE₀ are reported in averages over 10 datasets with standard deviations given in the parentheses, and the CPU time (in minutes) was recorded for one dataset on a Linux machine with Intel® Core™i7–3770 CPU@3.40GHz.

Algorithm	FSR	NSR	MSE₁	MSE₀	CPU(m)
eSGLD	0(0)	0(0)	2.91 × 10⁻³(1.90 × 10⁻³)	1.26 × 10⁻⁷(1.18 × 10⁻⁸)	3.3
RJMH	0.50(0.10)	0.16(0.042)	1.60 × 10⁻¹(3.89 × 10⁻²)	2.64 × 10⁻⁵(8.75 × 10⁻⁶)	180.1
SaM	0.05(0.05)	0.013(0.013)	1.29 × 10⁻²(1.27 × 10⁻²)	1.01 × 10⁻⁶(1.00 × 10⁻⁶)	150.4
B-Lasso	0(0)	0(0)	2.32 × 10⁻⁴(3.58 × 10⁻⁵)	1.40 × 10⁻⁷(5.08 × 10⁻⁹)	32.8

Open in a new tab

5. Discussion

This paper has extended the stochastic gradient Langevin dynamics algorithm to general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. To the best of our knowledge, this paper provides the first Bayesian method and theory for high-dimensional discrete parameter estimation with mini-batch samples, while the existing methods work for continuous parameters or very low dimensional discrete problems only. Other than generalized linear models, the proposed algorithm can have many applications in data science. For example, it can be used for sparse deep learning and accelerating computation for statistical models/problems where latent variables are involved, such as hidden Markov models, random coefficient models, and model-based clustering problems.

Algorithm 1 can be further extended by updating θ using a variant of stochastic gradient Langevin dynamics, such as stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014), stochastic gradient thermostats (Ding et al., 2014), stochastic gradient Fisher scoring (Ahn et al., 2012), or preconditioned stochastic gradient Langevin dynamics (Li et al., 2016). We expect that the advantages of these variants (over stochastic gradient Langevin dynamics) can be carried over to the extension.

Supplementary Material

NIHMS1645699-supplement-1.pdf^{(313.9KB, pdf)}

Acknowledgements

This work was partially supported by the grants DMS-1811812, DMS-1818674, and R01-GM126089. The authors thank the editor, associate editor and referees for their insightful comments/suggestions.

Appendix

A.1. Proof of Lemma 1

Proof. Let π(θ) denote the prior density of θ, and let π(ϑ) denote the density of ϑ. Then

\nabla_{θ} log π (θ ∣ X_{N}) = \nabla_{θ} log p (X_{N} ∣ θ) + \nabla_{θ} log π (θ) = \frac{1}{p (X_{N} ∣ θ)} \nabla_{θ} \int p (X_{N}, ϑ ∣ θ) d ϑ + \nabla_{θ} log π (θ) = \int \frac{p (X_{N}, ϑ ∣ θ)}{p (X_{N} ∣ θ)} \nabla_{θ} log p (X_{N}, ϑ ∣ θ) d ϑ + \nabla_{θ} log π (θ) = \int π (ϑ ∣ θ, X_{N}) \nabla_{θ} [log p (X_{N} ∣ θ, ϑ) + log π (θ ∣ ϑ) + log π (ϑ) - log π (θ)] d ϑ + \nabla_{θ} log π (θ) = \int \nabla_{θ} log π (θ ∣ ϑ, X_{N}) π (ϑ ∣ θ, X_{N}) d ϑ,

where the second and third equalities follow from the relation ∇_θ log(g(θ)) = ∇_θg(θ)/g(θ) (for an appropriate function g(θ)), and the fourth and fifth equalities are by direct calculations of the conditional distributions. □

A.2. Proof of Theorem 1

Let π_* = π(θ|X_N) denote the posterior density function of θ, and let π_t = π(θ^(t)|X_N) denote the density of θ^(t) generated by Algorithm 1 at iteration t. We are interested in studying the discrepancy between π_* and π_t in the 2nd order Wasserstein distance. The following conditions are assumed.

(A.1)
The posterior π_* is strongly log-concave and gradient-Lipschitz:
$f (θ) - f (θ') - \nabla f {(θ')}^{T} (θ - θ') \geq \frac{q_{N}}{2} {‖ θ - θ' ‖}_{2}^{2}, \forall θ, θ' \in Θ,$ (1)

${‖ \nabla f (θ) - \nabla f (θ') ‖}_{2} \leq Q_{N} {‖ θ - θ' ‖}_{2}, \forall θ, θ' \in Θ,$ (2)
where f (θ) = − log π(θ|X_N), and $c_{0}^{'} N \leq q_{N} \leq Q_{N} \leq c_{0} N$ for some positive constants c₀ and $c_{0}^{'}$ .
(A.2)
The posterior π_* has bounded second moment: ∫_Θ θ^T θπ_*(θ)dθ = O(p).
(A.3)
${max}_{S \in S} E_{X_{N}} [{‖ \nabla_{θ} log π (θ ∣ γ_{S}, X_{N}) ‖}^{2} ∣ θ] = O (N^{2} ({‖ θ ‖}^{2} + p))$ , where E_XN denotes expectation with respect to the distribution of X_N, and $S$ denotes the set of all possible models.
(A.4)
Let L_N(γ_S, θ) = log p(X_N|γ_S, θ)/N and let ${L_{N}^{(i)} (θ) : i = 1, 2, \dots, | S |}$ be the descending order statistics of ${L_{N} (γ_{S}, θ) : S \in S}$ . Assume that there exists a constant δ > 0 such that ${inf}_{θ \in Θ} (L_{N}^{(1)} (θ) - L_{N}^{(2)} (θ)) \geq δ$ .

Proof. Part (i). In Algorithm 1, the gradient ∇log π(θ^(t)|X_N) is estimated by running a short Markov chain with a mini-batch of data. Since the initial distribution of the Markov chain might not coincide with its equilibrium distribution, the resulting gradient estimate can be biased. Let $ζ^{(t)} = \frac{1}{m} \sum_{k = 1}^{m} \nabla_{θ} log π (θ^{(t)} ∣ γ_{S_{k}, n}^{(t)}, X_{n, N}^{(t)}) - \nabla log π (θ^{(t)} ∣ X_{N})$ . Following from (A.3), we have

{‖ E (ζ^{(t)} ∣ θ^{(t)}) ‖}^{2} = O [\frac{N^{2} ({‖ θ^{(t)} ‖}^{2} + p)}{m^{2}}], E {‖ ζ^{(t)} - E (ζ^{(t)} ∣ θ^{(t)}) ‖}^{2} = O [\frac{N^{2} ({‖ θ^{(t)} ‖}^{2} + p)}{m n}] .

Following from Lemma S2 in the supplementary material, if m ≻ p^1/2, ϵ ≺ 1/N ≺ (mn)/(Np), and V = O(p) holds, then

W_{2} (π_{t}, π_{*}) = {(1 - ω)}^{t} W_{2} (π_{0}, π_{*}) + O (\frac{p^{1 / 2}}{m}) + O ({(ϵ p)}^{1 / 2}) + O ({(\frac{ϵ N p}{m n})}^{1 / 2}) \to 0, as t \to \infty,

(3)

for some ω > 0, since q_N ≍ N and Q_N ≍ N hold by conditions (A.1) and (A.2).

Part (ii). Since ρ(θ) is α-Lipschitz, we have |ρ(θ)| ≤ α‖θ‖ + C′ for some constant C′. Further, π_* is strongly log-concave, so π_*(|ρ|) < ∞, i.e., ρ is π_*-integrable. On the other hand,

‖ \int ρ (θ) d π_{*} (θ) - \int ρ (\tilde{θ}) d π_{t} (\tilde{θ}) ‖ = ‖ E ρ (θ) - E ρ (\tilde{θ}) ‖ \leq E ‖ ρ (θ) - ρ (\tilde{θ}) ‖ \leq α E {‖ θ - \tilde{θ} ‖}_{2} \leq α {E {‖ θ - \tilde{θ} ‖}_{2}^{2}}^{1 / 2} = α W_{2} (π_{*}, π_{t}) = o (1), (due to eq . (3)) .

(4)

where θ and $\tilde{θ}$ are two random variables whose marginal distributions follow π_* and π_t respectively, E(·) denotes expectation with respect to the joint distribution of θ and $\tilde{θ}$ , and ${(E {‖ θ - θ_{t} ‖}_{2}^{2})}^{1 / 2} = W_{2} (π_{*}, π_{t})$ . This implies that ρ is also π_t-integrable and $\int ρ (\tilde{θ}) d π_{t} (\tilde{θ}) \to \int ρ (θ) d π_{*} (θ)$ as t → ∞.

Further, by the property of Markov chain, WLLN applies and thus $\sum_{t = 1}^{T} ρ (θ^{(t)}) / T - \sum_{t = 1}^{T} \int ρ (\tilde{θ}) d π_{t} (\tilde{θ}) / T = O_{p} (T^{- 1 / 2})$ . Combining it with the above result leads to

\sum_{t = 1}^{T} ρ (θ^{(t)}) / T - π_{*} (ρ) = O_{p} (T^{- 1 / 2}) + α \sum_{t = 1}^{T} W_{2} (π_{*}, π_{t}) / T \to 0 .

(5)

Part (iii). To establish the convergenceof $\hat{π} (γ_{S} ∣ X_{N})$ , we define L_N(γ_S, θ^(t)) = log p(X_N|γ_S, θ^(t))/N, $L_{n} (γ_{S}, θ^{(t)}) = log p (X_{n}^{(t)} ∣ γ_{S}, θ^{(t)}) / n$ , and $ξ_{n, S}^{(t)} = L_{n} (γ_{S}, θ^{(t)}) - L_{N} (γ_{S}, θ^{(t)})$ for any $S \in S$ For each S, $ξ_{n, S}^{(t)}$ is approximately Gaussian with $E (ξ_{n, S}^{(t)}) = 0$ and $Var (ξ_{n, S}^{(t)}) = O (1 / n)$ . Therefore, for any positive ν, with probability $1 - | S |^{- ν}$ , max_S |ξ_{n, S}| is bounded by $δ_{n} : = {(2 ν + 2) log | S | / n}^{1 / 2} = O [{(v + 1) p / n}^{1 / 2}]$ according to the tail probability of the Gaussian. It implies, with high probability, that if S is the most likely model, i.e., $L_{N}^{(t)} (γ_{S}) = L_{N}^{(1)} (θ^{(t)})$ , then

| π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)}) - π (γ_{S} ∣ X_{N}, θ^{(t)}) | = | \frac{1}{1 + \sum_{S' \neq S} e^{N (L_{N} (γ_{S'}, θ^{(t)}) - L_{N} (γ_{S}, θ^{(t)}) + ξ_{n, S'} - ξ_{n, S})}} - \frac{1}{1 + \sum_{S' \neq S} e^{N (L_{N} (X_{N} ∣ γ_{S'}, θ^{(t)}) - L_{N} (X_{N} ∣ γ_{S}, θ^{(t)}))}} | = \frac{\sum_{S' \neq S} e^{N (L_{N} (X_{N} ∣ γ_{S'}, θ^{(t)}) - L_{N} (X_{N} ∣ γ_{S}, θ^{(t)})) + b_{S'}})}{{[1 + \sum_{S' \neq S} e^{N (L_{N} (X_{N} ∣ γ_{S'}, θ^{(t)}) - L_{N} (X_{N} ∣ γ_{S}, θ^{(t)}) + b_{S'})}]}^{2}} N | ξ_{n, S'} - ξ_{n, S} | \leq (2^{p} - 1) e^{- N (δ - 2 δ_{n})} N 2 δ_{n} \leq e^{- N δ / 2} \to 0,

if νp ≺ n (i.e., δ_n ≺ δ) and N ≻ p, where the second equality follows from the mean-value theorem by viewing N(L_N(X_N|γ_S′, θ^(t)) − L_N(X_N|γ_S, θ^(t)))’s as the arguments of π(γ_S|X_N, θ^(t)), and b_S′ denotes a value between 0 and (ξ_n,S′ − ξ_{n, S}). Similarly, if S is not the most likely model, then we denote S* as the most likely model and, by the mean-value theorem,

| π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)}) - π (γ_{S} ∣ X_{N}, θ^{(t)}) | = | \frac{e^{N (L_{N} (γ_{S}, θ^{(t)}) - L_{N} (γ_{S^{*}}, θ^{(t)}) + ξ_{n, S} - ξ_{n, S^{*}})}}{1 + \sum_{S' \neq S} e^{N (L_{N} (γ_{S'}, θ^{(t)}) - L_{N} (γ_{S^{*}}, θ^{(t)}) + ξ_{n, S^{'}} - ξ_{n, S^{*}}})} - \frac{e^{N (L_{N} (γ_{S}, θ^{(t)}) - L_{N} (γ_{S^{*}}, θ^{(t)}))}}{1 + \sum_{S' \neq S} e^{N (L_{N} (γ_{S'}, θ^{(t)}) - L_{N} (γ_{S^{*}}, θ^{(t)})}} | \leq [1 + (2^{p} - 1) e^{- N (δ - 2 δ_{n})} + e^{2 N δ_{n}}] e^{- N (δ - 2 δ_{n})} N 2 δ_{n} \leq e^{- N δ / 2} \to 0 .

In conclusion, with probability $1 - 1 / {| S |}^{ν}, | π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)}) - π (γ_{S} ∣ X_{N}, θ^{(t)}) | < exp (- N δ / 2)$ for all S, any iteration t and any θ^(t) ∈ Θ. Then, one could choose some ν = (n/p)^1/2 → ∞, such that $π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)}) - π (γ_{S} ∣ X_{N}, θ^{(t)})$ is bounded by

max_{S} E | π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)}) - π (γ_{S} ∣ X_{N}, θ^{(t)}) | \leq exp (- N δ / 2) + 1 / {| S |}^{ν} \to 0,

(6)

for any iteration t. Conditioned on {θ^(t) : t = 1,2, …}, [π(γ_S|X_n,N, θ^(t)) − π(γ_S|X_N, θ^(t))]’s are independent and each is bounded by 1, so WLLN applies. Therefore, for any $S \in S$ , by WLLN,

\frac{1}{T} \sum_{t = 1}^{T} π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)}) - \frac{1}{T} \sum_{t = 1}^{T} π (γ_{S} ∣ X_{N}, θ^{(t)}) = O_{p} (T^{- 1 / 2}) + exp (- N δ / 2) + 1 / {| S |}^{ν} \to 0,

(7)

provided p ≺ n ≤ N. Since {θ^(t) : t = 1,2, …} forms a time-homogeneous Markov chain, whose convergence is measured by (3), and the function π(γ_S|X_N, θ) is bounded and continuous in θ,

\frac{1}{T} \sum_{t = 1}^{T} π (γ_{S} ∣ X_{N}, θ^{(t)}) - π (γ_{S} ∣ X_{N}) = O_{p} (T^{- 1 / 2}),

(8)

holds for any $S \in S$ . Combining (8) with (7) leads to

\frac{1}{T} \sum_{t = 1}^{T} π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)}) - π (γ_{S} ∣ X_{N}) = O_{p} (T^{- 1 / 2}) + exp (- N δ / 2) + 1 / {| S |}^{ν} \to 0.

(9)

Conditioned on $X_{n, N}^{(t)}$ and θ^(t) by the standard theory of MCMC, $m^{- 1} \sum_{i = 1}^{m} I (γ_{S}^{(t, i)} = γ_{S})$ forms a consistent estimator of $π (γ_{S} ∣ X_{n, N}^{(t)}, θ^{(t)})$ with an asymptotic bias of O(1/m). Since m is increasing with p and N, the estimator is asymptotically unbiased. Combining this result with (9) leads to

\frac{1}{m T} \sum_{t = 1}^{T} \sum_{i = 1}^{m} I (γ_{S_{i}, n}^{(t)} = γ_{S}) - π (γ_{S} ∣ X_{N}) = O_{p} (T^{- 1 / 2}) + exp (- N δ / 2) + 1 / {| S |}^{ν} + O_{p} (m^{- 1 / 2}),

(10)

which converges to 0 as T → ∞ and N → ∞. □

References

Ahn S, Balan AK & Welling M (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring. In ICML. [Google Scholar]
Barbieri M & Berger J (2004). Optimal predictive model selection. Annals of Statistics 32, 870–897. [Google Scholar]
Bardenet R, Doucet A & Holmes C (2014). Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In ICML. [Google Scholar]
Bardenet R, Doucet A & Holmes CC (2017). On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research 18, 47:1–47:43. [Google Scholar]
Betancourt M (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In ICML. [Google Scholar]
Bierkens J, Fearnhead P & Roberts G (2019). The zig-zag process and super-efficient Monte Carlo for Bayesian analysis of big data. Annals of Statistics 47, 1288–1320. [Google Scholar]
Bouchard Coté A, Vollmer S & Doucet A (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association 113, 855–867. [Google Scholar]
Cappé O, Moulines E & Ryden T (2005). Inference in Hidden Markov Models. New York: Springer. [Google Scholar]
Chen H, Seita D, Pan X & Canny J (2016). An efficient minibatch acceptance test for Metropolis-Hastings. arXiv:1610.06848.
Chen T, Fox EB & Guestrin C (2014). Stochastic gradient Hamiltonian Monte Carlo. In ICML. [Google Scholar]
Chen Y & Ghahramani Z (2016). Scalable discrete sampling as a multi-armed bandit problem. In ICML. [Google Scholar]
Dalalyan AS & Karagulyan AG (2017). User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. CoRR abs/1710.00095.
Ding N, Fang Y, Babbush R, Chen C, Skeel RD & Neven H (2014). Bayesian sampling using stochastic gradient thermostats. In NIPS. [Google Scholar]
Gibbs A & Su F (2002). On choosing and bounding probability metrics. International Statistical Review 70, 419–435. [Google Scholar]
Green P (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732. [Google Scholar]
Korattikara A, Chen Y & Welling M (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In ICML. [Google Scholar]
Li C, Chen C, Carlson DE & Carin L (2016). Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI. [Google Scholar]
Ma Y-A, Chen T & Fox EB (2015). A complete recipe for stochastic gradient MCMC. In NIPS. [Google Scholar]
Maclaurin D & Adams RP (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. In IJCAI. [Google Scholar]
Nemeth C & Fearnhead P (2019). Stochastic gradient Markov chain Monte Carlo. arXiv:1907.06986. [Google Scholar]
Sato I & Nakagawa H (2014). Approximation analysis of stochastic gradient Langevin dynamics by using Fokker-Planck equation and Ito process. In ICML. [Google Scholar]
Scott SL, Blocker AW, Bonassi FV, Chipman HA, George EI & McCulloch RE (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management 11, 78–88. [Google Scholar]
Srivastava S, Li C & Dunson DB (2018). Scalable Bayes via Barycenter in Wasserstein space. Journal of Machine Learning Research 19, 1–35. [Google Scholar]
Tanner M & Wong W (1987). The calculation of posterior distributions by data augmentation (with Discussions). Journal of the American Statistical Association 82, 528–540. [Google Scholar]
Teh W, Thiery A & Vollmer S (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research 17, 1–33. [Google Scholar]
Welling M & Teh YW (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML. [Google Scholar]
Xue J & Liang F (2019). Double-parallel Monte Carlo for Bayesian analysis of big data. Statistics and Computing 29, 23–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1645699-supplement-1.pdf^{(313.9KB, pdf)}

[R1] Ahn S, Balan AK & Welling M (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring. In ICML. [Google Scholar]

[R2] Barbieri M & Berger J (2004). Optimal predictive model selection. Annals of Statistics 32, 870–897. [Google Scholar]

[R3] Bardenet R, Doucet A & Holmes C (2014). Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In ICML. [Google Scholar]

[R4] Bardenet R, Doucet A & Holmes CC (2017). On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research 18, 47:1–47:43. [Google Scholar]

[R5] Betancourt M (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In ICML. [Google Scholar]

[R6] Bierkens J, Fearnhead P & Roberts G (2019). The zig-zag process and super-efficient Monte Carlo for Bayesian analysis of big data. Annals of Statistics 47, 1288–1320. [Google Scholar]

[R7] Bouchard Coté A, Vollmer S & Doucet A (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association 113, 855–867. [Google Scholar]

[R8] Cappé O, Moulines E & Ryden T (2005). Inference in Hidden Markov Models. New York: Springer. [Google Scholar]

[R9] Chen H, Seita D, Pan X & Canny J (2016). An efficient minibatch acceptance test for Metropolis-Hastings. arXiv:1610.06848.

[R10] Chen T, Fox EB & Guestrin C (2014). Stochastic gradient Hamiltonian Monte Carlo. In ICML. [Google Scholar]

[R11] Chen Y & Ghahramani Z (2016). Scalable discrete sampling as a multi-armed bandit problem. In ICML. [Google Scholar]

[R12] Dalalyan AS & Karagulyan AG (2017). User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. CoRR abs/1710.00095.

[R13] Ding N, Fang Y, Babbush R, Chen C, Skeel RD & Neven H (2014). Bayesian sampling using stochastic gradient thermostats. In NIPS. [Google Scholar]

[R14] Gibbs A & Su F (2002). On choosing and bounding probability metrics. International Statistical Review 70, 419–435. [Google Scholar]

[R15] Green P (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732. [Google Scholar]

[R16] Korattikara A, Chen Y & Welling M (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In ICML. [Google Scholar]

[R17] Li C, Chen C, Carlson DE & Carin L (2016). Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI. [Google Scholar]

[R18] Ma Y-A, Chen T & Fox EB (2015). A complete recipe for stochastic gradient MCMC. In NIPS. [Google Scholar]

[R19] Maclaurin D & Adams RP (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. In IJCAI. [Google Scholar]

[R20] Nemeth C & Fearnhead P (2019). Stochastic gradient Markov chain Monte Carlo. arXiv:1907.06986. [Google Scholar]

[R21] Sato I & Nakagawa H (2014). Approximation analysis of stochastic gradient Langevin dynamics by using Fokker-Planck equation and Ito process. In ICML. [Google Scholar]

[R22] Scott SL, Blocker AW, Bonassi FV, Chipman HA, George EI & McCulloch RE (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management 11, 78–88. [Google Scholar]

[R23] Srivastava S, Li C & Dunson DB (2018). Scalable Bayes via Barycenter in Wasserstein space. Journal of Machine Learning Research 19, 1–35. [Google Scholar]

[R24] Tanner M & Wong W (1987). The calculation of posterior distributions by data augmentation (with Discussions). Journal of the American Statistical Association 82, 528–540. [Google Scholar]

[R25] Teh W, Thiery A & Vollmer S (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research 17, 1–33. [Google Scholar]

[R26] Welling M & Teh YW (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML. [Google Scholar]

[R27] Xue J & Liang F (2019). Double-parallel Monte Carlo for Bayesian analysis of big data. Statistics and Computing 29, 23–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

Qifan Song

Yan Sun

Mao Ye

Faming Liang

Summary

1. Introduction

2. A Brief Review of Stochastic Gradient Langevin Dynamics

3. An Extended Stochastic Gradient Langevin Dynamics Algorithm

Lemma 1.

3.1. Bayesian Variable Selection

Algorithm 1.

Theorem 1.

Theorem 2.

3.2. Missing Data

4. An Illustrative Example

Table 1.

5. Discussion

Supplementary Material

Acknowledgements

Appendix

A.1. Proof of Lemma 1

A.2. Proof of Theorem 1

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

Qifan Song

Yan Sun

Mao Ye

Faming Liang

Summary

1. Introduction

2. A Brief Review of Stochastic Gradient Langevin Dynamics

3. An Extended Stochastic Gradient Langevin Dynamics Algorithm

Lemma 1.

3.1. Bayesian Variable Selection

Algorithm 1.

Theorem 1.

Theorem 2.

3.2. Missing Data

4. An Illustrative Example

Table 1.

5. Discussion

Supplementary Material

Acknowledgements

Appendix

A.1. Proof of Lemma 1

A.2. Proof of Theorem 1

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases