Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jul 23.
Published in final edited form as: Biometrika. 2020 Jul 13;107(4):997–1004. doi: 10.1093/biomet/asaa029

Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

Qifan Song 1, Yan Sun 1, Mao Ye 1, Faming Liang 1
PMCID: PMC8302213  NIHMSID: NIHMS1645699  PMID: 34305153

Summary

Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC algorithm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

Keywords: Dimension Jumping, Missing Data, Stochastic Gradient Langevin Dynamics, Subsampling

1. Introduction

After six decades of continual development, MCMC has proven to be a powerful and typically unique computational tool for analyzing data of complex structures. However, for large datasets, its computational cost can be prohibitive as it requires all of the data to be processed at each iteration. To tackle this difficulty, a variety of scalable algorithms have been proposed in the recent literature. According to the strategies they employed, these algorithms can be grouped into a few categories, including stochastic gradient MCMC algorithms (Welling & Teh, 2011; Ding et al., 2014; Ahn et al., 2012; Chen et al., 2014; Betancourt, 2015; Ma et al., 2015; Nemeth & Fearnhead, 2019), split-and-merge algorithms (Scott et al., 2016; Srivastava et al., 2018; Xue & Liang, 2019), mini-batch Metropolis-Hastings algorithms (Chen et al., 2016; Korattikara et al., 2014; Bardenet et al., 2014; Maclaurin & Adams, 2014; Bardenet et al., 2017), nonreversibleMarkov process-based algorithms(Bierkens et al., 2019; Bouchard Coté et al., 2018), and some discrete sampling algorithms based on the multi-armed bandit (Chen & Ghahramani, 2016).

Although scalable algorithms have been developed for both continuous and discrete sampling problems, they are hard to be applied to dimension jumping problems. Dimension jumping is characterized by variable selection where the number of parameters changes from iteration to iteration in MCMC simulations. Under their current settings, the stochastic gradient MCMC and nonreversible Markov process-based algorithms are only applicable to problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. For the split-and-merge algorithms, it is unclear how to aggregate samples of different dimensions drawn from the posterior distributions based on different subset data. The multi-armed bandit algorithms are only applicable to problems with a small discrete domain and can be extremely inefficient for high-dimensional variable selection problems. The mini-batch Metropolis-Hastings algorithms are in principle applicable to dimension jumping problems. However, they are generally difficult to use. For example, the algorithms by Chen et al. (2016), Korattikara et al. (2014), and Bardenet et al. (2014) perform approximate acceptance tests using subset data. The amount of data consumed for each test varies significantly from one iteration to another, which compromise their scalability. The algorithms by Maclaurin & Adams (2014) and Bardenet et al. (2017) perform exact tests but require a lower bound on the parameter distribution across its domain. Unfortunately, the lower bound is usually difficult to obtain.

This paper proposes an extended stochastic gradient Langevin dynamics algorithm which, by introducing appropriate latent variables, extends the stochastic gradient Langevin dynamics algorithm to more general large-scale Bayesian computing problems such as variable selection and missing data. The extended stochastic gradient Langevin dynamics algorithm is highly scalable and much more efficient than traditional MCMC algorithms. Compared to the mini-batch Metropolis-Hastings algorithms, the proposed algorithm is much easier to use, which involves only a fixed amount of data at each iteration and does not require any lower bound on the parameter distribution.

2. A Brief Review of Stochastic Gradient Langevin Dynamics

Let XN = (X1,X2, …, XN) denote a set of N independent and identically distributed samples drawn from the distribution f(x|θ), where N is the sample size and θ is the parameter. Let p(XNθ)=i=1Nf(Xiθ) denote the likelihood function, let π(θ) denote the prior distribution of θ, and let log π(θ|XN) = logp(XN|θ) + log π(θ) denote the log-posterior density function. If θ has a fixed dimension and log π(θ|XN) is differentiable with respect to θ, then the stochastic gradient Langevin dynamics algorithm (Welling & Teh, 2011) can be applied to simulate from the posterior, which iterates by

θt+1=θt+ϵt+12^θlogπ(θtXN)+(ϵt+1τ)ηt+1,ηt+1~N(0,Id), (1)

where d is the dimension of θ, Id is an d × d-identity matrix, ϵt+1 is the step size (also known as the learning rate), τ is the temperature, and ^θlogπ(θtXN) denotes an estimate of ∇θ log π(θt|XN) based on a mini-batch of samples. The learning rate can be decreasing or kept as a constant. For the former, the convergence of the algorithm was studied in Teh et al. (2016). For the latter, the convergence of the algorithm was studied in Sato & Nakagawa (2014) and Dalalyan & Karagulyan (2017). Refer to Nemeth & Fearnhead (2019) for more discussions on the theory, implementation and variants of this algorithm.

3. An Extended Stochastic Gradient Langevin Dynamics Algorithm

To extend the applications of the stochastic gradient Langevin dynamics algorithm to varying-dimensional problems such as variable selection and missing data, we first establish an identity for evaluating ∇θ log π(θ|XN) in presence of latent variables. As illustrated below, the latent variables can be the model indicator in the variable selection problems or missing values in the missing data problems.

Lemma 1.

For any latent variable ϑ,

θlogπ(θXN)=θlogπ(θϑ,XN)π(ϑθ,XN)dϑ, (2)

where π(ϑ | θ, XN) and π(θ | ϑ, XN) denote the conditional distribution of ϑ and θ, respectively.

Lemma 1 provides us a Monte Carlo estimator for ∇θ log π(θ | XN) by averaging over the samples drawn from the conditional distribution π(ϑ|θ, XN). The identity (2) is similar to Fisher’s identity. The latter has been used in evaluating the gradient of the log-likelihood function in presence of latent variables, see e.g. Cappé et al. (2005). When N is large, the computation can be accelerated by subsampling. Let Xn denote a subsample, where n denotes the subsample size. Without loss of generality, we assume that N is a multiple of n, i.e., N/n is an integer. Let Xn,N = {Xn, … , Xn} denote a duplicated dataset with the subsample, whose total sample size is also N. Following from (2), we have

θlogπ(θXn,N)=θlogπ(θϑ,Xn,N)π(ϑθ,Xn,N)dϑ. (3)

Since ∇θ log π(θ | Xn,N) = ∇θ log p(Xn,N|θ) + ∇θ log π(θ) is true and log p(Xn,N|θ) is unbiased for log p(XN|θ), ∇θ log π(θ | Xn,N) forms an unbiased estimator of ∇θ log π(θ | XN). Sampling from π(γS|θ, Xn,N) can be much faster than sampling from π(γS|θ, XN) as for the former the likelihood only needs to be evaluated on a mini-batch of samples.

3.1. Bayesian Variable Selection

As an illustrative example, we consider the problem of variable selection for linear regression

Y=zTβ+ε, (4)

where ε is a zero-mean Gaussian random error with variance σ2, βp is the vector of regression coefficients, and z = (z1,z2, … , zp) is the vector of explanatory variables. Let γS=(γS1,,γSp) be a binary vector indicating the variables included in model S, and let βS be the vector of regression coefficients associated with the model S. From the perspective of Bayesian statistics, we are interested in estimating the posterior probability π(γS|XN)·for each model SS and the posterior mean π(ρ) = ∫ ρ(β)π(β|XN) for some integrable function ρ(·), where S comprises 2p models. Both quantities can be estimated using the reversible jump Metropolis-Hastings algorithm (Green, 1995) by sampling from the posterior distribution π(γS, βS|XN). However, when N is large, the algorithm can be extremely slow due to repeated scans of the full dataset in simulations.

As aforementioned, the existing stochastic gradient MCMC algorithms cannot be directly applied to simulate of π(γS, βS|XN) due to the dimension jumping issue involved in model transition. To address this issue, we introduce an auxiliary variable θ = (θ1,θ2, … , θp), which links γS and βS through

βS=θ*γS=(θ1γS1,θ2γS2,,θpγSp), (5)

where * denotes elementwise multiplication. Let θ[S]={θi:γSi=1,i=1,2,,p} and θ[S]={θi:γSi=0,i=1,2,,p} be subvectors of θ corresponding to the nonzero and zero elements of γS, respectively. Note that βS is sparse with all elements in θ[−S] being zero, while θ can be dense. Based on the relation (5), we suggest to simulate from π(θ|XN) using the stochastic gradient Langevin dynamic algorithm, for which the gradient ∇θ log π(θ|XN) can be evaluated using Lemma 1 by treating γS as the latent variable. Let π(θ) denote the prior of θ. To simplify the computation of ∇θ log π(θ | γS, XN), we further assume the a priori independence that π(θ|γS) = π(θ[S]|γS)π(θ[−S]|γS). Then it is easy to derive

θlogπ(θγS,XN)={θ[S]logp(XNθ[S],γS)+θ[S]π(θ[S]γS),for component θ[S],θ[S]logπ(θ[S]γS),for component θ[S],

which can be used in evaluating ∇log π(θ|XN) by Lemma 1. If a mini-batch of data is used, the gradient can be evaluated based on (3). This leads to an extended stochastic gradient langevin dynamics algorithm.

Algorithm 1.

[Extended Stochastic Gradient Langevin Dynamics for Bayesian Variable Selection]

  1. (Subsampling) Draw a subsample of size n (with or without replacement) from the full dataset XN at random, and denote the subsample by Xn(t), where t indexes the iteration.

  2. (Simulating models) Simulate models γS1,n(t),,γSm,n(t) from the conditional posterior π(γSθ(t),Xn,N(t)) by running a short Markov chain, where Xn,N(t)={Xn(t),,Xn(t)} and θ(t) is the sample of θ at iteration t.

  3. (Updating θ) Update θ(t) by setting θ(t+1)=θ(t)+(2m)1ϵt+1k=1mθlogπ(θ(t)γSk,n(t),Xn,N(t))+(ϵt+1τ)ηt+1, where ϵt+1 is the learning rate, ηt+1 ~ N(0, Ip), τ is the temperature, and p is the dimension of θ.

Theorem 1 justifies the validity of this algorithm with the proof given in the Appendix.

Theorem 1.

Assume that the conditions (A.1)-(A.3) (given in Appendix) hold, m, p, n are increasing with N such that Nnp, mp1/2, and a constant learning rate ϵ ≺ 1/N is used. Then, as N → ∞,

  1. W2(πt,π*) → 0 as t → ∞, where πt denotes the distribution of θ(t), π* = π(θ|XN), and W2(·,·) denotes the second order Wasserstein distance between two distributions.

  2. If ρ(θ) is α-Lipschitz for some constant α > 0, then t=1Tρ(θ(t))/Tpπ*(ρ) as T → ∞, where p denotes convergence in probability and π*(ρ) = ∫Θ ρ(θ)π(θ|XN).

  3. If (A.4) further holds, t=1Ti=1mI(γSi,n(t)=γS)/(mT)π(γSXN)p0 as T → ∞

Part (i) establishes the weak convergence of θt; that is, if the total sample size N and the iteration number t are sufficiently large, and the subsample size n and the number of models m simulated at each iteration are reasonably large, then π(θt|XN) will converge to the true posterior π(θ|XN) in 2-Wasserstein distance. Refer to Gibbs & Su (2002) for discussions on the relation between Wasserstein distance and other probability metrics. Parts (ii) & (iii) address our general interests on how to estimate the posterior mean and posterior probability, respectively, based the samples simulated by Algorithm 1. For parts (i), (ii) and (iii), the explicit convergence rates are given in equations (3), (5) and (10), respectively.

For the choice of mp1/2, p can be approximately treated as the maximum size of the models under consideration, which is of the same order as the true model. Therefore, m can be pretty small under the model sparsity assumption. Theorem 1 is established with a constant learning rate. In practice, one may use a decaying learning rate, see e.g. Teh et al. (2016), where it is suggested to set ϵt = O(1/tκ) for some 0 < κ ≤ 1. For the decaying learning rate, Teh et al. (2016) recommended some weighted averaging estimators for π*(ρ). Theorem 2 shows that the unweighted averaging estimators used above still work if the learning rate slowly decays at a rate of ϵt = O(1/tκ) for 0 < κ < 1. However, if κ = 1, the weighted averaging estimators are still needed. The proof of Theorem 2 is given in the supplementary material.

Theorem 2.

Assume the conditions of Theorem 1 hold. If a decaying learning rate ϵt = O(1/tκ) is used for some 0 < κ < 1, then parts (i), (ii) and (iii) of Theorem 1 are still valid.

3.2. Missing Data

Missing data are ubiquitous over all fields from science to technology. However, under the big data scenario, how to conduct Bayesian analysis in presence of missing data is still unclear. The existing data-augmentation algorithm (Tanner & Wong, 1987) is full data based and thus can be extremely slow. In this context, we let XN denote the incomplete data and let θ denote the model parameters. If we treat the missing values as latent variables, then Lemma 1 can be used for evaluating the gradient ∇θ log π(θ|XN). However, Algorithm 1 cannot be directly applied to missing data problems, since the imputation of the missing data might depend on the subsample only. To address this issue, we propose Algorithm S1 (given in the Supplementarymaterial), where the missing values ϑ are imputed from π(ϑ|θ, Xn) at each iteration. Theorem 1 and Theorem 2 are still applicable to this algorithm.

4. An Illustrative Example

This section illustrates the performance of Algorithm 1 using a simulated example. More numerical examples are presented in the supplementary material. Ten synthetic datasets were generated from the model (4) with N = 50,000, p = 2001, σ2 = 1, β1 = ⋯ = β5 = 1, β6 = β7 = β8 = −1, and β0 = β9 = ⋯ = βp = 0, where σ2 is assumed to be known, and the explanatory variables are normally distributed with a mutual correlation coefficient of 0.5. A hierarchical prior was assumed for the model and parameters with the detail given in the supplementary material. For each dataset, Algorithm 1 was run for 5000 iterations with n = 200, m = 10, and the learning rate ϵt ≡ 10−6, where the first 2000 iterations were discarded for the burn-in process and the samples generated from the remaining iterations were used for inference. At each iteration, the reversible jump Metropolis-Hastings algorithm (Green, 1995) was used for simulating the models γSi,n(t), i = 1, 2, …, m with the detail given in the supplementary material.

Table 1 summarizes the performance of the algorithm, where the false selection rate (FSR), negative selection rate (NSR), mean squared errors for false predictors (MSE0) and mean squared errors for true predictors (MSE1) are defined in the supplementary material. The variables were selected according to the median posterior probability rule (Barbieri & Berger, 2004), which selects only the variables with the marginal inclusion probability greater than 0.5. The Bayesian estimates of parameters were obtained by averaging over a set of thinned (by a factor of 10) posterior samples. For comparison, some existing algorithms were applied to this example with the results given in Table 1 and the implementation details given in the supplementary material. The comparison show that the proposed algorithm has much alleviated the pain of Bayesian methods in big data analysis.

Table 1.

Bayesian variable selection with the extended stochastic gradient Langevin dynamics (eSGLD), reversible jump Metropolis-Hastings (RJMH), split-and-merge (SaM) and Bayesian Lasso (B-Lasso) algorithms, where FSR, NSR, MSE1 and MSE0 are reported in averages over 10 datasets with standard deviations given in the parentheses, and the CPU time (in minutes) was recorded for one dataset on a Linux machine with Intel® Core™i7–3770 CPU@3.40GHz.

Algorithm FSR NSR MSE1 MSE0 CPU(m)
eSGLD 0(0) 0(0) 2.91 × 10−3(1.90 × 10−3) 1.26 × 10−7(1.18 × 10−8) 3.3
RJMH 0.50(0.10) 0.16(0.042) 1.60 × 10−1(3.89 × 10−2) 2.64 × 10−5(8.75 × 10−6) 180.1
SaM 0.05(0.05) 0.013(0.013) 1.29 × 10−2(1.27 × 10−2) 1.01 × 10−6(1.00 × 10−6) 150.4
B-Lasso 0(0) 0(0) 2.32 × 10−4(3.58 × 10−5) 1.40 × 10−7(5.08 × 10−9) 32.8

5. Discussion

This paper has extended the stochastic gradient Langevin dynamics algorithm to general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. To the best of our knowledge, this paper provides the first Bayesian method and theory for high-dimensional discrete parameter estimation with mini-batch samples, while the existing methods work for continuous parameters or very low dimensional discrete problems only. Other than generalized linear models, the proposed algorithm can have many applications in data science. For example, it can be used for sparse deep learning and accelerating computation for statistical models/problems where latent variables are involved, such as hidden Markov models, random coefficient models, and model-based clustering problems.

Algorithm 1 can be further extended by updating θ using a variant of stochastic gradient Langevin dynamics, such as stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014), stochastic gradient thermostats (Ding et al., 2014), stochastic gradient Fisher scoring (Ahn et al., 2012), or preconditioned stochastic gradient Langevin dynamics (Li et al., 2016). We expect that the advantages of these variants (over stochastic gradient Langevin dynamics) can be carried over to the extension.

Supplementary Material

1

Acknowledgements

This work was partially supported by the grants DMS-1811812, DMS-1818674, and R01-GM126089. The authors thank the editor, associate editor and referees for their insightful comments/suggestions.

Appendix

A.1. Proof of Lemma 1

Proof. Let π(θ) denote the prior density of θ, and let π(ϑ) denote the density of ϑ. Then

θlogπ(θXN)=θlogp(XNθ)+θlogπ(θ)=1p(XNθ)θp(XN,ϑθ)dϑ+θlogπ(θ)=p(XN,ϑθ)p(XNθ)θlogp(XN,ϑθ)dϑ+θlogπ(θ)=π(ϑθ,XN)θ[logp(XNθ,ϑ)+logπ(θϑ)+logπ(ϑ)logπ(θ)]dϑ+θlogπ(θ)=θlogπ(θϑ,XN)π(ϑθ,XN)dϑ,

where the second and third equalities follow from the relation ∇θ log(g(θ)) = ∇θg(θ)/g(θ) (for an appropriate function g(θ)), and the fourth and fifth equalities are by direct calculations of the conditional distributions. □

A.2. Proof of Theorem 1

Let π* = π(θ|XN) denote the posterior density function of θ, and let πt = π(θ(t)|XN) denote the density of θ(t) generated by Algorithm 1 at iteration t. We are interested in studying the discrepancy between π* and πt in the 2nd order Wasserstein distance. The following conditions are assumed.

  • (A.1)
    The posterior π* is strongly log-concave and gradient-Lipschitz:
    f(θ)f(θ)f(θ)T(θθ)qN2θθ22,θ,θΘ, (1)
    f(θ)f(θ)2QNθθ2,θ,θΘ, (2)
    where f (θ) = − log π(θ|XN), and c0NqNQNc0N for some positive constants c0 and c0.
  • (A.2)

    The posterior π* has bounded second moment: ∫Θ θT θπ*(θ) = O(p).

  • (A.3)

    maxSSEXN[θlogπ(θγS,XN)2θ]=O(N2(θ2+p)), where EXN denotes expectation with respect to the distribution of XN, and S denotes the set of all possible models.

  • (A.4)

    Let LN(γS, θ) = log p(XN|γS, θ)/N and let {LN(i)(θ):i=1,2,,|S|} be the descending order statistics of {LN(γS,θ):SS}. Assume that there exists a constant δ > 0 such that infθΘ(LN(1)(θ)LN(2)(θ))δ.

Proof. Part (i). In Algorithm 1, the gradient ∇log π(θ(t)|XN) is estimated by running a short Markov chain with a mini-batch of data. Since the initial distribution of the Markov chain might not coincide with its equilibrium distribution, the resulting gradient estimate can be biased. Let ζ(t)=1mk=1mθlogπ(θ(t)γSk,n(t),Xn,N(t))logπ(θ(t)XN). Following from (A.3), we have

E(ζ(t)θ(t))2=O[N2(θ(t)2+p)m2],Eζ(t)E(ζ(t)θ(t))2=O[N2(θ(t)2+p)mn].

Following from Lemma S2 in the supplementary material, if mp1/2, ϵ ≺ 1/N ≺ (mn)/(Np), and V = O(p) holds, then

W2(πt,π*)=(1ω)tW2(π0,π*)+O(p1/2m)+O((ϵp)1/2)+O((ϵNpmn)1/2)0, as t, (3)

for some ω > 0, since qNN and QNN hold by conditions (A.1) and (A.2).

Part (ii). Since ρ(θ) is α-Lipschitz, we have |ρ(θ)| ≤ αθ‖ + C′ for some constant C′. Further, π* is strongly log-concave, so π*(|ρ|) < ∞, i.e., ρ is π*-integrable. On the other hand,

ρ(θ)dπ*(θ)ρ(θ˜)dπt(θ˜)=Eρ(θ)Eρ(θ˜)Eρ(θ)ρ(θ˜)αEθθ˜2α{Eθθ˜22}1/2=αW2(π*,πt)=o(1), (due to eq.(3)). (4)

where θ and θ˜ are two random variables whose marginal distributions follow π* and πt respectively, E(·) denotes expectation with respect to the joint distribution of θ and θ˜, and (Eθθt22)1/2=W2(π*,πt). This implies that ρ is also πt-integrable and ρ(θ˜)dπt(θ˜)ρ(θ)dπ*(θ) as t → ∞.

Further, by the property of Markov chain, WLLN applies and thus t=1Tρ(θ(t))/Tt=1Tρ(θ˜)dπt(θ˜)/T=Op(T1/2). Combining it with the above result leads to

t=1Tρ(θ(t))/Tπ*(ρ)=Op(T1/2)+αt=1TW2(π*,πt)/T0. (5)

Part (iii). To establish the convergenceof π^(γSXN), we define LN(γS, θ(t)) = log p(XN|γS, θ(t))/N, Ln(γS,θ(t))=logp(Xn(t)γS,θ(t))/n, and ξn,S(t)=Ln(γS,θ(t))LN(γS,θ(t)) for any SS For each S, ξn,S(t) is approximately Gaussian with E(ξn,S(t))=0 and Var(ξn,S(t))=O(1/n). Therefore, for any positive ν, with probability 1|S|ν, maxS |ξn, S| is bounded by δn:={(2ν+2)log|S|/n}1/2=O[{(v+1)p/n}1/2] according to the tail probability of the Gaussian. It implies, with high probability, that if S is the most likely model, i.e., LN(t)(γS)=LN(1)(θ(t)), then

|π(γSXn,N(t),θ(t))π(γSXN,θ(t))|=|11+SSeN(LN(γS,θ(t))LN(γS,θ(t))+ξn,Sξn,S)11+SSeN(LN(XNγS,θ(t))LN(XNγS,θ(t)))|=SSeN(LN(XNγS,θ(t))LN(XNγS,θ(t)))+bS)[1+SSeN(LN(XNγS,θ(t))LN(XNγS,θ(t))+bS)]2N|ξn,Sξn,S|(2p1)eN(δ2δn)N2δneNδ/20,

if νpn (i.e., δnδ) and Np, where the second equality follows from the mean-value theorem by viewing N(LN(XN|γS′, θ(t)) − LN(XN|γS, θ(t)))’s as the arguments of π(γS|XN, θ(t)), and bS′ denotes a value between 0 and (ξn,S′ − ξn, S). Similarly, if S is not the most likely model, then we denote S* as the most likely model and, by the mean-value theorem,

|π(γSXn,N(t),θ(t))π(γSXN,θ(t))|=|eN(LN(γS,θ(t))LN(γS*,θ(t))+ξn,Sξn,S*)1+SSeN(LN(γS,θ(t))LN(γS*,θ(t))+ξn,Sξn,S*)eN(LN(γS,θ(t))LN(γS*,θ(t)))1+SSeN(LN(γS,θ(t))LN(γS*,θ(t))|[1+(2p1)eN(δ2δn)+e2Nδn]eN(δ2δn)N2δneNδ/20.

In conclusion, with probability 11/|S|ν,|π(γSXn,N(t),θ(t))π(γSXN,θ(t))|<exp(Nδ/2) for all S, any iteration t and any θ(t) ∈ Θ. Then, one could choose some ν = (n/p)1/2 → ∞, such that π(γSXn,N(t),θ(t))π(γSXN,θ(t)) is bounded by

maxSE|π(γSXn,N(t),θ(t))π(γSXN,θ(t))|exp(Nδ/2)+1/|S|ν0, (6)

for any iteration t. Conditioned on {θ(t) : t = 1,2, …}, [π(γS|Xn,N, θ(t)) − π(γS|XN, θ(t))]’s are independent and each is bounded by 1, so WLLN applies. Therefore, for any SS, by WLLN,

1Tt=1Tπ(γSXn,N(t),θ(t))1Tt=1Tπ(γSXN,θ(t))=Op(T1/2)+exp(Nδ/2)+1/|S|ν0, (7)

provided pnN. Since {θ(t) : t = 1,2, …} forms a time-homogeneous Markov chain, whose convergence is measured by (3), and the function π(γS|XN, θ) is bounded and continuous in θ,

1Tt=1Tπ(γSXN,θ(t))π(γSXN)=Op(T1/2), (8)

holds for any SS. Combining (8) with (7) leads to

1Tt=1Tπ(γSXn,N(t),θ(t))π(γSXN)=Op(T1/2)+exp(Nδ/2)+1/|S|ν0. (9)

Conditioned on Xn,N(t) and θ(t) by the standard theory of MCMC, m1i=1mI(γS(t,i)=γS) forms a consistent estimator of π(γSXn,N(t),θ(t)) with an asymptotic bias of O(1/m). Since m is increasing with p and N, the estimator is asymptotically unbiased. Combining this result with (9) leads to

1mTt=1Ti=1mI(γSi,n(t)=γS)π(γSXN)=Op(T1/2)+exp(Nδ/2)+1/|S|ν+Op(m1/2), (10)

which converges to 0 as T → ∞ and N → ∞. □

References

  1. Ahn S, Balan AK & Welling M (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring. In ICML. [Google Scholar]
  2. Barbieri M & Berger J (2004). Optimal predictive model selection. Annals of Statistics 32, 870–897. [Google Scholar]
  3. Bardenet R, Doucet A & Holmes C (2014). Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In ICML. [Google Scholar]
  4. Bardenet R, Doucet A & Holmes CC (2017). On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research 18, 47:1–47:43. [Google Scholar]
  5. Betancourt M (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In ICML. [Google Scholar]
  6. Bierkens J, Fearnhead P & Roberts G (2019). The zig-zag process and super-efficient Monte Carlo for Bayesian analysis of big data. Annals of Statistics 47, 1288–1320. [Google Scholar]
  7. Bouchard Coté A, Vollmer S & Doucet A (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association 113, 855–867. [Google Scholar]
  8. Cappé O, Moulines E & Ryden T (2005). Inference in Hidden Markov Models. New York: Springer. [Google Scholar]
  9. Chen H, Seita D, Pan X & Canny J (2016). An efficient minibatch acceptance test for Metropolis-Hastings. arXiv:1610.06848.
  10. Chen T, Fox EB & Guestrin C (2014). Stochastic gradient Hamiltonian Monte Carlo. In ICML. [Google Scholar]
  11. Chen Y & Ghahramani Z (2016). Scalable discrete sampling as a multi-armed bandit problem. In ICML. [Google Scholar]
  12. Dalalyan AS & Karagulyan AG (2017). User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. CoRR abs/1710.00095.
  13. Ding N, Fang Y, Babbush R, Chen C, Skeel RD & Neven H (2014). Bayesian sampling using stochastic gradient thermostats. In NIPS. [Google Scholar]
  14. Gibbs A & Su F (2002). On choosing and bounding probability metrics. International Statistical Review 70, 419–435. [Google Scholar]
  15. Green P (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732. [Google Scholar]
  16. Korattikara A, Chen Y & Welling M (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In ICML. [Google Scholar]
  17. Li C, Chen C, Carlson DE & Carin L (2016). Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In AAAI. [Google Scholar]
  18. Ma Y-A, Chen T & Fox EB (2015). A complete recipe for stochastic gradient MCMC. In NIPS. [Google Scholar]
  19. Maclaurin D & Adams RP (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. In IJCAI. [Google Scholar]
  20. Nemeth C & Fearnhead P (2019). Stochastic gradient Markov chain Monte Carlo. arXiv:1907.06986. [Google Scholar]
  21. Sato I & Nakagawa H (2014). Approximation analysis of stochastic gradient Langevin dynamics by using Fokker-Planck equation and Ito process. In ICML. [Google Scholar]
  22. Scott SL, Blocker AW, Bonassi FV, Chipman HA, George EI & McCulloch RE (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management 11, 78–88. [Google Scholar]
  23. Srivastava S, Li C & Dunson DB (2018). Scalable Bayes via Barycenter in Wasserstein space. Journal of Machine Learning Research 19, 1–35. [Google Scholar]
  24. Tanner M & Wong W (1987). The calculation of posterior distributions by data augmentation (with Discussions). Journal of the American Statistical Association 82, 528–540. [Google Scholar]
  25. Teh W, Thiery A & Vollmer S (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research 17, 1–33. [Google Scholar]
  26. Welling M & Teh YW (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML. [Google Scholar]
  27. Xue J & Liang F (2019). Double-parallel Monte Carlo for Bayesian analysis of big data. Statistics and Computing 29, 23–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES