Estimates and Standard Errors for Ratios of Normalizing Constants from Multiple Markov Chains via Regeneration

Hani Doss; Aixin Tan

doi:10.1111/rssb.12049

. Author manuscript; available in PMC: 2017 Jul 11.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2013 Dec 9;76(4):683–712. doi: 10.1111/rssb.12049

Estimates and Standard Errors for Ratios of Normalizing Constants from Multiple Markov Chains via Regeneration

Hani Doss ¹, Aixin Tan ²

PMCID: PMC5505497 NIHMSID: NIHMS527362 PMID: 28706463

Abstract

In the classical biased sampling problem, we have k densities π₁(·), …, π_k(·), each known up to a normalizing constant, i.e. for l = 1, …, k, π_l(·) = ν_l(·)/m_l, where ν_l(·) is a known function and m_l is an unknown constant. For each l, we have an iid sample from π_l,·and the problem is to estimate the ratios m_l/m_s for all l and all s. This problem arises frequently in several situations in both frequentist and Bayesian inference. An estimate of the ratios was developed and studied by Vardi and his co-workers over two decades ago, and there has been much subsequent work on this problem from many different perspectives. In spite of this, there are no rigorous results in the literature on how to estimate the standard error of the estimate. We present a class of estimates of the ratios of normalizing constants that are appropriate for the case where the samples from the π_l’s are not necessarily iid sequences, but are Markov chains. We also develop an approach based on regenerative simulation for obtaining standard errors for the estimates of ratios of normalizing constants. These standard error estimates are valid for both the iid case and the Markov chain case.

phrases: Geometric ergodicity, importance sampling, Markov chain Monte Carlo, ratios of normalizing constants, regenerative simulation, standard errors

1 Introduction

The problem of estimating ratios of normalizing constants of unnormalized densities arises frequently in statistical inference. Here we mention three instances of this problem. In missing data (or latent variable) models, suppose that the data is X_obs, and the likelihood of the data is difficult to write down but X_obs can be augmented with a part X_mis in such a way that the likelihood for (X_mis, X_obs) is easy to write. In this case (using generic notation) we have p_θ(X_mis | X_obs) = p_θ(X_mis, X_obs)/p_θ(X_obs). The denominator, i.e. the likelihood of the observed data at parameter value θ, is precisely a normalizing constant. For the purpose of carrying out likelihood inference, if θ₁ is some reference value, knowledge of $\log (p_{θ} (X_{obs}) / p_{θ_{1}} (X_{obs}))$ is equivalent to knowledge of log(p_θ(X_obs)): for these two functions the maximum occurs at the same point, and the negative second derivative at the maximum (i.e. the observed Fisher information) is the same.

A second example arises when the likelihood has the form p_θ(x) = g_θ(x)/z_θ, where g_θ is a known function. This situation arises in exponential family problems, and except for the usual textbook examples, the normalizing constant is analytically intractable. If for some arbitrary point θ₁ we know the ratio $z_{θ} / z_{θ_{1}}$ , then we would know p_θ(x) up to a multiplicative constant and, as before, this would be equivalent to knowing p_θ(x) itself. A third example arises in certain hyperparameter selection problems in Bayesian analysis. Suppose that we wish to choose a prior from the family ${π_{h}, h \in ℋ}$ , where the π_h’s are densities with respect to a dominating measure μ. For any h ∈ $ℋ$ , the marginal likelihood of the data X when the prior is π_h is given by m_h(X) = ∫ p_θ(X)π_h(θ) μ(dθ), i.e. it is the normalizing constant in the statement “the posterior is proportional to the likelihood times the prior.” The empirical Bayes choice of h is by definition argmax_h m_h(X). Suppose that h₁ is some arbitrary point in $ℋ$ . As in the previous two examples, for the purpose of finding the empirical Bayes choice of h, knowing $m_{h} (X) / m_{h_{1}} (X)$ is equivalent to knowing m_h(X). (One may also be interested in the closely related problem of estimating the posterior expectation of a function f(θ) when the hyperparameter is h, which is given by E_h(f(θ) | X) = (∫ f(θ)p_θ(X)π_h(θ) μ(dθ))/m_h(X). Estimating E_h(f(θ) | X) as h varies is relevant in Bayesian sensitivity analysis. The scheme for doing this used in Buta and Doss (2011) does not involve estimating m_h(X) itself and requires only estimating $m_{h} (X) / m_{h_{1}} (X)$ for some fixed $h_{1} \in ℋ$ .)

Now, estimation of a normalizing constant is generally a difficult problem; for example, the so-called harmonic mean estimator proposed by Newton and Raftery (1994) typically converges at a rate that is much slower than $\sqrt{n}$ (Wolpert and Schmidler, 2011). On the other hand, estimating a ratio of normalizing constants typically can be done with a $\sqrt{n}$ -consistent estimator. To illustrate this fact, consider the second of the problems described above, and let μ be the measure with respect to which the p_θ’s are densities. Suppose that X₁, X₂, … are a “sample” from $p_{θ_{1}}$ (iid sample or ergodic Markov chain output). For the simple and well-known estimator $(1 / n) \sum_{i = 1}^{n} g_{θ} (X_{i}) / g_{θ_{1}} (X_{i})$ we have

\frac{1}{n} \sum_{i = 1}^{n} \frac{g_{θ} (X_{i})}{g_{θ_{1}} (X_{i})} \overset{a . s .}{\to} \int \frac{g_{θ} (x)}{g_{θ_{1}} (x)} p_{θ_{1}} (x) μ (d x) = \frac{z_{θ}}{z_{θ_{1}}},

(1.1)

and under certain moment conditions on the ratio $g_{θ} (X_{i}) / g_{θ_{1}} (X_{i})$ and mixing conditions on the chain, the estimate on the left of (1.1) also satisfies a central limit theorem (CLT). In fact, in all the problems mentioned above, it is not necessary to estimate the normalizing constants themselves, and it is sufficient to estimate ratios of normalizing constants.

If θ is not close to θ₁, or more precisely, if g_θ and $g_{θ_{1}}$ are not close over the region where the X_i’s are likely to be, the ratio $g_{θ} (X_{i}) / g_{θ_{1}} (X_{i})$ has high variance, so the estimator above does not work well. It is better to choose θ₁, …, θ_k appropriately spread out in the parameter space Θ, and on the left side of (1.1) replace $g_{θ_{1}}$ with $\sum_{s = 1}^{k} w_{s} g_{θ_{s}}$ , where w_s > 0, s = 1, …, k. The hope is that g_θ will be close to at least one of the $g_{θ_{s}}$ ’s, and so preclude having large variances. To implement this, suppose we know all the ratios $z_{θ_{s}} / z_{θ_{t}}$ , s, t ∈ {1, …, k}, or equivalently, we know $z_{θ_{1}} / z_{θ_{s}}$ , s ∈ {1, …, k}. In this case, if for each l = 1, …, k there is available a sample $X_{1}^{(l)}$ , …, $X_{n_{l}}^{(l)}$ from $g_{θ_{l}} / z_{θ_{l}}$ , then letting $n = \sum_{l = 1}^{k} n_{l}$ and a_l = n_l/n, we have

\begin{array}{l} \sum_{l = 1}^{k} \frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} \frac{a_{l} g_{θ} (X_{i}^{(l)})}{\sum_{s = 1}^{k} a_{s} g_{θ_{s}} (X_{i}^{(l)}) (z_{θ_{1}} / z_{θ_{s}})} \overset{a . s .}{\to} \sum_{l = 1}^{k} \int \frac{a_{l} g_{θ} (x)}{\sum_{s = 1}^{k} a_{s} g_{θ_{s}} (x) (z_{θ_{1}} / z_{θ_{s}})} \frac{g_{θ_{l}} (x)}{z_{θ_{l}}} μ (d x) \\ = \int \sum_{l = 1}^{k} \frac{a_{l} g_{θ_{l}} (x) / z_{θ_{l}}}{\sum_{s = 1}^{k} a_{s} g_{θ_{s}} (x) (z_{θ_{1}} / z_{θ_{s}})} g_{θ} (x) μ (d x) = \frac{z_{θ}}{z_{θ_{1}}} . \end{array}

(1.2)

When compared with the estimate on the left side of (1.1), the estimate on the left side of (1.2) is accurate over a much bigger range of θ’s. But to use it, it is necessary to be able to estimate the ratios $z_{θ_{1}} / z_{θ_{s}}$ , s ∈ {1, …, k}, and it is this problem that is the focus of this paper.

We now state explicitly the version of this problem that we will deal with here, and we change to the notation that we will use for the rest of the paper. We have k densities π₁, …, π_k with respect to the measure μ, which are known except for normalizing constants, i.e. we have π_l = ν_l/m_l, where the ν_l’s are known functions and the m_l’s are unknown constants. For each l we have a Markov chain $Φ_{l} = {X_{1}^{(l)}, \dots, X_{n_{l}}^{(l)}}$ with invariant density π_l, the k chains are independent, and the objective is to estimate all possible ratios m_i/m_j, i ≠ j or, equivalently, the vector

d = (m_{2} / m_{1}, \dots, m_{k} / m_{1}) .

When the samples are iid sequences, this is the biased sampling problem introduced by Vardi (1985), which contains examples that differ in character quite a bit from those considered here.

Suppose we are in the iid case, and consider the pooled sample $S = {X_{i}^{(l)}, i = 1, n_{l}, l = 1, \dots, k}$ . Let x ∈ S, and suppose that x came from the l^th sample. If we pretend that the only thing we know about x is its value, then the probability that x came from the l^th sample is

\frac{n_{l} π_{l} (x)}{\sum_{s = 1}^{k} n_{s} π_{s} (x)} = \frac{a_{l} ν_{l} (x) / m_{l}}{\sum_{s = 1}^{k} a_{s} ν_{s} (x) / m_{s}} : = λ_{l} (x, m),

(1.3)

where m = (m₁, …, m_k). Geyer (1994) proposed to treat the vector m as an unknown parameter and to estimate it by maximizing the quasi-likelihood function

L_{n} (m) = \prod_{l = 1}^{k} \prod_{i = 1}^{n_{l}} λ_{l} (X_{i}^{(l)}, m)

(1.4)

with respect to m. Actually, there is a non-identifiability issue regarding L_n: for any constant c > 0, L_n(m) and L_n(cm) are the same. So we can estimate m only up to an overall multiplicative constant, i.e. we can estimate only d. Accordingly, Geyer (1994) proposed to estimate d by maximizing L_n(m) subject to the constraint m₁ = 1. (A more detailed discussion of the quasi-likelihood function (1.4) is given in Section 3.) In fact, the resulting estimate, $\hat{d}$ , was originally proposed by Vardi (1985), and studied further by Gill, Vardi and Wellner (1988), who showed that it is consistent and asymptotically normal, and established its optimality properties, all under the assumption that for each l = 1, …, k, $X_{1}^{(l)}$ , …, $X_{n_{l}}^{(l)}$ is an iid sequence. Geyer (1994) extended the consistency and asymptotic normality result to the case where the k sequences $X_{1}^{(l)}$ , …, $X_{n_{l}}^{(l)}$ are Markov chains satisfying certain mixing conditions. The estimate was rederived in Meng and Wong (1996), Kong et al. (2003), and Tan (2004) from completely different perspectives, all under the iid assumption.

As mentioned earlier, for the kinds of problems we have in mind the distributions π_l are analytically intractable, and the estimate on the left side of (1.1) or (1.2) are applicable to a much larger class of problems if we are willing to use Markov chain samples instead of iid samples. The variances of these estimates have a complex form which is difficult to estimate consistently. The variance matrix of $\hat{d}$ is much harder to estimate since $\hat{d}$ is not given in closed form, but is given only implicitly as the solution to a constrained optimization problem.

The present paper deals with two issues. First, none of the authors cited above give consistent estimators of the variance matrix of $\hat{d}$ , even in the iid case. (For the iid case, Kong et al. (2003) give an estimate that involves the inverse of a certain Fisher information matrix, but this formal calculation does not establish consistency of the estimate, or even the necessary CLT, nor do the authors make such claims.) As mentioned earlier, the problem of estimating the variance is far more challenging when the samples are Markov chains as opposed to iid sequences. In this paper we give a CLT for the vector $\hat{d}$ based on regenerative simulation. The main benefit of this result is that it gives, essentially as a free by-product, a simple consistent estimate of the variance matrix in the Markov chain setting. Second, the estimate of d obtained by the afore-mentioned authors is optimal in the case where the samples are iid. When the samples are Markov chains, the estimate is no longer optimal. We present a method for forming estimators which are suitable in the Markov chain setting. The regeneration-based CLT and estimate of the variance matrix both apply to the class of estimators that we propose.

The rest of this paper is organized as follows. In Section 2 we extend the quasi-likelihood function used by Geyer (1994) to a class of quasi-likelihood functions, each of which gives rise to an estimator of d. The main theoretical developments of this paper are in Section 3, where we use ideas from regenerative simulation to develop CLTs for any of these estimators, and we show that variance estimates emerge as by-products. There are two reasons why we need to be able to estimate the variance of an estimate of d. One is the standard rule that a point estimate should always be accompanied by a measure of uncertainty, an important rule that applies to any context. The other is that when the samples are Markov chains, as opposed to iid samples, the quasi-likelihood function (1.4) gives rise to a non-optimal estimator. So in Section 4 we consider the other quasi-likelihood functions presented in Section 2 and we develop a method for choosing the one which gives rise to the estimator with the smallest variance. It is precisely our ability to estimate the variance that makes this possible. In Section 5 we present a small study that illustrates the gains obtained from using an estimate of d designed for Markov chains, and we illustrate our methodology by showing how it can be used to estimate certain quantities of interest in the Ising model of statistical mechanics. The Appendix provides proofs of the three assertions made by the theorem in Section 3, namely strong consistency of our estimates of d, the CLT for the estimates, and strong consistency of the estimates of the variance matrix.

2 Estimation of the Ratios of Normalizing Constants in the Markov Chain Setting

We begin by considering more carefully the quasi-likelihood function for m given by (1.4), and for the technical development it is much more convenient to work on the log scale. So define the vector ζ by

ζ_{l} = - \log (m_{l}) + \log (a_{l}), for l = 1, \dots, k,

(2.1)

and rewrite (1.3) as

p_{l} (x, ζ) = \frac{ν_{l} (x) e^{ζ_{l}}}{\sum_{s = 1}^{k} ν_{s} (x) e^{ζ_{s}}}, for l = 1, \dots, k .

(2.2)

Clearly, ζ determines and is determined by (m₁, …, m_k), and the log quasi-likelihood function for ζ is

l_{n} (ζ) = \sum_{l = 1}^{k} \sum_{i = 1}^{n_{j}} \log (p_{l} (X_{i}^{(l)}, ζ)) .

(2.3)

In (2.1), (m₁, …, m_k) is an arbitrary vector with strictly positive components, i.e. m_l need not correspond to the normalizing constant for ν_l. We will use ζ₍_t₎ to denote the true value of ζ, i.e. the value it takes when the m_l’s are the normalizing constants for the ν_l’s. The non-identifiability issue now is that for any constant c ∈ ℝ, l_n(ζ) and l_n(ζ +c1_k) are the same (here, 1_k is the vector of k 1’s), so we can estimate ζ₍_t₎ only up to an additive constant. Accordingly, with ζ₀ ∈ ℝ^k defined by ${[ζ_{0}]}_{l} = {[ζ_{(t)}]}_{l} - (\sum_{s = 1}^{k} {[ζ_{(t)}]}_{s}) / k$ , Geyer (1994) proposed to estimate ζ₀ by $\hat{ζ}$ , the maximizer of l_n subject to the linear constraint ζ^⊤1_k = 0, and thus obtain an estimate of d.

The term p_l(x, ζ) in (2.2) has the appearance of a likelihood ratio, and for the denominator, in the term $ν_{s} (x) e^{ζ_{s}} = a_{s} (ν_{s} (x) / m_{s})$ , the probability measure ν_s/m_s is given weight proportional to the length of the chain Φ_s. Now Gill et al’s. (1988) optimality result does not apply to the Markov chain case, in which, among other things, the chains Φ₁, …, Φ_k mix at possibly different rates, and the a_s’s should in some sense reflect the vague notion of “effective sample sizes” of the different chains. The optimal choice of the vector a = (a₁, …, a_k) is very difficult to determine theoretically, and in Section 4 we describe an empirical method for choosing a. Accordingly in (2.1) and henceforth, a will not necessarily be given by a_l = n_l/n, but will be an arbitrary probability vector satisfying the condition that a_l > 0 for l = 1, …, k.

A Quasi-Likelihood Function Designed for the Markov Chain Setting

As mentioned earlier, Geyer (1994) showed that when we take a_j = n_j/n, the maximizer of the log quasi-likelihood function defined by (2.3) (subject to the constraint ζ^⊤1_k = 0) is a consistent estimate of the true value ζ₀, and also satisfies a CLT, even when the k sequences ${X_{i}^{(l)}}_{i = 1}^{n_{l}}$ , l = 1, …, k are Markov chains. But when the k sequences are Markov chains, the choice a_j = n_j/n is no longer optimal, and for other choices of a, the (constrained) maximizer of (2.3) is not necessarily even consistent. We will present a new log quasi-likelihood function which does yield consistent asymptotically normal estimates, and before doing this, we give a brief motivating argument.

Suppose that we are in the simple case where we have a parametric family {p_θ, θ ∈ Θ} and we observe data Y₁, …, $Y_{n} \overset{iid}{~} p_{θ_{0}}$ , for some θ₀ ∈ Θ. Let l_y(θ) = log(p_θ(y)), and let $Q (θ) = E_{θ_{0}} (l_{Y} (θ))$ . The fact that argmax_θ Q(θ) = θ₀ is well known (and easy to see via a short argument involving Jensen’s inequality). The log likelihood function based on Y₁, …, Y_n is $\sum_{i = 1}^{n} l_{Y_{i}} (θ)$ . By the strong law of large numbers,

n^{- 1} \sum_{i = 1}^{n} l_{Y_{i}} (θ) \overset{a . s .}{\to} Q (θ) for all θ \in Θ,

(2.4)

and assuming sufficient regularity conditions, ${argmax}_{θ} n^{- 1} \sum_{i = 1}^{n} l_{Y_{i}} (θ) \overset{a . s .}{\to} {argmax}_{θ} Q (θ) = θ_{0}$ , i.e. the maximum likelihood estimator is consistent.

We now return to the present situation, in which for l = 1, …, k, ${X_{i}^{(l)}}_{i = 1}^{n_{l}}$ is a Markov chain with invariant density π_l. Suppose we use l_n(ζ) given by (2.3), with a an arbitrary probability vector (i.e. a is not necessarily given by a_j = n_j/n), and let $Q (ζ) = E_{ζ_{0}} (l_{n} (ζ))$ . The key condition

\underset{ζ}{argmax} Q (ζ) = ζ_{0}

(2.5)

need not hold, and the constrained maximizer of l_n(ζ) may converge, but not to the true value.

With this in mind, suppose that a is an arbitrary probability vector with non-zero entries and define w ∈ ℝ^k by

w_{l} = a_{l} \frac{n}{n_{l}}, l = 1, \dots, k .

(2.6)

The log quasi-likelihood function we will use is

ℓ_{n} (ζ) = \sum_{l = 1}^{k} w_{l} \sum_{i = 1}^{n_{l}} \log (p_{l} (X_{i}^{(l)}, ζ))

(2.7)

instead of l_n given by (2.3) [note the slight change of notation from l to ℓ]. As will emerge in our proofs of consistency and asymptotic normality of the constrained maximizer of ℓ_n(ζ), for this log quasi-likelihood function, the stochastic process (in ζ) n⁻¹ ℓ_n(ζ) converges almost surely to a function of ζ which is maximized at ζ₀, a condition that plays the role of (2.4) and (2.5). Note that if a_l = n_l/n, then w_l = 1 and (2.7) reduces to (2.3).

3 A Regeneration-Based CLT and Estimate of the Variance Matrix

Here we discuss estimation of the variance matrix of the estimator $\hat{ζ}$ developed in Section 2. Estimation of the variance matrix is complicated by the fact that $\hat{ζ}$ is based on several Markov chains and that it is given only implicitly as the solution of a constrained maximization problem. Before describing our approach for estimating the variance matrix of $\hat{ζ}$ , we first review what is available in a much simpler setting.

3.1 Estimation of the Variance of the Sample Mean in the Single Markov Chain Setting

Suppose we have a single Markov chain X₁, X₂, … on the measurable space $(X, ℬ)$ , with invariant distribution π, f : X → ℝ is a function whose expectation μ = E_π(f(X)) we estimate via $\hat{μ} = {\bar{f}}_{n} : = n^{- 1} \sum_{i = 1}^{n} f (X_{i})$ , and we are interested in estimating the variance of $\hat{μ}$ . Here we describe the commonly used approaches for the simple case, namely those based on batching, regeneration, and spectral methods. We explain how regeneration may be used for the case of the statistic $\hat{ζ}$ in Section 3.2 (see also the Appendix), and in Section 6 we explain how batching and spectral methods can be implemented for the case of the statistic $\hat{ζ}$ using the theoretical development in Section 3. We then compare and contrast the three methods of estimating the variance matrix and argue that, when it can be carried out, regeneration is the method of choice, but point out that unfortunately regeneration-based methods are not always feasible.

Batching

This method involves breaking up ${X_{i}}_{i = 1}^{n}$ into M non-overlapping segments of equal length called batches. For m = 1, …, M, batch m is used to produce an estimate ${\hat{μ}}^{[m]}$ in the obvious way. If we have a CLT that states $n^{1 / 2} (\hat{μ} - μ) \overset{d}{\to} N (0, κ^{2})$ , then we can conclude that for fixed M, ${(n / M)}^{1 / 2} ({\hat{μ}}^{[m]} - μ) \overset{d}{\to} N (0, κ^{2})$ for each m. If the batch length is large enough relative to the “mixing time” of the chain, then these estimators are approximately independent. If the independence assumption was exactly true rather than approximately true, then the sample variance of the ${\hat{μ}}^{[m]} ’ s$ would be a valid estimator of (M/n)κ². The batch means estimate of κ² is simply n/M times this sample variance. Under regularity conditions that include M → ∞ at a certain rate, the batch means estimate of κ² is strongly consistent; see Jones et al. (2006), and also Flegal, Haran and Jones (2008). The method of batching has the advantage that it is trivial to program, although some authors caution that it can be outperformed by spectral methods in terms of mean squared error; see, e.g., Geyer (1992).

Regenerative Simulation

A regeneration is a random time at which a stochastic process probabilistically restarts itself. The “tours” made by the chain in between such random times are iid, and this fact makes it much easier to analyze the asymptotic behavior of averages, and of statistics which are functions of several averages. In the discrete state space setting, if x ∈ X is any point to which the chain returns infinitely often, then the times of return to x form a sequence of regenerations. For most of the Markov chains used in MCMC algorithms, the state space is continuous, and there is no point to which the chain returns infinitely often with probability one. Even when the state space is discrete, regenerations based on returns to a point x, as described above, are often not useful, because if x has very small probability under the stationary distribution, then on average it will take a very long time to return to x. Fortunately, Mykland, Tierney and Yu (1995) provided a general technique for identifying a sequence of regeneration times 1 = τ₀ < τ₁ < τ₂ < ⋯ that is based on the construction of a minorization condition. This construction will be reviewed shortly, but we now briefly sketch how having a regeneration sequence ${τ_{t}}_{t = 0}^{\infty}$ enables us to construct a simple estimate of the standard error of $\bar{f}$ . Define

Y_{t} = \sum_{i = τ_{t - 1}}^{τ_{t} - 1} f (X_{i}) and T_{t} = \sum_{i = τ_{t - 1}}^{τ_{t} - 1} 1 = τ_{t} - τ_{t - 1}, t = 1, 2, \dots,

and note that the pairs (Y_t, T_t) form an iid sequence. If we run the chain for ρ regenerations, then the total number of cycles, starting at τ₀, is given by $n = \sum_{t = 1}^{ρ} T_{t}$ . We may write $\bar{f}$ as

\frac{\sum_{i = 1}^{n} f (X_{i})}{n} = \frac{\sum_{t = 1}^{ρ} Y_{t}}{\sum_{t = 1}^{ρ} T_{t}} = \frac{(\sum_{t = 1}^{ρ} Y_{t}) / ρ}{(\sum_{t = 1}^{ρ} T_{t}) / ρ} .

(3.1)

Equation (3.1) expresses $\bar{f}$ as a ratio of two averages of iid quantities, and this representation enables us to use the delta method to obtain both a CLT for $\bar{f}$ and a simple standard error estimate for $\bar{f}$ .

An outline of the argument is as follows. From (3.1) we see that as ρ → ∞ (which implies that n → ∞) we have

E_{π} (f (X)) \overset{a . s .}{\leftarrow} \frac{\sum_{i = 1}^{n} f (X_{i})}{n} = \frac{(\sum_{t = 1}^{ρ} Y_{t}) / ρ}{(\sum_{t = 1}^{ρ} T_{t}) / ρ} \overset{a . s .}{\to} \frac{E (Y_{1})}{E (T_{1})},

(3.2)

where the convergence statement on the left follows from the ergodic theorem, and the convergence statement on the right follows from two applications of the strong law of large numbers. (In (3.2) the subscript π to the expectation indicates that X ∼ π.) From (3.2) we obtain E(Y₁) = E_π(f(X))E(T₁). Now the bivariate CLT gives

ρ^{1 / 2} (\begin{matrix} \bar{Y} - E_{π} (f (X)) E (T_{1}) \\ \bar{T} - E (T_{1}) \end{matrix}) \overset{d}{\to} N (0, \sum_{f}),

(3.3)

where Σ_f = Cov ((Y₁, T₁)^⊤). The delta method applied to the function h(y, t) = y/t gives the CLT

ρ^{1 / 2} (\bar{Y} / \bar{T} - E_{π} (f (X))) \overset{d}{\to} N (0, σ_{f}^{2})

where $σ_{f}^{2} = {(\nabla h)}^{⊤} \sum_{f} \nabla h$ (and ∇h is evaluated at the vector of means in (3.3)). Moreover, it is straightforward to check that for the variance estimator

{\hat{σ}}_{f}^{2} = \frac{\sum_{t = 1}^{ρ} {(Y_{t} - \bar{f} T_{t})}^{2}}{ρ {\bar{T}}^{2}},

(3.4)

we have ${\hat{σ}}_{f}^{2} \overset{a . s .}{\to} σ_{f}^{2}$ . The regularity conditions needed to make this argument rigorous are spelled out when we discuss the case of the more complicated estimator $\hat{ζ}$ (Section 3.2 and the Appendix).

The argument above hinges on being able to arrive at a sequence of regeneration times, and whether these are useful depends on whether the sequence has the property that the length of the tours between regenerations is not very large. We now describe the minorization condition that can sometimes be used to construct useful regeneration sequences. Let K_x(A) be the Markov transition function. The construction described in Mykland et al. (1995) requires the existence of a function s: X → [0, 1), whose expectation with respect to π is strictly positive, and a probability measure Q, such that K satisfies

K_{x} (A) \geq s (x) Q (A) for all x \in X and A \in ℬ .

(3.5)

This is called a minorization condition and, as we describe below, it can be used to introduce regenerations into the Markov chain driven by K. Define the Markov transition function R by

R_{x} (A) = \frac{K_{x} (A) - s (x) Q (A)}{1 - s (x)} .

Note that for fixed x ∈ X, R_x is a probability measure. We may therefore write

K_{x} = s (x) Q + (1 - s (x)) R_{x},

which gives a representation of K_x as a mixture of two probability measures, Q and R_x. This provides an alternative method of simulating from K. Suppose that the current state of the chain is X_n. We generate δ_n ∼ Bernoulli(s(X_n)). If δ_n = 1, we draw X_n₊₁ ~ Q; otherwise, we draw $X_{n + 1} ~ R_{X_{n}}$ . Note that if δ_n = 1, the next state of the chain is drawn from Q, which does not depend on the current state. Hence the chain “forgets” the current state and we have a regeneration. To be more specific, suppose we start the Markov chain with

X_{1} ~ Q

(3.6)

and then use the method described above to simulate the chain. Each time δ_n = 1, we have X_n₊₁ ∼ Q and the process stochastically restarts itself; that is, the process regenerates.

In practice, simulating from R can be extremely difficult. Fortunately, Mykland et al. (1995), following Nummelin (1984, p. 62), noticed a clever way of circumventing the need to draw from R. Instead of making a draw from the conditional distribution of δ_n given x_n and then generating x_n₊₁ given (δ_n, x_n), which would result in a draw from the joint distribution of (δ_n, x_n₊₁) given x_n, we simply draw from the conditional distribution of x_n₊₁ given x_n in the usual way (i.e. using K), and then draw δ_n given (x_n, x_n₊₁). This alternative sampling mechanism yields a draw from the same joint distribution, but avoids having to draw from R. Moreover, given (x_n, x_n₊₁), δ_n has a Bernoulli distribution with success probability given simply by

P (δ_{n} = 1 | x_{n} = x^{'}, x_{n + 1} = x) = [\frac{d (s (x^{'}) Q)}{d K_{x^{'}}}] (x),

where $[d (s (x^{'}) Q) / d K_{x^{'}}]$ is the Radon-Nikodym derivative of s(x′)Q with respect to K_x_′, whose existence is implied by (3.5).

We note that both batching and regenerative simulation involve breaking up the Markov chain into segments. In batching, the segments are only approximately independent, but they are of equal lengths; in regenerative simulation, the segments are exactly independent, but they are not of equal lengths, and in fact the lengths are random.

Spectral Methods

The asymptotic variance, σ², of $(1 / n) \sum_{i = 1}^{n} f (X_{i})$ (when it exists) is the infinite series

σ^{2} = γ_{0} + 2 \sum_{j = 1}^{\infty} γ_{j},

where γ_j = Cov(f(X₁), f(X₁₊_j)) is calculated under the assumption that X₁ has the stationary distribution. Spectral methods involve truncating the series after M_n terms and estimating the truncated series. In more detail, we consider the sum $\sum_{j = 1}^{M_{n}} w_{n} (j / n) γ_{j}$ , where w_n is a decreasing function on [0, 1] satisfying w_n(0) = 1 and w_n(1) = 0, and is called the lag window. We estimate σ² via ${\hat{σ}}^{2} = {\hat{γ}}_{0} + 2 \sum_{j = 1}^{M_{n}} w_{n} (j / n) {\hat{γ}}_{j}$ , where ${\hat{γ}}_{j}$ are estimates of γ_j. To ensure consistency, we must have M_n _→ ∞, but M_n must increase slowly with n; precise conditions on the truncation point M_n and the window w_n are given in Flegal and Jones (2010).

3.2 A CLT for the Estimate of d Designed for Markov Chains

We assume that for l = 1, …, k, chain l has Markov transition function $K_{x}^{(l)} (A)$ which satisfies the minorization condition

K_{x}^{(l)} (A) \geq s_{l} (x) Q_{l} (A) for all x \in X, A \in ℬ

(3.7)

for some probability measure Q_l and function s_l : X → [0, 1) with $E_{π_{l}} (s_{l} (X)) > 0$ , and that the chain has been run for ρ_l regenerations. Let $1 = τ_{0}^{(l)} < τ_{1}^{(l)} < \dots < τ_{ρ_{l}}^{(l)}$ denote the regeneration times of the l^th chain, and let $T_{t}^{(l)} = τ_{t}^{(l)} - τ_{t - 1}^{(l)}$ be the length of the t^th tour of the l^th chain. So the length of the l^th chain, $n_{l} = T_{1}^{(l)} + \dots + T_{ρ_{l}}^{(l)}$ , is random. We will assume that ρ₁, …, ρ_k → ∞ in such a way that ρ_l/ρ₁ → c_l ∈ (0, ∞), for l = 1, …, k. We will allow the vector a to depend on ρ = (ρ₁, …, ρ_k), i.e. a = a⁽^ρ⁾ (although we will suppress this dependence in the notation except when this dependence matters), and we will make the minimal assumption that a⁽^ρ⁾ → α as ρ₁, …, ρ_k → ∞, where α is a probability vector with strictly positive entries. The extra generality is needed if we wish to choose a in a data-driven way (cf. Remark 3 of Section 4). The definitions of ζ and p_l(x, ζ) given by (2.1) and (2.2), respectively, are still in force, ζ₀ is still the centered version of the true value of ζ, but now $\hat{ζ}$ is the constrained maximizer of the new log quasi-likelihood function (2.7). We will show that $\hat{ζ}$ is a consistent asymptotically normal estimate of ζ₀, and since ζ₀ determines and is determined by d, this will produce a corresponding estimate $\hat{d}$ of d. Before proceeding, we mention the fact that difficulties arise if the supports of the distributions π₁, …, π_k differ (the difficulties are pervasive: for the case where we have a continuum of distributions {π_θ, θ ∈ Θ}, the simple estimate (1.1) is not even defined if π_θ is not absolutely continuous with respect to $π_{θ_{1}}$ ). So for the rest of this paper, we will assume that the k distributions π₁, …, π_k are mutually absolutely continuous. We do not really need to make an assumption this strong, but the assumption is satisfied for all the classes of problems we are considering, and making it eliminates some technical issues.

In order to state our CLT for the vector $ρ_{1}^{1 / 2} (\hat{d} - d)$ , we need to define the quantities that go into the expression for the asymptotic variance matrix. We first consider the vector $ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0})$ , whose variance matrix is singular (since this vector sums to 0). The asymptotic distribution of $ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0})$ involves the matrices B and Ω defined below. Let ζ_α be the vector whose components are [ζ_α]_l = −log(m_l) + log(α_l), and let B be the k × k matrix given by

\begin{array}{l} B_{r r} = \sum_{j = 1}^{k} α_{j} E_{π_{j}} (p_{r} (X, ζ_{α}) [1 - p_{r} (X, ζ_{α})]), r = 1, \dots, k, \\ B_{r s} = - \sum_{j = 1}^{k} α_{j} E_{π_{j}} (p_{r} (X, ζ_{α}) p_{s} (X, ζ_{α})), r, s = 1, \dots, k, r \neq s . \end{array}

(3.8)

We will be using the natural estimate defined by

\begin{array}{l} {\hat{B}}_{r r} = \sum_{l = 1}^{k} α_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, \hat{ζ}) [1 - p_{r} (X_{i}^{(l)}, \hat{ζ})]), r = 1, \dots, k, \\ {\hat{B}}_{r s} = - \sum_{l = 1}^{k} α_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, \hat{ζ}) p_{s} (X_{i}^{(l)}, \hat{ζ})), r, s = 1, \dots, k, r \neq s . \end{array}

(3.9)

Let

\begin{array}{l} y_{i}^{(r, l)} (a) = p_{r} (X_{i}^{(l)}, ζ_{0}) - E_{π_{l}} (p_{r} (X, ζ_{0})), i = 1, \dots, n_{l}, \\ y_{i}^{(r, l)} (α) = p_{r} (X_{i}^{(l)}, ζ_{α}) - E_{π_{l}} (p_{r} (X, ζ_{α})), i = 1, \dots, n_{l}, \end{array}

(3.10)

and note that both $y_{i}^{(r, l)} (a)$ and $y_{i}^{(r, l)} (α)$ have mean 0. Define

\begin{array}{c} Y_{t}^{(r, l)} (a) = \sum_{i = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} y_{i}^{(r, l)} (a), {\bar{Y}}^{(r, l)} (a) = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} Y_{t}^{(r, l)} (a), \\ Y_{t}^{(r, l)} (α) = \sum_{i = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} y_{i}^{(r, l)} (α), {\bar{Y}}^{(r, l)} (α) = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} Y_{t}^{(r, l)} (α), and {\bar{T}}^{(l)} = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} T_{t}^{(l)} . \end{array}

(3.11)

Let Ω be the k × k matrix defined by

Ω_{r s} = \sum_{l = 1}^{k} \frac{α_{l}^{2}}{C_{l}} \frac{E (Y_{1}^{(r, l)} (α) Y_{1}^{(s, l)} (α))}{{(E (T_{1}^{(l)}))}^{2}}, r, s = 1, \dots, k,

(3.12)

To obtain an estimate $\hat{Ω}$ , we let

Z_{t}^{(r, l)} = \sum_{i = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} p_{r} (X_{i}^{(l)}, \hat{ζ}) and {\hat{μ}}_{r}^{(l)} = \frac{\sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, \hat{ζ})}{n_{l}},

and define $\hat{Ω}$ by

{\hat{Ω}}_{r s} = \sum_{l = 1}^{k} \frac{a_{l}^{2}}{c_{l}} \frac{1}{{({\bar{T}}^{(l)})}^{2}} \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} (Z_{t}^{(r, l)} - T_{t}^{(l)} {\hat{μ}}_{r}^{(l)}) (Z_{t}^{(s, l)} - T_{t}^{(l)} {\hat{μ}}_{r}^{(l)}), r, s = 1, \dots, k .

(3.13)

The function g : ℝ^k → ℝ^k⁻¹ that maps ζ₀ into d is

g (ζ) = (\begin{matrix} e^{ζ_{1} - ζ_{2}} a_{2} / a_{1} \\ e^{ζ_{1} - ζ_{3}} a_{3} / a_{1} \\ ⋮ \\ e^{ζ_{1} - ζ_{k}} a_{k} / a_{1} \end{matrix}),

(3.14)

and its gradient at ζ₀ (in terms of d) is

D = (\begin{matrix} d_{2} & d_{3} & \dots & d_{k} \\ - d_{2} & 0 & \dots & 0 \\ 0 & - d_{3} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & - d_{k} \end{matrix}) .

(3.15)

We have d = g(ζ₀), and by definition $\hat{d} = g (\hat{ζ})$ .

The theorem below has three parts, pertaining to the strong consistency of $\hat{d}$ , asymptotic normality of $\hat{d}$ , and a consistent estimate of the asymptotic variance matrix of $\hat{d}$ . For consistency we need only minimal assumptions on the Markov chains Φ₁, …, Φ_k, namely the so-called basic regularity conditions (irreducibility, aperiodicity and Harris recurrence) that are needed for the ergodic theorem (Meyn and Tweedie, 1993, Chapter 17). CLTs and associated results always require a stronger condition, and the one that is most commonly used is geometric ergodicity. The theorem refers to the following conditions, which pertain to each l = 1, …, k.

A1 The Markov chain ${X_{1}^{(l)}, X_{2}^{(l)}, \dots}$ satisfies the basic regularity conditions.
A2 The Markov chain ${X_{1}^{(l)}, X_{2}^{(l)}, \dots}$ is geometrically ergodic.
A3 The Markov transition function K⁽^l⁾ satisfies the minorization condition (3.7).

For a square matrix C, C^† will denote the Moore-Penrose inverse of C.

Theorem 1

Suppose that for each l = 1, …, k, the Markov chain ${X_{1}^{(l)}, X_{2}^{(l)}, \dots}$ has invariant distribution π_l.

Under A1, the log quasi-likelihood function (2.7) has a unique maximizer subject to the constraint ζ^⊤1_k = 0. Let $\hat{ζ}$ denote this maximizer, and let $\hat{d} = g (\hat{ζ})$ . Then as ρ₁ → ∞, $\hat{d} \overset{a . s .}{\to} d$ .
Under A1 and A2, as ρ1 → ∞,
$ρ_{1}^{1 / 2} (\hat{d} - d) \overset{d}{\to} N (0, W) where W = D^{⊤} B^{†} Ω B^{†} D .$ (3.16)
Assume A1–A3. Let $\hat{D}$ be the matrix D in (3.15) with $\hat{d}$ in place of d, and let $\hat{B}$ and $\hat{Ω}$ be defined by (3.9) and (3.13), respectively. Then, $\hat{W} : = {\hat{D}}^{⊤} {\hat{B}}^{†} \hat{Ω} {\hat{B}}^{†} \hat{D}$ is a strongly consistent estimator of W.

4 Choice of the Vector a

As mentioned earlier, the log quasi-likelihood that has been proposed and studied in the literature involves the functions p_l(x, ζ) given by (2.2), which have the form

\frac{\frac{n_{l}}{n} ν_{l} (x) / m_{l}}{\sum_{s = 1}^{k} \frac{n_{s}}{n} ν_{s} (x) / m_{s}},

(4.1)

where in the denominator of (4.1), the probability density ν_s(x)/m_s is given weight proportional to the length of the s^th chain. Intuitively, one would want to replace n_s with the “effective sample size” for chain s, so that if chain s mixes slowly, the weight that is given to ν_s(x)/m_s is small. Unfortunately, there is really no such thing as an effective sample size because the effect of slow mixing varies quite a bit with the function whose mean is being estimated. Therefore, it is better to take a direct approach that involves replacing the vector (n₁/n, …, n_k/n) by a probability vector a, and choose a to minimize the variance of the resulting estimator. (It should be emphasized that the estimator is a complicated function of k chains.)

In more detail, we do the following. Let $S_{k} = {a \in ℝ^{k} : a_{1}, \dots, a_{k} \geq 0 and \sum_{s = 1}^{k} a_{s} = 1}$ be the k-dimensional simplex. For each a ∈ $S_{k}$ , in (4.1) replace n_s/n by a_s and form the corresponding log quasi-likelihood function (see equation (2.7)), call it $ℓ_{n}^{(a)} (ζ)$ . We let ${\hat{ζ}}_{a}$ be the constrained maximizer of $ℓ_{n}^{(a)} (ζ)$ , and let ${\hat{d}}_{a}$ be the corresponding estimate of d. Let W_a be the variance matrix of ${\hat{d}}_{a}$ given by Part 2 of Theorem 1, and let ${\hat{W}}_{a}$ be its estimate. We choose a to minimize $trace ({\hat{W}}_{a})$ (this corresponds to the classical “A-optimal design”). It should be noted that we are able to carry out this optimization scheme precisely because Theorem 1 enables us to estimate W_a.

Remarks

It is natural to ask whether in the Markov chain case our procedure gives rise to an optimal estimate of d, and we now address this question. To keep the discussion as simple as possible, we consider the case k = 2. Let $ℬ$ be the set of all “bridge functions” β : X → ℝ satisfying the conditions that 0 < | ∫ β(x)π₁(x)π₂(x) μ(dx)| < ∞ and β(x) = 0 when either π₁(x) = 0 or π₂(x) = 0. It is easy to see that when the two sequences $X_{1}^{(l)}, \dots, X_{n_{l}}^{(l)}$ , l = 1, 2 are each iid, for any β ∈ $ℬ$ , the estimate
${\hat{d}}_{2} = \frac{n_{1}^{- 1} \sum_{i = 1}^{n_{1}} β (X_{i}^{(1)}) ν_{2} (X_{i}^{(1)})}{n_{2}^{- 1} \sum_{i = 1}^{n_{2}} β (X_{i}^{(2)}) ν_{1} (X_{i}^{(2)})}$
is a consistent and asymptotically normal estimate of d₂. Meng and Wong (1996) show that within $ℬ$ , the function for which the asymptotic variance is minimized is
$β_{opt, iid} (x) = {[s_{1} ν_{1} (x) + s_{2} ν_{2} (x) / d_{2}]}^{- 1},$
where s_j = n_j/n, j = 1, 2. Because this function involves the unknown d₂, Meng and Wong (1996) propose an iterative scheme in which we start with, say, ${\hat{d}}_{2}^{(0)} = 1$ , and at stage m, we from
${\hat{d}}_{2}^{(m + 1)} = \frac{\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} \frac{ν_{2} (X_{i}^{(1)})}{s_{1} ν_{1} (X_{i}^{(1)}) + s_{2} ν_{2} (X_{i}^{(1)}) / {\hat{d}}_{2}^{(m)}}}{\frac{1}{n_{2}} \sum_{i = 1}^{n_{2}} \frac{ν_{1} (X_{i}^{(2)})}{s_{1} ν_{1} (X_{i}^{(2)}) + s_{2} ν_{2} (X_{i}^{(2)}) / {\hat{d}}_{2}^{(m)}}} .$
They show that $\lim_{m \to \infty} {\hat{d}}_{2}^{(m)}$ exists and is exactly equal to the estimate considered by Geyer (1994), and so established an equivalence between the iterative bridge estimator and the estimate based on maximization of the log quasi-likelihood function.

When the sequences $X_{1}^{(l)}, \dots, X_{n_{l}}^{(l)}$ , l = 1, 2 are Markov chains, the optimal bridge function has the form β_opt,mcmc(x) = β_∗(x)β_opt,iid(x), where the correction factor, β_∗(x), is the solution to a complicated Fredholm integral equation (Romero, 2003) and reflects the dependence structure of the two chains. In particular, for the case of Markov chains, the optimal bridge function need not have the form
$β (x) = {[t_{1} ν_{1} (x) + t_{2} ν_{2} (x)]}^{- 1},$ (4.2)
for any t₁, t₂. Unfortunately, β_∗ is very hard to identify, let alone estimate. To conclude, since our procedure is, effectively, searching within the class (4.2), it will not yield an optimal estimate in general, and instead should be viewed as a method for yielding estimates which are practically useful, even if not optimal.
A crude way to find ${\hat{a}}_{opt} : = {argmin}_{a} trace ({\hat{W}}_{a})$ is to calculate $trace ({\hat{W}}_{a})$ as a varies over a grid in $S_{k}$ and then find the minimizing a. This is inefficient and unnecessary, as there exist efficient algorithms for minimizing real-valued functions of several variables; see, e.g., Robert and Casella (2004, Chapter 5).
The vector ${\hat{a}}_{opt}$ can be calculated from a small pilot experiment, after which new chains are run and used to form the log quasi-likelihood function $ℓ_{n}^{({\hat{a}}_{opt})} (ζ)$ , from which we obtain $\hat{ζ}$ (and hence $\hat{d}$ ).
If for each l, $X_{1}^{(l)}, \dots, X_{n_{l}}^{(l)}$ is an iid sequence, then a regeneration occurs at each step. In this case, there is no need to estimate a, since the optimal value is known to be a_j = n_l/n (Meng and Wong, 1996). The w_l’s in (2.6) reduce to 1, and the log quasi-likelihood function (2.7) reduces to exactly the log quasi-likelihood function used by Geyer (1994), so our estimate is exactly the estimate introduced by Vardi (1985), who worked in the iid setting.

5 Illustrations

Here we have two goals. In Section 5.1 we provide a simulation study to show the gains in efficiency that are possible if we use the method for choosing the weight vector a described in Section 4. Our illustration involves toy problems. The purpose of Section 5.2 is to demonstrate the applicability of our methodology, and we return to the second of the three classes of problems we discussed in Section 1, where we have a family of probability densities of the form p_θ(x) =g_θ(x)/z_θ, which are intractable because the normalizing constant z_θ cannot be computed in closed form. Our focus here is a bit different, in that we are not interested in estimating the family z_θ, θ ∈ Θ; rather, we are now interested in estimating a family of expectations of the form E_θ(U(X)), θ ∈ Θ, where U is a function, as well as estimating functions of these expectations. Our illustration is in the context of the Ising model of statistical physics, and we show how to estimate the internal energy and specific heat of the system as a function of temperature.

5.1 Gains in Efficiency When Using the Optimal Weight Vector a

Recall that ${\hat{a}}_{opt} = {argmin}_{a} trace (\hat{W_{a}})$ is calculated from a small pilot experiment. Let ${\hat{d}}_{{\hat{a}}_{opt}}$ be the corresponding estimate of d. Also, let ${\hat{d}}_{conv}$ denote the estimate of d obtained when we use the conventional choice a_j = n_j/n. In this section we demonstrate through a simulation study that significant gains in efficiency are possible if we use ${\hat{d}}_{{\hat{a}}_{opt}}$ instead of ${\hat{d}}_{conv}$ in situations where the Markov chains mix at different rates. We consider a very simple situation where k = 2, so that d is just d₂. We take π₁ and π₂ to be two t distributions, specifically π₁ = t_5,1 and π₂ = t_5,0, where t_r_,_μ denotes the t distribution with r degrees of freedom, centered at μ. The representation π_l = ν_l/m_l is taken to be trivial: ν_l = π_l and m_l = 1 for l = 1, 2. So d₂ = m₂/m₁ is known to be 1, but we proceed to estimate it as if we didn’t know that fact.

In our simulations, chain 1 is an iid sequence from π₁. Chain 2 is an independence Metropolis-Hastings (IMH) chain with proposal density t_5,_μ. That is, if the current state of the chain is x, a proposal Y ~ t_5,_μ is generated; the chain moves to Y with acceptance probability min {[t_5,0(Y)t₅,_μ (x)]/[t_5,0(x)t_5,_μ(Y)], 1} and stays at x with the remaining probability. We will let μ range over a fine grid in (−3, 3). Note that when μ = 0, the proposal is always accepted, and the chain is an iid sequence from t_5,0, but as μ moves away from 0 in either direction, proposals are less likely to be accepted, and the mixing rate of the chain is slower. It is simple to check that inf_x (t_5,_μ(x)/t_5,0(x)) > 0, which implies that the IMH algorithm is uniformly ergodic (Mengersen and Tweedie, 1996, Theorem 2.1) and hence geometrically ergodic. Moreover, Mykland et al. (1995, Section 4.1) have shown that for IMH chains there is always a scheme for producing minorization conditions and regeneration sequences, and here we use the scheme they described.

Our simulation study is carried out as follows. For each value of μ, we conduct a pilot study to calculate ${\hat{a}}_{opt}$ , using the method described in Section 4. The pilot study is based on 1000 iid draws from π₁ and a number of regenerations of the IMH Markov chain for π₂ that gives a sample of approximately the same size. Then we run the main study, in which we form ${\hat{d}}_{{\hat{a}}_{opt}}$ (where ${\hat{a}}_{opt}$ is obtained in the pilot study), and also form ${\hat{d}}_{conv}$ . The main study is 10 times as large as the pilot study. For each μ, the above is replicated 1000 times, and from these replicates we calculate the average squared distance between the ${\hat{d}}_{{\hat{a}}_{opt}}$ ’s and 1, the average squared distance between the ${\hat{d}}_{conv}$ ’s and 1, and form the ratio, which we take as a measure of the efficiency of ${\hat{d}}_{{\hat{a}}_{opt}}$ vs. ${\hat{d}}_{conv}$ .

Figure 1 gives a plot of the ratio of these estimated mean squared errors as μ varies over (−3, 3), along with 95% confidence bands, valid pointwise (the bands are constructed via the delta method applied to the function f(o, c) = o/c). From the figure we see that, as expected, the efficiency is about 1 when μ is near 0. But it grows rapidly as μ moves away from 0 in either direction, reaching about 15 when μ is 3 or −3, and it is reasonable to believe that the efficiency is unbounded as μ → ∞ or μ→ −∞. Figure 2 provides a graphical description of the explanation. The figure gives a plot of ${[{\hat{a}}_{opt}]}_{1}$ , the first component of ${\hat{a}}_{opt}$ , as μ varies over (−3, 3). When μ = 0, the two chains are each iid sequences, and ${\hat{a}}_{opt} \dot{=} (. 5, . 5)$ , so that ${\hat{d}}_{{\hat{a}}_{opt}} \dot{=} {\hat{d}}_{conv}$ . But when μ moves away from 0 in either direction, chain 2 mixes more slowly, and ${[{\hat{a}}_{opt}]}_{1}$ increases towards 1, so that in the term (2.2) in our quasi-likelihood function, less weight is given to chain 2, which is why ${\hat{d}}_{{\hat{a}}_{opt}}$ is more efficient than is ${\hat{d}}_{conv}$ .

Estimated relative efficiency of ${\hat{d}}_{{\hat{a}}_{opt}}$ vs. ${\hat{d}}_{conv}$ , together with 95% confidence bands. As μ moves away from 0, the mixing rate of chain 2 slows, and the efficiency of ${\hat{d}}_{{\hat{a}}_{opt}}$ vs. ${\hat{d}}_{conv}$ increases. The horizontal line at height 1 serves a reference line.

The points are the medians of the first component of ${\hat{a}}_{opt}$ , i.e. the weight assigned to sample 1 in the term (2.2) in our quasi-likelihood function, over the 1000 replications at each μ. As μ moves away from 0, the weight given to the second (slower mixing) chain decreases to 0.

Of course, because the calculation of ${\hat{d}}_{{\hat{a}}_{opt}}$ requires a pilot study, the comparison above could be viewed as unfair. However, for ${\hat{d}}_{{\hat{a}}_{opt}}$ to perform well all that is required, both in theory and in practice, is that ${\hat{a}}_{opt}$ consistently estimate ${argmin}_{a} Var ({\hat{d}}_{a})$ , and for this to occur all that is required is that the size of the pilot study increase to infinity. That is, the size of the pilot study can increase to infinity arbitrarily slowly when compared to the size of the main study so, asymptotically, the amount of time required to compute ${\hat{d}}_{{\hat{a}}_{opt}}$ and ${\hat{d}}_{conv}$ is the same.

Since ultimately our estimates ${\hat{d}}_{{\hat{a}}_{opt}}$ and the standard error estimates given by Theorem 1 are to be used to produce confidence intervals for d₂ (more generally confidence regions for d), we checked the coverage probability of these intervals. Figure 3 gives a graphical display of the observed coverage rates of the nominal 95% confidence intervals over the 1000 repetitions, as μ ranges from −3 to 3. The figure shows that these hover neatly around .95, with no systematic deviation from .95, and no deviation from .95 that is not accounted for in an experiment involving only 1000 repetitions.

Observed coverage rate of confidence intervals based on ${\hat{d}}_{{\hat{a}}_{opt}}$ and the standard error estimates given by Theorem 1.

5.2 Estimation of the Internal Energy and Specific Heat as Functions of Temperature in the Ising Model

We consider the Ising model on a c × c square lattice with periodic boundary conditions. That is, we have a graph (V, E) where V denotes the set of c² vertices of the lattice, and E denotes the set of 2c² edges that connect nearest neighbors on the lattice. Vertices in the first and last rows are also considered neighbors, as are vertices in the first and last columns, so the graph resides on the torus. For each vertex i ∈ V, we have a random variable X_i taking on the values 1 and −1. The random vector X = {X_i, I ∈ V} gives the state of the system, and the state space S contains $2^{c^{2}}$ states. For x ∈ S, let $H (x) = - \sum_{i ~ j} x_{i} x_{j}$ , where the notation i ~ j signifies that i and j are nearest neighbors. For each θ ∈ Θ := [0, ∞), define a probability distribution p_θ on S by

p_{θ} (x) = z_{θ}^{- 1} \exp [- θ H (x)], x \in S,

where $z_{θ} = \sum_{x \in S} \exp [- θ H (x)]$ is the normalizing constant, called the partition function in the physics literature, and θ = 1/(κT), where T is the temperature and κ is the Boltzmann constant. See, e.g., Newman and Barkema (1999, sec. 1.2) for an overview.

Important to physicists are the internal energy of the system, defined by

I_{θ} = E_{p_{θ}} [H (X)], θ \in Θ,

and the specific heat, which is the derivative of the internal energy with respect to temperature, or equivalently,

C_{θ} = - κ θ^{2} \frac{\partial I_{θ}}{\partial θ} = κ θ^{2} {E_{p_{θ}} [H^{2} (X)] - {(E_{p_{θ}} [H (X)])}^{2}}, θ \in Θ,

and interest is focused on how these quantities vary with θ. Because the size of the state space increases very rapidly as c increases, except for the case c ≤ 5, the quantities above cannot be evaluated, and MCMC must be used. It is simple to implement a Metropolis-Hastings algorithm that randomly chooses a site, proposes to flip its spin, and accepts this proposal with the Metropolis-Hastings probability; however this algorithm converges very slowly. Swendsen and Wang (1987) proposed a data augmentation algorithm in which bond variables are introduced: if i and j are nearest neighbors and X_i = X_j, then with probability 1 − exp(−θ) an edge is placed between vertices i and j. This partitions the state space into connected components, and entire components are flipped. This algorithm converges far more rapidly than the single-site updating algorithm, and it is the algorithm we use here. Mykland et al. (1995, sec. 5.3) developed a simple minorization condition for the Swendsen-Wang algorithm, and we use it here to produce the regenerative chains that are needed to estimate the families {I_θ, θ ∈ Θ} and {C_θ, θ ∈ Θ} via the methods of this paper.

We now consider the problem of estimating the families {I_θ, θ ∈ Θ} and {C_θ, θ ∈ Θ}, and as we will see, the issue of obtaining standard errors for our estimates is quite important. We are in the framework of the second of the three classes of problems mentioned in Section 1, and the two-step procedure given there, described in the present context, is as follows:

Step 1 We choose points θ₁, …, θ_k appropriately spread out in the region of Θ of interest, and for l = 1, …, k, we run a Swendsen-Wang chain with invariant distribution $p_{θ_{l}}$ for ρ_l regenerations. Using these k chains, we form $\hat{d}$ , the estimate of the vector d, where $d_{l} = z_{θ_{l}} / z_{θ_{1}}$ , l = 2, …, k.
Step 2 For each l = 1, …, k, we generate a new Swendsen-Wang chain with invariant distribution $p_{θ_{l}}$ for R_l regenerations, and we use these new chains, together with the estimate $\hat{d}$ produced in Step 1, to estimate I_θ and C_θ.

We now describe the details involved in Step 2. Denote the l^th sample (in Step 2) by ${X_{i}^{(l)}, i = 1, \dots, n_{l}}$ . For each θ ∈ Θ, define g_θ(x) = exp[−θH(x)] for x ∈ S. Let

u (x) = \frac{g_{θ} (x)}{\sum_{s = 1}^{k} g_{θ_{s}} (x)}, v (x) = H (x) u (x), and z (x) = H^{2} (x) u (x),

and let

{\hat{u}}_{n} = \sum_{l = 1}^{k} \frac{{\hat{d}}_{l}}{n_{l}} \sum_{i = 1}^{n_{l}} u (X_{i}^{(l)}), {\hat{v}}_{n} = \sum_{l = 1}^{k} \frac{{\hat{d}}_{l}}{n_{l}} \sum_{i = 1}^{n_{l}} v (X_{i}^{(l)}), and {\hat{z}}_{n} = \sum_{l = 1}^{k} \frac{{\hat{d}}_{l}}{n_{l}} \sum_{i = 1}^{n_{l}} z (X_{i}^{(l)}) .

(These quantities depend on θ, but this dependence is temporarily suppressed in the notation.) Using E_l to denote expectation with respect to $p_{θ_{l}}$ , we have

{\hat{I}}_{θ} : = \frac{{\hat{v}}_{n}}{{\hat{u}}_{n}} \overset{a . s .}{\to} \frac{\sum_{l = 1}^{k} d_{l} E_{l} (v (X))}{\sum_{l = 1}^{k} d_{l} E_{l} (u (X))} = \frac{(z_{θ} / z_{θ_{1}}) \sum_{x \in S} H (x) p_{θ} (x)}{(z_{θ} / z_{θ_{1}}) \sum_{x \in S} p_{θ} (x)} = I_{θ}

as ρ_l → ∞ and R_l → ∞ for l = 1, …, k, where the convergence statement follows from ergodicity of the Swendsen-Wang chains and the fact that $\hat{d} \overset{a . s .}{\to} d$ _. Similarly, we have

{\hat{C}}_{θ} : = κ θ^{2} (\frac{{\hat{z}}_{n}}{{\hat{u}}_{n}} - {(\frac{{\hat{v}}_{n}}{{\hat{u}}_{n}})}^{2}) \overset{a . s .}{\to} C_{θ} .

Furthermore, Theorem 2 of Tan, Doss and Hobert (2012) deals precisely with the asymptotic distribution of estimates of the form ${\hat{I}}_{θ}$ and ${\hat{C}}_{θ}$ , in the framework of regenerative Markov chains. This theorem, which relies on Theorem 1 of the present paper, states that if (i) both Stage 1 and Stage 2 chains satisfy A1–A3 of the present paper, (ii) for l = 1, …, k, R_l/R₁ and ρ_l/ρ₁ converge to positive finite constants, and (iii) R₁/ρ₁ converges to a nonnegative finite constant, then $R_{1}^{1 / 2} ({\hat{I}}_{θ} - I_{θ})$ and $R_{1}^{1 / 2} ({\hat{C}}_{θ} - C_{θ})$ have asymptotically normal distributions, and the theorem also provides regeneration-based consistent estimates of the asymptotic variances. These are the estimates we use in this section.

We will apply the approach described above in two situations. The first involves the Ising model on a square lattice small enough so that exact calculations can be done. This enables us to check the performance of our estimators and confidence intervals. The second involves the Ising model on a larger lattice, where calculations can be done only through Monte Carlo methods.

We first consider the Ising model on a 5 × 5 lattice, and we focus on the problem of estimating C_θ, the specific heat. Figure 4 was created using our methods. The left panel gives a plot of ${\hat{C}}_{θ}$ , together with 95% confidence bands (valid pointwise), and a plot of the exact values. The right panel gives the standard error estimates for ${\hat{C}}_{θ}$ .To create the figure, we used the approach described above, with k = 5 and (θ₁, …, θ₅) = (.3, .4, .5, .6, .7). For each l = 1, …, 5, regenerative Swendsen-Wang chains of (approximate) length 10,000 were run for θ_l, based on which $\hat{d}$ and $\hat{W}$ from Theorem 1 were calculated. We then ran independent chains for the same five θ values, for as many iterations, to form estimates ${\hat{C}}_{θ}$ on a fine grid of θ values that range from .2 to 1 in increments of .01. The plot in the right panel was obtained from the formula in Theorem 2 of Tan et al. (2012) and the exact values of C_θ were obtained using closed-form expressions from the physics literature.

Estimation of the specific heat for the Ising model on a 5 × 5 lattice. Left panel gives a plot of the point estimates and a plot of the exact values, as θ varies. The two plots are visually indistinguishable. Also provided are 95% confidence bands. Right panel gives standard errors for ${\hat{C}}_{θ}$ .

We mention that Newman and Barkema (1999, sec. 3.7) also considered the problem of estimating the specific heat for the Ising model on a 5 × 5 lattice. They have a plot very similar to ours, but they produced it by running a separate Swendsen-Wang chain for each θ value on a fine grid, and each chain is used solely for the θ value under which it was generated. In contrast, our method requires only k Swendsen-Wang chains, where k is fairly small, and all chains are used to estimate C_θ for every θ. Here, we have considered a simple instance of the Ising model, the so-called one-parameter case. It is common to also consider the situation where there is an external magnetic field, in which case θ has dimension 2, and $p_{θ} (x) \propto \exp (θ_{1} \sum_{i ~ j} x_{i} x_{j} + θ_{2} \sum_{i \in V} x_{i})$ . Running a separate Swendsen-Wang chain for each θ in a fine subgrid in dimension 2 becomes extremely time consuming, whereas our approach is easily still workable.

In our second example, we consider the Ising model on a 30 × 30 lattice, for which exact calculations of physical quantities are prohibitively expensive, and our interest is now on estimating the internal energy. The left panel of Figure 5 shows a plot of ${\hat{I}}_{θ}$ vs. θ as θ ranges from .35 to 1.5 in increments of .01. To form the plot we carried out the two-step procedure discussed earlier, with k = 5 and reference points (θ₁, …, θ₅) = (.65, .75, .85, .95, 1.05), and a sample size of 100,000 for each chain in both steps. The left panel also shows 95% bands, valid pointwise, and the right panel shows the estimated standard errors. From the plot, we can see that the standard errors are much larger when θ < θ₁ = .65 than they are when θ ≥ θ₁. The importance sampling estimates are not stable when we try to extrapolate below the lowest reference θ value, but we can go well above the highest reference value and still get accurate estimates. It is our ability to estimate SE’s through regeneration that makes it possible for us to determine the range of θ’s for which we have reliable estimates. In fact, this range depends in a complicated way on the reference points and the sample sizes, and even for the relatively simple case where k = 1, the range is not simply an interval centered at θ₁.

Estimation of the internal energy for the Ising model on a 30 × 30 lattice. Left panel gives estimated values, together with 95% confidence bands. Right panel gives the corresponding standard error estimates.

6 Discussion

The main contributions of this paper are the development of estimators of the vector d which are appropriate for the Markov chain setting, and of consistent standard errors for these estimators. Although we have discussed only estimating variances via regenerative simulation, both batching and spectral methods can also be used (and we believe that a rigorous asymptotic theory can be developed for each of these two methods, although we do not attempt to do so here).

There are two ways to do batching. One is trivial, and is described as follows. Suppose we have some method for estimating d, and let $\hat{d}$ be the estimate produced by this method when we use the sequences ${X_{i}^{(1)}}_{i = 1}^{n_{1}}, \dots, {X_{i}^{(k)}}_{i = 1}^{n_{k}}$ . We break up each of the k chains into M non-overlapping segments of equal length, and we let ${\hat{d}}^{[1]}$ be the estimate produced from the first segment of each chain, i.e. ${\hat{d}}^{[1]}$ is produced from the sequences ${X_{i}^{(1)}}_{i = 1}^{n_{1} / M}, \dots, {X_{i}^{(k)}}_{i = 1}^{n_{k} / M}$ (we ignore the problem that the n_j’s may not be divisible by M). Similarly define ${\hat{d}}^{[2]}, \dots {\hat{d}}^{[M]}$ . Assuming we have established that $n^{1 / 2} (\hat{d} - d) \overset{d}{\to} N (0, V)$ , we have, for m = 1, …, M, the corresponding result ${(n / M)}^{1 / 2} ({\hat{d}}^{[m]} - d) \overset{d}{\to} N (0, V)$ . Denoting the average $M^{- 1} \sum_{m = 1}^{M} {\hat{d}}^{[m]}$ by ${\hat{d}}^{[\cdot]}$ , we may estimate the variance matrix V by $(n / M) \sum_{m = 1}^{M} {({\hat{d}}^{[m]} - {\hat{d}}^{[\cdot]})}^{2} / (M - 1)$ , or what is slightly better, $(n / M) \sum_{m = 1}^{M} {({\hat{d}}^{[m]} - \hat{d})}^{2} / (M - 1)$ This method requires essentially no programming effort and gives ball-park estimates of the variance matrix.

The crude estimates obtained by the method above are badly outperformed by the regeneration-based estimates developed in this paper (this is illustrated by Figure 6 below, which we will discuss shortly). To obtain better batching-based variance estimates, we must make use of the structure of the problem. We now outline the approach for doing this. Notably, the approach applies equally well to spectral methods. As mentioned earlier, $\hat{d} = g (\hat{ζ})$ and d = g(ζ₀), where g is given by (3.14). So asymptotic normality of $ρ_{1}^{1 / 2} (\hat{d} - d)$ follows from asymptotic normality of $ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0})$ by the delta method. As we will see in the Appendix, the proof of asymptotic normality of $ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0})$ , or that of $n^{1 / 2} (\hat{ζ} - ζ_{0})$ , is based on representing each component of this vector as a linear combination of standardized averages of functions of the k Markov chains, plus a vanishingly small term. This will be made clear in Appendix A.2; see in particular (A.16) and (A.9). We express this generically as

{[n^{1 / 2} (\hat{ζ} - ζ_{0})]}_{r} = \sum_{j = 1}^{k} c_{r j} \frac{1}{n_{j}^{1 / 2}} \sum_{i = 1}^{n_{j}} f_{r j} (X_{i}^{(j)}) + o_{p} (1) for r = 1, \dots, k,

where f_r₁, …, f_rk are real-valued functions for which $E (f_{r j} (X_{i}^{(j)})) = 0$ , and c_r₁, …, c_rk are constants. For each r and j, the variance of $n_{j}^{- 1 / 2} \sum_{i = 1}^{n_{j}} f_{r j} (X_{i}^{(j)})$ can be estimated by batching in the usual way, or via spectral methods, and since the k Markov chains are independent, this leads to an estimate of the variance of ${[n^{1 / 2} (\hat{ζ} - ζ_{0})]}_{r}$ , r = 1, …, k. The covariances $Cov ({[n^{1 / 2} (\hat{ζ} - ζ_{0})]}_{r}, {[n^{1 / 2} (\hat{ζ} - ζ_{0})]}_{s})$ , r, s = 1, …, k, are handled in a similar way.

Dotplots of 1000 estimates of standard error of ${\hat{d}}_{{\hat{a}}_{opt}}$ in the setting of Section 5.1 with μ = 1, for three methods: regenerative simulation, structure-based batching, and crude batching. The height of the horizontal dotted line is the true value of the standard error.

Figure 6 compares the distributions of the estimates of standard deviation of the statistic ${\hat{d}}_{{\hat{a}}_{opt}}$ based on regeneration, structure-based batching, and the crude method of batching, for the toy example of Section 5.1 with μ = 1. The length of the chains is n = 10,000 (approximately) and the number of batches is b_n ≈ n^1/2, as recommended in Flegal and Jones (2010). Here we do not know the true value of the standard deviation of ${\hat{d}}_{{\hat{a}}_{opt}}$ , so we estimate this quantity via the sample standard deviation based on 1000 independent replications. The horizontal dotted line is positioned at this empirical estimate. The figure gives a dotplot of the 1000 values for each of the three methods. It shows that structure-based batching outperforms crude batching, and both are outperformed by the regeneration-based method developed in this paper, at least in this particular scenario. When we change the parameter values the ordering is maintained, although the magnitude of the differences varies. (We do not display results for the spectral-based estimate because its performance varies quite a bit with the choice of of window and truncation point; however we can say that in almost all cases it is outperformed by the regeneration-based methods.)

Flegal and Jones (2010) provide a thorough analysis and comparison of confidence intervals produced by regeneration, batching, and spectral methods, for the case where the statistic is an average, They studied confidence intervals of the form ${\bar{f}}_{n} \pm 1.96 {\hat{σ}}_{n} / n^{1 / 2}$ , and they report that when tuning parameters are chosen in a suitable manner all three methods produce acceptable results. The significant differences in performance we report above are not inconsistent with the results in Flegal and Jones (2010)—all that is needed for confidence intervals to be asymptotically valid is convergence of ${\hat{σ}}_{n}$ to σ in probability.

We now mention some of the advantages of regenerative simulation for our problem.

Batching is generally acknowledged to be outperformed in terms of accuracy by spectral methods (Geyer, 1992), but estimation by spectral methods is computationally expensive. Typically, the truncation point M_n is of the order n^η for some η ∈ (0, 1), and for each j estimation of γ_j requires O(n) operations, so the number of operations needed to estimate the variance by spectral methods is O(n¹⁺^η). By contrast, the number of operations needed to calculate the variance estimate (3.4) is of the order O(n). Consider now the estimate ${\hat{d}}_{{\hat{a}}_{opt}}$ constructed in Section 4, whose calculation requires us to repeatedly compute trace $({\hat{W}}_{a})$ for a in $S_{k}$ , in order to obtain ${\hat{a}}_{opt} : = {argmin}_{a} trace ({\hat{W}}_{a})$ . Once we have constructed the k regeneration sequences $τ_{0}^{(l)} < τ_{1}^{(l)} < \dots τ_{ρ_{l}}^{(l)}$ , l = 1, …, k, these same sequences may be used in the computation of ${\hat{W}}_{a}$ for all a ∈ $S$ . But the analogous minimization procedure applied to spectral methods would come at a significantly greater computational cost.
Regeneration does not require us to specify tuning parameters such as the batch size for the case of batching or the truncation point and window for the case of spectral methods.
By (3.6) we start each chain at a regeneration point. Therefore, the issue of burn-in does not even exist.

We now discuss the general applicability of the three methods. Regeneration has been applied successfully to a number of problems in recent years; see for example Mykland et al. (1995), Sahu and Zhigljavsky (2003), Roy and Hobert (2007), Tan and Hobert (2009), and Flegal, Jones and Neath (2012). We believe that, when feasible, regenerative simulation is the method of choice. Unfortunately, its successful implementation is problem-specific, i.e. it cannot be routinely applied (the general-purpose method developed by Mykland et al. (1995) applies only to independence Metropolis-Hastings chains). To use regenerative simulation, one must come up with a minorization condition which gives rise to regeneration times that are not too long, and there does not exist a generic procedure for doing this. In most of the published examples, the successful minorization condition is obtained only after some trial and error.

Acknowledgments

We are grateful to two referees and an associate editor, whose constructive criticism improved the paper. This work was supported by NSF Grants DMS-08-05860 and DMS-11-06395 and NIH Grant P30 AG028740.

Appendix: Proof of Theorem 1

A.1 Proof of Consistency of $\hat{d}$

We first work in the ζ domain, and at the very end switch to the d domain. As mentioned earlier, in the standard textbook situation in which we have X₁, …, $X_{n} \overset{iid}{~} p_{θ_{0}}$ where θ₀ ∈ Θ, l_n(θ) is the log likelihood and $Q (θ) = E_{θ_{0}} (l_{1} (θ))$ , the classical proof of consistency (Wald, 1949) is based on the observation that Q(θ) is maximized at θ = θ₀, and that for each fixed θ, $l_{n} (θ) \overset{a . s .}{\to} Q (θ)$ . The, convergence may be non-uniform, and care needs to be exercised in showing that the maximizer of l_n(θ) converges to the maximizer of Q(θ). The present situation is simpler in that the log likelihood and its expected value are twice differentiable and concave, but is more complicated in that we have multiple sequences, they are not iid, and we have a non-identifiability issue, so that maximization is carried out subject to a constraint.

We will write ℓ_ρ instead of ℓ_n to remind ourselves that the ρ_l’s are given and the n_l’s are determined by these ρ_l’s. Also, we will write ℓ_ρ(X, ζ) instead of ℓ_ρ(ζ) when we need to note the dependence of ℓ_ρ(ζ) on X, where $X = (X_{1}^{(1)}, \dots, X_{n_{1}}^{(1)}, \dots, X_{1}^{(k)}, \dots, X_{n_{k}}^{(k)})$ . We define the (scaled) expected log quasi-likelihood by

λ (ζ) = \sum_{l = 1}^{k} a_{l} E_{π_{l}} (\log [p_{l} (X, ζ)]) .

As ρ_l → ∞, we have n_l → ∞, so $n_{l}^{- 1} \sum_{i = 1}^{n_{l}} \log (p_{l} (X_{i}^{(l)}, ζ)) \overset{a . s .}{\to} E_{π_{1}} (\log [p_{l} (X, ζ)])$ , and so

n^{- 1} ℓ_{ρ} (X, ζ) \overset{a . s .}{\to} λ (ζ) for all ζ .

The structure of our proof is similar to that of Theorem 1 of Geyer (1994), and the outline of our proof is as follows. First, define S = {ζ : ζ^⊤1_k = 0}, and recall that $\hat{ζ}$ is defined to be a maximizer of ℓ_ρ(X, ζ) satisfying $\hat{ζ} \in S$ .

We will show that for every X, ℓ_ρ(X, ζ) is everywhere twice differentiable and concave in ζ.
We will show that λ(ζ) is finite, everywhere twice differentiable, and concave. We further show that its Hessian matrix is semi-negative definite, and that its only null eigenvector is 1_k.
We will show that ∇λ(ζ₀) = 0.
We will note that the two steps above imply that ζ₀ is the unique maximizer of λ subject to the condition ζ₀ ∈ S.
We will argue that with probability one, for every ζ, ∇²ℓ_ρ(X, ζ) is semi-negative definite, and 1_k is its only null eigenvector. This will show that $\hat{ζ}$ is the unique maximizer of ℓ_ρ(X, ζ) subject to $\hat{ζ} \in S$ .
We will conclude that the convergence of ℓ_ρ(X, ζ) to λ(ζ) implies convergence of their maximizers that reside in S, that is, $\hat{ζ} \overset{a . s .}{\to} ζ_{0}$ .

We now provide the details.

The differentiability is immediate from the definition of ℓ_ρ (see (2.7)). To show concavity, it is sufficient to show that for every x, log(p_l(x, ζ)) is concave in ζ. Now
$\frac{\partial^{2} \log (p_{l} (x, ζ))}{\partial ζ^{2}} = - (diag (p) - p p^{⊤}),$ (A.1)
where p = (p₁(x, ζ), …, p_k(x, ζ))^⊤. The matrix inside the parentheses on the right side of (A.1) is the variance matrix for the multinomial distribution with parameter p, so this matrix is positive semi-definite.

First, λ(ζ) is finite because λ(ζ) ≤ 0, and

\begin{array}{l} - λ (ζ) = \sum_{l = 1}^{k} a_{l} E_{π_{l}} [\log (\frac{1}{p_{l} (X, ζ)})] \\ = \sum_{l = 1}^{k} a_{l} E_{π_{l}} [\log (1 + \sum_{s \neq l} \frac{ν_{s} (X)}{ν_{l} (X)} e^{ζ_{s} - ζ_{l}})] \\ \leq \sum_{l = 1}^{k} a_{l} E_{π_{l}} (\sum_{s \neq l} \frac{ν_{s} (X)}{ν_{l} (X)} e^{ζ_{s} - ζ_{l}}) (\log (1 + a) < a for a > 0) \\ \leq \sum_{l = 1}^{k} a_{l} \sum_{s \neq l} e^{ζ_{s} - ζ_{l}} \int \frac{ν_{s} (x)}{ν_{l} (x)} π_{l} (x) μ (d x) \\ = \sum_{l = 1}^{k} a_{l} \sum_{s \neq l} e^{ζ_{s} - ζ_{l}} \frac{m_{s}}{m_{l}} \int \frac{π_{s} (x)}{π_{l} (x)} π_{l} (x) μ (d x) < \infty . \end{array}

We now obtain the first and second derivatives of λ. By a standard argument involving the dominated convergence theorem, we can interchange the order of differentiation and integration. (If v is the vector of length k with a 1 in the r^th position and 0’s everywhere else, then for any x, any ζ, and any l ∈ {1, …, k}, [log(p_l(x, ζ + υ/m)) − log(p_l(x, ζ))]m = ∂ log(p_l(x, ζ_∗))/∂ζ_r, where ζ_∗ is between ζ + υ/m and ζ, and this partial derivative is uniformly bounded between −1 and 1.) So for r = 1, …, k, we have

\frac{\partial λ (ζ)}{\partial ζ_{r}} = \sum_{l = 1}^{k} a_{l} E_{π_{l}} (\frac{\partial \log (p_{l} (X, ζ))}{\partial ζ_{r}}) = a_{r} - \sum_{l = 1}^{k} a_{l} E_{π_{l}} (p_{r} (X, ζ)) .

(A.2)

Consider the integrand on the right side of (A.2), i.e. p_r(X, ζ). Its gradient is given by

\partial p_{r} / \partial ζ_{r} = p_{r} - p_{r}^{2}

and ∂p_r/∂ζ_l = −p_rp_l for l ≠ r, and these derivatives are uniformly bounded in absolute value by 1. Hence again by the dominated convergence theorem, we can interchange the order of differentiation and integration, and doing this gives

\begin{array}{l} - \frac{\partial^{2} λ (ζ)}{\partial ζ_{r}^{2}} = \sum_{l = 1}^{k} a_{l} E_{π_{l}} (\frac{\partial p_{r} (X, ζ)}{\partial ζ_{r}}) = \sum_{l = 1}^{k} a_{l} E_{π_{l}} [p_{r} (X, ζ) - p_{r}^{2} (X, ζ)] \\ - \frac{\partial^{2} λ (ζ)}{\partial ζ_{s} \partial ζ_{r}} = \sum_{l = 1}^{k} a_{l} E_{π_{l}} (\frac{\partial p_{r} (X, ζ)}{\partial ζ_{s}}) = \sum_{l = 1}^{k} a_{l} E_{π_{l}} [- p_{r} (X, ζ) p_{s} (X, ζ)] for s \neq r . \end{array}

(A.3)

Define the expectation operator

E_{P} = \sum_{l = 1}^{k} a_{l} E_{l}

. From (A.3) we have −∇²λ(ζ) = E_P(J), where J = diag(p) − pp^⊤, and as before p = (p₁(X, ζ), …, p_k(X, ζ))^⊤. As before, J is the covariance of the multinomial, so is positive semi-definite, and therefore so is E_P (J).

We now determine the null eigenvectors of ∇²λ(ζ) (which is −E_P (J)). If ∇²λ(ζ)u = 0, then u^⊤[∇²λ(ζ)]u = 0, so E_P(u^⊤Ju) = 0. Since J is positive semi-definite, it has a square root, J^1/2. Hence E_P (‖J^1/2u‖²) = 0, which implies Ju = 0 [P]-a.e. The condition Ju = 0 [P]-a.e. is expressed as

p_{r} (X, ζ) (\sum_{l = 1}^{k} p_{l} (X, ζ) u_{l} - u_{r}) = 0 [P] - a . e . for r = 1, \dots, k,

(A.4)

and under our assumption that ν₁, …, ν_k are mutually absolutely continuous, (A.4) implies that

u_{r} = \sum_{l = 1}^{k} p_{l} (X, ζ) u_{l}

for r = 1, …, k. So u₁ = ⋯ = u_k, i.e. u ∝ 1_k.

To show that ∇λ(ζ₀) = 0, we write
$\begin{array}{l} {\frac{\partial λ (ζ)}{\partial ζ_{r}} |}_{ζ_{0}} = a_{r} - \sum_{l = 1}^{k} a_{l} \int \frac{ν_{r} (x) a_{r} / m_{r}}{\sum_{s = 1}^{k} ν_{s} (x) a_{s} / m_{s}} π_{l} (x) μ (d x) \\ = a_{r} - \int \frac{\sum_{l = 1}^{k} a_{l} π_{l} (x)}{\sum_{s = 1}^{k} a_{s} ν_{s} (x) / m_{s}} ν_{r} (x) a_{r} / m_{r} μ (d x) \\ = a_{r} - a_{r} \int π_{r} (x) μ (d x) = 0 . \end{array}$
For any ζ satisfying ζ^⊤1_k = 0, we may write
$\begin{array}{l} λ (ζ) = λ (ζ_{0}) + {(ζ - ζ_{0})}^{⊤} \nabla λ (ζ_{0}) + \frac{1}{2} {(ζ - ζ_{0})}^{⊤} \nabla^{2} λ (ζ_{*}) (ζ - ζ_{0}) \\ = λ (ζ_{0}) + \frac{1}{2} {(ζ - ζ_{0})}^{⊤} \nabla^{2} λ (ζ_{*}) (ζ - ζ_{0}), \end{array}$
where ζ_∗ is between ζ and ζ₀. If ζ ≠ ζ₀, i.e. ζ − ζ₀ ≠ 0, then since (ζ − ζ₀)^⊤ 1_k = 0, ζ − ζ₀ cannot be a scalar multiple of 1_k. Hence by Step 2, (ζ − ζ₀)^⊤∇²λ(ζ_∗)(ζ − ζ₀) < 0.
Clearly $\nabla ℓ_{ρ} (X, \hat{ζ})$ . The proof that (i) ∇²ℓ_ρ(X, ζ) is semi-negative definite, (ii) the only null eigenvector of ∇²ℓ_ρ(X, ζ) is 1_k, and (iii) $\hat{ζ}$ is the unique maximizer of ℓ_ρ(X, ζ) subject to the constraint ζ ∈ S, is essentially identical to the proof of these assertions for λ(ζ).
Since $n^{- 1} ℓ_{ρ} (X, ζ) \overset{a . s .}{\to} λ (ζ)$ for each ζ, a.s. convergence occurs on a dense subset of S. Also, the functions involved are all concave in the entire space of ζ’s, hence are concave in S. Therefore, we have a.s. uniform convergence of $n^{- 1} ℓ_{ρ} (X, ζ)$ to λ(ζ) on compact subsets of S. Under concavity, this is enough to imply ${argmax}_{ζ \in S} ℓ_{ρ} (X, ζ) \overset{a . s .}{\to} {argmax}_{ζ \in S} λ (ζ)$ , i.e. $\hat{ζ} \overset{a . s .}{\to} ζ_{0}$ .

Finally, to see that $\hat{d} \overset{a . s .}{\to} d$ , we write $\hat{d} - d = g (\hat{ζ}) - g (ζ_{0}) = \nabla g {(ζ_{*})}^{⊤} (\hat{ζ} - ζ_{0})$ , where ζ_∗ is between $\hat{ζ}$ and ζ₀. The function g actually depends on a⁽^ρ⁾, so depends on ρ, but the gradient ∇g(ζ_∗) is bounded for large ρ because $\hat{ζ} \overset{a . s .}{\to} ζ_{0}$ and α⁽^ρ⁾ → α. Therefore $\hat{d} \overset{a . s .}{\to} d$ .

A.2 Proof of Regeneration-Based CLT for $\hat{d}$

We begin by considering $ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0})$ . As in the classical proof of asymptotic normality of maximum likelihood estimators, we expand ∇ℓ_ρ at $\hat{ζ}$ around ζ₀, and using the appropriate scaling factor, we get

- \frac{ρ_{1}^{1 / 2}}{n} (\nabla ℓ_{ρ} (\hat{ζ}) - \nabla ℓ_{ρ} (ζ_{0})) = - \frac{1}{n} \nabla^{2} ℓ_{ρ} (ζ_{*}) ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0}),

(A.5)

where ζ_∗ is between $\hat{ζ}$ and ζ₀. Consider the left side of (A.5), which is just $ρ_{1}^{1 / 2} n^{- 1} \nabla ℓ_{ρ} (ζ_{0})$ , since $\nabla ℓ_{ρ} (\hat{ζ}) = 0$ . There are several nontrivial components to the proof, so we first give an outline.

We show that each element of the vector n⁻¹∇ℓ_ρ(ζ₀) can be represented as a linear combination of mean 0 averages of functions of the k chains plus a vanishingly small term.
Using Step 1 above, we obtain a regeneration-based CLT for the scaled score vector, via a considerably more involved version of the method we used in Section 3.1: we show that $ρ_{1}^{1 / 2} n^{- 1} \nabla ℓ_{ρ} (ζ_{0}) \overset{d}{\to} N (0, Ω)$ , where Ω is given by (3.12).
We argue that $- n^{- 1} \nabla^{2} ℓ_{ρ} (ζ_{*}) \overset{a . s .}{\to} B$ and that ${(- n^{- 1} \nabla^{2} ℓ_{ρ} (ζ_{*}))}^{†} \overset{a . s .}{\to} B^{†}$ , where B is defined in (3.8), using ideas in Geyer (1994).
We conclude that $ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0}) \overset{d}{\to} N (0, B^{†} Ω B^{†})$ .
We note the relationships d = g(ζ₀) and $\hat{d} = g (\hat{ζ})$ , where g was define by (3.14), and apply the delta method to obtain the desired result.

We now provide the details.

We start by considering n⁻¹∇ℓ_ρ(ζ₀). For r = 1, …, k, we have

\begin{array}{l} \frac{\partial ℓ_{ρ} (ζ_{0})}{\partial ζ_{r}} = w_{r} \sum_{i = 1}^{n_{r}} (1 - p_{r} (X_{i}^{(r)}, ζ_{0})) - \sum_{\begin{array}{l} l = 1 \\ l \neq r \end{array}}^{k} w_{l} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{0}) \\ = w_{r} \sum_{i = 1}^{n_{r}} (1 - p_{r} (X_{i}^{(r)}, ζ_{0}) - [1 - E_{π_{r}} (p_{r} (X, ζ_{0}))]) \\ - \sum_{\begin{array}{l} l = 1 \\ l \neq r \end{array}}^{k} w_{l} \sum_{i = 1}^{n_{l}} [p_{r} (X_{i}^{(l)}, ζ_{0}) - E_{π_{l}} (p_{r} (X, ζ_{0}))] + e, \end{array}

(A.6)

where

e = w_{r} \sum_{i = 1}^{n_{r}} [1 - E_{π_{r}} (p_{r} (X, ζ_{0}))] - \sum_{\begin{array}{l} l = 1 \\ l \neq r \end{array}}^{k} w_{l} \sum_{i = 1}^{n_{l}} E_{π_{l}} (p_{r} (X, ζ_{0})) .

(A.7)

We claim that e = 0. To see this, note that from (A.7) we have

e = w_{r} n_{r} - \sum_{l = 1}^{k} w_{l} n_{l} E_{π_{l}} (p_{r} (X, ζ_{0})) = w_{r} n_{r} - \sum_{l = 1}^{k} w_{l} n_{l} \frac{a_{r}}{a_{l}} E_{π_{r}} (p_{l} (X, ζ_{0})) .

(A.8)

The last equality in (A.8) holds because

\begin{array}{l} E_{π_{l}} (p_{r} (X, ζ_{0})) = \int \frac{ν_{r} (x) e^{{[ζ_{0}]}_{r}}}{\sum_{s = 1}^{k} ν_{s} (x) e^{{[ζ_{0}]}_{s}}} π_{l} (x) μ (d x) = \int \frac{ν_{r} (x) a_{r} / m_{r}}{\sum_{s = 1}^{k} ν_{s} (x) a_{s} / m_{s}} π_{l} (x) μ (d x) \\ = \int \frac{ν_{l} (x) a_{r} / m_{l}}{\sum_{s = 1}^{k} ν_{s} (x) a_{s} / m_{s}} π_{r} (x) μ (d x) = \frac{a_{r}}{a_{l}} E_{π_{r}} (p_{l} (X, ζ_{0})) . \end{array}

Therefore, using the fact that w_ln_la_r/a_l = w_rn_r, we get

e = w_{r} n_{r} - w_{r} n_{r} \sum_{l = 1}^{k} E_{π_{r}} (p_{l} (X, ζ_{0})) = w_{r} n_{r} - w_{r} n_{r} E_{π_{r}} (\sum_{l = 1}^{k} p_{l} (X, ζ_{0})) = 0 .

We summarize: Because e = 0, (A.6) can be used to view n⁻¹∂ℓ_ρ(ζ₀)/∂ζ_r as a linear combination of mean 0 averages of functions of the k chains. To express these averages in terms of iid quantities, we first recall the definitions of

y_{i}^{(r, l)} (a)

Y_{t}^{(r, l)} (a)

{\bar{Y}}^{(r, l)} (a)

, and

{\bar{T}}^{(l)}

given in (3.10) and (3.11), and multiplying by the scaling factor

ρ_{1}^{1 / 2} n^{- 1}

, we rewrite (A.6) as

\frac{ρ_{1}^{1 / 2}}{n} \frac{\partial ℓ_{ρ} (ζ_{0})}{\partial ζ_{r}} = - \frac{ρ_{1}^{1 / 2}}{n} \sum_{l = 1}^{k} w_{l} \sum_{i = 1}^{n_{l}} [p_{r} (X_{i}^{(l)}, ζ_{0}) - E_{π_{l}} (p_{r} (X, ζ_{0}))]

(A.9a)

= - \sum_{l = 1}^{k} \frac{ρ_{1}^{1 / 2} n_{l}}{n} w_{l} \frac{1}{n_{l}} \sum_{t = 1}^{ρ_{l}} Y_{t}^{(r, l)} (a)

(A.9b)

= - \sum_{l = 1}^{k} \frac{ρ_{1}^{1 / 2} n_{l}}{n} w_{l} \frac{\sum_{t = 1}^{ρ_{l}} Y_{t}^{(r, l)} (a)}{\sum_{t = 1}^{ρ_{l}} T_{t}^{(l)}}

(A.9c)

= - \sum_{l = 1}^{k} [{(\frac{ρ_{1}}{ρ_{l}})}^{1 / 2} \frac{n_{l}}{n} w_{l}] [ρ_{l}^{1 / 2} \frac{{\bar{Y}}^{(r, l)} (a)}{{\bar{T}}^{(l)}}]

(A.9d)

= - \sum_{l = 1}^{k} [{(\frac{ρ_{1}}{ρ_{l}})}^{1 / 2} a_{l}] [ρ_{l}^{1 / 2} \frac{{\bar{Y}}^{(r, l)} (a)}{{\bar{T}}^{(l)}}] .

(A.9e)

We now apply a more complex and more rigorous version of the argument we used in Section 3.1. We note the following: (i) the k chains are geometrically ergodic by Assumption A2; (ii) since p_r(x, ζ) ∈ (0, 1) for all x and all ζ, $E_{π_{l}} ({| y_{1}^{(r, l)} (a) |}^{2 + ε}) < \infty$ for some ε > 0 (in fact for any ε > 0); and (iii) by (3.10) the mean of $Y_{t}^{(r, l)} (a)$ is 0. The usual CLT for iid sequences does not apply to the sequence $Y_{1}^{(r, l)} (a)$ , …, $Y_{ρ_{l}}^{(r, l)} (a)$ because a = a⁽^ρ⁾ is allowed to change with ρ, so the distribution of $Y_{t}^{(r, l)} (a)$ changes with ρ. Since r and l are now fixed and play no important role, while the dependence of a on ρ now needs to be noted we will write y_i(a⁽^ρ⁾) instead of $y_{i}^{(r, l)} (a)$ , Y_t(a⁽^ρ⁾) instead of $Y_{t}^{(r, l)} (a)$ , etc. We really have a triangular array of random variables, and we will apply the Lindeberg-Feller version of the CLT.

We first need to show that $E ({[Y_{t} (a^{(ρ)})]}^{2}) < \infty$ . (This condition is nontrivial because Y_t(a⁽^ρ⁾) is the sum of a random number of terms.) Note that since p_r(x, ζ) ∈ (0, 1), |y_i(a⁽^ρ⁾)| ≤ 1, and therefore,
$| Y_{t} (a^{(ρ)}) | \leq T_{t}^{(l)} .$ (A.10)

Theorem 2 of Hobert, Jones, Presnell and Rosenthal (2002) states that $E [{(T_{t}^{(l)})}^{2}] < \infty$ under geometric ergodicity. So E([Y_t(a⁽^ρ⁾)]²) < ∞, and we may form the triangular array whose $ρ_{l}^{th}$ row consists of the variables U₁(a⁽^ρ⁾), …, $U_{ρ_{l}} (a^{(ρ)})$ where
$U_{t} (a^{(ρ)}) = \frac{Y_{t} (a^{(ρ)})}{{(\sum_{s = 1}^{ρ_{l}} Var [Y_{s} (a^{(ρ)})])}^{1 / 2}} .$

Clearly, E[U_t(a⁽^ρ⁾)] = 0 and $\sum_{t = 1}^{ρ_{l}} Var [U_{t} (a^{(ρ)})] = 1$ .

The Lindeberg Condition is that for every η > 0,
$\sum_{t = 1}^{ρ_{l}} E ({[U_{t} (a^{(ρ)})]}^{2} I (| U_{t} (a^{(ρ)}) | > η)) \overset{a . s .}{\to} 0 as ρ_{l} \to \infty,$
and this is equivalent to the condition
$E [\frac{{[Y_{1} (a^{(ρ)})]}^{2}}{Var [Y_{1} (a^{(ρ)})]} I (\frac{| Y_{1} (a^{(ρ)}) |}{{(ρ_{l} Var [Y_{1} (a^{(ρ)})])}^{1 / 2}} > η)] \to 0 as ρ_{l} \to \infty .$ (A.11)

To see (A.11) note that as ρ_l → ∞, by the assumption that a⁽^ρ⁾ → α where all the components of α are strictly positive and dominated convergence, we have
$Y_{1} (a^{(ρ)}) \overset{a . s .}{\to} Y_{1} (α) .$

By (A.10), we have ${[Y_{t} (a^{(ρ)})]}^{2} \leq {(T_{t}^{(l)})}^{2}$ , and $E [{(T_{t}^{(l)})}^{2}] < \infty$ by Theorem 2 of Hobert et al. (2002). Therefore, E([Y_t(a⁽^ρ⁾)]²) → E([Yt(α)]²) by (A.10) and dominated convergence, and we also have E[Y_t(a⁽^ρ⁾)] → E[Y_t(α)], so that Var[Y_t(a⁽^ρ⁾)] → Var[Y_t(α)]. Since I[|Y₁(a⁽^ρ⁾)| > (ρ_l Var[Y₁(a⁽^ρ⁾)])^1/2η] = 0 for large ρ, (A.11) follows by dominated convergence.

The Lindeberg-Feller theorem (together with the fact that ${\bar{T}}^{(l)} \overset{a . s .}{\to} E (T_{1}^{(l)})$ )) now states that the term in the second set of brackets in (A.9e) has an asymptotic normal distribution, with mean 0, and variance $E ({[Y_{1}^{(r, l)} (α)]}^{2}) / {(E (T_{1}^{(l)}))}^{2}$ . The term in the first set of brackets converges to $α_{l} c_{l}^{- 1 / 2} .$ Since the k chains are independent, we conclude that
$\frac{ρ_{1}^{1 / 2}}{n} \frac{\partial ℓ_{ρ} (ζ_{0})}{\partial ζ_{r}} \overset{d}{\to} N (0, Ω_{r r}) as ρ_{1} \to \infty,$
where Ω was defined in (3.12). But by the Cramér-Wold Theorem, we obtain the more general statement involving the asymptotic distribution of the entire gradient vector. The argument is standard and gives
$\frac{ρ_{1}^{1 / 2}}{n} \nabla ℓ_{ρ} (ζ_{0}) \overset{d}{\to} N (0, Ω) as ρ_{1} \to \infty .$
Now, referring to (A.5), denote the matrix −n⁻¹∇²ℓ_ρ(ζ_∗) by B_ρ. We have
$\begin{array}{l} {[B_{ρ}]}_{r r} = \sum_{l = 1}^{k} a_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{*}) [1 - p_{r} (X_{i}^{(l)}, ζ_{*})]), r = 1, \dots, k, \\ {[B_{ρ}]}_{r s} = - \sum_{l = 1}^{k} a_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{*}) p_{s} (X_{i}^{(l)}, ζ_{*})), r, s = 1, \dots, k, r \neq s, \end{array}$ (A.12)
and for later use also define $B_{ρ}^{(α)}$ by
$\begin{array}{l} {[B_{ρ}^{(α)}]}_{r r} = \sum_{l = 1}^{k} a_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{α}) [1 - p_{r} (X_{i}^{(l)}, ζ_{α})]), r = 1, \dots, k, \\ {[B_{ρ}^{(α)}]}_{r s} = - \sum_{l = 1}^{k} a_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{α}) p_{s} (X_{i}^{(l)}, ζ_{α})), r, s = 1, \dots, k, r \neq s . \end{array}$ (A.13)

From (A.12) we can check that
$B_{ρ} 1_{k} = 0,$ (A.14)
and because $1_{k}^{⊤} \hat{ζ} = 0$ and $1_{k}^{⊤} ζ_{0} = 0$ , we have
$(\begin{array}{l} B_{ρ} \\ \frac{1_{k}^{⊤}}{\sqrt{k}} \end{array}) ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0}) = (\begin{matrix} \frac{ρ_{1}^{1 / 2}}{n} \nabla ℓ_{ρ} (ζ_{0}) \\ 0 \end{matrix}) .$

Hence
$(B_{ρ}^{†}, \frac{1_{k}}{\sqrt{k}}) (\begin{array}{l} B_{ρ} \\ \frac{1_{k}^{⊤}}{\sqrt{k}} \end{array}) ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0}) = (B_{ρ}^{†}, \frac{1_{k}}{\sqrt{k}}) (\begin{matrix} \frac{ρ_{1}^{1 / 2}}{n} \nabla ℓ_{ρ} (ζ_{0}) \\ 0 \end{matrix}) .$ (A.15)

Now from (A.14) and the spectral decomposition of the symmetric matrix B_ρ, we have $B_{ρ}^{†} B_{ρ} = I_{k} - 1_{k} 1_{k}^{⊤} / k$ , so (A.15) becomes
$ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0}) = B_{ρ}^{†} \frac{ρ_{1}^{1 / 2}}{n} \nabla ℓ_{ρ} (ζ_{0}) .$ (A.16)

We now study the asymptotic behavior of $B_{ρ}^{†}$ . From (A.13), the fact that w_l = a_ln/n_l by definition, and ergodicity, we have
$\begin{array}{l} {[B_{ρ}^{(α)}]}_{r r} = \sum_{l = 1}^{k} a_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{0}) [1 - p_{r} (X_{i}^{(l)}, ζ_{0})]) \overset{a . s .}{\to} B_{r r}, \\ {[B_{ρ}^{(α)}]}_{r s} = - \sum_{l = 1}^{k} a_{l} (\frac{1}{n_{l}} \sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{0}) p_{s} (X_{i}^{(l)}, ζ_{0})) \overset{a . s .}{\to} B_{r s}, r \neq s . \end{array}$

The first part of Theorem 1 states that $\hat{ζ} - ζ_{0} \overset{a . s .}{\to} 0$ as ρ₁ → ∞. Now since partial derivatives (with respect to ζ) of terms of the form p_r(x, ζ) (1 − p_r(x, ζ)) or p_r(x, ζ)p_s(x, ζ) are uniformly bounded by 1 in absolute value, we see that ${[B_{ρ}]}_{r s} - {[B_{ρ}^{(α)}]}_{r s} = O ({‖ ζ_{*} - ζ_{α} ‖}_{1})$ a.s. for all r and s, and conclude that $B_{ρ} \overset{a . s .}{\to} B$ . (Here, ‖υ‖₁ denotes the L₁ norm of a vector $υ \in ℝ^{k}$ .) Similarly, ${[\hat{B}]}_{r s} - {[B_{ρ}^{(α)}]}_{r s} = O ({‖ \hat{ζ} - ζ_{α} ‖}_{1})$ a.s. for all r and s, so $\hat{B} \overset{a . s .}{\to} B$ . Furthermore, the spectral decomposition of B_ρ and B, and the fact that B_ρ1_k = 0 and B1_k = 0, we have
$B_{ρ}^{†} = {(B_{ρ} + \frac{1}{k} 1_{k} 1_{k}^{⊤})}^{- 1} - \frac{1}{k} 1_{k} 1_{k}^{⊤} and B^{†} = {(B + \frac{1}{k} 1_{k} 1_{k}^{⊤})}^{- 1} - \frac{1}{k} 1_{k} 1_{k}^{⊤},$ (A.17)
showing that $B_{ρ}^{†} \overset{a . s .}{\to} B^{†}$ .
The convergence statement $ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0}) \overset{d}{\to} N (0, B^{†} Ω B^{†})$ now follows immediately.
Finally, we write $ρ_{1}^{1 / 2} (\hat{d} - d) = ρ_{1}^{1 / 2} (g (\hat{ζ}) - g (ζ_{0})) = \nabla g {(ζ_{*})}^{⊤} ρ_{1}^{1 / 2} (\hat{ζ} - ζ_{0})$ , where ζ_∗ is between $\hat{ζ}$ and ζ₀. Since $\nabla g {(ζ_{*})}^{⊤} \overset{a . s .}{\to} D$ , the desired result (3.16) now follows.

A.3 Proof of Consistency of the Estimate of the Asymptotic Variance Matrix

In the proof of the first part of Theorem 1, we showed that $\hat{ζ} \overset{a . s .}{\to} ζ_{0}$ and $\hat{d} \overset{a . s .}{\to} d$ . Hence, $\hat{D} \overset{a . s .}{\to} D$ . In the proof of the second part of Theorem 1 we showed that $\hat{B} \overset{a . s .}{\to} B$ . Using the spectral representation of $\hat{B}$ and of B (see (A.17)), we see that this entails ${\hat{B}}^{†} \overset{a . s .}{\to} B^{†}$ .

To complete the proof, we need to show that $\hat{Ω} \overset{a . s .}{\to} Ω$ . Consider the expressions for Ω and $\hat{Ω}$ given by (3.12) and (3.13), respectively. Since a → α and ${\bar{T}}^{(l)} \overset{a . s .}{\to} E (T_{1}^{(l)})$ , to show that $\hat{Ω} \overset{a . s .}{\to} Ω$ , we need only show that

\frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} (Z_{t}^{(r, l)} - {\hat{μ}}_{r}^{(l)} T_{t}^{(l)}) (Z_{t}^{(s, l)} - {\hat{μ}}_{r}^{(l)} T_{t}^{(l)}) \overset{a . s .}{\to} E (Y_{1}^{(r, l)} (α) Y_{1}^{(s, l)} (α)) .

(A.18)

Now, the left side of (A.18) is an average of quantities that involve $Z_{t}^{(r, l)}$ and ${\hat{μ}}_{r}^{(l)}$ , which themselves are a sum and an average, respectively, of a function that involves the random quantity $\hat{ζ}$ . At the risk of making the notation more cumbersome, we will now write $Z_{t}^{(r, l)} (\hat{ζ})$ instead of $Z_{t}^{(r, l)}$ and ${\hat{μ}}_{r}^{(l)} (\hat{ζ})$ instead of ${\hat{μ}}_{r}^{(l)}$ . Our plan is to introduce $Ω_{ρ}^{(α)}$ , a version of $\hat{Ω}$ in which $\hat{ζ}$ is replaced by the non-random quantity ζ_α, and show that (i) $Ω_{ρ}^{(α)} \overset{a . s .}{\to} Ω$ and (ii) $\hat{Ω} - Ω_{ρ}^{(α)} \overset{a . s .}{\to} 0$ . To this end, let

Z_{t}^{(r, l)} (ζ_{α}) = \sum_{i = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} p_{r} (X_{i}^{(l)}, ζ_{α}) and {\hat{μ}}_{r}^{(l)} (ζ_{α}) = \frac{\sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(l)}, ζ_{α})}{n_{l}},

and note that by definition

Z_{t}^{(r, l)} (\hat{ζ}) = \sum_{i = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} p_{r} (X_{i}^{(l)}, \hat{ζ}) and {\hat{μ}}_{r}^{(l)} (\hat{ζ}) = \frac{\sum_{i = 1}^{n_{l}} p_{r} (X_{i}^{(1)}, \hat{ζ})}{n_{l}} .

Define the k × k matrices Ψ, $\hat{Ψ}$ , and $Ψ_{ρ}^{(α)}$ by

\begin{matrix} Ψ_{r s} = E [{Z_{1}^{(r, l)} (ζ_{α}) - T_{1}^{(l)} E_{π_{l}} [p_{r} (X, ζ_{α})]} {Z_{1}^{(s, l)} (ζ_{α}) - T_{1}^{(l)} E_{π_{l}} [p_{s} (X, ζ_{α})]}], \\ {\hat{Ψ}}_{r s} = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} (Z_{t}^{(r, l)} (\hat{ζ}) - T_{t}^{(l)} {\hat{μ}}_{r}^{(l)} (\hat{ζ})) (Z_{t}^{(s, l)} (\hat{ζ}) - T_{t}^{(l)} {\hat{μ}}_{s}^{(l)} (\hat{ζ})), \\ {[Ψ_{ρ}^{(α)}]}_{r s} = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} (Z_{t}^{(r, l)} (ζ_{α}) - T_{t}^{(1)} {\hat{μ}}_{r}^{(l)} (ζ_{α})) (Z_{t}^{(s, l)} (ζ_{α}) - T_{t}^{(1)} {\hat{μ}}_{s}^{(l)} (ζ_{α})) . \end{matrix}

Note that Ψ_rs is simply the right side of (A.18). Here, Ψ_rs is the population-level quantity (which we wish to estimate), ${\hat{Ψ}}_{r s}$ is the empirical estimate of this quantity, and ${[Ψ_{ρ}^{(α)}]}_{r s}$ is an “intermediate” or bridging quantity, used only in our proof. We will show that (i) $Ψ_{ρ}^{(α)} \overset{a . s .}{\to} Ψ$ and (ii) ${\hat{Ψ}}_{ρ}^{(α)} - Ψ_{ρ}^{(α)} \overset{a . s .}{\to} 0$ .

To show that $Ψ_{ρ}^{(α)} \overset{a . s .}{\to} Ψ$ , we first express ${\hat{Ψ}}_{r s}$ as a sum of four averages. That the four averages converge to their respective population counterparts follows from the ergodic theorem, together with the fact that $E [{(T_{1}^{(l)})}^{2}] < \infty$ .

To show that ${\hat{Ψ}}_{r s} - {[Ψ_{ρ}^{(α)}]}_{r s} \overset{a . s .}{\to} 0$ , we express ${\hat{Ψ}}_{r s} - {[Ψ_{ρ}^{(α)}]}_{r s}$ as the sum of four differences of averages, and show that each of these converges almost surely to 0. Consider the first difference, which is

D_{1} : = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} [Z_{t}^{(r, l)} (\hat{ζ}) Z_{t}^{(s, l)} (\hat{ζ}) - Z_{t}^{(r, l)} (ζ_{α}) Z_{t}^{(s, l)} (ζ_{α})] .

(A.19)

The expression inside the brackets in (A.19) is equal to

D_{1 t} : = \sum_{i = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} \sum_{j = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} [p_{r} (X_{i}^{(l)}, \hat{ζ}) p_{s} (X_{j}^{(1)}, \hat{ζ}) - p_{r} (X_{i}^{(l)}, ζ_{α}) p_{s} (X_{j}^{(l)}, ζ_{α})]

(A.20)

and because all partial derivatives with respect to ζ of functions of the form p_r(x, ζ)p_s(y, ζ) are uniformly bounded by 1 in absolute value, the expression inside the brackets in (A.20) is bounded by ${‖ \hat{ζ} - ζ_{α} ‖}_{1}$ . Since there are ${(T_{t}^{(l)})}^{2}$ summands in the double sum in (A.20), $| D_{1 t} | < {(T_{t}^{(l)})}^{2} {‖ \hat{ζ} - ζ_{α} ‖}_{1}$ , and from the fact that $E [{(T_{1}^{(l)})}^{2}] < \infty$ we now see that $D_{1} \overset{a . s .}{\to} 0$ .

The second difference is

D_{2} : = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} [Z_{t}^{(r, l)} (\hat{ζ}) T_{t}^{(l)} {\hat{μ}}_{s}^{(l)} (\hat{ζ}) - Z_{t}^{(r, l)} (ζ_{α}) T_{t}^{(l)} {\hat{μ}}_{s}^{(l)} (ζ_{α})] .

(A.21)

The expression inside the brackets in (A.21)

D_{2 t} : = T_{t}^{(l)} \frac{1}{n_{l}} \sum_{i = τ_{t - 1}^{(l)}}^{τ_{t}^{(l)} - 1} \sum_{j = 1}^{n_{l}} [p_{r} (X_{i}^{(l)}, \hat{ζ}) p_{s} (X_{j}^{(l)}, \hat{ζ}) - p_{r} (X_{i}^{(l)}, ζ_{α}) p_{s} (X_{i}^{(l)}, ζ_{α})],

and reasoning as we did for the case of the first difference, we have $| D_{2 t} | < T_{t}^{(l)} \cdot T_{t}^{(l)} {‖ \hat{ζ} - ζ_{α} ‖}_{1}$ , which implies that $D_{2} \overset{a . s .}{\to} 0$ . The third difference is handled in a similar way.

The fourth difference

D_{4} : = \frac{1}{ρ_{l}} \sum_{t = 1}^{ρ_{l}} [{(T_{t}^{(l)})}^{2} {\hat{μ}}_{r}^{(l)} (\hat{ζ}) {\hat{μ}}_{s}^{(l)} (\hat{ζ}) - {(T_{t}^{(l)})}^{2} {\hat{μ}}_{r}^{(l)} (ζ_{α}) {\hat{μ}}_{s}^{(l)} (ζ_{α})] .

(A.22)

The expression inside the brackets in (A.22) is

D_{4 t} : = {(T_{t}^{(l)})}^{2} \frac{1}{n_{l}^{2}} \sum_{i = 1}^{n_{l}} \sum_{j = 1}^{n_{l}} [p_{r} (X_{i}^{(l)}, \hat{ζ}) p_{s} (X_{j}^{(l)}, \hat{ζ}) - p_{r} (X_{i}^{(l)}, ζ_{α}) p_{s} (X_{j}^{(l)}, ζ_{α})],

and we have $| D_{4 t} | < {(T_{t}^{(l)})}^{2} {‖ \hat{ζ} - ζ_{α} ‖}_{1}$ , from which we conclude that $D_{4} \overset{a . s .}{\to} 0$ .

References

Buta E, Doss H. Computational approaches for empirical Bayes methods and Bayesian sensitivity analysis. Annals of Statistics. 2011;39:2658–2685. [Google Scholar]
Flegal JM, Haran M, Jones GL. Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science. 2008;23:250–260. [Google Scholar]
Flegal JM, Jones GL. Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics. 2010;38:1034–1070. [Google Scholar]
Flegal JM, Jones GL, Neath RC. Markov chain Monte Carlo estimation of quantiles. arXiv preprint arXiv:1207.6432 2012 [Google Scholar]
Geyer CJ. Practical Markov chain Monte Carlo (with discussion) Statistical Science. 1992;7:473–511. [Google Scholar]
Geyer CJ. Tech Rep 568r. Department of Statistics, University of Minnesota; 1994. Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. [Google Scholar]
Gill RD, Vardi Y, Wellner JA. Large sample theory of empirical distributions in biased sampling models. The Annals of Statistics. 1988;16:1069–1112. [Google Scholar]
Hobert JP, Jones GL, Presnell B, Rosenthal JS. On the applicability of regenerative simulation in Markov chain Monte Carlo. Biometrika. 2002;89:731–743. [Google Scholar]
Jones GL, Haran M, Caffo BS, Neath R. Fixed-width output analysis for Markov chain Monte Carlo. Journal of the American Statistical Association. 2006;101:1537–1547. [Google Scholar]
Kong A, McCullagh P, Meng XL, Nicolae D, Tan Z. A theory of statistical models for Monte Carlo integration (with discussion) (Series B).Journal of the Royal Statistical Society. 2003;65:585–618. [Google Scholar]
Meng XL, Wong WH. Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica. 1996;6:831–860. [Google Scholar]
Mengersen KL, Tweedie RL. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics. 1996;24:101–121. [Google Scholar]
Meyn SP, Tweedie RL. Markov Chains and Stochastic Stability. Springer-Verlag; New York, London: 1993. [Google Scholar]
Mykland P, Tierney L, Yu B. Regeneration in Markov chain samplers. Journal of the American Statistical Association. 1995;90:233–241. [Google Scholar]
Newman M, Barkema G. Monte Carlo Methods in Statistical Physics. Oxford University Press; 1999. [Google Scholar]
Newton M, Raftery A. Approximate Bayesian inference with the weighted likelihood bootstrap (with discussion) (Series B).Journal of the Royal Statistical Society. 1994;56:3–48. [Google Scholar]
Nummelin E. General Irreducible Markov Chains and Non-negative Operators. Cambridge University Press; London: 1984. [Google Scholar]
Robert CP, Casella G. Monte Carlo Statistical Methods. Second. Springer-Verlag; New York: 2004. [Google Scholar]
Romero M. Ph D thesis. University of Chicago; 2003. On Two Topics with no Bridge: Bridge Sampling with Dependent Draws and Bias of the Multiple Imputation Variance Estimator. [Google Scholar]
Roy V, Hobert JP. Convergence rates and asymptotic standard errors for MCMC algorithms for Bayesian probit regression. (Series B).Journal of the Royal Statistical Society. 2007;69:607–623. [Google Scholar]
Sahu SK, Zhigljavsky AA. Self-regenerative Markov chain Monte Carlo with adaptation. Bernoulli. 2003;9:395–422. [Google Scholar]
Swendsen R, Wang J. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters. 1987;58:86–88. doi: 10.1103/PhysRevLett.58.86. [DOI] [PubMed] [Google Scholar]
Tan A, Doss H, Hobert JP. Tech rep. Department of Statistics, University of Florida; 2012. Honest importance sampling with multiple Markov chains. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tan A, Hobert JP. Block Gibbs sampling for Bayesian random effects models with improper priors: convergence and regeneration. Journal of Computational and Graphical Statistics. 2009;18:861–878. [Google Scholar]
Tan Z. On a likelihood approach for Monte Carlo integration. Journal of the American Statistical Association. 2004;99:1027–1036. [Google Scholar]
Vardi Y. Empirical distributions in selection bias models. The Annals of Statistics. 1985;13:178–203. [Google Scholar]
Wald A. Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics. 1949;20:595–601. [Google Scholar]
Wolpert RL, Schmidler SC. α-stable limit laws for harmonic mean estimators of marginal likelihoods. Statistica Sinica. 2011 (in press) preprint at http://ftp.stat.duke.edu/WorkingPapers/10-19.html.

[R1] Buta E, Doss H. Computational approaches for empirical Bayes methods and Bayesian sensitivity analysis. Annals of Statistics. 2011;39:2658–2685. [Google Scholar]

[R2] Flegal JM, Haran M, Jones GL. Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science. 2008;23:250–260. [Google Scholar]

[R3] Flegal JM, Jones GL. Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics. 2010;38:1034–1070. [Google Scholar]

[R4] Flegal JM, Jones GL, Neath RC. Markov chain Monte Carlo estimation of quantiles. arXiv preprint arXiv:1207.6432 2012 [Google Scholar]

[R5] Geyer CJ. Practical Markov chain Monte Carlo (with discussion) Statistical Science. 1992;7:473–511. [Google Scholar]

[R6] Geyer CJ. Tech Rep 568r. Department of Statistics, University of Minnesota; 1994. Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. [Google Scholar]

[R7] Gill RD, Vardi Y, Wellner JA. Large sample theory of empirical distributions in biased sampling models. The Annals of Statistics. 1988;16:1069–1112. [Google Scholar]

[R8] Hobert JP, Jones GL, Presnell B, Rosenthal JS. On the applicability of regenerative simulation in Markov chain Monte Carlo. Biometrika. 2002;89:731–743. [Google Scholar]

[R9] Jones GL, Haran M, Caffo BS, Neath R. Fixed-width output analysis for Markov chain Monte Carlo. Journal of the American Statistical Association. 2006;101:1537–1547. [Google Scholar]

[R10] Kong A, McCullagh P, Meng XL, Nicolae D, Tan Z. A theory of statistical models for Monte Carlo integration (with discussion) (Series B).Journal of the Royal Statistical Society. 2003;65:585–618. [Google Scholar]

[R11] Meng XL, Wong WH. Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica. 1996;6:831–860. [Google Scholar]

[R12] Mengersen KL, Tweedie RL. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics. 1996;24:101–121. [Google Scholar]

[R13] Meyn SP, Tweedie RL. Markov Chains and Stochastic Stability. Springer-Verlag; New York, London: 1993. [Google Scholar]

[R14] Mykland P, Tierney L, Yu B. Regeneration in Markov chain samplers. Journal of the American Statistical Association. 1995;90:233–241. [Google Scholar]

[R15] Newman M, Barkema G. Monte Carlo Methods in Statistical Physics. Oxford University Press; 1999. [Google Scholar]

[R16] Newton M, Raftery A. Approximate Bayesian inference with the weighted likelihood bootstrap (with discussion) (Series B).Journal of the Royal Statistical Society. 1994;56:3–48. [Google Scholar]

[R17] Nummelin E. General Irreducible Markov Chains and Non-negative Operators. Cambridge University Press; London: 1984. [Google Scholar]

[R18] Robert CP, Casella G. Monte Carlo Statistical Methods. Second. Springer-Verlag; New York: 2004. [Google Scholar]

[R19] Romero M. Ph D thesis. University of Chicago; 2003. On Two Topics with no Bridge: Bridge Sampling with Dependent Draws and Bias of the Multiple Imputation Variance Estimator. [Google Scholar]

[R20] Roy V, Hobert JP. Convergence rates and asymptotic standard errors for MCMC algorithms for Bayesian probit regression. (Series B).Journal of the Royal Statistical Society. 2007;69:607–623. [Google Scholar]

[R21] Sahu SK, Zhigljavsky AA. Self-regenerative Markov chain Monte Carlo with adaptation. Bernoulli. 2003;9:395–422. [Google Scholar]

[R22] Swendsen R, Wang J. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters. 1987;58:86–88. doi: 10.1103/PhysRevLett.58.86. [DOI] [PubMed] [Google Scholar]

[R23] Tan A, Doss H, Hobert JP. Tech rep. Department of Statistics, University of Florida; 2012. Honest importance sampling with multiple Markov chains. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Tan A, Hobert JP. Block Gibbs sampling for Bayesian random effects models with improper priors: convergence and regeneration. Journal of Computational and Graphical Statistics. 2009;18:861–878. [Google Scholar]

[R25] Tan Z. On a likelihood approach for Monte Carlo integration. Journal of the American Statistical Association. 2004;99:1027–1036. [Google Scholar]

[R26] Vardi Y. Empirical distributions in selection bias models. The Annals of Statistics. 1985;13:178–203. [Google Scholar]

[R27] Wald A. Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics. 1949;20:595–601. [Google Scholar]

[R28] Wolpert RL, Schmidler SC. α-stable limit laws for harmonic mean estimators of marginal likelihoods. Statistica Sinica. 2011 (in press) preprint at http://ftp.stat.duke.edu/WorkingPapers/10-19.html.

PERMALINK

Estimates and Standard Errors for Ratios of Normalizing Constants from Multiple Markov Chains via Regeneration

Hani Doss

Aixin Tan

Abstract

1 Introduction

2 Estimation of the Ratios of Normalizing Constants in the Markov Chain Setting

A Quasi-Likelihood Function Designed for the Markov Chain Setting

3 A Regeneration-Based CLT and Estimate of the Variance Matrix

3.1 Estimation of the Variance of the Sample Mean in the Single Markov Chain Setting

Batching

Regenerative Simulation

Spectral Methods

3.2 A CLT for the Estimate of d Designed for Markov Chains

Theorem 1

4 Choice of the Vector a

Remarks

5 Illustrations

5.1 Gains in Efficiency When Using the Optimal Weight Vector a

Figure 1.

Figure 2.

Figure 3.

5.2 Estimation of the Internal Energy and Specific Heat as Functions of Temperature in the Ising Model

Figure 4.

Figure 5.

6 Discussion

Figure 6.

Acknowledgments

Appendix: Proof of Theorem 1

A.1 Proof of Consistency of $\hat{d}$

A.2 Proof of Regeneration-Based CLT for $\hat{d}$

A.3 Proof of Consistency of the Estimate of the Asymptotic Variance Matrix

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Estimates and Standard Errors for Ratios of Normalizing Constants from Multiple Markov Chains via Regeneration

Hani Doss

Aixin Tan

Abstract

1 Introduction

2 Estimation of the Ratios of Normalizing Constants in the Markov Chain Setting

A Quasi-Likelihood Function Designed for the Markov Chain Setting

3 A Regeneration-Based CLT and Estimate of the Variance Matrix

3.1 Estimation of the Variance of the Sample Mean in the Single Markov Chain Setting

Batching

Regenerative Simulation

Spectral Methods

3.2 A CLT for the Estimate of d Designed for Markov Chains

Theorem 1

4 Choice of the Vector a

Remarks

5 Illustrations

5.1 Gains in Efficiency When Using the Optimal Weight Vector a

Figure 1.

Figure 2.

Figure 3.

5.2 Estimation of the Internal Energy and Specific Heat as Functions of Temperature in the Ising Model

Figure 4.

Figure 5.

6 Discussion

Figure 6.

Acknowledgments

Appendix: Proof of Theorem 1

A.1 Proof of Consistency of d^

A.2 Proof of Regeneration-Based CLT for d^

A.3 Proof of Consistency of the Estimate of the Asymptotic Variance Matrix

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.1 Proof of Consistency of $\hat{d}$

A.2 Proof of Regeneration-Based CLT for $\hat{d}$