Beyond prediction: A framework for inference with variational approximations in mixture models

T Westling; T H McCormick

doi:10.1080/10618600.2019.1609977

. Author manuscript; available in PMC: 2020 Jul 24.

Published in final edited form as: J Comput Graph Stat. 2019 Jun 26;28(4):778–789. doi: 10.1080/10618600.2019.1609977

Beyond prediction: A framework for inference with variational approximations in mixture models

T Westling ¹, T H McCormick ²

PMCID: PMC7380652 NIHMSID: NIHMS1547738 PMID: 32713999

Abstract

Variational inference is a popular method for estimating model parameters and conditional distributions in hierarchical and mixed models, which arise frequently in many settings in the health, social, and biological sciences. Variational inference in a frequentist context works by approximating intractable conditional distributions with a tractable family and optimizing the resulting lower bound on the log-likelihood. The variational objective function is typically less computationally intensive to optimize than the true likelihood, enabling scientists to fit rich models even with extremely large datasets. Despite widespread use, little is known about the general theoretical properties of estimators arising from variational approximations to the log-likelihood, which hinders their use in inferential statistics. In this paper we connect such estimators to profile M-estimation, which enables us to provide regularity conditions for consistency and asymptotic normality of variational estimators. Our theory also motivates three methodological improvements to variational inference: estimation of the asymptotic model-robust covariance matrix, a one-step correction that improves estimator efficiency, and an empirical assessment of consistency. We evaluate the proposed results using simulation studies and data on marijuana use from the National Longitudinal Study of Youth.

Keywords: generalized linear mixed models, profile M-estimation

1. Introduction

Thanks to rapid improvements in data availability and user-friendly tools for data storage and manipulation, researchers from an ever-broader set of scientific disciplines now routinely analyze extremely complex, high-dimensional data. However, computing parameter estimates using stalwart statistical techniques such as maximum likelihood and Markov chain Monte Carlo can be a challenge in these settings and is often a bottleneck in practice. In these situations, researchers often turn to computationally efficient approximations.

Variational approximations are one method of approximating a likelihood function or posterior distribution that are increasingly popular across a range of scientific fields. In public health, for example, Lee and Wand (2016) used a variational approximation to estimate a model for overall and hospital-specific trends in cesarean section rates. In statistical genetics, Raj et al. (2014) used a variational approximation to a multinomial model of allele frequencies across populations of individuals. O’Connor et al. (2010) used a variational approximation to a model of demographics and lexical choice in geo-tagged Twitter data.

Despite their popularity, variational approximations do not typically come with guarantees about the statistical properties of the resulting estimator. This drawback is particularly problematic when a scientist would like to interpret a parameter estimate, in which case estimator consistency is crucial, or report a confidence interval, in which case good coverage rates rely on the ability to accurately estimate the sampling distribution of the estimator. In this paper, we address the problem of inference using variational approximations. We show that, in a wide range of parametric mixture models, well-established theory from profile M-estimation provides an asymptotic lens through which we may understand the large-sample properties of parameter estimates resulting from variational approximations to the log-likelihood. Using the M-estimation framework, we derive conditions for consistency and asymptotic normality of variational estimators.

The theory we establish for variational estimators motivates us to also propose three methodological improvements to these estimators. First, we provide a consistent estimator of the asymptotic covariance matrix of variational estimators. Second, we introduce a one-step correction to the variational estimator that improves large-sample statistical efficiency. Third, we develop an empirical evaluation of estimator consistency for use when the theoretical calculations are intractable. We demonstrate the importance of these methodological advances with two logistic mixture models of marijuana use by age among participants in the National Longitudinal Survey of Youth (NLSY).

The remainder of the paper is organized as follows. This section formally defines the class of models and variational estimators we study. Section 2 connects variational estimation to profile M-estimation and states our theoretical results. Section 3 illustrates the general theoretical results in a few simple models. Section 4 presents our three methodological contributions. Section 5 evaluates our methods using simulated data and demonstrates an application to the NLSY. Section 6 presents a discussion. Technical conditions and proofs of theorems are provide in the supplementary material. Code to replicate all of the empirical analyses in this paper are available at https://github.com/tedwestling/variational_asymptotics.

1.1. Variational estimators

In this paper, we consider inference for a Euclidean parameter θ in a parametric mixture model p_θ(x) = ∫_z p_θ(x , z)d μ(z), where the marginal likelihood p_θ(x) is computationally expensive to compute. Parametric mixture models have been used in a variety of scientific contexts. For example, mixed-membership models are a type of mixture model that have been used to model text (Blei et al. 2003), social networks (Airoldi et al. 2008), population genetics (Pritchard et al. 2000), and scientific collaborations (Erosheva et al. 2004).

Mixture models have been used in conjunction with both Bayesian and frequentist inferential frameworks. In a frequentist setting, maximum likelihood (ML) estimation comes with guarantees of asymptotic efficiency and methods of conducting inference for many models. These guarantees provide a degree of assurance for scientists that the point estimates and uncertainty intervals will behave in predictable ways.

ML estimation can, however, be computationally burdensome. When the integral in p_θ(x) must be approximated numerically, the cost of this computation increases exponentially with the dimension of the domain $Z$ of the latent variable since p_θ(x , z) needs to be evaluated at sufficiently many points to accurately approximate the integral. This computational burden is a significant barrier for researchers who want to develop tailored mixture models to flexibly represent the dependencies in their data. As a result, a variety of approximate methods have been developed as alternatives to maximum likelihood.

Variational inference is an approximate method based on optimizing a lower bound for the original objective function. This lower bound is designed to eliminate the need for, or at least reduce the dimension of, any numerical integrals, thereby improving computational efficiency. Variational inference can be used in a frequentist context to approximate the log-likelihood or in a Bayesian context to approximate the posterior distribution. In this paper we focus on the former. We will refer to estimators of θ resulting from optimizing a variational approximation to the log-likelihood as variational estimators.

Before providing formal definitions, we distinguish between two key aspects of the variational approximation. First, we can evaluate the properties of the optimizer of the variational lower bound. Second, we could consider the tightness of the variational lower bound to the true objective function. These questions are related. Demonstrating tightness of the lower bound is one way to control the difference between the true and variational optimizers, for example. However, a tight variational lower bound is not a necessary condition for good behavior of the variational estimator, and indeed does not hold in many settings where the variational estimator performs well. In this paper, we address the former of these two components, i.e. the properties of the optimizer of the variational lower bound.

We now move to a formal definition of variational estimation. Blei et al. (2017) presents a thorough introduction to variational inference and many relevant references. Let X₁,… , X_n be observed p-variate data generated independently and identically from a distribution P₀ on a sample space $X$ . Let $P = {P_{θ} : θ \in Θ}$ be a statistical model, where Θ is an open subset of $R^{d}$ and each P_θ has a density P_θ(x) = ∫_z P_θ(x , z)d μ. Here μ is a dominating measure on $Z \subseteq R^{k}$ . We can conceptualize this data-generating process as first drawing independent latent random variables Z₁,… ,Z_n from the marginal distribution $p_{θ, z} (z) = \int_{X} p_{θ} (x, z) d x$ , then drawing each X_i given Z_i from the conditional distribution p_θ,X∣Z(x ∣ Z_i) = p_θ(x , Z_i)/ p_θ,Z(Z_i). Of these, we only observe X₁,… ,X_n.

We are most interested in cases where p_θ(x) cannot be written in closed-form in terms of elementary functions, as in many generalized linear mixed models (McCulloch and Neuhaus 2001) and non-linear hierarchical models (Davidian and Giltinan 1995, Goldstein 2011). In these cases, calculating the log-likelihood of the observed data, $\sum_{i = 1}^{n} \log p_{θ} (X_{i})$ , and its derivatives with respect to θ requires numerical integration. When the dimension of the latent variable is large, these numerical integrals are computationally expensive.

Variational inference parameter estimates are obtained by maximizing a criterion function motivated as follows. Denote by $Q_{0}$ the set of densities dominated by μ, and by $Q_{0}^{X}$ the set of all conditional densities dominated by μ for all $x \in X$ ; that is, all $s : Z \times X \to R$ such that $s (\cdot ∣ x) \in Q_{0}$ for all $x \in X$ . Suppose that $P_{0} \in P$ , so that p₀ = P_θ₀ for some θ₀ ∈ Θ. Then θ₀ and the true conditional distribution of the latent variable π_θ₀(z ∣ x):= p_θ₀(x , z) / p_θ₀ (x) can be represented as

(θ_{_{0}}, π_{_{θ_{0}}}) = \underset{θ \in Θ, s \in Q_{0}^{X}}{arg max} E_{_{P_{0}}} [\int_{Z} \log (\frac{p_{_{θ}} (X, Z)}{s (Z ∣ X)}) s (Z ∣ X) d μ (Z)] .

(1)

To see this, first define

f_{_{0}} (θ, s) ≔ E_{_{P_{_{0}}}} [- D_{K L} (s (\cdot ∣ X) ∣ ∣ π_{_{θ}} (\cdot ∣ X))] = E_{_{P_{_{0}}}} [\int_{Z} \log (\frac{π_{_{θ}} (Z ∣ X)}{s (Z ∣ X)}) s (Z ∣ X) d μ (Z)],

where D_KL denotes the Kullback-Leibler (KL) divergence. Thus, f₀(θ , s) is the expected KL divergence between s(·∣ X) and π_θ(·∣ X). By Gibbs’ inequality, f₀(θ , s)≤ 0 = f₀(θ₀ , π_θ₀) for all $(θ, s) \in Θ \times Q_{0}^{X}$ . Next, note that θ₀ maximizes $θ \mapsto g_{0} (θ) ≔ E_{P_{0}} [\log \frac{p_{θ} (X)}{p_{0} (X)}] = - D_{K L} (p_{0} ‖ p_{θ})$ . Therefore, (θ₀ , π_θ₀) maximizes (θ , s) ↦ f₀(θ , s) + g₀(θ) over $Θ \times Q_{0}^{X}$ , and after some rearranging, we can see that this is equivalent to the representation in (1).

The expectation-maximization (EM) algorithm can be motivated by (1) by replacing the unknown P₀ with the empirical distribution and alternating between optimization over θ and s. Using similar reasoning to that presented above, this amounts to alternating between computing $θ_{(t)} ≔ \arg \max_{θ \in Θ} \sum_{i = 1}^{n} \int_{Z} [\log p_{θ} (X_{i}, Z)] π_{θ_{(t - 1)}} (Z ∣ X_{i}) d μ (Z)$ , where θ_(t-1) is the previous value of θ, and computing π_{θ(_t)}(·∣ X_i) for each observed X_i. However, if the marginal likelihood p_θ(x) cannot be written in terms of elementary functions, then neither can π_θ, and hence the EM algorithm requires numerical integration.

To construct a variational approximation to the log-likelihood, we replace the optimization over $Q_{0}^{X}$ in (1) with an optimization over $Q^{X}$ , where $Q$ is a smaller variational family of distributions, and as before $Q^{X}$ is the set of conditional distributions over $X$ such that $s (\cdot ∣ x) \in Q$ for each $x \in X$ . For example, $Q$ could consist of all independent products over each dimension of z (known as mean-field variational inference), all multivariate Gaussian distributions, or all independent Gaussian distributions. For simplicity, we will assume throughout that $Q$ is indexed by a finite-dimensional Euclidean parameter ψ ∈ Ψ, so that every $s \in Q^{X}$ can be identified with a density s(·∣ x) = q(·; ψ (x)). We note that in some cases even when $Q$ is a semiparametric family, it can be shown that the optimal q lies in a parametric sub-family with a known form, so that our results can still be applied (see, e.g. Section 5.3 of Wainwright and Jordan 2008). For families where this does not apply, our theory could be extended to incorporate semiparametric $Q$ .

Let Ψⁿ denote the n-fold Cartesian product Ψ × … × Ψ. For ψ ∈ Ψⁿ and i ∈ {1,… , n}, we will denote ψ_i ∈ Ψ the lth element of ψ. Given the observed data X₁,… , X_n, Ψⁿ then parametrizes the set of variational conditional distributions over X₁,… , X_n, and each ψ_i parametrizes the variational conditional distribution s(·∣ X_i) = q(·; ψ_i). Given $Q$ and Ψ, the variational estimator of θ, which we will denote ${\hat{θ}}_{n}$ , and the variational conditional estimators ψ_n are the joint maximizers of the following objective function:

({\hat{θ}}_{_{n}}, ψ_{_{n}}) ≔ \underset{θ \in Θ, ψ \in Ψ^{^{n}}}{arg max} \sum_{i = 1}^{n} \int \log (\frac{p_{_{θ}} (X_{_{i}}, Z_{_{i}})}{q (Z_{_{i}}; ψ_{_{i}})}) q (Z_{_{i}}; ψ_{_{i}}) d μ (Z_{i}) = \underset{θ \in Θ, ψ \in Ψ^{_{n}}}{arg max} L_{_{n}} (θ, ψ; X_{_{n}}) .

(2)

We note that we are implicitly assuming that the full variational distribution over (Z₁,… , Z_n) factors as $\prod_{i = 1}^{n} q (Z_{i} : ψ_{i})$ . However, since the true conditional distribution of (Z₁,… , Z_n) given (X₁,… , X_n) factors as $\prod_{i = 1}^{n} π_{θ_{0}} (Z_{i} ∣ X_{i})$ , the optimal variational distribution will always factor as well, so this assumption comes with no loss of generality.

A crucial piece of motivation for our work is that, since $L_{n}$ is typically not proportional to the log-likelihood, it is not clear what the asymptotic properties of the variational estimator ${\hat{θ}}_{n}$ are. In many circumstances, the variational estimator is used for prediction. In such cases, scientists can evaluate the quality of the variational approximation using cross-validation or another held-out data technique. If, however, a scientist would like to go beyond prediction and interpret the point estimator (or, critically, its uncertainty) produced by a variational approximation, not knowing the properties of the estimator is a substantial hindrance. In particular, we would like to know whether ${\hat{θ}}_{n}$ is consistent and, if it is consistent, what the asymptotic distribution of $\sqrt{n} ({\hat{θ}}_{n} - θ_{0})$ is.

Asymptotic properties of variational estimators have been studied in depth for certain specific models, yielding positive results regarding the consistency of variational estimators for Gaussian mixture models (Wang and Titterington 2006), exponential family models with missing values (Wang and Titterington 2004), Poisson mixed models as the cluster size and number of clusters both diverge (Hall, Ormerod and Wand 2011, Hall, Pham, Wand and Wang 2011), Markovian models with missing values (Hall et al. 2002), and stochastic block models for social networks (Bickel et al. 2013). Of particular note are Hall, Ormerod and Wand (2011) and Hall, Pham, Wand and Wang (2011), who derive sharp asymptotics for Poisson regression with random cluster intercepts as both the number of clusters and observations per cluster diverge. Our work is distinct from these results in two ways. First, we provide results at a general level rather than for a specific model. Second, we focus on the asymptotic regime where the number of clusters is diverging, but the number of observations per cluster is stochastically bounded.

More recently, researchers have begun developing general theoretical results for variational estimators. For example, Pati et al. (2018) studied finite-sample risk bounds for mean-field variational Bayes estimators in a very general setting, and applied their results to derive the rate of convergence of variational Bayes estimators in Latent Dirichlet Allocation and Gaussian mixture models. Wang and Blei (to appear) provided sufficient conditions for a Bernstein-von Mises result for the variational Bayes posterior distribution. We note that both of these recent works are distinct from our goals here, which are to study the asymptotic properties of frequentist variational estimators.

2. Variational approximations and M-Estimation

In this section, we demonstrate the connection between M-estimators and variational inference. The key for this connection is using a profile version of the variational objective function. Viewing variational inference in this way unlocks a deep and broad set of theoretical results developed for M-estimators. We make this connection explicit in this section and then, in Section 4, demonstrate how these theoretical results can be used to develop new methods for scientific practice.

2.1. Variational estimation as M-Estimation

We will study the general properties of the variational estimator ${\hat{θ}}_{n}$ through the lens of M-estimation. An M-estimator of a parameter θ is the maximizer of a data-1 ” dependent objective function $M_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} m (θ; X_{i})$ . From (2) we can see that

L_{_{n}} (θ, ψ_{_{n}}; X_{_{n}}) = \sum_{i = 1}^{n} ν (θ, ψ_{_{i}}; X_{_{i}}) for ν (θ, ψ; x) = \int \log (\frac{p_{_{θ}} (x, z)}{q (z; ψ)}) q (z; ψ) d μ (z) .

Applying the theory of M-estimation to the vector (θ, ψ_n) with m (·) = ν(·) is complicated due to the dependence on ψ_i, which are known as incidental parameters specific to each data point. The θ, in contrast, are structural parameters shared across all data (Lancaster 2000). Hall, Ormerod and Wand (2011) dealt with this problem for Poisson mixed models by assuming the cluster size was growing with the number of observations, so that the incidental parameters effectively became structural. In our more general setting we could analogously assume that each observed data X_i is composed of replicates X_i1,… , X_im and let m grow with n. However, this would limit the applicability of our results to only cases where clusters are very large. Since, in practice, clusters are often small, we instead apply M-estimation to the profiled variational objective.

In order to use the M-estimation framework for the variational estimator ${\hat{θ}}_{n}$ , we will express the optimization defined in (2) as a two-stage procedure, where first $L_{n}$ is optimized with respect to ψ_n for each fixed θ, then this profiled function is optimized with respect to θ. Furthermore, we note that optimizing $L_{n}$ with respect to ψ_n for fixed Θ is equivalent to optimizing each summand ν(θ, ψ_i; x) with respect to ψ_i for fixed θ. Therefore, we will assume that for each θ ∈ Θ and P₀-a.e. x, the map ψ ↦ ν(θ, ψ; x) possesses a unique point of maximum in Ψ, which we will denote by $\hat{ψ} (θ; x)$ . We then define the profiled single-data objective function

m (θ; x) ≔ sup_{ψ \in Ψ} ν (θ, ψ; x) = ν (θ, \hat{ψ} (θ, x); x) .

Proposition 1 below asserts that, with this assumption, the variational estimator ${\hat{θ}}_{n}$ of the model parameters from equation (2) is equal to the maximizer of the profiled criterion function $\sum_{i = 1}^{n} m (θ; X_{i})$ . This result formally establishes the connection between variational inference and M-estimation that we will use throughout this article.

Proposition 1. Suppose that, for all θ ∈ Θ and P₀-a.e_n. x, ψ ↦ ν(θ, ψ; X) possesses a unique maximizer in Ψ, and that $θ \mapsto \sum_{i = 1} m (θ; X_{i})$ possesses at least one maximizer in Θ. Then ${\hat{θ}}_{n} \in \arg \max_{θ \in Θ} \sum_{i = 1}^{n} m (θ; X_{i})$ .

The proofs of all results are provided in the supplementary material.

The representation of ${\hat{θ}}_{n}$ provided by Proposition 1 now falls within the M-estimator framework. We can therefore use the existing, well-studied asymptotic theory of M-estimators to better understand the asymptotic properties of variational estimators. In the subsequent sections, we show that, using this representation, the theory for M-estimators yields general results for consistency and asymptotic normality for variational estimators.

2.2. Consistency

We first explore consistency using the M-estimator representation of the variational estimator. An important point that we will return to later is that, depending on the model and approximation, the estimator based on the variational lower bound may not be consistent for the truth. Hence, in what follows, we refer to $\bar{θ}$ as the limit of ${\hat{θ}}_{n}$ , so that $\bar{θ} = θ_{0}$ if and only if ${\hat{θ}}_{n}$ is consistent.

The population objective function M₀(θ) = E_P₀ [m (θ; X)] governs the asymptotic properties of the variational estimator ${\hat{θ}}_{n}$ . Under regularity conditions, ${\hat{θ}}_{n} \overset{P_{0}}{\to} \arg \max_{θ} M_{0} (θ)$ , so that if M₀ is uniquely maximized at θ₀ then ${\hat{θ}}_{n}$ is consistent for θ₀, as we state below.

Theorem 1. Suppose the function $M_{0} (θ) = E_{P_{0}} [ν (θ, \hat{ψ} (θ; X); X)]$ attains a finite global maximum at $\bar{θ}$ and conditions (A1)-(A3) hold. Then ${\hat{θ}}_{n} \overset{P_{0}}{\to} \bar{θ}$ . Regularity conditions (A1)-(A3) justify the application of Theorem 5.14 of van der Vaart (2000) and are provided in the supplementary material Condition (A1) requires that $ν (θ, \hat{ψ} (θ; x); x)$ be upper semi-continuous in θ for a.e. x. This is implied, for instance, if v is upper semi-continuous in θ and ψ and $\hat{ψ}$ is continuous in θ, for a.e. x. Condition (A2) requires that v have a measurable and integrable local envelope function. Condition (A3) requires that ${\hat{θ}}_{n}$ be contained in a compact with probability tending to one. If the parameter space is not compact, (A3) can often be established via a suitable compactification of the parameter space, as in van der Vaart (2000) Example 5.16.

Here and throughout, we define D_ψ and D_θ as the derivative operators with respect to ψ and θ, respectively. If v and $\hat{ψ}$ are sufficiently smooth functions of θ for P0-a.e. x(see the supplementary material for additional details), then the Leibniz integral rule implies that $D_{θ} M_{0} ∣_{θ = \bar{θ}} = E_{P_{0}} [D_{θ} ν ∣_{θ = \bar{θ}, ψ = \hat{ψ} (\bar{θ}; X)}]$ , and furthermore since $D_{ψ} ν ∣_{ψ = \hat{ψ} (θ; x)} = 0$ by definition of $\hat{ψ}$ as a maximizer, $E_{P_{0}} [D_{θ} ν ∣_{θ = \bar{θ}, ψ = \hat{ψ} (\bar{θ}; X)}] = 0$ as well. Therefore, a preliminary step in assessing whether ${\hat{θ}}_{n}$ is consistent is to determine whether $E_{P_{0}} [D_{θ} ν ∣_{θ = \bar{θ}, ψ = \hat{ψ} (\bar{θ}; x)}] = 0$ . If it does not equal zero, then ${\hat{θ}}_{n}$ cannot be consistent. If it does equal zero, and in addition M₀(θ) is strictly concave and regularity conditions (A1)–(A3) hold, then ${\hat{θ}}_{n}$ is consistent.

In practice, it is often not possible to derive $\hat{ψ} (θ; x)$ in closed form, which prevents a theoretical assessment of consistency of the variational estimator. This is the situation, for instance, in many generalized linear mixed models. In Section 4.3, we propose an empirical method of assessing consistency that does not require explicit derivation of $\hat{ψ} (θ; x)$ .

2.3. Asymptotic normality

If the variational estimator ${\hat{θ}}_{_{n}}$ is consistent for $\bar{θ}$ and additional regularity conditions hold then $\sqrt{n} ({\hat{θ}}_{_{n}} - \bar{θ}) \overset{d}{\to} N (0, V (\bar{θ}))$ where v(θ) is the sandwich covariance. Here and throughout, we denote by $D_{_{\cdot}}^{^{2}}$ the second derivative operator with respect to •.

Theorem 2. Suppose ${\hat{θ}}_{_{n}} \overset{P_{_{0}}}{\to} \bar{θ}$ , a point of maximum of M₀ (θ) = E_P₀[m (θ; X)], and conditions (B1)-(B4) hold. Then

\sqrt{n} ({\hat{θ}}_{n} - \bar{θ}) \overset{d}{\to} N_{_{d}} (0, V (\bar{θ}))

where V(θ) = A(θ)⁻¹B(θ)A(θ)⁻¹ for

A (θ) = E_{_{P_{_{0}}}} [D_{_{θ}}^{^{2}} m (θ; X)]

(3)

B (θ) = E_{_{P_{_{0}}}} [(D_{_{θ}} m (θ; X)) (D_{_{θ}} m (θ; X))^{^{T}}] .

(4)

In the next section we provide formulas for estimating the matrices A and B regardless of whether m (θ; X) is known explicitly.

Conditions (B1)–(B4), stated in the supplementary material, guarantee that m (θ; X) satisfies the conditions of van der Vaart (2000) Theorem 5.23. Condition (B1) states that $\hat{ψ} (θ; x)$ exists for all θ and a.e. x, and (B2) states that it is twice continuously differentiable in θ in a neighborhood of $\bar{θ}$ for a.e. x. If for θ in a neighborhood of $\bar{θ}$ and a.e. x, (i) v is twice continuously differentiable in ψ, (ii) $D_{_{ψ}} ν ∣_{_{θ, \hat{ψ} (θ; x)}} = 0$ , (iii) $D_{_{ψ}}^{^{2}} ν$ is invertible, and (iv) D_ψ ν is twice continuously differentiable in θ, then the implicit function theorem implies (B1) and (B2).

Condition (B3) requires that v be twice continuously differentiable in θ in a neighborhood of $\bar{θ}$ and $\hat{ψ} (\bar{θ}; x)$ for a.e. x. The differentiability of v required by this condition and the implicit function theorem from the previous paragraph depend on the smoothness of p_θ(x, z) and the variational density q(z; ψ). For instance, by the Leibniz integral rule, if P_θ is twice continuously differentiable in θ at x and for q(·;ψ)-a.e. z, and its second derivative is dominated by a q(·;ψ)-integrable function, then v is twice continuously differentiable in θ at ψ and x.

Finally, condition (B4) requires that v and $\hat{ψ}$ be Lipschitz functions in neighborhoods of $\bar{θ}$ and $\hat{ψ} (\bar{θ}; x)$ for every x, and that their Lipschitz constant be bounded by a square-integrable function of x. The Lipschitz property of v and $\hat{ψ}$ for fixed x is implied by the differentiability required by (B2) and (B3). Square-integrability of the Lipschitz constant as a function of x is not guaranteed, but is a relatively mild requirement since the neighborhoods around $\bar{θ}$ and $\hat{ψ} (\bar{θ}; x)$ may be arbitrarily small.

3. Illustrations of the general theory

In this section, we illustrate the use of our theoretical results for assessing the consistency and asymptotic efficiency of variational estimators in two mixture models. For each model, we highlight the main features necessary to apply our general results, and leave detailed derivations for the supplementary material.

3.1. Consistent and efficient variational estimation

As our first illustration of our general theoretical results, we demonstrate that a variational estimator is consistent and efficient in an exponential mixture model. Suppose that each data unit i consists of a vector of observations X_i = (X_i1,… , X_ip). Conditional on independent latent random variables Z₁,… ,Z_n each distributed as E x p (β), these observations are generated independently as X_ij ~ E x p (Z_i). The parameter vector is $θ = β \in R^{+}$ . This could serve, for instance, as a model of the lifetimes of clusters of memoryless units.

The marginal density of X_i is now $p_{_{θ}} (x) = Γ (p + 1) β {(β + \sum_{j = 1}^{p} x_{_{j}})}^{^{- (p + 1)}}$ , and hence the true conditional distribution of Z_i given X_i is Gamma ( $p + 1, β + \sum_{j = 1}^{p} X_{_{i j}}$ ). Therefore, any variational family of conditional distributions that includes the gamma family as a sub-class will yield a variational estimator ${\hat{θ}}_{_{n}}$ that is equal to the MLE. However, for the purpose of demonstrating our theoretical method of assessing consistency, it is illustrative to consider a variational class that does not include the true conditional distribution. We will show that in this example, using the mis-specified variational class of log-normal distributions still yields a consistent, and even efficient, variational estimator.

Suppose the variational class is taken to be all log-normal distributions, parametrized by $ψ = (μ, σ^{2}) \in R \times R^{+} = Ψ$ . Straightforward computation then gives

ν (θ, ψ; x) \propto \log β + (p + 1) μ - (β + \sum_{j = 1}^{d} x_{j}) e^{μ + σ^{2} ∕ 2} + \log σ .

This is a smooth function, and by composition laws for concave functions, we can see that ν(θ, ψ; x) is strictly concave in ψ for each fixed θ and x (Boyd and Vandenberghe 2004). Therefore, the unique zero of the gradient of v with respect to ψ is the unique ψ maximizing v for fixed θ and x. This gives $\hat{μ} (θ; x) = \log \frac{p + 1}{β + \sum_{j = 1}^{p} x_{j}} - (p + 1)^{^{- 1}} ∕ 2$ and $\hat{σ} (θ; x) = (p + 1)^{^{- 1 ∕ 2}}$ . Thus, the profile objective function $m (θ; x) = ν (θ, \hat{ψ} (θ; x); x)$ can be written explicitly up to a constant as $\log \frac{β}{{(β + \sum_{j = 1}^{p} x_{j})}^{^{p + 1}}}$ . Condition (A1) is satisfied because m smooth in θ.

Condition (A2) is satisfied because $sup_{θ} m (θ; x) = c - p \log (\sum_{j = 1}^{p} x_{j})$ for some c < ∞, and the expectation of this expression is finite. Condition (A3), which requires tightness of ${\hat{θ}}_{_{n}}$ , can be established either by restricting the parameter space to a compact, or by extending the parameter space to [0, ∞] equipped with the metric d(β₁, β₂) = ∣ arctan β₁ – arctan β₂ ∣, as in van der Vaart (2000) Example 5.16.

Since conditions (A1)–(A3) hold, Theorem 1 implies that ${\hat{θ}}_{_{n}} \overset{P}{\to} \bar{θ}$ , the point of maximum of θ ↦ M₀(θ) = E_P₀ [m (θ; X)]. In this case, since m (θ; x) is equal up to a constant to the log-likelihood of a single observation, by a standard argument involving Jensen’s inequality, M₀(θ) is uniquely maximized at θ₀. Therefore, ${\hat{θ}}_{_{n}}$ is consistent even though the variational class does not include the true conditional distribution.

Conditions (B1)-(B3) are satisfied because both v and $\hat{ψ}$ are smooth in θ, and the second derivative of m is bounded up to a constant in a neighborhood of θ₀ by $(\sum_{j = 1}^{p} x_{j})^{^{- 1}}$ , which is P₀-integrable. The Lipschitz condition (B4) is also satisfied because v and $\hat{ψ}$ are differentiable with bounded derivatives in a neighborhood of (θ₀, $\hat{ψ}$ ) and θ₀, respectively. The asymptotic variance of $\sqrt{n} ({\hat{θ}}_{_{n}} - θ_{_{0}})$ , as implied by Theorem 2, is equal to $(1 + 2 ∕ p) β_{_{0}}^{^{2}}$ .

3.2. Inconsistent variational estimation

We now consider an extension of the previous model in which a variational estimator is inconsistent. We keep an identical setup from the previous model, but now, we model the latent variable as Z_i ~ Gamma(α, β) rather than E x p β. This is a more flexible model indexed by the parameter $θ = (α, β) \in R^{+} \times R^{+}$ .

The marginal density of X_i, is now $p_{_{θ}} (x) = Γ (p + α) Γ (α)^{^{- 1}} β^{^{α}} {(β + \sum_{j = 1}^{p} x_{j})}^{^{- (p + α)}}$ , and hence the true conditional distribution of Z_i given X_i is $Gamma (p + α, β + \sum_{j = 1}^{p} X_{_{i j}})$ . As before, for illustrative purposes we take the variational class to be all log-normal distributions, parametrized by $ψ = (μ, σ^{2}) \in R \times R^{+} = Ψ$ . We now have

ν (θ, ψ; x) \propto α \log β - \log Γ (α) + (p + α) μ - (β + \sum_{j = 1}^{d} x_{j}) e^{μ + σ^{2} ∕ 2} + \log σ .

Once again, v is a smooth function, and is strictly concave in ψ for each fixed θ and x. Setting its derivative with respect to μ and σ to zero and solving gives $\hat{μ} (θ; x) = \log \frac{p + α}{β + \sum_{j = 1}^{p} x_{j}} - (p + α)^{^{- 1}} ∕ 2$ and $\hat{σ} (θ) = (p + α)^{^{- 1 ∕ 2}}$ . Thus,

m (θ; x) \propto α \log β - \log Γ (α) + (p + α) \log \frac{p + α}{β + \sum_{j = 1}^{p} x_{j}} - (p + α) - \frac{1}{2} \log (p + α) = \log p_{_{θ}} (x) - \log Γ (p + α) - (α + p) + (p + α) \log (p + α) - \frac{1}{2} \log (p + α) .

Conditions (A1)-(A3) can be checked for this example much as in the previous example. Therefore, Theorem 1 again implies that ${\hat{θ}}_{_{n}}$ tends in probability to the point of maximum of M₀(θ). As before, M₀ is not available in closed form in terms of elementary functions. However, we have M₀(θ) = E_P₀ [log p_θ (X)] + f (α), where f′ (α) > 0 for all α. Since E_P₀ [log p_θ (X)] is smooth and maximized at θ₀, this implies that D_θM₀ ∣_{θ = θ₀} ≠ 0, so that θ₀ cannot be the point of maximum of M₀ This shows that the variational estimator using a mis-specified log-normal conditional distribution is inconsistent in this example.

While the limit $\bar{θ}$ is not available explicitly, we can approximate it using numerical integration and optimization. Figure 1 shows the limits of the variational estimators as a function of the true parameter value for p = 5. The bias is small when α₀ and β₀ are small, but increases as α₀ and β₀ grow.

Fig. 1. — Limits of the variational parameter estimates in the Exponential-Gamma mixture model using a mis-specified variational conditional distribution. The left display shows the limit $\bar{α}$ as a function of the true α₀ used to generate the data. Note that $\bar{α}$ does not depend on β₀. The right display shows the limit $\bar{β}$ as a function of the true β₀ for four values of α₀ The identity function is shown as a solid black line.

4. Practical tools for inference with variational estimators

We now propose three methodological innovations based on the asymptotic results from Section 2. First, we demonstrate how to leverage asymptotic normality to enhance uncertainty estimators. Second, we show that a one-step correction can be applied to improve the efficiency of the variational estimator. Finally, we address the difficulty of theoretical assessment of consistency mentioned in Section 2, providing a way to test the consistency of a variational estimator when theoretical calculations are intractable.

4.1. Sandwich covariance estimation

We now discuss computation of consistent covariance estimators. Recall that in practice, m (θ; X) is often not available in closed form. Fortunately, the derivatives of m (θ; X) can be expressed in terms of the derivatives of ν(θ, ψ; X), which are always available, because ν(θ, ψ; X) is a result of the model and variational family used. Thus, using the chain rule, the asymptotic variance can be estimated whether or not m (θ, X) is available explicitly. We denote by D_θν and D_ψν the first partial derivatives of v, and $D_{_{θ θ}}^{^{2}} ν$ , $D_{_{θ ψ}}^{^{2}}$ , and $D_{_{ψ ψ}}^{^{2}} ν$ the second derivatives of v.

Concerning D_θm (θ; x), which appears in equation (4), since $\hat{ψ} (θ; x)$ maximizes v for fixed θ, x, $D_{_{θ}} m (θ; x) = D_{_{θ}} ν (θ, ψ; x) ∣_{_{ψ = \hat{ψ} (θ; x)}}$ . For $D_{_{θ}}^{^{2}} m (θ; x)$ in equation (3),

D_{_{θ}}^{^{2}} m (θ; x) = {[D_{_{θ θ}}^{^{2}} ν - D_{_{θ ψ}}^{^{2}} ν {(D_{_{ψ ψ}}^{^{2}} ν)}^{^{- 1}} D_{_{θ ψ}}^{^{2}} ν^{^{T}}]}_{ψ = \hat{ψ} (θ; x)},

as we show in the supplementary material, where we abbreviate ν(θ, ψ; X) as v for presentation. Replacing the appropriate derivatives in the definition of V (θ) with the above expressions and the population expectations with empirical ones gives a way to calculate the asymptotic covariance only knowing ν(θ, ψ; X) and its derivatives (which can be calculated numerically), as opposed to m (θ; x), the computation of which involves optimization.

We now have ${\hat{A}}_{_{n}} ({\hat{θ}}_{_{n}})^{^{- 1}} {\hat{B}}_{_{n}} ({\hat{θ}}_{_{n}}) {\hat{A}}_{_{n}} ({\hat{θ}}_{_{n}})^{^{- 1}} \overset{P}{\to} V (\bar{θ})$ where

{\hat{A}}_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} {[D_{_{θ θ}}^{^{2}} ν - D_{_{θ ψ}}^{^{2}} ν {(D_{_{ψ ψ}}^{^{2}} ν)}^{^{- 1}} D_{_{θ ψ}}^{^{2}} ν^{^{T}}]}_{ψ = {\hat{ψ}}_{i}, x = X_{i}},

(5)

{\hat{B}}_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} {[(D_{_{θ}} ν) {(D_{_{θ}} ν)}^{^{T}}]}_{ψ = {\hat{ψ}}_{i}, x = X_{i}} .

(6)

Equations (5) and (6) provide a formula for constructing an asymptotic covariance matrix for the variational estimator ${\hat{θ}}_{_{n}}$ . This covariance can be used to construct asymptotically calibrated Wald intervals, regions, and hypothesis tests about θ₀ if $\bar{θ} = θ_{_{0}}$ . Furthermore, the sandwich covariance is model-robust in the sense that it is valid even if $P_{_{0}} \notin P$ .

For an MLE under correct model specification, A (θ) = B (θ) and the asymptotic covariance reduces to A (θ)⁻¹, the inverse Fisher information matrix. In this case the sandwich covariance is only needed for model-robust uncertainty estimation. However, when m is not proportional to the log-likelihood, as is often true with variational inference, A and B are not necessarily equal even under correct model specification. Therefore the sandwich covariance is necessary even if $P_{_{0}} \in P$ .

4.2. One-step correction

The variational estimator ${\hat{θ}}_{_{n}}$ is not guaranteed to be asymptotically efficient since the variational objective function need not be proportional to the log-likelihood. Hence while Wald-type intervals, regions, and tests using the sandwich estimator proposed in the last section will be asymptotically valid, they may be suboptimal since ${\hat{θ}}_{_{n}}$ may have larger asymptotic variance than the MLE. In these cases, a one-step correction to the variational estimator yields a more efficient estimator.

The one-step estimator is ${\hat{θ}}_{_{n}}^{^{(1)}} = {\hat{θ}}_{_{n}} - I_{_{n}} ({\hat{θ}}_{_{n}})^{^{- 1}} S_{_{n}} ({\hat{θ}}_{_{n}})$ , where l (θ; x) = log p_θ(x) and

S_{_{n}} (θ) = \frac{1}{n} \sum_{i = 1}^{n} D_{_{θ}} l (θ; X_{i}), I_{_{n}} (θ) = \frac{1}{n} \sum_{i = 1}^{n} (D_{_{θ}} l (θ; X_{i})) (D_{_{θ}} l (θ; X_{i}))^{^{T}}

are the score and observed information at θ. Under regularity conditions $\sqrt{n} ({\hat{θ}}_{_{n}}^{^{(1)}} - θ_{_{0}}) \overset{P}{\to} N (0, I (θ_{_{0}})^{^{- 1}})$ for I (θ₀) the Fisher information matrix, which is the same asymptotic distribution as the maximum likelihood estimator or posterior mean.

Computing S_n and I_n require numerical integration in the same way that computing the MLE would. Indeed, the one-step correction is a single step of a Newton-Raphson algorithm for finding the MLE starting at ${\hat{θ}}_{_{n}}$ . However, unlike finding the MLE, this one-step procedure only requires a single calculation of these quantities, so requires less computation than finding the exact MLE. Nevertheless, in some cases the one-step correction may not be computationally feasible for the same reasons that computing the MLE is not.

4.3. An empirical test of the consistency of variational estimators

In many cases, including generalized linear mixed models, neither $\hat{ψ} (θ; x)$ , m (θ; x), nor M₀(θ) are available analytically. This presents a challenge not present in the classical M-estimation scenario and seriously undermines the goal of theoretically evaluating the consistency of variational estimators. Simulation studies could be used to assess consistency for any particular fixed, known truth, but would be computationally burdensome.

Here, we propose a method for evaluating the consistency of a variational estimator at a single fixed parameter value θ* when m (θ; x) is not available explicitly. Suppose that the data were generated from P₀ = P_θ*. Then a crucial condition for consistency of the variational estimator at θ*, as stated in Theorem 1, is that M*(θ) := E_θ*[m (θ; X)] be maximized at θ*. If M* is smooth and θ* is in the interior of the parameter space, then M* being maximized at θ* implies that D_θM*(θ*) = 0. Furthermore, as long as ∣ D_θm (θ; x) ∣≤ h(x) for all θ in a neighborhood of θ* and P_θ*-a.e. x for a P_θ*-integrable function h, then by the dominated convergence theorem, $D_{_{θ}} M^{^{*}} (θ^{^{*}}) = E_{_{θ^{^{*}}}} [D_{_{θ}} m (θ^{^{*}}; X)] = E_{_{θ^{^{*}}}} [D_{_{θ}} ν (θ^{^{*}}, \hat{ψ} (θ^{^{*}}; X); X)]$ . Our proposed method for numerically evaluating consistency of the variational estimator under P_θ* is motivated by numerically assessing whether $E_{_{θ^{^{*}}}} [D_{_{θ}} ν (θ^{^{*}}, \hat{ψ} (θ^{^{*}}; X); X)] = 0$ . Our method unfolds in the following steps.

Fix θ* and b very large (for instance 10⁴ or 10⁵).
For j = 1, … , b :
1. Simulate $X_{_{j}}^{^{*}} \sim P_{_{θ^{^{*}}}}$ .
2. Find $ψ_{_{j}}^{^{*}} = \hat{ψ} (θ^{^{*}}; X_{_{j}}^{^{*}})$ by numerically optimizing $ψ^{\mapsto} ν (θ^{^{*}}, ψ; X_{_{j}}^{^{*}})$ .
3. Evaluate $G_{_{j}}^{^{*}} = D_{_{θ}} ν ∣_{_{θ^{^{*}}, ψ_{_{j}}^{^{*}}, X_{_{j}}^{^{*}}}}$ .
Test the null hypothesis that $E_{_{θ^{^{*}}}} [G_{_{j}}^{^{*}}] = 0$ either using independent t-tests on each component or Hotelling’s T² test on the entire vector.

If the test rejects the null hypothesis then the variational estimator cannot be consistent; if not then one can be arbitrarily certain (with large enough b) that the mean score is zero at θ*. If a weakly significant p-value is found and it is unclear what to conclude, the experiment could be repeated with a larger b.

This method is a necessary, but not sufficient test of consistency. As we explain more below, asymptotically we expect our method to have few false negatives (indication that the estimator is inconsistent when it is actually consistent) but possibly false positives (indications that the estimator is consistent when it is actually inconsistent). The first reason for potential false positives is that $E_{_{θ^{^{*}}}} [G_{_{j}}^{^{*}}] = 0$ is a necessary but not sufficient condition for consistency. Even if its gradient is zero, θ* it need not be a global maxima of the objective function. The second reason for potential false positives is that the method can only assess consistency at a single parameter value θ* rather than on the entirety of the parameter space. Typically one will first use the variational algorithm to estimate ${\hat{θ}}_{_{n}}$ , then use this method to assess consistency at $θ^{^{*}} = {\hat{θ}}_{_{n}}$ . If the estimator is consistent for every θ in a neighborhood of θ₀ then for n large enough ${\hat{θ}}_{_{n}}$ will be in that neighborhood and the method will not indicate inconsistency. On the other hand if ${\hat{θ}}_{_{n}} \overset{P_{_{0}}}{\to} \bar{θ} \neq θ_{_{0}}$ then this method is approximately assessing whether the algorithm is consistent near $\bar{θ}$ . If the variational algorithm is consistent at $\bar{θ}$ but not at θ₀ then the method would indicate that the estimator is consistent when in fact it is not. Despite the possibility of false positives, we do not know of any other practical ways to assess consistency of variational estimators when the limit objective is not available in closed form.

5. Numerical studies

In this section, we empirically evaluate the variational estimator, the sandwich covariance, and the one-step correction in mixed effects logistic regression models. In these models, theoretical assessment of the consistency and efficiency of variational inference is challenging because the profiled criterion function is not available in closed form. Hence, we turn to our empirical assessment of consistency and numerical studies to assess the properties of variational estimators.

We consider mixed effects logistic regression models – first with random intercepts, then with random intercepts, slopes, and quadratic terms – using data on marijuana use in adolescents in the United States from the National Longitudinal Survey of Youth 1997 (Bureau of Labor Statistics, U.S. Department of Labor 2013). The data consist of approximately yearly interviews of n = 8660 youth from 1997 to 2012, with the number of interviews per youth ranging from four to sixteen. For youth is jth interview, we consider the binary outcome Y_ij of whether the youth used marijuana in the thirty days preceding the interview. We focus on understanding the relationship between marijuana use, age, and sex. Since our goal is to understand the properties of variational estimators, we use the data, along with “known” parameter values, to simulate outcomes. This way we can assess the accuracy of parameter estimates and coverage of uncertainty intervals. We also use our methods to conduct an analysis of the real NLSY data.

The results indicate that variational estimators are not always consistent: in the first example the estimator is consistent for some parameters and not for others, and in the second example it is not consistent for any parameters. The first example also demonstrates that even when the variational estimator is consistent, it is not necessarily efficient. In either case the sandwich covariance matrix provides good confidence interval coverage rates and the one-step correction improves efficiency. The empirical evaluation of consistency correctly identifies inconsistency of the parameter vector as a whole, but not always inconsistency of individual parameters.

5.1. Logistic regression with random intercepts

First we consider logistic regression with random intercepts. Let Z_i be a random intercept controlling each youth’s overall propensity for marijuana use, SEX_i be an indicator that the youth is male, and AGE_ij be youth is age at interview j. Denote p_ij = P(Y_ij = 1 ∣ Z_i, SEX_i, AGE_ij). Our first model for marijuana usage is then

\log it (p_{_{i j}}) = {\begin{matrix} Z_{_{i}} + β_{_{0}} + β_{_{1}} (A G E_{_{i j}} ∕ 35) + β_{_{2}} (A G E_{_{i j}} ∕ 35)^{^{2}}, S E X_{_{i}} = 0 \\ Z_{_{i}} + β_{_{3}} + β_{_{4}} (A G E_{_{i j}} ∕ 35) + β_{_{5}} (A G E_{_{i j}} ∕ 35)^{2}, S E X_{_{i}} = 1 . \end{matrix}

Each youth’s outcomes Y_i1, … ,Y_{in_i} are assumed conditionally independent given Z_i, and we model Z_i as IID N (0, σ²). The parameter vector is θ = (β, log (σ²)). The inclusion of the quadratic effect of age is important because we expect that marijuana usage peaks some time in young adulthood and decreases thereafter. This model form is similar to that used in the analysis of age-crime curves (Fabio et al. 2011).

To estimate θ we consider a variational class of conditional distributions over z_n consisting of all independent Gaussian distributions. This is known as a Gaussian variational approximation (GVA). The variational parameters are ψ_i = (m_i, log s_i), m_i being the mean and s_i the standard deviation of the variational conditional distribution of γ_i. The variational objective involves one-dimensional numerical integrals. To optimize the variational objective function we use a variational EM algorithm using the statistical software R (R Core Team 2018). We used the R package fastGHQuad (Blocker 2018) for numerical integration. In this case it is not possible to express the profiled objective function explicitly.

To evaluate our methods, we conducted a simulation study based on the NLSY data. For each of 1000 simulations, we draw a bootstrap sample of youth. Conditional on these youth’s age and sex we simulated Y₁, … , Y_n from the model, treating the variational estimate ${\hat{θ}}_{_{n}}$ for the data as the true parameter value θ₀. We then estimated the model parameters and asymptotic covariance matrix using maximum likelihood with the R package lme4 (Bates et al. 2015), the Gaussian variational approximation, and the one-step correction to the variational estimator. Finally, we used our proposed method to assess consistency of the variational algorithm at the estimated parameter value.

We first examine the accuracy of point estimates for regression fixed effects. All three estimators concentrate on the true values of the fixed effects β₀ through β₅ (box plots are provided in the supplementary material). Table 1 shows the variance of the estimators for each of the seven model parameters. The variational estimator has slightly larger variance than the MLE, but the one-step correction nearly matches the variance of the MLE. Thus, as we asserted theoretically, the one-step correction is efficient as long as the variational estimator is consistent, even when the variational estimator is inefficient.

Table 1.

Estimator variances in the logistic regression with random intercepts simulation

	β₀	β₁	β₂	β₃	β₄	β₅	log (σ²)
MLE	0.16	1.68	1.05	0.12	1.25	0.77	8.6 × 10 ⁻³
GVA	0.23	1.90	1.19	0.19	1.46	0.91	0.61
One-step correction to GVA	0.17	1.67	1.04	0.13	1.26	0.78	0.60

Open in a new tab

Moving now to the random intercept variance, the MLE concentrates on the true variance component, log (σ²), while the variational estimator and one-step correction do not. Our empirical assessment of consistency described in Section 4.3 correctly identifies this inconsistency. The multivariate Hotelling test rejected in every simulation with p < 10⁻¹⁶, correctly indicating that the population mean gradient of the entire parameter vector was significantly different from zero. Additionally, no more than 2.5% of the marginal t-tests rejected at the 0.01 level for each of the fixed effects, in line with their apparent consistency, while every one of the 1000 simulations rejected the marginal t-test for the variance parameter with p < 10⁻¹⁶. These results are better than what is guaranteed theoretically, since the theory does not guarantee that the marginal t-tests will accurately reflect the consistency or inconsistency of individual parameters.

We now move to examining uncertainty intervals. Table 2 shows the estimated coverage of marginal 95% Wald-type confidence intervals of the model parameters for each the three estimators. The coverage of the confidence intervals of the linear and quadratic age fixed effects using the sandwich covariance for the variational estimator and the inverse Fisher information for the one-step correction are within the Monte Carlo error of the nominal 95%. The variational confidence intervals for the sex-specific intercepts β₀ and β₃ are too small at 90%, likely because of the underestimation of σ², the variance of the random intercept. The coverage of the confidence intervals of log σ² is close to zero for the variational estimator and one-step correction, which is not surprising given that the estimator is inconsistent. The coverage of log σ² is not shown for the MLE because lme4 does not provide an interval for this parameter.

Table 2.

Coverage of 95% confidence intervals in the logistic regression with random intercepts simulation.

	β₀	β₁	β₂	β₃	β₄	β₅	log (σ²)
Maximum likelihood	0.94	0.94	0.94	0.95	0.95	0.95	–
GVA + sandwich covariance	0.90	0.94	0.94	0.90	0.94	0.94	0.09
One-step correction to GVA	0.93	0.94	0.94	0.91	0.95	0.95	0.02

Open in a new tab

5.2. Logistic regression with random quadratics

We now alter the model presented above to include random slopes and quadratic terms for each youth. The random intercept model may not accurately capture the dependence structure of a single subject’s marijuana use over time, since the random intercepts model implies an exchangeable marginal correlation structure, which is unrealistic given the longitudinal nature of the data. A more realistic model allows random slopes and quadratic terms as well, so that the latent variable Z_i now has three components. For the conditional probability p_ij = P (Y_ij = 1 ∣ Z_i, SEX_i, AGE_ij) we now have

\log it (p_{_{i j}}) = {\begin{matrix} (Z_{_{i 0}} + β_{_{0}}) + (Z_{_{i 1}} + β_{_{1}}) (A G E_{_{i j}} ∕ 35) + (Z_{_{i 2}} + β_{_{2}}) (A G E_{_{i j}} ∕ 35)^{^{2}}, S E X_{_{i}} = 0 \\ (Z_{_{i 0}} + β_{_{3}}) + (Z_{_{i 1}} + β_{_{4}}) (A G E_{_{i j}} ∕ 35) + (Z_{_{i 2}} + β_{_{5}}) (A G E_{_{i j}} ∕ 35)^{2}, S E X_{_{i}} = 1 . \end{matrix}

Thus β₀, β₁, and β₂ are the coefficients of the quadratic curve for the average female, and analogously for males. We model the random effects z_n as IID mean zero multivariate Gaussian with covariance matrix Σ. Once again we use MLE, a Gaussian variational approximation, and a one-step correction to the Gaussian variational approximation to estimate the average random effects and covariance matrix.

The marginal likelihood for this model involves an intractable integral over $R^{3}$ . GVA only requires numerical inegration over a one-dimensional integral, and is hence less computationally intensive than ML estimation. To compare the computational burden of these methods, we simulated 100 data sets each at sample sizes n = 100, 250, and 500, and used the lme4 package, which uses a Laplace approximation to the log-likelihood, GVA, the one-step correction, and ML estimation to obtain estimates of the parameters in the logistic regression with random quadratics model. We used the implementation of the L-BFGS-B algorithm (Byrd et al. 1995) in the optim function in R (R Core Team 2018) to compute the GVA and MLE.

Figure 3 shows box plots of the computation time in minutes of these four algorithms. The MLE was the most computationally expensive – at sample size n = 500, the average computation time was already 70 minutes. GVA and GVA+OS required an average of 6.3 and 9.7 minutes respectively to compute with n = 500 observations. lme4 was the most computationally efficient, requiring an average of 2.4 minutes.

Fig. 3. — Box plots of the computation time of four estimation methods of the logistic regression with random quadratics model. Three sample sizes are shown. “lme4” refers to the lme4 package, which uses a Laplace approximation to the marginal likelihood, “GVA” stands for Gaussian variaitonal approximation, “GVA+OS” refers to the one-step correction, and “MLE” stands for maximum likelihood estimation, performed using L-BFGS-B optimization.

We conducted a simulation study with the same structure as the study in the last section to compare the point estimates and CI coverage of lme4 (Bates et al. 2015), GVA, and GVA+OS using all 8660 observations. Box plots of the three estimators of the mean random effects are shown in Figure 2. The pattern is very different from the random intercept model. The lme4 estimates are slightly inconsistent (for random effects with dimension larger than one the lme4 package uses a Laplace approximation to the likelihood). The GVA estimates are even more biased than the lme4 estimates. Despite this, it appears that our proposed one-step corrected fixed effects are roughly centered around the true values. This is surprising since our theory does not guarantee that the one-step correction will be consistent when the variational estimate is not. All three estimators performed quite poorly in terms of estimating the covariance matrix of the random effects.

Fig. 2. — Estimates of mean random effects from the logistic regression with random quadratics simulation study. The dotted line indicates the true parameter value. “lme4” corresponds to estimate from the lme4 package, which uses a Laplace approximation. “GVA” stands for Gaussian variational approximation, and “GVA+OS” refers to the one-step correction.

Table 3 shows the estimated coverage of 95% CIs for the mean random effects for the three estimators. The variational sandwich coverage was close to 0 in every case due to the bias in the parameter estimate seen in Figure 2. The lme4 CIs also do not perform well, with substantially lower than desired coverage. The one-step correction coverage is closest to the desired 95%. These intervals are conservative, containing the true value more than 95% of the time.

Table 3.

Coverage of 95% CIs in the logistic regression with random quadratics simulation.

	β₀	β₁	β₂	β₃	β₄	β₅
Laplace approximation via lme4	0.72	0.68	0.60	0.67	0.61	0.52
GVA + sandwich	0.01	0.01	0.00	0.013	0.01	0.00
One-step correction to GVA	0.98	0.98	0.98	0.98	0.98	0.98

Open in a new tab

The multivariate Hotelling test of consistency soundly rejected for every simulation, correctly indicating that the variational parameter estimator is consistent. In practice, therefore, while we would not be able to perform the same empirical evaluation as we have here (since we do not know the true model parameters to compute accuracy and coverage), we would have information that calls into question the viability of the GVA procedure for this model. The marginal t-tests of consistency for β did not reject in the majority of simulations. Hence, while the marginal t-tests were an accurate diagnostic tool in the random intercept setting, they were not in the random quadratic setting.

5.3. Analysis of marijuana use in NLSY97

We used our one-step corrected estimator to assess the likelihood of marijuana usage by age and sex in the NLSY. We performed our analysis in R (R Core Team 2018) using the packages numDeriv (Gilbert and Varadhan 2016), parallel (R Core Team 2018), fastGHQuad (Blocker 2018), ggplot2 (Wickham 2016), reshape (Wickham 2007), plyr (Wickham 2011), MASS (Venables and Ripley 2002), and DescTools (Signorell 2019). Figure 4 shows the estimated mean curves as a function of age for both the random intercepts and random quadratics models and for both females and males. The average male and female from the random quadratics model have slightly faster increases, peak at younger ages, and decrease earlier than the average male and female from the random intercepts model. In both models the average male has higher overall probability and slightly later peak usage: in the random intercepts model, the estimated peak female usage probability occurs at 21.3 years (95% CI: [18.1, 24.5]), and peak male usage at 22.2 years (95% CI: [19.9, 24.6]). In the random quadratics model, the estimated peak female usage probability occurs at 17.9 years (95% CI: [15.9, 20.0]), and peak male usage at 19.1 years (95% CI: [17.3, 20.8]).

Fig. 4 — One-step correction point estimates and pointwise 95% confidence intervals of probability of having used marijuana in the past month. Curves on the left are for the average female, right are average male. Both the logistic regression with random intercepts and random quadratics are shown.

6. Discussion

We have presented a general framework for understanding the properties of variational estimation for parametric mixture models. The key insight of our work comes from representing the profiled variational objective function as an M-estimator. Once we make this connection, we can leverage a rich toolkit of asymptotic and methodological results available for this context.

The theory does not guarantee that variational estimators are consistent, and it is often difficult to derive the profile objective function necessary to assess consistency. We proposed an empirical test of consistency based on estimating the gradient of the profile objective at a single parameter value. This proposed method worked well in practice, correctly indicating whether variational estimator is inconsistent in two generalized linear mixed models.

We also used the asymptotic theory to propose a sandwich covariance estimator to provide calibrated confidence regions of variational estimators and a one-step correction to the variational estimator. Both of these methods work well when the variational estimator is consistent, and in fact the one-step correction exceeded our expectations by correcting some of the bias in fixed-effect variational parameter estimators in a logistic regression model with random quadratics.

Our theory is limited to models which are IID at some level. While this includes many hierarchical and longitudinal models, it excludes models for fully dependent time series, spatial data, and dyadic data. Extending the theory to cover those cases could be a fruitful next step. Additionally, we made the simplifying assumption that the variational class is of fixed and finite dimension, but our theory could be extended to other variational classes.

Our theory only provides results regarding the asymptotic behavior of the variational estimator ${\hat{θ}}_{_{n}}$ of the structural parameters θ₀. It does not cover the behavior of the variational parameters ${\hat{ψ}}_{_{i}}$ , which govern the unit-specific variational conditional distributions of the latent variable Z_i given the observed data X_i. The behavior of these parameters is important in settings where confidence regions with valid coverage are desired for the Z_i. This is the case, for instance, when these Z_i correspond to specific fixtures of the real world, such as counties or schools. However, we note that, since the variational family typically does not contain the true conditional distribution, it may be difficult to provide regions with good coverage of Z_i using variational inference. By contrast, as we have demonstrated, ${\hat{θ}}_{_{n}}$ may be consistent even when the variational family does not contain the true conditional. This was one reason we chose to focus on the asymptotic behavior of ${\hat{θ}}_{_{n}}$ . Nevertheless, we would conjecture that ${\hat{ψ}}_{_{i}}$ . given X_i converges in distribution to

\underset{ψ \in Ψ}{arg \max} \int (\log \frac{p_{_{\bar{θ}}} (X_{_{i}}, z)}{q (z; ψ)}) q (z; ψ) d z

for each i, where $\bar{θ}$ is the limit in probability of the variational estimator ${\hat{θ}}_{_{n}}$ . We leave further discussion along these lines to future work.

We developed our theory in the context of a fixed variational family of conditional distributions. A natural question is whether our theory provides insight about which variational families yield consistency. Unfortunately, it appears to be difficult to address this question in a general manner. As a simple example, it would intuitively seem that if a given class of variational distributions yields a consistent estimator, then any enlargement of the class should also yield a consistent estimator. However, it is not clear whether this is true based on our theory. This would be an important topic of future research.

Supplementary Material

Supp 1

NIHMS1547738-supplement-Supp_1.zip^{(630.5KB, zip)}

Acknowledgments

The authors thank two referees, an associate editor, and the editor for providing constructive feedback that helped them improve this manuscript. The authors also gratefully acknowledge grant 62389-CS-YIP from the United States Army Research Office, grants SES-1559778 and DMS-1737673 from the National Science Foundation, and grant number K01 HD078452 from the National Institute of Child Health and Human Development (NICHD).

Footnotes

Supplementary material

Supplementary material includes proofs of Theorems 1 and 2, an intuitive explanation of the over-concentration of the variational Bayes posterior distribution, derivations related to the exponential mixture model, and additional simulation results.

Contributor Information

T. Westling, Center for Causal Inference, University of Pennsylvania

T. H. McCormick, Departments of Statistics & Sociology, University of Washington

References

Airoldi EM, Blei DM, Fienberg SE and Xing EP (2008), ‘Mixed membership stochastic blockmodels’, Journal of Machine Learning Research 9, 1981–2014. [PMC free article] [PubMed] [Google Scholar]
Bates D, Mächler M, Bolker B and Walker S (2015), ‘Fitting Linear Mixed-Effects Models Using lme4’, Journal of Statistical Software 67(1), 1–48. [Google Scholar]
Bickel P, Choi D, Chang X and Zhang H (2013), ‘Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels’, The Annals of Statistics 41 (4), 1922–1943. [Google Scholar]
Blei DM, Kucukelbir A and McAuliffe JD (2017), ‘Variational inference: A review for statisticians’, Journal of the American Statistical Association 112(518), 859–877. [Google Scholar]
Blei DM, Ng AY and Jordan MI (2003), ‘Latent dirichlet allocation’, Journal of Machine Learning Research 3, 993–1022. [Google Scholar]
Blocker AW (2018), fastGHQuad: Fast ‘Rcpp’Implementation of Gauss-Hermite Quadrature. R package version 1.0. URL: https://CRAN.R-project.org/package=fastGHQuad [Google Scholar]
Boyd S and Vandenberghe L (2004), Convex optimization, Cambridge University Press. [Google Scholar]
Bureau of Labor Statistics, U.S. Department of Labor (2013), ‘National Longitudinal Survey of Youth 1997 cohort, 1997-2011 (rounds 1-15)’, Produced by the National Opinion Research Center, the University of Chicago and distributed by the Center for Human Resource Research, The Ohio State University, Columbus, OH. [Google Scholar]
Byrd R, Lu P, Nocedal J and Zhu C (1995), ‘A limited memory algorithm for bound constrained optimization’, SIAM Journal on Scientific Computing 16(5), 1190–1208. [Google Scholar]
Davidian M and Giltinan DM (1995), Nonlinear Models for Repeated Measurement Data, Vol. 62, CRC press. [Google Scholar]
Erosheva E, Fienberg S and Lafferty J (2004), ‘Mixed-membership models of scientific publications’, Proceedings of the National Academy of Sciences 101, 5220–5227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fabio A, Tu L-C, Loeber R and Cohen J (2011), ‘Neighborhood socioeconomic disadvantage and the shape of the age-crime curve’, American Journal of Public Health 101, S325–S332. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gilbert P and Varadhan R (2016), numDeriv: Accurate Numerical Derivatives. R package version 20168–1. URL: https://CRAN.R-project.org/package=numDeriv [Google Scholar]
Goldstein H (2011), Multilevel Statistical Models, Vol. 922, John Wiley & Sons. [Google Scholar]
Hall P, Humphreys K and Titterington D (2002), ‘On the adequacy of variational lower bound functions for likelihood-based inference in Markovian models with missing values’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(3), 549–564. [Google Scholar]
Hall P, Ormerod JT and Wand MP (2011), ‘Theory of Gaussian variational approximation for a Poisson mixed model’, Statistica Sinica 21(1), 369–389. [Google Scholar]
Hall P, Pham T, Wand MP and Wang SS (2011), ‘Asymptotic normality and valid inference for Gaussian variational approximation’, The Annals of Statistics 39(5), 2502–2532. [Google Scholar]
Lancaster T (2000), ‘The incidental parameter problem since 1948’, Journal of Econometrics 95(2), 391–413. [Google Scholar]
Lee CYY and Wand MP (2016), ‘Variational methods for fitting complex Bayesian mixed effects models to health data’, Statistics in Medicine 35, 165–188. [DOI] [PubMed] [Google Scholar]
McCulloch CE and Neuhaus JM (2001), Generalized Linear Mixed Models, Wiley Online Library. [Google Scholar]
O’Connor B, Eisenstein J, Xing EP and Smith NA (2010), ‘Discovering demographic language variation’, Technical report. [Google Scholar]
Pati D, Bhattacharya A and Yang Y (2018), On Statistical Optimality of Variational Bayes, in Storkey A and Perez-Cruz F, eds, ‘Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics’, Vol. 84 of Proceedings of Machine Learning Research, PMLR, Playa Blanca, Lanzarote, Canary Islands, pp. 1579–1588. [Google Scholar]
Pritchard JK, Stephens M and Donnelly P (2000), ‘Inference of population structure using multilocus genotype data’, Genetics 155(2), 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Core Team (2018), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
Raj A, Stephens M and Pritchard JK (2014), ‘fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets’, Genetics 197(2), 573–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Signorell A (2019), DescTools: Tools for Descriptive Statistics. R package version θ.99.27. URL: https://cran.r-project.org/package=DescTools [Google Scholar]
van der Vaart AW (2000), Asymptotic Statistics, Vol. 3, Cambridge University Press. [Google Scholar]
Venables WN and Ripley BD (2002), Modern Applied Statistics with S, fourth edn, Springer, New York: ISBN θ-387-95457-0. URL: http://www.stats.ox.ac.uk/pub/MASS4 [Google Scholar]
Wainwright MJ and Jordan MI (2008), ‘Graphical models, exponential families, and variational inference’, Foundations and Trends in Machine Learning 1(1-2), 1–305. [Google Scholar]
Wang B and Titterington DM (2004), Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values, in ‘Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence’, UAI ’04, AUAI Press, Arlington, Virginia, United States, pp. 577–584. [Google Scholar]
Wang B and Titterington DM (2006), ‘Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model’, Bayesian Analysis 1(3), 625–650. [Google Scholar]
Wang Y and Blei DM (to appear), ‘Frequentist Consistency of Variational Bayes’, Journal of the American Statistical Association. [Google Scholar]
Wickham H (2007), ‘Reshaping data with the reshape package’, Journal of Statistical Software 21 (12). URL: http://www.jstatsoft.org/v21/i12/paper [Google Scholar]
Wickham H (2011), ‘The Split-Apply-Combine Strategy for Data Analysis’, Journal of Statistical Software 40(1), 1–29. URL: http://www.jstatsoft.org/v40/i01/ [Google Scholar]
Wickham H (2016), ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag; New York: URL: http://ggplot2.org [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1547738-supplement-Supp_1.zip^{(630.5KB, zip)}

[R1] Airoldi EM, Blei DM, Fienberg SE and Xing EP (2008), ‘Mixed membership stochastic blockmodels’, Journal of Machine Learning Research 9, 1981–2014. [PMC free article] [PubMed] [Google Scholar]

[R2] Bates D, Mächler M, Bolker B and Walker S (2015), ‘Fitting Linear Mixed-Effects Models Using lme4’, Journal of Statistical Software 67(1), 1–48. [Google Scholar]

[R3] Bickel P, Choi D, Chang X and Zhang H (2013), ‘Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels’, The Annals of Statistics 41 (4), 1922–1943. [Google Scholar]

[R4] Blei DM, Kucukelbir A and McAuliffe JD (2017), ‘Variational inference: A review for statisticians’, Journal of the American Statistical Association 112(518), 859–877. [Google Scholar]

[R5] Blei DM, Ng AY and Jordan MI (2003), ‘Latent dirichlet allocation’, Journal of Machine Learning Research 3, 993–1022. [Google Scholar]

[R6] Blocker AW (2018), fastGHQuad: Fast ‘Rcpp’Implementation of Gauss-Hermite Quadrature. R package version 1.0. URL: https://CRAN.R-project.org/package=fastGHQuad [Google Scholar]

[R7] Boyd S and Vandenberghe L (2004), Convex optimization, Cambridge University Press. [Google Scholar]

[R8] Bureau of Labor Statistics, U.S. Department of Labor (2013), ‘National Longitudinal Survey of Youth 1997 cohort, 1997-2011 (rounds 1-15)’, Produced by the National Opinion Research Center, the University of Chicago and distributed by the Center for Human Resource Research, The Ohio State University, Columbus, OH. [Google Scholar]

[R9] Byrd R, Lu P, Nocedal J and Zhu C (1995), ‘A limited memory algorithm for bound constrained optimization’, SIAM Journal on Scientific Computing 16(5), 1190–1208. [Google Scholar]

[R10] Davidian M and Giltinan DM (1995), Nonlinear Models for Repeated Measurement Data, Vol. 62, CRC press. [Google Scholar]

[R11] Erosheva E, Fienberg S and Lafferty J (2004), ‘Mixed-membership models of scientific publications’, Proceedings of the National Academy of Sciences 101, 5220–5227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Fabio A, Tu L-C, Loeber R and Cohen J (2011), ‘Neighborhood socioeconomic disadvantage and the shape of the age-crime curve’, American Journal of Public Health 101, S325–S332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Gilbert P and Varadhan R (2016), numDeriv: Accurate Numerical Derivatives. R package version 20168–1. URL: https://CRAN.R-project.org/package=numDeriv [Google Scholar]

[R14] Goldstein H (2011), Multilevel Statistical Models, Vol. 922, John Wiley & Sons. [Google Scholar]

[R15] Hall P, Humphreys K and Titterington D (2002), ‘On the adequacy of variational lower bound functions for likelihood-based inference in Markovian models with missing values’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(3), 549–564. [Google Scholar]

[R16] Hall P, Ormerod JT and Wand MP (2011), ‘Theory of Gaussian variational approximation for a Poisson mixed model’, Statistica Sinica 21(1), 369–389. [Google Scholar]

[R17] Hall P, Pham T, Wand MP and Wang SS (2011), ‘Asymptotic normality and valid inference for Gaussian variational approximation’, The Annals of Statistics 39(5), 2502–2532. [Google Scholar]

[R18] Lancaster T (2000), ‘The incidental parameter problem since 1948’, Journal of Econometrics 95(2), 391–413. [Google Scholar]

[R19] Lee CYY and Wand MP (2016), ‘Variational methods for fitting complex Bayesian mixed effects models to health data’, Statistics in Medicine 35, 165–188. [DOI] [PubMed] [Google Scholar]

[R20] McCulloch CE and Neuhaus JM (2001), Generalized Linear Mixed Models, Wiley Online Library. [Google Scholar]

[R21] O’Connor B, Eisenstein J, Xing EP and Smith NA (2010), ‘Discovering demographic language variation’, Technical report. [Google Scholar]

[R22] Pati D, Bhattacharya A and Yang Y (2018), On Statistical Optimality of Variational Bayes, in Storkey A and Perez-Cruz F, eds, ‘Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics’, Vol. 84 of Proceedings of Machine Learning Research, PMLR, Playa Blanca, Lanzarote, Canary Islands, pp. 1579–1588. [Google Scholar]

[R23] Pritchard JK, Stephens M and Donnelly P (2000), ‘Inference of population structure using multilocus genotype data’, Genetics 155(2), 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] R Core Team (2018), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]

[R25] Raj A, Stephens M and Pritchard JK (2014), ‘fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets’, Genetics 197(2), 573–589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Signorell A (2019), DescTools: Tools for Descriptive Statistics. R package version θ.99.27. URL: https://cran.r-project.org/package=DescTools [Google Scholar]

[R27] van der Vaart AW (2000), Asymptotic Statistics, Vol. 3, Cambridge University Press. [Google Scholar]

[R28] Venables WN and Ripley BD (2002), Modern Applied Statistics with S, fourth edn, Springer, New York: ISBN θ-387-95457-0. URL: http://www.stats.ox.ac.uk/pub/MASS4 [Google Scholar]

[R29] Wainwright MJ and Jordan MI (2008), ‘Graphical models, exponential families, and variational inference’, Foundations and Trends in Machine Learning 1(1-2), 1–305. [Google Scholar]

[R30] Wang B and Titterington DM (2004), Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values, in ‘Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence’, UAI ’04, AUAI Press, Arlington, Virginia, United States, pp. 577–584. [Google Scholar]

[R31] Wang B and Titterington DM (2006), ‘Convergence properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model’, Bayesian Analysis 1(3), 625–650. [Google Scholar]

[R32] Wang Y and Blei DM (to appear), ‘Frequentist Consistency of Variational Bayes’, Journal of the American Statistical Association. [Google Scholar]

[R33] Wickham H (2007), ‘Reshaping data with the reshape package’, Journal of Statistical Software 21 (12). URL: http://www.jstatsoft.org/v21/i12/paper [Google Scholar]

[R34] Wickham H (2011), ‘The Split-Apply-Combine Strategy for Data Analysis’, Journal of Statistical Software 40(1), 1–29. URL: http://www.jstatsoft.org/v40/i01/ [Google Scholar]

[R35] Wickham H (2016), ggplot2: Elegant Graphics for Data Analysis, Springer-Verlag; New York: URL: http://ggplot2.org [Google Scholar]

PERMALINK

Beyond prediction: A framework for inference with variational approximations in mixture models

T Westling

T H McCormick

Abstract

1. Introduction

1.1. Variational estimators

2. Variational approximations and M-Estimation

2.1. Variational estimation as M-Estimation

2.2. Consistency

2.3. Asymptotic normality

3. Illustrations of the general theory

3.1. Consistent and efficient variational estimation

3.2. Inconsistent variational estimation

Fig. 1.

4. Practical tools for inference with variational estimators

4.1. Sandwich covariance estimation

4.2. One-step correction

4.3. An empirical test of the consistency of variational estimators

5. Numerical studies

5.1. Logistic regression with random intercepts

Table 1.

Table 2.

5.2. Logistic regression with random quadratics

Fig. 3.

Fig. 2.

Table 3.

5.3. Analysis of marijuana use in NLSY97

Fig. 4.

6. Discussion

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases