Summary
Fully Bayesian inference in the presence of unequal probability sampling requires stronger structural assumptions on the data-generating distribution than frequentist semiparametric methods, but offers the potential for improved small-sample inference and convenient evidence synthesis. We demonstrate that the Bayesian exponentially tilted empirical likelihood can be used to combine the practical benefits of Bayesian inference with the robustness and attractive large-sample properties of frequentist approaches. Estimators defined as the solutions to unbiased estimating equations can be used to define a semiparametric model through the set of corresponding moment constraints. We prove Bernstein–von Mises theorems which show that the posterior constructed from the resulting exponentially tilted empirical likelihood becomes approximately normal, centred at the chosen estimator with matching asymptotic variance; thus, the posterior has properties analogous to those of the estimator, such as double robustness, and the frequentist coverage of any credible set will be approximately equal to its credibility. The proposed method can be used to obtain modified versions of existing estimators with improved properties, such as guarantees that the estimator lies within the parameter space. Unlike existing Bayesian proposals, our method does not prescribe a particular choice of prior or require posterior variance correction, and simulations suggest that it provides superior performance in terms of frequentist criteria.
Keywords: Bayesian method of moments, Bernstein–von Mises theorem, Double robustness, Exponentially tilted empirical likelihood, M-estimation, Selection bias
1. Introduction
Consider the problem of estimating a population parameter vector in the presence of unequal probability sampling. We investigate two settings. The first, which we refer to as the design setting, assumes a selection mechanism determined by the data collector, but only partial design information in the form of sampling probabilities for the selected individuals is provided to the analyst. This situation is frequently encountered when analysing public-use survey datasets (Si et al., 2015; Zanganeh & Little, 2015; Wang et al., 2017). The second scenario is an observational setting where the selection mechanism is unknown, but assumed to be ignorable conditional on a set of fully observed covariates. Certain problems in missing data and causal inference, such as estimating an average treatment causal effect with no unmeasured confounders, can be formulated in this framework (Kang & Schafer, 2007).
In both cases, it is common in practice to use semiparametric estimators that incorporate inverse probability weighting. If the selection probabilities are known, weighting methods are simple to implement and require few modelling assumptions for consistency (Horvitz & Thompson, 1952; Hájek, 1971). In the observational setting, inverse probability weights estimated from a selection model can be combined with an imputation model to produce doubly robust estimators that are consistent as long as one of the models is correctly specified (Robins et al., 1994; Scharfstein et al., 1999; Kang & Schafer, 2007; Rotnitzky & Vansteelandt, 2014). However, while the large-sample properties of these estimators are attractive, their reliability for small datasets is less justified theoretically. In particular, the use of inverse probability weighting can lead to disastrous performance in the presence of model misspecification and a practical violation of positivity (Kang & Schafer, 2007). For small-sample inference, the prior distribution in a Bayesian approach can offer both regularization and a systematic way of incorporating informative knowledge into the analysis, with this influence gradually relaxed as the sample size increases. The Bayesian paradigm also provides a convenient framework for evidence synthesis (Ades & Sutton, 2006); repeatedly integrating over the posterior distributions of parameters allows one to propagate uncertainty across multiple data sources within a single analysis.
A significant drawback of Bayesian approaches is the requirement of stronger structural assumptions on the data distribution and the sampling mechanism. In the design setting, one option is to specify a flexible regression imputation model with the sampling probability included as a covariate to adjust for the selection bias. To obtain estimates for the target population, the sampling probability can be integrated out using a sampling probability model conditional on selection (Si et al., 2015; Zanganeh & Little, 2015). Alternatively, one could adopt a sample likelihood approach (Pfeffermann et al., 2006) which truncates the dataset to just the sampled individuals and requires the specification of a conditional selection model. Both approaches involve directly modelling the dependence structure between the incomplete data and the sampling probabilities, rather than using the unavailable design variables specified by the data collector. This is potentially difficult to specify correctly. In the observational setting, Bayesian estimators will generally fail to be doubly robust; either the selection mechanism is ignored or the model parameters are a priori dependent, so that misspecification of just one model can feed back into the other, precluding consistency (Zigler et al., 2013; Robins et al., 2015). Specializing to mean estimation of binary outcomes, Ray & van der Vaart (2019) presented a thorough treatment of assumptions and priors for fully Bayesian approaches to satisfy Bernstein–von Mises theorems, as well as the proposal of novel propensity score-dependent priors that incorporate a preliminary estimate of the propensity score into the prior of the outcome regression model.
There have been a number of Bayes-like proposals that aim to resolve these issues. In a survey inference context, Wang et al. (2017) suggested using an approximate normal likelihood centred at a weighted estimator to match the robustness and simplicity of the frequentist approach. McCandless et al. (2010) and Graham et al. (2016) achieved double robustness by cutting the feedback between the models, but this necessitates a post hoc variance correction. This idea was reviewed by Saarela et al. (2016), who also proposed a method for doubly robust estimation using a Bayesian bootstrap model. Although their method does not require variance correction and matches the performance of the frequentist estimators, it prescribes a specific choice of noninformative prior. For the estimation of average treatment effects, with a particular focus on settings where the dimension of the covariates is high relative to the sample size, Antonelli & Dominici (2019) proposed a partially Bayesian method. The resulting posterior has the double robustness property, but also requires a variance correction using a method such as the nonparametric bootstrap.
In this paper we develop an inferential framework that offers the practical benefits of Bayesian statistics described above, along with the attractive asymptotic guarantees of frequentist semiparametric estimators. Central to our method is a novel application of Bayesian exponentially tilted empirical likelihood (Schennach, 2005), an approach which forms a posterior by combining a prior with a likelihood function defined by moment conditions. We specialize to the domain of M-estimation, since many proposed semiparametric estimators (e.g., Hájek, 1971; Robins et al., 1994; Scharfstein et al., 1999; Cao et al., 2009; Rotnitzky et al., 2012) are M-estimators, and the unbiased estimating equations they solve are used to define a set of corresponding moment constraints. We prove Bernstein–von Mises theorems showing that the resulting Bayesian exponentially tilted empirical likelihood posterior becomes approximately normal, centred at the chosen estimator with matching asymptotic variance; the choice of prior is unrestricted, apart from continuity and nonzero mass in a neighbourhood of the probability limit of the estimator. Thus, the posterior shares the analogous properties of the estimator, such as double robustness and local efficiency, and the frequentist coverage of any credible set will be approximately equal to its credibility. In particular, the latter implication extends the large-sample posterior properties proved by Chib et al. (2018), and provides an interpretation of the credible sets as regularized or shrinkage estimators of confidence sets, filling a conceptual gap otherwise left empty due to the procedure not being fully Bayesian.
Additionally, we prove that a separation condition, similar to the requirement in Theorem 1 of Chib et al. (2018), is implied under standard assumptions for the consistency of M-estimators. This allows the user to avoid a potentially difficult verification. Schennach (2005) provided an interpretation of Bayesian exponentially tilted empirical likelihood which justifies its use as a Bayesian procedure. However, the conditions of this result are not satisfied in our design setting in § 2.3. We give an alternative interpretation, connecting the likelihood function with a proper likelihood arising from an exponential family of maximum-entropy distributions, and suggest that this paves the way for future work.
Our approach offers the ability to obtain modified versions of existing estimators with improved properties, even in the absence of informative priors. For example, certain proposed estimators (e.g., Cao et al., 2009) may have a nonzero probability of lying outside the parameter space, potentially leading to suboptimal finite-sample performance (Rotnitzky et al., 2012). This problem can be rectified by simply restricting the support of the prior, yielding a new estimator which is population bounded in accordance with the variation of its predecessor and has identical asymptotic behaviour. Having a posterior distribution also allows the user to have a choice of estimators, such as the mean, median or maximum a posteriori estimator, depending on the situation or the target loss function.
2. Proposal
2.1. Exponentially tilted empirical likelihood
Suppose that D is a random vector drawn from a distribution P 0. The objective is to estimate the parameter θ 0 ∈ Θ ⊂ ℝm, which is assumed to satisfy the moment condition EPo {g(D, θ 0)} = 0, where g is a function mapping into ℝm.Thus, the dimension of the moment conditionis assumed to match the dimension of the parameter. The observed data Di (i = 1, …, n) are independent and identically distributed replicates of D with realized values di. An M-estimator solves the estimating equation for θ ∈ Θ. Many proposed estimators for unequal probability sampling problems take this form, accompanied by a set of regularity assumptions similar to the following.
Assumption 1
-
(i)
The parameter space Θ of θ is compact, and θ 0 lies in the interior of Θ and is the unique solution to E P0 {g (D, θ)}=0.
-
(ii)
With probability 1, there is a unique solution to for each n.
-
(iii)
The variance Ω0 = varp 0 {g(D, θ 0)} is nonsingular.
-
(iv)
The expectation
-
(v)
With probability 1, g(D, θ) is continuous at each θ ∈ Θ.
-
(vi)
With probability 1, g(D, θ) is continuously differentiable with respect to θ in a neighbourhood of θ 0, and where ∂θ denotes the partial derivative with respect to θ and F refers to the Frobenius norm.
-
(vii)
The matrix G 0 = E P0{∂θg(D, θ 0)} is invertible.
Assumption 1 is sufficient for the M-estimator to be consistent and asymptotically normally distributed (van der Vaart, 1998),
in the sense of convergence in distribution, with where G 0 and Ω0 can be consistently estimated by
| (1) |
respectively.
The moment condition g can also define a semiparametric model by restricting to distributions P that satisfy EP{g(D, θ)} = 0 for θ ∈ Θ. For values of θ such that the origin lies in the convex hull of {g(di,θ): i = 1,…,n}, the exponentially tilted empirical likelihood (Jing & Wood, 1996; Corcoran, 1998; Lee & Young, 1999; Schennach, 2005) is defined, up to a constant factor, as
where the probabilities p 1(θ), …, pn(θ) solve the optimization problem
| (2) |
subject to
| (3) |
For other values of θ, Ln(θ) is set to 0. The function p(θ) ={p 1(θ),…,pn(θ)}T is well-defined since: (i) for each value ofθ the constraint set is compact and the objective function is continuous, so if the constraint set is nonempty then the objective function attains the maximum; and (ii) the objective function is strictly concave, so there is a unique maximizer.
One may interpret this likelihood function as being derived from a θ-parameterized set of multinomial distributions supported on the observed data values. For each value of θ, the solution minimizes the Kullback–Leibler divergence from the empirical distribution subject to the constraint EP {g(D, θ)}=0. More precisely, the Kullback–Leibler divergence is minimized with the empirical distribution as the second argument; the opposite direction corresponds to the empirical likelihood (Owen, 2001). This connection mirrors the relationship between variational Bayesian methods and expectation propagation (Gelman et al., 2013). The exponentially tilted empirical likelihood is connected to M-estimation in the following manner.
Proposition 1
The M-estimator maximizes the exponentially tilted empirical likelihood.
Furthermore, we show that Assumption 1 is sufficient for establishing the following separation property, which says that Ln decays superpolynomially to 0 outside of any ball around .
Theorem 1
If Assumption 1 is satisfied, then for any δ > 0 there exists an ε > 0 such that
with probability approaching 1.
2.2. Bayesian exponentially tilted empirical likelihood
From a Bayesian perspective, Schennach (2005) proposed that the exponentially tilted empirical likelihood could be combined with a prior p(θ) to form a posterior
and referred to this approach as Bayesian exponentially tilted empirical likelihood. Schennach justified this by proving that if all observed data values are distinct, then Ln(θ)can be represented as a limit
which suggests that it has a proper probabilistic interpretation as a likelihood derived from a semiparametric model after marginalizing an infinite-dimensional nuisance parameter. The prior for the nuisance parameter ξB = (ξ B,1, …, ξB,B)T conditional on θ and a positive real number ε is a distribution on a grid of values such that the induced mixture of uniform densities centred on the components of ξB satisfies the moment restrictions with in a tolerance of ε, favouring mixtures with small support. Conditional on ξB,Di is distributed according to the corresponding mixture of uniform densities. As B →∞, the spacing of the grid of values tends to zero and the range tends to infinity. Chib et al. (2018) further proved Bernstein–von Mises results, showing that the total variation distance between the posterior distribution of n 1/2(θ–θ0) and the normal distribution N(0, ∑0) tends to zero under correctly specified moment constraints.
We specialize to the domain of M-estimation and prove a Bernstein-von Mises theorem with centring point being the M-estimator . This implies not only that the posterior is consistent and asymptotically normal, but also that frequentist coverage of any credible set will be approximately equal to its credibility, extending the properties implied by the results of Chib et al. (2018). We specify a distinct set of further assumptions.
If Ln(θ) is nonzero, the optimization problem specified by (2) and (3) can be solved (Schennach, 2007) by considering the dual problem
| (4) |
where solves
| (5) |
Assumption 2
There exists a neighbourhood B of θ 0 on which, with probability approaching 1, the exponentially tilted empirical likelihood is nonzero or, equivalently, there exists a function such that for all θ ∈ B,
Assumption 3
For almost all values of d, g(d,θ)is twice differentiable with respect to θ in a neighbourhood ofθ0, and the second derivative satisfies the Lipschitz condition
for an integrable function ψ, where op refers to the operator norm.
Assumption 4
For almost all values of d, there exists a neighbourhood of (0,θ0) contained in ℝm × Θ in which the function
and all of its first and second partial derivatives are dominated by an integrable function.
These conditions allow us to establish the following intermediate result.
Proposition 2
If Assumptions 3 and 4 are satisfied, then on a neighbourhood of θ 0 there exists a unique function λ 0 mapping into ℝm such that
and λ 0 is twice Lipschitz differentiable.
Consequently, one can generate an exponential family {Pθ } from P 0 via
locally around θ 0, where κ(θ) = logE P0[exp{λ 0(θ)T g(D, θ)}] and EP θ{g(D,θ)}=0. The exponentially tilted distribution Pθ is the I-projection (see Csiszár, 1975) of P 0 onto the set {P : EP{g(D, θ)}=0}, i.e., the closest element to P 0 in the set in terms of Kullback–Leibler divergence. In this local region of θ 0, the exponentially tilted empirical likelihood is approximately equal to the likelihood generated by this exponential family. This suggests that the exponentially tilted empirical likelihood is a plug-in estimate of a least-favourable family of distributions aimed at reducing the original semiparametric model to a parametric model in a minimally informative way. This provides a general interpretation of the Bayesian exponentially tilted empirical likelihood approach that holds even in certain situations to which the Schennach (2005) interpretation does not apply, such as the design setting in §2.3 where the set of observed data values may not be distinct.
Theorem 2
Suppose that Assumptions 1–4 hold. Assume also that the prior p(θ) admits a continuous density with respect to the Lebesgue measure andis positive at θ 0. Then
in probability, where is the density of .
By centring and scaling and using an alternative form of the total variation distance (Tsybakov, 2009), we obtain the equivalent representation
in probability, where B ranges over all elements of the Borel sigma-algebra on ℝm. Theorem 2 implies both posterior consistency and asymptotically correct frequentist coverage of credible sets. The following result confirms the first-order equivalence between the posterior mean and , establishing the validity of the methodology as a shrinkage estimation framework that can yield finite-sample gains while matching the asymptotic performance of the standard estimator.
Theorem 3
Suppose Assumptions 12013;4 hold and that ∫‖θ‖2 p(θ) dθ < ∞. Let dθ be the Bayesian exponentially tilted empirical likelihood posterior mean.
Then
where the convergence is in probability and in distribution, respectively.
2.3. Design setting
We first consider the estimation of a population parameter in a design setting where the selection probabilities are known for the sampled individuals. The dataDi = (RiZi, Ri, Riπi) (i = 1, 2026; , n) are independent and identically distributed from P 0; here Ri is the selection indicator, which is equal to 1 if Zi is observed and 0 otherwise, πi = pr(Ri = 1 | Wi), and Zi and Ri are conditionally independent given Wi. The variables W 1, …, Wn are the design variables chosen by the data collector to assign sampling probabilities to individuals in the target population, but they are not included in the dataset. We make the positivity assumption that there exists a δ > 0 such that πi ⩾ δ with probability 1. The target parameter θ 0 is the unique solution to E P0 {u(Z, θ)} = 0 for a function u and θ ∈ Θ ⊂ ℝm. The full-data estimating function u is adapted below to the estimating function g for the observed data, allowing us to apply Theorem 2.
Example 1 (Outcome mean). We have Z = Y and u(Z, θ) = Y - θ.
Example 2 (Linear regression). We have Z = (Y,X) and u(Z,θ) = X T(Y – Xθ).
Consider the estimator that solves the estimating equation
To address the technicality that the sampling probabilities are provided as Riπi in the notation rather than just πi, we set Ri/(Riπi) = 0 when Ri = 0 so that Ri/(Riπi) is equivalent to Ri/πi. In the case of estimating the population outcome mean, this estimator specializes to the Hájek estimator (Hájek, 1971). For D = (RY,R,Rπ) ∼ P 0 and g(D,θ) = Ru(Z,θ)/π,
where we have used the conditional independence of R and Z conditional on W and the equality o f E P0|W {R | W} and π. This shows that θ 0 is the unique solutionto E P0{g (D, θ)}=0. Let Ln(θ) be the exponentially tilted empirical likelihood function corresponding to the moment conditions E P0 {g(D, θ)} = 0 for θ ∈ Θ. The likelihood function is combined with a user-specified prior p(θ) to form a posterior
If Assumptions 1–4 are satisfied and p(θ) is continuous and nonzero around θ 0, Theorem 2 implies that
in probability, where is the density of and as described in §2.1 and §2.2. Since is a consistent estimator of θ 0, the posterior will concentrate around θ 0 as n gets large. Furthermore, since Σ0 is equal to the asymptotic variance of , the frequentist coverage of any credible set will be approximately equal to its credibility.
2.4. Observational setting
We consider estimation of a population parameter when the selection mechanism is unknown. The observed data Di = (RiZi, Ri, Wi)(i = 1, …, n) are independent and identically distributed from P 0; Zi and Ri are as before, and Wi is a vector of covariates observed for each i such that Zi and Ri are conditionally independent given Wi. The target parameter γ0 is the unique solutionto E P0{u(Z,γ)} = 0 for a function u and values of γ belonging to a compact real subset Γ. In a missing data context, the conditional independence of Zi and Ri is sometimes referred to as a missing-at-random assumption. This set-up may also be viewed as one arm of a point exposure causal inference problem in the potential outcomes framework, with the conditional independence corresponding to an assumption of no unmeasured confounders.
Let π0(W) = pr(R = 1 | W) be the true propensity score and let φ0(W, γ) = E P0{u(Z, γ) | W}. We make the positivity assumption that there exists a δ > 0 such that π0(W) ⩾ δ with probability 1. Solving
where and are estimators of π0 and φ0, respectively, leads to a doubly robust estimator of Y; that is, the estimator is consistent and asymptotically normal as long as at least one of and is consistent. There is a significant body of work regarding choices for and , particularly for population outcome mean estimation, which lead to various favourable efficiency properties. See Kang & Schafer (2007) and Rotnitzky & Vansteelandt (2014) for comprehensive reviews.
If and are derived from the solutions to unbiased estimating equations, as is often the case in practice, one can exploit this to formulate a set of nested moment constraints for an exponentially tilted empirical likelihood model. We show in Theorem 4 that the resulting marginal posterior distribution of γ is calibrated asymptotically to the behaviour of the selected estimator.
We restrict our attention to parametric working models π (W; α) and φ (W, γ ; β) for real-valued parameters α and β. Suppose that solve the unbiased estimating equation
where ρ is a set of additional auxiliary parameters, possibly empty (Rotnitzky & Vansteelandt, 2014). The two parameters (α, β) can be estimated either separately or together. For example, in the case of mean estimation, Robins et al. (1994) estimated α with maximum likelihood for a logistic regression model, and estimated β separately using ordinary least squares. Scharfstein et al. (1999) also used maximum likelihood estimation for α, but included the reciprocal of the propensity score as a covariate in the outcome regression model.
Let be the solution to
Let θ = (α,β,ρ,γ) and define g(D,θ)={U α,β,ρ(D,α,β,ρ)T, h(D,α,β,γ)T}T, where
In accordance with Assumption 1(i), we assume that there exists a value θ 0 = (α 0, β 0, ρ0, γ*) which is the unique solution to E P0{g(D, θ)}=0. We say that the working model for the propensity score is correctly specified if π0(W) = π(W; α 0), and similarly we say that the model for φ is correctly specified if φ0(W,γ)= φ(W,γ;β 0). If at least one of the models is correctly specified, then γ* = γ0 and the M-estimator consistently estimates the true parameter.
LetLn (θ) be the exponentially tilted empirical likelihood function corresponding to the moment conditions EP{g(D,θ)} = 0. The likelihood function is combined with a user-specified prior p(θ) to form a posterior
Let p(γ | d1, …, dn) be the marginal posterior for γ.
Theorem 4
Suppose Assumptions 1–4 hold and that the prior p(θ) admits a continuous density with respect to Lebesgue measure and is positive at θ 0 . Then, as n →∞,
in P0-probability, where is the density of and .
As stated earlier, is, by construction, consistent for estimating γ0 provided either π0 (W) = π(W; α 0) or φ0(W, γ) = φ(W, γ ; β 0) for all γ or both. Therefore, Theorem 4 implies that the exponentially tilted empirical likelihood posterior shares this double robustness property; the posterior will concentrate around the true value as long as one of the working models is correctly specified. Furthermore, credible sets for γ will asymptotically have nominal frequentist coverage if consistency holds, even if one of the working models is misspecified. If both models are misspecified, the credible sets will have approximately nominal coverage for the probability limit of , which is possibly different from γ0.
2.5. Implementation
In this subsection we describe how one can compute Ln(θ) for a fixed value of θ. To simplify the notation, write gi = g(Di, θ) for each i = 1, …, n, suppressing the dependence on θ. To check whether the feasible set of the optimization problem specified by (2) and (3) is nonempty, it is sufficient and computationally convenient, for example by using an R (R Development Core Team, 2020) package such as lpSolve (Berkelaar, 2015), to check whether there exists a feasible solution to the linear programming problem
| (6) |
where g = (g 1,…, gn) and c = (1,…,1)T. The objective 0 is suggested here for computational simplicity, but can be replaced by b T x for arbitrary b ∈ ℝn as we are concerned only with the feasible set. If the feasible set is empty, Ln(θ) is set to zero. Otherwise, assuming that the solution to (2) and (3) lies in the interior of the simplex, i.e., all the values of pi are nonzero, the optimization problem can be solved by considering the dual problem described by (4) and (5).
Assuming that is strictly positive definite, a unique solution to (5) exists and can be found via the Newton–Raphson method. This requires specifying a small convergence tolerance value with respect to a norm of choice. Pseudo-code for evaluating Ln(θ) is provided in the Supplementary Material. Once we are able to evaluate Ln pointwise, we can perform posterior inference using standard Bayesian computational machinery such as Markov chain Monte Carlo or importance sampling.
3. Simulations
3.1. Mean estimation for binary outcomes
In this simulation, we examine estimation of the population mean of binary outcomes in a design setting. In the notation of § 2.3, Z = Y, u(Z, θ) = Y - θ and θ 0 = E P0(Y). The design variables Wi (i = 1, …, n) are independent and identically distributed according to the beta distribution Be(1.5, 3.5), and the outcomes Yi | Wi ∼ Ber(Wi) so that θ 0 = 0.3. The selection variables Ri (i = 1, …, n) are independent and identically distributed according to Ri | πi ∼ Ber(πi), where logit(πi) = Wi. Thus, Yi and the selection probability πi are positively correlated, and the selection must be adjusted for in order to estimate θ 0. The data available for analysis are Di = (RiYi, Ri, Riπi)(i = 1, …, n), so that the design variables are excluded.
Following the approach in §2.3, the M-estimator is the Hájek estimator which solves
We use g to define the exponentially tilted empirical likelihood Ln(θ), which we combine with three different priors for θ: θ ∼ Be(0.5, 0.5), θ ∼ Un(0, 1) and θ ∼ Be(1.5, 3.5). The first is Jeffrey’s prior. The mean of the Be(1.5, 3.5) prior is equal to θ 0, so we consider this prior informative, while the first two are considered noninformative.
We compare this approach with the proposal inWang et al. (2017, §2). In a survey inference context, Wang et al. (2017) suggested using a Bayesian approach with an approximate normal likelihood
where is a consistent and asymptotically normal estimator of θ 0 and is a robust estimator of the variance of . The estimator acts as asummary statistic for the data, such that the posterior is
where is defined by the normal model above and p(θ) is a prior for θ. We choose to be the Hájek estimator defined above and estimate its variance with the nonparametric bootstrap. We use the same priors as defined above.
Table 1 compares the frequentist estimator with the Bayesian methods. Each setting was replicated 2000 times. The Bayes point estimators are the posterior means. Coverage rates were computed based on central 95% credible regions. The Bayesian computation was carried out using importance sampling with 5000 particles for each replication, and the tolerance for computing the exponentially tilted empirical likelihood was 10−4. On a standard quad-core laptop computer, a single replication of the Bayesian exponentially tilted empirical likelihood method with n = 100 took 0.25 seconds to run, while the method ofWang et al. (2017) took 0.04 seconds.
Table 1. Bias, root mean squared error and coverage rate from 2000 Monte Carlo simulations using the Hájek estimator, the normal approximation of Wang et al. (2017) and the proposed Bayesian exponentially tilted empirical likelihood approach.
| Population size | Prior | Method | Bias (×100) | RMSE (×100) | CR (%) |
|---|---|---|---|---|---|
| n = 25 | Hájek | 0.17 | 11.67 | ||
| Jeffrey’s | Normal BETEL |
-1.57 1.19 |
13.06 11.79 |
88.6 92.9 |
|
| Uniform | Normal BETEL |
0.72 2.30 |
11.55 11.06 |
91.5 94.5 |
|
| Be(1.5, 3.5) | Normal BETEL |
-1.62 -0.11 |
10.22 9.18 |
92.3 96.2 |
|
| n = 50 | Hájek | 0.08 | 8.27 | ||
| Jeffrey’s | Normal BETEL |
-0.69 0.88 |
8.92 8.29 |
92.1 94.7 |
|
| Uniform | Normal BETEL |
0.39 1.45 |
8.28 8.18 |
91.7 94.6 |
|
| Be(1.5, 3.5) | Normal BETEL |
-1.06 0.11 |
7.32 7.33 |
92.6 95.7 |
|
| n= 100 | Hájek | -0.08 | 6.01 | ||
| Jeffrey’s | Normal BETEL |
-0.23 0.51 |
6.24 6.03 |
92.1 94.8 |
|
| Uniform | Normal BETEL |
-0.11 0.57 |
6.03 5.87 |
92.5 94.6 |
|
| Be(1.5, 3.5) | Normal BETEL |
-0.75 -0.08 |
5.75 5.56 |
92.8 94.9 |
RMSE, root mean squared error; CR, coverage rate of 95% credible regions; BETEL, Bayesian exponentially tilted empirical likelihood.
The Bayesian exponentially tilted empirical likelihood estimators generally have a higher magnitude of bias than the normal approximation when a noninformative prior is used, but have lower bias with the informative prior. This reflects the fact that the exponentially tilted empirical likelihood is less informative than the normal likelihood, resulting in higher shrinkage towards the prior mean. In the case of the two noninformative priors, this causes an upward bias towards the prior mean 0.5. This conservative characteristic leads to the Bayesian exponentially tilted empirical likelihood approach having superior performance in terms of root mean squared error and coverage rate across almost all settings, particularly with the smaller sample sizes for which the normal approximation is less accurate.
3.2. Doubly robust mean estimation with missing data
This simulation is conducted under the observational setting described in § 2.4 and follows the design of Kang & Schafer (2007). For each i = 1, …, n, the vector of covariates Wi = (Wi 1,Wi 2,Wi 3,Wi 4) ∼ N(0,I 4) where I 4 is the 4 ×4 identity matrix, the selection indicator Ri | Wi ∼ Ber{π0(Wi)} where
and Z = Y, the outcome, with Yi | Wi ∼ N{m 0(Wi),1} where
Clearly, Yi and Ri are conditionally independent given Wi. The data are Di = (RiYi, Ri, Wi) (i = 1,…,n). In addition to the correctly specified models (a) and (b) , we also consider the misspecified models (c) and (d) , where are transformed covariates with and . The target parameter is μ 0 = E P0(Y) = 210. We write m and μ instead of the φ and γ used in §2.4 to match the notation of Kang & Schafer (2007). For the sake of brevity, the estimators and methods described in the rest of this subsection will be expressed in terms of the correct covariates Wi. Under misspecification, the covariates Wi are replaced with as appropriate.
The doubly robust augmented inverse probability weighted estimator (Robins et al., 1994), sometimes referred to as the standard doubly robust estimator, is
| (7) |
with and estimated via maximum likelihood estimation or, equivalently, by solving
| (8) |
where U α and U β are the score equations for the logistic and linear regression models, respectively. In this case, the set of additional auxiliary parameters ρ referred to in §2.4 is empty.
Saarela et al. (2016) proposed a Bayesian doubly robust approach using the Bayesian bootstrap (Rubin, 1981). A Dirichlet process model is specified for Di in the limit of the concentration measure tending to 0. Inference for μ is based on a posterior predictive distribution induced by maximizing expected utility functions. Here, we follow the approach detailed in §6.2 of their paper and choose the utility functions to match the specification of . More explicitly, the parameters α and β are linked to the Bayesian bootstrap model via
corresponding to maximization of the expected loglikelihoods of, respectively, the propensity score and the outcome regression models under the posterior. The target parameter μ is defined by
In practice, we sample from the posterior predictive distribution by repeatedly generating uniform Dirichlet weights ω = (ω 1, …, ωn) and computing with the fixed uniform weights (1/n,…,1/n) replaced by ω in (7) and (8). Define to be the posterior predictive mean ofμ for this method. The Bayesian exponentially tilted empirical likelihood posterior for θ = (α, β, μ) is obtained by setting u(Z, μ) = Y – μ and following the approach described in §2.4. We compare this with the doubly robust augmented inverse probability weighted estimator and the proposal of Saarela et al. (2016).
In Table 2, for both of the Bayesian exponentially tilted empirical likelihood estimators, we use independent, weakly informative generalized Student-t priors for the working models (a)–(d), all with three degrees of freedom: (a)α ∼ t 3{(0,0,0,0,0)T,I 5} (b)β ∼ t 3{(210,30,10,10,10)T,I 5}, (c) α ∼ t 3{(0, 0, 0, 0, 0)T, I 5}, and (d) β ∼ t 3{(35, 50, 0, −135, 0)T, I 5}, where the shape matrices are the 5 × 5 identity matrix so that the prior variance of each component is equal to 3. For , a weakly informative prior μ ∼ t 3(210, 1) is specified for the target parameter, while is equipped with a more informative prior, μ ~ N(210,1). The three parameters (α,β,μ) are a priori independent across all settings. Sampling from the Saarela et al. (2016) posterior can be implemented directly, as described above. The exponentially tilted empirical likelihood was computed with a tolerance of 10-4, and posterior samples were drawn using a Metropolis–Hastings algorithm with 2000 iterations, with an initial burn-in of 500 iterations. On a standard quad-core laptop computer, a single replication using the Bayesian exponentially tilted empirical likelihood method without parallelization took 55 seconds. Computing the frequentist estimator took 0.05 seconds, and the method of Saarela et al. (2016) took 3.71 seconds.
Table 2. Monte Carlo simulations based on 1000 replications using the standarddoublyrobust estimator, the methodofSaarela et al. (2016) andthe Bayesian exponentially tilted empirical likelihood approach.
| OR correct, PS correct | OR incorrect, PS correct | ||||||||
| Estimator | Bias | RMSE | MAE | ESD | Estimator | Bias | RMSE | MAE | ESD |
| −0.01 | 2.55 | 1.73 | 2.55 | 0.27 | 3.61 | 2.32 | 3.60 | ||
| 0.01 | 2.57 | 1.71 | 2.57 | 0.57 | 3.44 | 2.31 | 3.39 | ||
| 0.04 | 2.23 | 1.24 | 2.23 | 0.31 | 3.55 | 2.13 | 3.54 | ||
| −0.05 | 1.95 | 1.18 | 1.95 | 0.29 | 2.87 | 1.81 | 2.85 | ||
| OR correct, PS incorrect | OR incorrect, PS incorrect | ||||||||
| Estimator | Bias | RMSE | MAE | ESD | Estimator | Bias | RMSE | MAE | ESD |
| −0.01 | 2.59 | 1.73 | 2.59 | −6.44 | 38.52 | 3.64 | 37.97 | ||
| −0.09 | 2.60 | 1.73 | 2.60 | −4.81 | 15.41 | 3.38 | 14.64 | ||
| 0.06 | 2.32 | 1.33 | 2.32 | −5.11 | 14.75 | 3.36 | 13.84 | ||
| −0.09 | 2.10 | 1.29 | 2.10 | −2.37 | 4.20 | 2.51 | 3.47 | ||
RMSE, root mean squared error; MAE, median of absolute errors; ESD, empirical standard deviation; DR, doubly robust; Sa, the method of Saarela et al. (2016); BETEL, Bayesian exponentially tilted empirical likelihood; OR, outcome regression; PS, propensity score; OR correct, use of the correct outcome regression model (a); OR incorrect, use of the model (c); PS correct, use of the correct propensity score model (b); PS incorrect, use of the model (d).
The results in Table 2 show that in all settings, outperforms both and in terms of root mean squared error and median absolute error. This illustrates that when substantial prior knowledge of the target parameter is available, use of this information in our proposed approach leads to better overall performance than the other estimators evaluated. When at least one of the working models is correctly specified, exhibits a higher magnitude of bias than , but lower root mean squared error and median absolute error, demonstrating the regularization benefits of using weakly informative priors with our proposal. When both working models are misspecified, the improvements are more substantial, suggesting that the use of a prior can help to alleviate the effects of misspecification. The patterns of results for compared with are less clear for the bias and root mean squared error metrics, but has consistently lower median absolute error.
4. Application
We examine the association between blood pressure and sodium, and potassium consumption using data from the National Health and Nutrition Examination Survey 2003–2006. The dataset includes 13 957 individuals with full data on the relevant information who were drawn from the U.S. civilian population, which we have assumed to be constant at 300 million during the time period 2003–2006. Each observation is associated with a weight variable that is assumed to be proportional to the reciprocal of the sampling probability of the individual. This follows the example found in Lumley (2010, § 5.2.4).
We work in the design setting described in § 2.3. The aim is to fit a linear regression model for blood pressure Y on sodium consumption X 1 and potassium consumption X 2. Age X 3 is also included for deconfounding. The moment condition is
where R is the selection indicator variable, W is the weight variable and X = (1,X 1,X 2,X 3)T. We consider the frequentist M-estimator with standard errors estimated using the sandwich estimator (1). For our Bayesian exponentially tilted empirical likelihood proposal, each regression parameter is assigned an independent prior: θ 0 ∼ t 3(100,1), θ 1 and θ 2 follow half-normal distributions on the positive and negative real numbers, respectively, each with scale parameter 1, and θ 3 ∼ t 3(0, 1). The priors for θ 1 and θ 2 reflect the substantial prior evidence that sodium raises blood pressure in humans while potassium does the opposite. The likelihood was computed with a tolerance of 10-4, and posterior samples were drawn using a Metropolis–Hastings algorithm.
Table 3 compares the frequentist estimates with the Bayesian exponentially tilted empirical likelihood posterior mean estimates. Besides the analysis of the full dataset, an analysis of a random sample of 300 samples was also carried out. With the smaller dataset, the frequentist approach leads to a positive estimated value for the effect of potassium on blood pressure. On the other hand, the Bayesian exponentially tilted empirical likelihood approach gives an estimated value much closer to the values obtained from the full dataset, illustrating the potentially significant impact of using an informative prior. The results of both approaches converge as the sample size increases, in accordance with our theory.
Table 3. Frequentist estimates and standard errors compared with Bayesian exponentially tilted empirical likelihood posterior means and posterior standard deviations.
| Sample size | Method | θ 0 | θ 1 | θ 2 | θ 3 | |
|---|---|---|---|---|---|---|
| n = 300 | Frequentist | Estimate | 95.10 | 0.39 | 0.85 | 0.54 |
| Standard error | 2.85 | 0.63 | 0.94 | 0.04 | ||
| BETEL | Posterior mean | 99.31 | 0.51 | -0.56 | 0.52 | |
| Posterior s.d. | 1.22 | 0.29 | 0.39 | 0.03 | ||
| n = 13 957 | Frequentist | Estimate | 99.74 | 0.80 | -0.91 | 0.50 |
| Standard error | 0.80 | 0.15 | 0.19 | 0.01 | ||
| BETEL | Posterior mean | 99.82 | 0.78 | -0.89 | 0.49 | |
| Posterior s.d. | 0.39 | 0.09 | 0.12 | 0.01 |
BETEL, Bayesian exponentially tilted empirical likelihood; s.d., standard deviation.
5. Discussion
Empirical likelihood estimators for missing data problems have previously been proposed by Qin & Zhang (2007) and Chan & Yam (2014). Their work provides a convenient framework for integrating multiple working models into a single analysis, extending the doubly robust property to a multiply robust one. The methods are based on maximizing the conditional empirical likelihood of the outcomes and covariates given selection, and thus differ from ours, which uses the marginal exponentially tilted empirical likelihood.
Lazar (2003) provides a justification for using the empirical likelihood for Bayesian inference based on a criterion proposed by Monahan&Boos (1992). We expect a corresponding justification to hold for the exponentially tilted empirical likelihood because of their similarity. Previous theoretical and empirical comparisons suggest that, in addition to the empirical likelihood, other approximate likelihoods such as the bootstrap likelihood (Davison et al., 1992) and the implied likelihood (Efron, 1993) would yield similar performance.
As suggested in §2.1, the empirical distribution may be viewed as an initial estimate of the true data-generating distribution in the exponentially tilted empirical likelihood. From this perspective, it is natural to wonder whether this initial estimate can be improved. While the empirical distribution can be applied very generally, its use may disregard additional known or assumed structure about the data distribution, such as its support, conditional independencies and smoothness. Nonparametric techniques such as density estimation may offer a way to incorporate this information into the initial estimate. Investigating whether such replacements are advantageous is a topic of further research.
Supplementary Material
Supplementary material available at Biometrika online includes pseudo-code for the implementation and proofs of all the theoretical results.
Acknowledgement
We thank Dr Shaun Seaman and the reviewers for helpful comments and suggestions. This work was funded by the U.K. Medical Research Council.
References
- Ades AE, Sutton AJ. Multiparameter evidence synthesis in epidemiology and medical decision-making: Current approaches. J R Statist Soc. 2006;A 161:5–35. [Google Scholar]
- Antonelli J, Dominici F. Causal Inference in high dimensions: A marriage between Bayesian modeling and good frequentist properties. arXiv: 1805.04899v5. 2019 doi: 10.1111/biom.13417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berkelaar M. lpSolve: Interface to ‘Lp_solve’ v 55 to Solve Linear/Integer Programs. 2015 R package version 5.6.15, available at https://cran.r-project.org/web/packages/lpSolve/index.html.
- Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the double robust estimator for a population mean with incomplete data. Biometrika. 2009;96:723–34. doi: 10.1093/biomet/asp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chan KCG, Yam SCP. Oracle, multiple robust and multipurpose calibration in a missing response problem. Statist Sci. 2014;29:380–96. [Google Scholar]
- Chib S, Shin M, Simoni A. Bayesian estimation and comparison of moment condition models. J Am Statist Assoc. 2018;113:1656–68. [Google Scholar]
- Corcoran SA. Bartlett adjustment of empirical discrepancy statistics. Biometrika. 1998;85:967–72. [Google Scholar]
- Csiszár I. I-divergence geometry of probability distributions and minimization problems. Ann Prob. 1975;3:146–58. [Google Scholar]
- Davison AC, Hinkley DV, Worton BJ. Bootstrap likelihoods. Biometrika. 1992;79:113–30. [Google Scholar]
- Efron B. Bayes and likelihood calculations from confidence intervals. Biometrika. 1993;80:3–26. [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Rubin AB. Bayesian Data Analysis. CRC Press; Boca Raton, Florida: 2013. [Google Scholar]
- Graham DJ, McCoy EJ, Stephens DA. Approximate Bayesian inference for doubly robust estimation. Bayesian Anal. 2016;11:47–69. [Google Scholar]
- Hájek J. Foundations of Statistical Inference (Proc Sympos, Univ Waterloo, Ontario, 1970) Rinehart and Winston; Toronto: Holt: 1971. Discussion of ‘An essay on the logical foundations of survey sampling, part I’; p. 236. [Google Scholar]
- Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Statist Assoc. 1952;47:663–85. [Google Scholar]
- Jing B-Y, Wood ATA. Exponential empirical likelihood is not Bartlett correctable. Ann Statist. 1996;24:365–9. [Google Scholar]
- Kang JDY, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with Discussion) Statist Sci. 2007;22:523–39. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lazar N. Bayesian empirical likelihood. Biometrika. 2003;90:319–26. [Google Scholar]
- Lee SMS, Young GA. Nonparametric likelihood ratio confidence intervals. Biometrika. 1999;86:107–18. [Google Scholar]
- Lumley T. Complex Surveys: A Guide to Analysis using R. Wiley; Hoboken, New Jersey: 2010. [Google Scholar]
- McCandless LC, Douglas IJ, Evans SJ, Smeeth L. Cutting feedback in Bayesian regression adjustment for the propensity score. Int J Biostatist. 2010;6 doi: 10.2202/1557-4679.1205. [DOI] [PubMed] [Google Scholar]
- Monahan JF, Boos DD. Proper likelihoods for Bayesian analysis. Biometrika. 1992;79:271–8. [Google Scholar]
- Owen AB. Empirical Likelihood. Chapman & Hall/CRC; New York: 2001. [Google Scholar]
- Pfeffermann D, Moura FAS, Nascimento-Silva PL. Multi-level modelling under informative sampling. Biometrika. 2006;93:943–59. [Google Scholar]
- Qin J, Zhang B. Empirical-likelihood-based inference in missing response problems and its application in observational studies. J R Statist Soc. 2007;B69:101–22. [Google Scholar]
- R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2020. ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]
- Ray K, van der Vaart A. Semiparametric Bayesian causal inference. arXiv: 1808.04246v2. 2019 [Google Scholar]
- Robins JM, Hernán M, Wasserman L. Discussion of ‘On Bayesian estimation of marginal structural models’. Biometrics. 2015;71:296–9. doi: 10.1111/biom.12273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. [Google Scholar]
- Rotnitzky A, Lei Q, Sued M, Robins JM. Improved double-robust estimation in missing data and causal inference models. Biometrika. 2012;99:439–56. doi: 10.1093/biomet/ass013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rotnitzky A, Vansteelandt S. In: Handbook of Missing Data Methodology. Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G, editors. CRC Press; Boca Raton, Florida: 2014. Double-robust methods; pp. 185–212. [Google Scholar]
- Rubin DB. The Bayesian bootstrap. Ann Statist. 1981;9:130–4. [Google Scholar]
- Saarela O, Belzile LR, Stephens DA. A Bayesian view of doubly robust causal inference. Biometrika. 2016;103:667–81. [Google Scholar]
- Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with Discussion) J Am Statist Assoc. 1999;94:1096–146. [Google Scholar]
- Schennach S. Bayesian exponentially tilted empirical likelihood. Biometrika. 2005;92:31–46. doi: 10.1093/biomet/asaa028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schennach S. Point estimation with exponentially tilted empirical likelihood. Ann Statist. 2007;35:634–72. [Google Scholar]
- Si Y, Pillai N, Gelman A. Bayesian nonparametric weighted sampling inference. Bayesian Anal. 2015;10:605–25. [Google Scholar]
- Tsybakov AB. Introduction to Nonparametric Estimation. Springer; New York: 2009. [Google Scholar]
- van der Vaart AW. Asymptotic Statistics. Cambridge University Press; Cambridge: 1998. [Google Scholar]
- Wang Z, Kim JK, Yang S. Approximate Bayesian inference under informative sampling. Biometrika. 2017;105:91–102. [Google Scholar]
- Zanganeh SZ, Little RJA. Bayesian inference for the finite population total from a heteroscedastic probability proportional to size sample. J Surv Statist Methodol. 2015;3:91–102. [Google Scholar]
- Zigler CM, Watts K, Yeh RW, Wang Y, Coull BA. Model feedback in Bayesian propensity score estimation. Biometrics. 2013;69:263–73. doi: 10.1111/j.1541-0420.2012.01830.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
