Summary
Covariate balance is often advocated for objective causal inference since it mimics randomization in observational data. Unlike methods that balance specific moments of covariates, our proposal attains uniform approximate balance for covariate functions in a reproducing-kernel Hilbert space. The corresponding infinite-dimensional optimization problem is shown to have a finite-dimensional representation in terms of an eigenvalue optimization problem. Large-sample results are studied, and numerical examples show that the proposed method achieves better balance with smaller sampling variability than existing methods.
Keywords: Average treatment effect, Eigenvalue optimization, Reproducing-kernel Hilbert space, Sobolev space
1. Introduction
The estimation of average treatment effects is important in the evaluation of an intervention or a treatment, but is complicated by confounding in observational studies where the treatment is not randomly assigned. When treatment assignment is unconfounded conditional on observable covariates, two popular modelling strategies are based on propensity score modelling (Rosenbaum & Rubin, 1983) and outcome regression modelling. Parametric approaches can suffer seriously from model misspecification, and there have been substantial recent efforts to construct more robust estimators within these modelling frameworks; see, for example, Robins et al. (1994), Qin & Zhang (2007), Tan (2010), Graham et al. (2012), and Han & Wang (2013).
Since randomization is a gold standard for identifying average treatment effects, Rubin (2007) advocated mimicking randomization, which balances the covariate distributions among the treated, the controls, and the combined sample, in the analysis of observational data. Based on these considerations, weighting-based covariate balancing methods have been proposed by Qin & Zhang (2007), Hainmueller (2012), Imai & Ratkovic (2014), Zubizarreta (2015) and Chan et al. (2016). A common feature of these methods is that a vector of user-specified functions of covariates is balanced. While balancing low-order moments of the covariates often yields good results, there is no guarantee that there will be sufficient balance over a large class of covariate functions. Matching is another general approach to attaining covariate balance. Exact matching is not feasible for multiple continuous covariates, and a user-specified coarsening of the covariate space is needed (Iacus et al., 2011). In this paper, we shall focus on weighting-based methods.
Instead of balancing prespecified moments of covariates, we propose a method to control the covariate functional balance over a reproducing-kernel Hilbert space (Aronszajn, 1950), which can be chosen large enough to contain any functions with mild smoothness constraints, including nonlinearities and interactions. At a conceptual level, the comparison between covariate balancing with an increasing number of basis functions and kernel-based covariate functional balancing is analogous to the comparison of regression and smoothing splines in conditional mean estimation. Unlike regression splines, smoothing splines do not require preselection of the number of knots and their locations. Although achieving our goal involves a challenge due to an infinite-dimensional optimization problem, we show that it has a finite-dimensional representation and can be solved by eigenvalue optimization. Large-sample properties are derived under minimal smoothness conditions on the outcome regression model. Consistent estimation of average treatment effects is then possible without first guessing or estimating the outcome regression function, and efficient estimation can be attained when the outcome regression function is estimated. Unlike weighting methods that require stringent smoothness conditions for the propensity score function, our method does not require smoothness of the propensity score.
2. Kernel-based covariate functional balancing
2·1. Preliminaries
Let Y (1) and Y (0) be the potential outcomes when an individual is assigned to the treatment group and control group respectively. We are interested in estimating the population average treatment effect τ = E{Y (1) − Y (0)}. In practice, Y (1) and Y (0) are not both observed. With T denoting the binary treatment indicator, we can represent the observed outcome as Y = TY (1) + (1 − T)Y(0). Moreover, we observe a vector of covariates for every individual, so the observed data are {(Ti, Yi, Xi), i= 1, …, N} where N is the sample size. We assume that {Ti, Yi(1), Yi(0), Xi} (i = 1, …, N) are independent and identically distributed, and that T is independent of {Y (1), Y}(0) conditional on X.
Note that τ consists of two expectations, E{Y (1)} and E{Y (0)}. In this work, we consider weighted estimation of these expectations. Without loss of generality, we focus on E{Y (1)}. In the following, we consider a weighting estimator of E{Y (1) that can be represented as . Hence, for estimation of E{Y (1), we only need to specify weights wi(i : Ti = 1) for individuals in the treatment group.
Let π(x) = pr(T = 1 | X = x) be the propensity score. Assuming knowledge of π(Xi) (i : Ti = 1), wi can be chosen as {π(Xi)}−1 to obtain a consistent estimator of E{Y (1)}. In practice, propensity scores are usually unknown. In such scenarios, one can estimate the propensity score function to form a plug-in estimator for E{Y (1)}. However, estimation errors and model misspecification of the propensity score function can lead to significant error in the estimation of E{Y (1)} due to the use of inverse probability weighting. Poor finite-sample performance of such estimators has been reported in the literature (Kang & Schafer, 2007).
Due to this unsatisfactory performance, some attention has been given to choosing wi (i : Ti = 1) via covariate balancing, which mimics randomization directly. To understand this, note that
(1) |
for any measurable function such that E{u(X)} exists and is finite. Instead of modelling the propensity function, it is therefore natural to choose weights that ensure the validity of the empirical finite-dimensional approximation of (1),
(2) |
where U(X) = {u1(X), …, uL(X)}T is a L-variate function of X. Here span u1, …, uL can be viewed as a finite-dimensional approximation space of functions in which the balancing is enforced. Practical considerations may suggest a choice of {u1, …, uL}. In this case, we call it parametric covariate balancing. Without assumptions on the outcome regression model, the balancing of fixed and finitely many component functions uj in (1) may not lead to consistent estimation (Hellerstein & Imbens, 1999). To allow consistent estimation in a larger family of outcome regression functions, another possibility is to allow L to increase with N (Chan et al., 2016). This has a nonparametric flavour similar to regression splines, for which the number of knots grows with sample size. However, the choices of L and {u1, …, uL} are not obvious. In this work, we aim to balance covariate functionals nonparametrically via reproducing-kernel Hilbert space modelling of the approximation space.
Let m(X) = E{Y (1) | X } and Yi(1) = m(Xi) + εi for i= 1, …, N. Further assume that the εi are independent with E(εi|Xi) = 0 and . All weighting estimators of E{Y(1)} admit the decomposition
(3) |
which allows a transparent understanding of the terms that have to be controlled. The first term on the right-hand side of (3) poses a challenge since the unknown outcome regression function m is intrinsically related to the outcome data, and could be complex and high-dimensional in general. To connect with covariate balancing, if m ∈ span{u1, …, uL} in (2), we can control the first term. For the second term, the εi (i = 1, …, N) are independent of the choice of wi (i : Ti = 1) if the outcome data are not used to obtain the weights. Some control over the magnitude of wi will lead to convergence of the second term. Corresponding details will be given in § 2.4. The convergence of the third term is ensured by the law of large numbers.
2.2. Construction of the method
We consider the following empirical validity measure for any suitable function u:
where w = (w1, …, wN)T. In parametric covariate balancing, the weights wi (i : Ti = 1) can be constructed to satisfy
where with u1, …, uL being suitable basis functions. In this case, the weights attain exact covariate balance as in (2) when the dimension of is small.
Here the overall validity of (1) is instead controlled directly on an approximation space , a reproducing-kernel Hilbert space with inner product , norm . Ideally, one would want to pick a large enough, possibly infinite-dimensional, space to guarantee the control of SN (w, u) on a rich class of functions. Unlike sieve spaces, is specified without reference to sample size. The matching of nonlinear functions is also automatic if is large enough to contain such functions, without the need to explicitly introduce particular nonlinear basis functions in sieve spaces. For any Hilbert space of functions of x1 and any Hilbert space of functions of x2, the tensor product space is defined as the completion of the class under the induced norm by and . A popular choice of is the tensor product reproducing-kernel Hilbert space with being the reproducing-kernel Hilbert space of functions of the jth component of X. Suppose the support of the covariate distribution is [0, 1]d and f(ℓ) is the ℓth derivative of a function f. Following Wahba (1990), one can choose as the ℓth-order Sobolev space are absolutely continuous, f(ℓ) ∈ L2[0, 1]} with norm
The second-order Sobolev space is one of the most common choices in practice and will be adopted in all of our numerical illustrations. Another common choice is the space generated by the Gaussian kernel, which will also be used in our numerical studies. If it is desirable to prioritize covariates based on prior beliefs, we can raise the components to different powers to reflect their relative importance. For Gaussian kernels, this is equivalent to using different bandwidth parameters for each covariate. In cases where there are binary or categorical covariates, one can choose the corresponding as a reproducing-kernel Hilbert space with kernel R(s, t) = I(s = t), for any levels s and t of such a covariate, as suggested by Gu (2013); here I is an indicator function.
Ideally, we want to control . However, there are two issues. First, that SN (w, cu) c2SN (w, u) for any c ≫ 0 suggests a scale issue of SN with respect to u. Therefore, in order to use SN (w, u) to determine the weights w, the magnitude of u should be standardized. To deal with this, we note that
(4) |
by the Cauchy–Schwarz inequality, where . In view of (4), we restrict our attention to . Second, similar to many statistical and machine learning frameworks, the optimization of an unpenalized sample objective function will result in overfitting. In our case, the weights become highly unstable. To alleviate this, we control to emphasize the balance on smoother functions. Additionally, we penalize on to control the variabilities of both w and the second term in the right-hand side of (3). Overall, we consider the constrained minimization
(5) |
where λ1 > 0 and λ2 > 0 are tuning parameters and the above minimization is only taken over wi (i : Ti = 1). The weights wi are restricted to be greater than or equal to 1, as their counterparts, the inverse propensities, satisfy {π(Xi)}−1 ⩾ 1. We denote the solution of (5) by ŵ. Further discussion on these tuning parameters will be given in § 2.4 and § 2.5. In particular, we show that the convergence to zero of the first term of (3) can be ensured even· when λ2 = 0. This indicates that this extra tuning parameter is mostly needed for our justification of the convergence of the second term in (3).
A small number of recent papers have also considered kernel-based methods for covariate balancing. An unpublished paper by Zhao (2017) considered a dual formulation of the method of Imai & Ratkovic (2014) for the estimation of π(x) under a logistic regression model, and generalized the linear predictor into a nonlinear one using the kernel trick. Since this method aims at estimating π(x), it requires smoothness conditions on π and penalizes roughness of the resulting estimate. Our method does not require smoothness of π and penalizes the roughness of the balancing functions. An unpublished paper by Kallus (2017a) considered weights that minimize the dual norm of a balancing error. Given a reproducing-kernel Hilbert space, this method does not have the ability to adapt to a relevant subset of functions. An external parameter is required to index the function space, such as the dispersion parameter of a Gaussian kernel, and needs to be specified in an ad hoc manner. Due to the lack of an explicit tuning parameter, this method will not work well for the Sobolev space, which does not have extra indexing parameters. Our method works for a given reproducing-kernel Hilbert space by using data-adaptive tuning to promote balancing of smoother functions within the given space.An unpublished paper of Hazlett (2016) proposed an extension of the moment-based balancing method of Hainmueller (2012) to balance the columns of the Gram matrix. Since the Gram matrix is N × N, exact balancing of N moment conditions under additional constraints on the weights is often computationally infeasible. Balancing a low-rank approximation of the Gram matrix may be an ad hoc solution but its theoretical properties have not been studied.
2·3. Finite-dimensional representation
Many common choices of reproducing-kernel Hilbert space, including Sobolev Hilbert space, are infinite-dimensional, so the inner optimization in (5) is essentially infinite-dimensional and appears impractical. Fortunately, we shall show that the solution of (5) enjoys a finite-dimensional representation. First, the inner optimization of (5) can be expressed as
Let K be the reproducing kernel of . By the representer theorem (Wahba, 1990), the solution lies in a finite-dimensional subspace span{K(Xj, ·): j = 1, …, N}. Now this optimization is equivalent to
(6) |
where M is a N × N matrix with (i, j)th element K(Xi, Xj). This positive semidefinite matrix is commonly known as the Gram matrix. Let the eigendecomposition of M be
where Q1 and Q2 are diagonal matrices. In particular, Q2 = 0. Let r be the rank of Q1. We remark that P2 and Q2 do not exist if r = N, but the following derivation still holds. Moreover,
where A(w) = a(w)a(w)T with a(w) = (T1w1 − 1, …, TN wN − 1)T. Let . The constrained optimization (6) is then equivalent to
Therefore, the target optimization becomes
(7) |
where σmax(M) represents the maximum eigenvalue of a matrix M.Again, the above minimization is only taken over wi (i : Ti = 1). Since is an affine transformation of w and VN is a convex function, the objective function of this minimization is convex with respect to w, due to Proposition 1, whose proof is given in the Supplementary Material. Due to convexity and Slater’s condition of strict feasibility, a necessary and sufficient condition for a global minimizer of (7) is the corresponding Karush–Kuhn–Tucker condition using subdifferentials.
Proposition 1
Let B ∈ ℝr×r be a symmetric matrix. The function σmax(vvT + B) is convex with respect to v ∈ ℝr.
For the computation, note that the maximum eigenvalue is evaluated at a rank-one modification of a diagonal matrix, which can be computed efficiently by solving the secular equation (O’Leary & Stewart, 1990) in a linear algebra package such as LAPACK. The objective function is second-order differentiable with respect to the wi when the maximum eigenvalue of has multiplicity 1. Moreover, the corresponding gradient has a closed-form expression. In this case, a common and fast nonlinear optimization method such as the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm with bound constraints can be applied. Nondifferentiability occurs when the largest two eigenvalues of coincide. To ensure validity, one could employ the following two-part computational strategy. First, one applies the Broyden–Fletcher–Goldfarb–Shanno algorithm and checks numerically whether the maximum eigenvalue evaluated at the resulting solution is repeated. If not, the objective function is differentiable at this solution and the Karush–Kuhn–Tucker condition is satisfied. Thus, the minimizer is obtained. Otherwise, the nonlinear eigenvalue optimization method of Overton (1992, § 5), which is applicable to the scenario of repeated eigenvalues, is initialized by the former estimate and then applied. In our practical experience, the second step is seldom needed and has negligible effect on the final solution. Therefore, for fast computation, we only apply the first part in our numerical illustrations.
2·4. Theoretical properties
For notational simplicity, we shall study the theoretical properties of the proposed estimator with being the tensor product of ℓth-order Sobolev spaces, as studied extensively in smoothing splines (Wahba, 1990; Gu, 2013). Our results can be extended to other choices of if an entropy result and a uniform boundedness condition for the unit ball are supplied; see the Supplementary Material. For instance, an entropy result for Gaussian reproducing-kernel Hilbert space can be obtained from Zhou (2002). As mentioned, we concentrate on E{Y (1)}. Similar conditions are required for E{Y (0)} to obtain results on the average treatment effect τ = E{Y(1) − Y (0)}.
Assumption 1
The propensity π(·) is uniformly bounded away from zero. That is, there exists a constant C such that 1/π(x) ≤ C < ∞ for all .
Assumption 2
The ratio d/ℓ is less than 2.
Assumption 3
The regression function m(·) belongs to .
Assumption 4
The errors {εi} are uncorrelated, with E(εi) = 0 and for all i = 1, …, N. Further, the {εi} are independent of {Ti} and {Xi}.
The above assumptions are very mild. Assumption 1 is the usual overlap condition required for identification. There are no additional smoothness assumptions on π(·), which would typically be required in propensity score or covariate balancing methods (Hirano et al., 2003; Chan et al., 2016). Assumption 2 corresponds to the weakest smoothness assumption on m(·) in smoothing spline regression. We use the notation AN ≍ BN to represent An = O(BN) and BN = O(AN) for some sequences AN and BN.
Theorem 1
SupposeAssumptions 1–3 hold. If λ1 ≍ N−1 and λ2 = O(N−1), then . If λ1 ≍ N−1 and λ2 ≍ N−1, then VN(ŵ) = Op(1) and there exist constants W > 0 and S2 > 0 such that E{VN(ŵ)} ≤ W and E{NSN(ŵ,m)} ≤ S2.
Theorem 1 supplies the rate of convergence of the first term in (3), and boundedness of the expectation of the second term in (3). Convergence of SN(ŵ,m) is guaranteed even if λ2 is chosen to be 0. However, to ensure boundedness of E{VN(ŵ)}, additional regularization is needed and hence λ2 > 0 is proposed. The following theorem establishes the N1/2-consistency of the weighting estimator. Moreover, we show that the asymptotic distribution has finite variance.
Theorem 2
Suppose Assumptions 1–4 hold and that . If λ1 ≍ N−1 and λ2 ≍ N−1, then
Moreover, has finite asymptotic variance.
Although Theorem 2 only gives the rate of convergence of the estimator, it is stronger than recent results for other kernel-based methods for the estimation of average treatment effects. Zhao (2017) and Hazlett (2016) do not provide the rates of convergence of their estimators. To our knowledge, the only paper that contains a rate of convergence for kernel-based methods is Kallus (2017b), who showed a root-N convergence rate under the assumption that m(X) is linear in X and did not give the asymptotic distribution. In fact, when linear assumptions hold, parametric covariate balancing is sufficient for estimating the average treatment effects (Qin & Zhang, 2007). When m(·) is a general function, the difficulty in theoretical development lies in the first term of (3), which is shown to attain the same rate of convergence as the other two terms of (3), but whose asymptotic distribution is not available. For the sieve-based method (Chan et al., 2016), the growth rate of the sieve approximation space can be carefully chosen in a range such that terms analogous to the first term of (3) have a faster convergence rate than the dominating terms. In our case, similar to nonparametric regression, there is only a particular growth rate of λ1 such that the bias and variance of the first term of (3) are balanced. In fact, it is possible that the term has an asymptotic bias of order N−1/2. In § 2.6, a modified estimator is studied by debiasing the first term of (3), so that its rate of convergence is faster than N−1/2 and is dominated by the other terms. In that case, the asymptotic distribution can be derived. Further discussion of the relationship between Theorem 2 and the literature is given in Remark 3.
2·5. Tuning parameter selection
In Theorems 1 and 2, λ1 and λ2 are required to decrease at the same order N−1, so as to achieve the desired asymptotic results. To reduce the amount of tuning, we choose λ2 = ζλ1 where ζ > 0 is fixed. As explained above, λ2 is chosen to be positive mostly to ensure the boundedness of E{VN(ŵ)}. From our practical experience, the term VN(ŵ) is usually stable and does not take large values even if λ2 is small. Therefore, we are inclined to choose a small ζ. In all of our numerical illustrations, ζ is fixed at 0.01. Now we focus on the choice of λ1. Tuning λ1 is similar to choosing the dimension of the sieve space in Chan et al. (2016), which is a difficult and mostly unsolved problem. In this paper, we do not attempt to solve it rigorously, but just to provide a reasonable solution.
By Lagrange multipliers, the optimization is equivalent to for some γ, where there exists a correspondence between γ and λ1. Since a larger regularization parameter corresponds to a stricter constraint, γ decreases with λ1. We use
(8) |
as a measure of the balancing error over with respect to the weights w. Due to the large subset of functions to balance, BN (ŵ) is large when γ is large or, equivalently, when λ1 is small. When γ decreases, or equivalently when λ1 increases, BN(ŵ) typically decreases to approximately zero, as the resulting weight ŵ approximately balances the whole subset . An example is given in § 3·2. When this happens, a further decrease of γ would not lead to any significant decrease in BN(ŵ). The key idea is to choose the smallest λ1 that achieves such approximate balancing, to ensure the largest subset of functions being well-balanced. In practice, we compute our estimator with respect to a grid of λ1 such that . Write ŵ(j) as the estimator with respect to . We select as our choice of λ1 if j∗ is the smallest j such that
where c is chosen as a negative constant of small magnitude. In the numerical illustrations, we set c = −10−6.
2·6. An efficient modified estimator
Since the outcome regression function m(·) is assumed to be in a reproducing-kernel Hilbert space , a kernel-based estimator , such as smoothing splines (Gu, 2013), can be employed, and is a natural estimator of E{Y(1)} = E[E{Y(1) | X}] = E{m(X)}. However, since randomization is administered before collecting any outcome data, Rubin (2007) advocated the estimation of treatment effects without using outcome data to avoid data snooping. On the other hand, Chernozhukov et al. (2017) and Athey et al. (2017) advocate the use of an estimated outcome regression function to improve the theoretical results in high-dimensional settings. Inspired by these results, we modify the weighting estimator by subtracting from both sides of (3), so that the first term in the decomposition becomes , while the remaining two terms are unchanged. It can then be shown that the first term has a rate of convergence faster than N−1/2 under mild assumptions, and the asymptotic distribution of the resulting estimator will be derived.
The estimator takes the form
which has the same form as the residual balancing estimator proposed in Athey et al. (2017). They considered a different setting of the high-dimensional linear regression model with sparsity assumptions, and showed that their estimator attains the semiparametric efficiency bound.
Our analysis requires the additional technical assumption that ŵ is op(N1/2). To achieve this, we adopt an assumption like the one in Athey et al. (2017), that ŵ ≤ BN1/3 for a prespecified large positive constant B. This can be enforced in the optimization (7) easily together with the constraint ŵ ⩾ 1. We call this estimator .
Theorem 3
Suppose Assumptions 1, 2 and 4 hold with . Also, let maxi E|εi|3 < ∞. Let such that ‖h‖N = op(1) and . Further, assume λ1 = o(N−1), , and . Write
where F, G1, …, GN are independent and identically distributed standard normal random variables independent of X1, …, XN, T1, …, TN and ε1, …, εN. Let ψN and be the corresponding characteristic functions of JN and . Then as N → ∞,
where is twice differentiable, and
(9) |
where V = E{1/π(X1)}.
Corollary 1
Under the assumptions of Theorem 3,
converges in distribution to a standard normal variable as N → ∞.
Remark 1
In Theorem 3, the estimand is E{Y (1), whereas in Corollary 1, the estimand is a finite-sample conditional average, . Athey al. et (2017) considered a finite-sample conditional average treatment effect and obtained a result similar to Corollary 1. Normalization by is possible in Corollary 1 following a conditional central limit theorem, since depends only on (Ti, Xi) (i = 1, …, N) and can be treated as constants upon conditioning. To derive the limiting distribution of JN in Theorem 3, one cannot use a similar normalization because the handling of the extra term requires averaging across the distribution of X. If converges to a constant in probability, one could use Slutsky’s theorem to prove asymptotic normality of JN. Theorem 3 requires a partially conditional central limit theorem which is proven in the Supplementary Material, and the distribution of JN can be approximated by a weighted sum of independent standard normal random variables. The asymptotic variance is bounded above by the right-hand side of (9), which is the semiparametric efficiency bound (Robins et al., 1994; Hahn, 1998).
Remark 2
Compared to Theorem 2, Theorem 3 requires different conditions on the orders of λ1 and λ2. These order specifications, together with a diminishing ‖h‖N, allow a direct asymptotic comparison between and V, which yields . This is essential for (9). To make sense of the theorem, the conditions λ = o(N−1) and should not lead to a null set of λ1. As an illustration, suppose achieves the optimal rate ‖h‖N ≍ N−ℓ/(2ℓ+d); then one can take λ2 = o{N−d/(2ℓ+d)}, which suggests . Due to Assumption 2, (d2+4ℓ2)(d2+2ℓd)−1 > 1. Therefore, there exist choices of λ1 and λ2 that fulfil the assumptions of Theorem 3. We found in simulations that the practical performance of the modified estimator is not sensitive to λ1 and λ2, and we thus use the method described in § 2·5 to obtain these tuning parameters.
Remark 3
Most existing efficient methods require explicit or implicit estimation of both π(·) and m(·). Chernozhukov et al. (2017) gave a general result on the required convergence rates of π(·) and m(·) for efficient estimation. Even though weighting methods do not explicitly estimate m(·), estimating equation-based methods would give implicit estimators of m(·) that attain good rates of convergence (Hirano et al., 2003; Chan et al., 2016). However, weights constructed based on complex optimization problems may not even converge to the true inverse propensities; see Athey et al. (2017) who, under a sparse linear model assumption, proposed an efficient estimator by controlling the balancing error of linear functions and the estimation error for m(·). Although our modified estimator is not a direct kernel-based extension of their method, we have arrived at a similar conclusion. Our method only requires and does not require smoothness of π(·) or linearity of m(·), and is therefore less vulnerable to the curse of dimensionality. The weighting estimator as described in Theorem 2 corresponds to , and therefore is not op(1) when m(·) is not the zero function. Theorem 2 is interesting because convergence to neither π(·) nor m(·) is established.
3. Numerical Examples
3·1. Simulation study
Simulation studies were conducted to evaluate the finite-sample performance of the proposed estimator. We considered settings where the propensity score and outcome regression models are nonlinear functions of the observed covariates, with possibly nonsmooth propensity score functions. For each observation, we generated a ten-dimensional multivariate standard Gaussian random vector Z = (Z1, …, Z10)T. The observed covariates are X (X1, …, X10)T where X1 = exp(Z1/2), X2 = Z2/{1 + exp(Z1)}, X3 = (Z1Z3/25 + 0.6)3, X4 = (Z2 + Z4 + 20)2 and Xj = Zj (j = 5, …, 10). Three propensity score models are studied: model 1 is pr(T = 1 | Z) = exp(−Z1 – 0.1Z4)/{1 + exp(−Z1 – 0.1Z4)}, model 2 is , and model 3 is , where , η2 is the scaling function of the Daubechies 4-tap wavelet (Daubechies, 1992), and η3 is the Weierstrass function with parameters a = 2 and b = 13. The functions η2 and η3 are chosen such that the propensity functions in models 2 and 3 are nonsmooth. Two outcome regression models are studied: model A is Y = 210 + (1.5T – 0.5)(27.4Z1 + 13.7Z2 + 13.7Z3 + 13.7Z4) + ε, and model B is , where ε follows the standard normal distribution. For each scenario, we compared the proposed weighting and modified estimators using the second-order Sobolev kernel and the Gaussian kernel with bandwidth parameter chosen via the median heuristic (Gretton et al., 2005). We also compared the Horvitz–Thompson estimator, where the weights are the inverse of propensity scores estimated by maximum likelihood under a working logistic regression model with X being the predictors; the Hájek estimator, which is a normalized version of the Horvitz–Thompson estimator with weights summing to N; the inverse probability weighting estimator using the covariate balancing propensity score of Imai & Ratkovic (2014) or the stable balancing weights of Zubizarreta (2015); and the nonparametric covariate balancing estimator of Chan et al. (2016) with exponential weights. The first moment of X was balanced explicitly for these methods.We compared the bias, root mean squared error and covariate balance of the methods, where covariate balance is evaluated at the true conditional mean function. In particular, we calculate SN(ŵ,m) to evaluate the covariate balance of the treatment and the combined groups, as well as its counterpart for the controls and the combined groups, and report the sum of these two measures. The reason for comparing the covariate balance at the true conditional mean function is that it is the optimal function to balance but is unknown in practice. For each scenario, 1000 independent datasets are generated, and the results for outcome models A and B with sample size N = 200 are given in Tables 1 and 2 respectively.
Table 1.
Biases, root mean square errors and overall covariate balancing measures of weighting estimators for outcome model A; the numbers reported are averages obtained from 1000 simulated datasets
PS1 | PS2 | PS3 | |||||||
---|---|---|---|---|---|---|---|---|---|
Bias | RMSE | Bal | Bias | RMSE | Bal | Bias | RMSE | Bal | |
Proposed weighting (S) | −192 | 470 | 28 | −114 | 422 | 20 | −86 | 420 | 21 |
Proposed weighting (G) | −362 | 599 | 86 | −232 | 505 | 58 | −201 | 504 | 67 |
Proposed modified (S) | 150 | 512 | 28 | 243 | 486 | 20 | 255 | 480 | 21 |
Proposed modified (G) | −91 | 378 | 86 | 45 | 368 | 58 | −28 | 370 | 67 |
Horvitz–Thompson | >9999 | >9999 | >9999 | >9999 | >9999 | >9999 | 7298 | >9999 | >9999 |
Hájek | 837 | 1792 | 275 | 693 | 1429 | 165 | 662 | 1480 | 175 |
Imai–Ratkovic | −527 | 1720 | 331 | −187 | 1523 | 274 | 224 | 1516 | 253 |
Zubizarreta | 444 | 715 | 28 | 389 | 658 | 23 | 381 | 630 | 21 |
Chan et al. | 262 | 594 | 21 | 226 | 549 | 17 | 244 | 533 | 16 |
The sample sizes were N = 200. The values of Bias and RMSE were multiplied by 100. RMSE, the root mean squared error; Bal, an overall covariate balancing measure; S, Sobolev kernel; G, Gaussian kernel.
Table 2.
Biases, root mean square errors and overall covariate balancing measures of weighting estimators for outcome model B; the numbers reported are averages obtained from 1000 simulated datasets
PS1 | PS2 | PS3 | |||||||
---|---|---|---|---|---|---|---|---|---|
Bias | RMSE | Bal | Bias | RMSE | Bal | Bias | RMSE | Bal | |
Proposed weighting (S) | −10 | 85 | 1 | −7 | 81 | 0 | −8 | 85 | 0 |
Proposed weighting (G) | −7 | 92 | 1 | −3 | 88 | 0 | −5 | 94 | 0 |
Proposed modified (S) | 3 | 82 | 1 | −8 | 82 | 0 | −9 | 85 | 0 |
Proposed modified (G) | −6 | 89 | 1 | −2 | 85 | 0 | −4 | 92 | 0 |
Horvitz–Thompson | 131 | 3629 | 1151 | −1 | 881 | 66 | −35 | 2092 | 451 |
Hájek | 4 | 392 | 14 | −3 | 221 | 4 | −2 | 367 | 12 |
Imai–Ratkovic | −7 | 110 | 1 | −6 | 97 | 1 | −6 | 108 | 1 |
Zubizarreta | −8 | 115 | 1 | −9 | 98 | 1 | −10 | 108 | 1 |
Chan et al. | −8 | 123 | 1 | −9 | 99 | 1 | −8 | 111 | 1 |
The sample sizes were N = 200. The values of Bias and RMSE were multiplied by 100. RMSE, the root mean squared error; Bal, an overall covariate balancing measure; S, Sobolev kernel; G, Gaussian kernel.
The results show that the empirical performances of the estimators are related to the degree of covariate balancing. Without any explicit covariate balancing, the Horvitz–Thompson estimator can be highly unstable. The Hájek estimator balances the constant function; the Imai–Ratkovic estimator balances X; the estimators of Zubizarreta and Chan et al. balance both the constant function and X. For outcome model A, the balance of both the constant function and X is important and the omission of either constraint can lead to poor performance. For outcome model B, the balance of X often implies approximate balance of the constant, and therefore the estimators of Imai and Ratkovic, Zubizarreta and Chan et al. performed similarly. However, in both cases, the proposed method outperformed the other estimators because it can also control the balance of nonlinear and higher-order moments. We attempted to compute a Horvitz–Thompson estimator using a smoothing spline logistic regression model with the same kernel as the proposed method using the R package gss, but the program did not converge in a reasonable time. We also tried to exactly balance the second moments in addition to the first moments of ten baseline covariates in the existing methods, but the algorithms did not converge in a substantial fraction of simulations. This shortcoming of the existing methods can be circumvented by using the proposed methods.
3·2. Data analysis
We compare the proposed methods with others using a study of the impact of child abduction by a militant group on the future income of abductees who escape later (Blattman & Annan, 2010). The data were collected from 741 males in Uganda during 2005–2006, of which 462 had been abducted by militant groups before 2005 but had escaped by the time of the study. Covariates include geographical region, age in 1996, father’s education, mother’s education, whether the parents had died during or before 1996, whether the father is a farmer, and household size in 1996. The investigators chose to collect covariate values in 1996 because it predates most abductions and is also easily recalled as the year of the first election since 1980. The authors discuss the plausibility of the unconfounded treatment assignment, since abduction is mostly due to random night raids on rural homes. The outcome of interest here is the daily wage of the study participants in Ugandan shillings in 2005. We compared the estimators as in the simulation studies in § 3.1. Table 3 shows the point estimates and 95% confidence intervals based on 1000 bootstrap samples. All methods comparing the abducted to the non-abducted group give a small but nonsignificant decrease in income. However, a small difference is observed between the proposed method and other methods, indicating that a mild nonlinear effect is possibly present, especially in the non-abducted group. To further illustrate this point, we compared the maximal balancing error BN (w) as a function of λ1, which, as defined by (8), measures the balancing error as a function of the size of nested subsets of , which is chosen as the Sobolev space. The subspace is smaller with an increasing λ1, containing smoother functions. We standardize the comparisons by dividing by the balancing error of constant weights that are used in unweighted comparisons. As seen in Fig. 1, the proposed estimator had approximately no balancing error after reaching a data-dependent threshold, so that any smoother functions can be approximately balanced. This is not the case for the other estimators, since there will be residual imbalance for nonlinear functions of a given smoothness. The Imai–Ratkovic estimator has less balancing error than the Horvitz–Thompson estimator with maximum likelihood weights, because the former explicitly balances more moments than the latter, including linear and some nonlinear covariate functionals. The Imai–Ratkovic estimator gives the closest result to the proposed estimators.
Table 3.
Dependence of daily wage (Ugandan schillings) of males on abduction as a child, Y (1), and non-abduction, Y (0)
E{Y (1)} | E{Y (0)} | τ | |
---|---|---|---|
Proposed weighting (S) | 1530 (1219, 1886) | 1851 (1247, 2354) | −321 (−893, 431) |
Proposed weighting (G) | 1516 (1212, 1822) | 1671 (1231, 2204) | −156 (−809, 382) |
Proposed modified (S) | 1532 (1239, 1945) | 1867 (1256, 2489) | −355 (−993, 444) |
Proposed modified (G) | 1536 (1238, 1845) | 1689 (1269, 2249) | −153 (−819, 380) |
Horvitz–Thompson | 1573 (1234, 2033) | 2135 (1478, 3075) | −562 (−1667, 242) |
Hájek | 1573 (1234, 2027) | 2131 (1471, 3064) | −558 (−1614, 241) |
Imai–Ratkovic | 1599 (1256, 2062) | 1998 (1381, 2857) | −399 (−1312, 365) |
Zubizarreta | 1591 (1253, 2073) | 2165 (1492, 3046) | −574 (−1613, 229) |
Chan et al. | 1580 (1246, 2060) | 2144 (1485, 3034) | −564 (−1576, 238) |
Numbers in parentheses are 95% confidence intervals. S, Sobolev kernel; G, Gaussian kernel.
Fig. 1.
Supremum balancing error for the child abduction data. Thick solid curves correspond to the proposed estimator using a Sobolev kernel, thin solid curves correspond to the proposed estimator using a Gaussian kernel, dashed curves correspond to the Horvitz–Thompson estimator, and dotted curves correspond to the Imai–Ratkovic estimator.
Supplementary Material
Acknowledgments
The authors thank the editor, an associate editor and a reviewer for their helpful comments and suggestions. Wong was partially supported by the U.S. National Science Foundation. Most of this work was conducted while the first author was affiliated with Iowa State University. Chan was partially supported by the National Heart, Lung, and Blood Institute of the U.S. National Institutes of Health and by the U.S. National Science Foundation.
Footnotes
Supplementary material
Supplementary material available at Biometrika online includes the proofs of Proposition 1 and Theorems 1–3.
Contributor Information
RAYMOND K. W. WONG, Department of Statistics, Texas A&M University, 401E Blocker Building, 155 Ireland Street, College Station, Texas 77843, U.S.A
KWUN CHUEN GARY CHAN, Department of Biostatistics, University of Washington, 1959 NE Pacific St., Seattle, Washington 98195, U.S.A.
References
- Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc. 1950;68:337–404. [Google Scholar]
- Athey S, Imbens G, Wager S. Approximate residual balancing: de-biased inference of average treatment effects in high dimensions. arXiv: 1604.07125v4 2017 [Google Scholar]
- Blattman C, Annan J. The consequences of child soldiering. Rev Econ Statist. 2010;92:882–98. [Google Scholar]
- Chan KCG, Yam SCP, Zhang Z. Globally efficient nonparametric inference of average treatment effects by empirical balancing calibration weighting. J R Statist Soc B. 2016;78:673–700. doi: 10.1111/rssb.12129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chernozhukov V, Chetverikov D, Demirer M, Dufo E, Hansen C, Newey W, Robins J. (arXiv: 1608.0060v5).Double/debiased machine learning for treatment and causal parameters. 2017 [Google Scholar]
- Daubechies I. Ten Lectures on Wavelets. Philadelphia: SIAM; 1992. [Google Scholar]
- Graham BS, Pinto CCDX, Egel D. Inverse probability tilting for moment condition models with missing data. Rev Econ Studies. 2012;79:1053–79. [Google Scholar]
- Gretton A, Herbrich R, Smola A, Bousquet O, Schölkopf B. Kernel methods for measuring independence. J Mach Learn Res. 2005;6:2075–129. [Google Scholar]
- Gu C. Smoothing Spline ANOVA Models. New York: Springer; 2013. [Google Scholar]
- Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66:315–32. [Google Scholar]
- Hainmueller J. Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies. Polit Anal. 2012;20:25–46. [Google Scholar]
- Han P, Wang L. Estimation with missing data: beyond double robustness. Biometrika. 2013;100:417–30. [Google Scholar]
- Hazlett C. (arXiv: 1605.00155v1).Kernel balancing: a flexible non-parametric weighting procedure for estimating causal effects. 2016 [Google Scholar]
- Hellerstein JK, Imbens GW. Imposing moment restrictions from auxiliary data by weighting. Rev Econ Statist. 1999;81:1–14. [Google Scholar]
- Hirano K, Imbens G, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica. 2003;71:1161–89. [Google Scholar]
- Iacus SM, King G, Porro G. Multivariate matching methods that are monotonic imbalance bounding. J Am Statist Assoc. 2011;106:345–61. [Google Scholar]
- Imai K, Ratkovic M. Covariate balancing propensity score. J R Statist Soc B. 2014;76:243–63. [Google Scholar]
- Kallus N. A framework for optimal matching for causal inference. (arXiv: 1606.05188v2).2017a [Google Scholar]
- Kallus N. (arXiv: 1612.08321v3).Generalized optimal matching methods for causal inference. 2017b [Google Scholar]
- Kang JD, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with Discussion) Statist Sci. 2007;22:523–39. doi: 10.1214/07-STS227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’leary D, Stewart G. Computing the eigenvalues and eigenvectors of symmetric arrowhead matrices. J Comp Phys. 1990;90:497–505. [Google Scholar]
- Overton ML. Large-scale optimization of eigenvalues. SIAM J Optimiz. 1992;2:88–120. [Google Scholar]
- Qin J, Zhang B. Empirical-likelihood-based inference in missing response problems and its application in observational studies. J R Statist Soc B. 2007;69:101–22. [Google Scholar]
- Robins J, Rotnitzky A, Zhao L. Estimation of regression coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66. [Google Scholar]
- Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]
- Rubin DB. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Statist Med. 2007;26:20–36. doi: 10.1002/sim.2739. [DOI] [PubMed] [Google Scholar]
- Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika. 2010;97:661–82. [Google Scholar]
- Wahba G. Spline Models for Observational Data. Philadelphia: SIAM; 1990. [Google Scholar]
- Zhao Q. Covariate balancing propensity score by tailored loss functions. arXiv: 1601.05890v3 2017 [Google Scholar]
- Zhou DX. The covering number in learning theory. J Complex. 2002;18:739–67. [Google Scholar]
- Zubizarreta JR. Stable weights that balance covariates for estimation with incomplete outcome data. J Am Statist Assoc. 2015;110:910–22. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.