Combining Multiple Observational Data Sources to Estimate Causal Effects

Shu Yang; Peng Ding

doi:10.1080/01621459.2019.1609973

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2019 Jun 11;115(531):1540–1554. doi: 10.1080/01621459.2019.1609973

Combining Multiple Observational Data Sources to Estimate Causal Effects

Shu Yang ^a, Peng Ding ^b

PMCID: PMC7571608 NIHMSID: NIHMS1050623 PMID: 33088006

Abstract

The era of big data has witnessed an increasing availability of multiple data sources for statistical analyses. We consider estimation of causal effects combining big main data with unmeasured confounders and smaller validation data with supplementary information on these confounders. Under the unconfoundedness assumption with completely observed confounders, the smaller validation data allow for constructing consistent estimators for causal effects, but the big main data can only give error-prone estimators in general. However, by leveraging the information in the big main data in a principled way, we can improve the estimation efficiencies yet preserve the consistencies of the initial estimators based solely on the validation data. Our framework applies to asymptotically normal estimators, including the commonly used regression imputation, weighting, and matching estimators, and does not require a correct specification of the model relating the unmeasured confounders to the observed variables. We also propose appropriate bootstrap procedures, which makes our method straightforward to implement using software routines for existing estimators. Supplementary materials for this article are available online.

Keywords: Calibration, Causal inference, Inverse probability weighting, Missing confounder, Two-phase sampling

1. Introduction

Unmeasured confounding is an important and common problem in observational studies. Many methods have been proposed to deal with unmeasured confounding in causal inference, such as sensitivity analyses (e.g., Rosenbaum and Rubin 1983a), instrumental variable approaches (e.g., Angrist, Imbens, and Rubin 1996). However, sensitivity analyses cannot provide point estimation, and valid instrumental variables are often difficult to find in practice. We consider the setting where external validation data provide additional information on unmeasured confounders. To be more precise, the study includes a large main dataset representing the population of interest with unmeasured confounders and a smaller validation dataset with additional information about these confounders.

Our framework covers two common types of studies. First, we have a large main dataset, and then collect more information on unmeasured confounders for a subset of units, for example, using a two-phase sampling design (Neyman 1938; Cochran 2007; Wang et al. 2009). Second, we have a smaller but carefully designed validation dataset with rich covariates, and then link it to a larger main dataset with fewer covariates. The second type of data is now ubiquitous. In the era of big data, extremely large data have become available for research purposes, such as electronic health records, claims databases, disease data registries, census data, and to name a few (e.g., Imbens and Lancaster 1994; Schneeweiss et al. 2005; Chatterjee et al. 2016). Although these datasets might not contain full confounder information that guarantees consistent causal effect estimation, they can be useful to increase efficiencies of statistical analyses.

In causal inference, Stürmer et al. (2005) proposed a propensity score calibration method when the main data contain the outcome and an error-prone propensity score based on partial confounders, and the validation data supplement a gold standard propensity score based on all confounders. Stürmer et al. (2005) then applied a regression calibration technique to correct for the measurement error from the error-prone propensity score. This approach does not require the validation data to contain the outcome variable. However, this approach relies on the surrogacy property entailing that the outcome variable is conditionally independent of the error-prone propensity score given the gold standard propensity score and treatment. This surrogacy property is difficult to justify in practice, and its violations can lead to substantial biases (Stürmer et al. 2007; Lunt et al. 2012). Under the Bayesian framework, McCandless, Richardson, and Best (2012) specified a full parametric model of the joint distribution for the main and validation data, and treat the gold standard propensity score as a missing variable in the main data. Antonelli, Zigler, and Dominici (2017) combined the ideas of Bayesian model averaging, confounder selection, and missing data imputation into a single framework in this context. Enders et al. (2018) use simulation to show that multiple imputation is more robust than two-phase logistic regression against misspecification of imputation models. Lin and Chen (2014) developed a two-stage calibration method, which summarizes the confounding information through propensity scores and combines the results from the main and validation data. Their two-stage calibration focuses on the regression context with a correctly specified outcome model. Unfortunately, regression parameters, especially in the logistic regression model used by Lin and Chen (2014), may not be the causal parameters of interest in general (Freedman 2008).

In this article, we propose a general framework to estimate causal effects in the setting where the big main data have unmeasured confounders, but the smaller external validation data provide supplementary information on these confounders. Under the assumption of ignorable treatment assignment, causal effects can be identified and estimated from the validation data, using commonly used estimators, such as regression imputation, (augmented) inverse probability weighting (Horvitz and Thompson 1952; Rosenbaum and Rubin 1983b; Robins, Rotnitzky, and Zhao 1994; Bang and Robins 2005; Cao, Tsiatis, and Davidian 2009), and matching (e.g., Rubin 1973; Rosenbaum 1989; Heckman, Ichimura, and Todd 1997; Hirano, Imbens, and Ridder 2003; Hansen 2004; Rubin 2006; Abadie and Imbens 2006; Stuart 2010; Abadie and Imbens 2016). However, these estimators based solely on the validation data may not be efficient. We leverage the correlation between the initial estimator from the validation data and the error-prone estimator from the main data to improve the efficiency over the initial estimator. This idea is similar to the two-stage calibration in Lin and Chen (2014); however, their method focuses only on regression parameters and requires the validation data to be a simple random sample from the main data. Alternatively, the empirical likelihood is also an attractive approach to combine multiple data sources (Chen and Sitter 1999; Qin 2000; Chen, Sitter, and Wu 2002; Chen, Leung, and Qin 2003). However, the empirical likelihood approach needs sophisticated programming, and its computation can be heavy when data become large. Our method is practically simple, because we only need to compute commonly used estimators that can be easily implemented by existing software routines. Moreover, Lin and Chen (2014) and the empirical likelihood approach can only deal with regular and asymptotically linear (RAL) estimators often formulated by moment conditions, but our framework can also deal with non-RAL estimators, such as matching estimators. We also propose a unified bootstrap procedure based on resampling the linear expansions of the estimators, which is simple to implement and works for both RAL and matching estimators.

Furthermore, we relax the assumption that the validation data are a random sample from the study population of interest. We also link the proposed method to existing methods for missing data, viewing the additional confounders as missing values for units outside of the validation data. In contrast to most existing methods in the missing data literature, the proposed method does not need to specify the missing data model relating the unmeasured confounders with the observed variables.

For simplicity of exposition, we use “IID” for “identically and independently distributed,” 1(·) for the indicator function, $ξ^{\otimes 2} = ξ ξ^{T}$ for a vector or matrix $ξ$ “plim” for the probability limit of a random sequence, and $A_{n} ≅ B_{n}$ for two random sequences satisfying $A_{n} = B_{n} + o_{P} (n^{- 1 / 2})$ with n being the generic sample size. We relegate all regularity conditions for asymptotic analyses to the online supplementary material.

2. Basic Setup

2.1. Notation: Causal Effect and Two Data Sources

Following Neyman (1923) and Rubin 1974), we use the potential outcomes framework to define causal effects. Suppose that the treatment is a binary variable A ∈ {0,1}, with 0 and 1 being the labels for control and active treatments, respectively. For each level of treatment a ∈ {0,1}, we assume that there exists a potential outcome Y(a), representing the outcome had the subject, possibly contrary to the fact, been given treatment a. The observed outcome is Y = Y(A) = AY(1) + (1 − A)Y(0). Let a vector of pretreatment covariates be (X, U), where X is observed for all units, but U may not be observed for some units.

Although we can extend our discussion to multiple data sources, for simplicity of exposition, we first consider a study with two data sources. The validation data have observations $O_{2} = {(A_{j}, X_{j}, U_{j}, Y_{j}) : j \in S_{2}}$ with sample size n₂ = | $S_{2}$ |. The main data have observations $O_{1} = {(A_{i}, X_{i}, Y_{i}) : i \in S_{1} \ S_{2}} \cup O_{2}$ with sample size $n_{1} = | S_{1} |$ . In our formulation, we consider the case with $S_{2}$ ⊂ $S_{1}$ , and let $ρ = \lim_{n_{2} \to \infty} n_{2} / n_{1} \in [0,1]$ . If one has two separate main and validation datasets, the main dataset in our context combines these two datasets. Although the main dataset is larger, that is, n₁ > n₂, it does not contain full information on important covariates U. Under a superpopulation model, we assume that {A_i, X_i, U_i, Y_i(0), Y_i (1) : i ∈ $S_{1}$ } are IID for all i ∈ $S_{1}$ , and therefore the observations in $O_{1}$ are also IID. The following assumption links the main and validation data.

Assumption 1.

The index set $S_{2}$ for the validation data of size n₂ is a simple random sample from $S_{1}$ .

Under Assumption 1, {A_j, X_j, U_j, Y_j(0), Y_j(1) : j ∈ $S_{2}$ } and the observations in $O_{2}$ of the validation data are also IID, respectively. We shall relax Assumption 1 to allow $S_{2}$ to be a general probability sample from $S_{1}$ in Section 7. But Assumption 1 makes the presentation simpler.

Example 1.

Two-phase sampling design is an example that results in the observed data structure. In a study, some variables (e.g., A, X, and Y) may be relatively cheaper, while some variables (e.g., U) are more expensive to obtain. A two-phase sampling design (Neyman 1938; Cochran 2007; Wang et al. 2009) can reduce the cost of the study: in the first phase, the easy-to-obtain variables are measured for all units, and in the second phase, additional expensive variables are measured for a selected validation sample.

Example 2.

Another example is highly relevant in the era of big data, where one links small data with full information on (A, X, U, Y) to external big data with only (A, X, Y). Chatterjee et al. (2016) recently consider this scenario for parametric regression analyses.

Without loss of generality, we first consider the average causal effect (ACE)

τ = E {Y (1) - Y (0)},

(1)

and will discuss extensions to other causal estimands in Section 4.1. Because of the IID assumption, we drop the indices i and j in the expectations in (1) and later equations.

In what follows, we define the conditional means of the outcome as

μ_{a} (X, U) = E (Y | A = a, X, U),

μ_{a} (X) = E (Y | A = a, X),

the conditional variances of the outcome as

σ_{a}^{2} (X, U) = var (Y | A = a, X, U),

σ_{a}^{2} (X) = var (Y | A = a, X),

the conditional probabilities of the treatment as

e (X, U) = P (A = 1 | X, U), e (X) = P (A = 1 | X) .

2.2. Identification and Model Assumptions

A fundamental problem in causal inference is that we can observe at most one potential outcome for a unit. Following Rosenbaum and Rubin (1983b), we make the following assumptions to identify causal effects.

Assumption 2 (Ignorability).

$Y (a) ⫫ A | (X, U)$ for a = 0 and 1.

Under Assumption 2, the treatment assignment is ignorable in $O_{2}$ given (X, U). However, the treatment assignment is only “latent” ignorable in $O_{1} \ O_{2}$ given X and the latent variable U (Frangakis and Rubin 1999; Jin and Rubin 2008).

Moreover, we require adequate overlap between the treatment and control covariate distributions, quantified by the following assumption on the propensity score e(X, U).

Assumption 3 (Overlap).

There exist constants c₁ and c₂ such that with probability 1, $0 < c_{1} \leq e (X, U) \leq c_{2} < 1$ .

Under Assumptions 2 and 3, P{A = 1 | X, U, Y(1)} = P{A = 1 | X, U, Y(0)} = e(X, U), and E{Y(a) | X, U} = E{Y(a) | A = a, X, U} = μ_a(X, U). The ACE τ can then be estimated through regression imputation, inverse probability weighting (IPW), augmented inverse probability weighting (AIPW), or matching. See Rosenbaum (2002), Imbens (2004), and Rubin (2006) for surveys of these estimators.

In practice, the outcome distribution and the propensity score are often unknown and therefore need to be modeled and estimated.

Assumption 4 (Outcome model).

The parametric model $μ_{a} (X, U; β_{a})$ is a correct specification for μ_a(X, U),for a = 0, 1; that is, $μ_{a} (X, U) = μ_{a} (X, U; β_{a}^{*})$ , where $β_{a}^{*}$ is the true model parameter, for a = 0, 1.

Assumption 5 (Propensity score model).

The parametric model e(X, U; α) is a correct specification for e(X, U); that is, $e (X, U) = e (X, U; α^{*})$ , where α^* is the true model parameter.

The consistency of different estimators requires different model assumptions.

3. Methodology and Important Estimators

3.1. Review of Commonly Used Estimators Based on Validation Data

The validation data {(A_j, X_j, U_j, Y_j) : j ∈ $S_{2}$ } contain observations of all confounders (X, U). Therefore, under Assumptions 2 and 3, τ is identifiable and can be estimated by some commonly used estimator solely from the validation data, denoted by ${\hat{τ}}_{2}$ . Although the main data do not contain the full confounding information, we leverage the information on the common variables (A, X, Y) as in the main data to improve the efficiency of ${\hat{τ}}_{2}$ . Before presenting the general theory, we first review important estimators that are widely used in practice.

Let μ_a(X, U; β_a) be a working model for μ_a(X, U), for a = 0, 1, and e(X, U; α) be a working model for e(X, U). We construct consistent estimators ${\hat{β}}_{a} (a = 0, 1)$ and $\hat{α}$ based on $O_{2}$ , with probability limits $β_{a}^{*} (a = 0, 1)$ and α^*, respectively. Under Assumption 4, $μ_{a} (X, U; β_{a}^{*}) = μ_{a} (X, U)$ , and under Assumption 5, e(X, U; α^*) = e(X, U).

Example 3 (Regression imputation).

The regression imputation estimator is ${\hat{τ}}_{reg, 2} = n_{2}^{- 1} \sum_{j \in S_{2}} {\hat{τ}}_{reg, 2, j}$ , where

{\hat{τ}}_{reg, 2, j} = μ_{1} (X_{j}, U_{j}; {\hat{β}}_{1}) - μ_{0} (X_{j}, U_{j}; {\hat{β}}_{0}) .

${\hat{τ}}_{reg, 2}$ is consistent for τ under Assumption 4.

Example 4 (Inverseprobability weighting).

The IPW estimator is ${\hat{τ}}_{IPW, 2} = n_{2}^{- 1} \sum_{j \in S_{2}} {\hat{τ}}_{IPW, 2, j}$ , where

{\hat{τ}}_{IPW, 2, j} = \frac{A_{j} Y_{j}}{e (X_{j}, U_{j}; \hat{α})} - \frac{(1 - A_{j}) Y_{j}}{1 - e (X_{j}, U_{j}; \hat{α})} .

${\hat{τ}}_{IPW, 2}$ is consistent for τ under Assumption 5.

The Horvitz-Thompson-type estimator ${\hat{τ}}_{IPW, 2}$ has large variability, and is often inferior to the Hajek-type estimator (Hájek 1971). We do not present the Hajek-type estimator because we can improve it by the AIPW estimator below. The AIPW estimator employs both the propensity score and the outcome models.

Example 5 (Augmented inverse probability weighting).

Define the residual outcome as $R_{j} = Y_{j} - μ_{1} (X_{j}, U_{j}; {\hat{β}}_{1})$ for treated units and $R_{j} = Y_{j} - μ_{0} (X_{j}, U_{j}; {\hat{β}}_{0})$ for control units. The AIPW estimator is ${\hat{τ}}_{AIPW, 2} = n_{2}^{- 1} \sum_{j \in S_{2}} {\hat{τ}}_{AIPW, 2, j}$ , where

{\hat{τ}}_{AIPW, 2, j} = \frac{A_{j} R_{j}}{e (X_{j}, U_{j}; \hat{α})} + μ_{1} (X_{j}, U_{j}; {\hat{β}}_{1}) - \frac{(1 - A_{j}) R_{j}}{1 - e (X_{j}, U_{j}; \hat{α})} - μ_{0} (X_{j}, U_{j}; {\hat{β}}_{0}) .

(2)

${\hat{τ}}_{AIPW, 2}$ is doubly robust in the sense that it is consistent if either Assumption 4 or 5 holds. Moreover, it is locally efficient if both Assumptions 4 and 5 hold (Bang and Robins 2005; Tsiatis 2006; Cao, Tsiatis, and Davidian 2009).

Matching estimators are also widely used in practice. To fix ideas, we consider matching with replacement with the number of matches fixed at M. Matching estimators hinge on imputing the missing potential outcome for each unit. To be precise, for unit j, the potential outcome under A_j is the observed outcome Y_j; the (counterfactual) potential outcome under 1 − A_j is not observed but can be imputed by the average of the observed outcomes of the nearest M units with 1 − A_j. Let these matched units for unit j be indexed by $J_{d, V, j}$ , where the subscripts d and V denote the dataset $O_{d}$ and the matching variable V (e.g., V = (X, U)), respectively. Without loss of generality, we use the Euclidean distance to determine neighbors; the discussion applies to other distances (Abadie and Imbens 2006). Let $K_{d, V, j} = \sum_{l \in S_{d}} 1 (j \in J_{d, V, l})$ be the number of times that unit j is used as a match based on the matching variable V in $O_{d}$ .

Example 6 (Matching).

Define the imputed potential outcomes as

\begin{array}{l} {\hat{Y}}_{j} (1) = {\begin{array}{l} M^{- 1} \sum_{l \in J_{2, (X, U), j}} Y_{l} & if A_{j} = 0, \\ Y_{j} & if A_{j} = 1, \end{array} \\ {\hat{Y}}_{j} (0) = {\begin{array}{l} Y_{j} & if A_{j} = 0, \\ M^{- 1} \sum_{l \in J_{2, (X, U), j}} Y_{l} & if A_{j} = 1. \end{array} \end{array}

Then the matching estimator of τ is

{\hat{τ}}_{mat, 2}^{(0)} = n_{2}^{- 1} \sum_{j \in S_{2}} {{\hat{Y}}_{j} (1) - {\hat{Y}}_{j} (0)} = n_{2}^{- 1} \sum_{j \in S_{2}} (2 A_{j} - 1) (Y_{j} - M^{- 1} \sum_{l \in J_{2, (X, U), j}} Y_{l}) .

Abadie and Imbens (2006) obtained the decomposition

n_{2}^{1 / 2} ({\hat{τ}}_{mat, 2}^{(0)} - τ) = B_{2} + D_{2},

where

B_{2} = n_{2}^{- 1 / 2} \sum_{j \in S_{2}} (2 A_{j} - 1) \times [M^{- 1} \sum_{l \in J_{2}, (X, U), j} {μ_{1 - A_{j}} (X_{j}, U_{j}) - μ_{1 - A_{j}} (X_{l}, U_{l})}],

(3)

D_{2} = n_{2}^{- 1 / 2} \sum_{j \in S_{2}} [μ_{1} (X_{j}, U_{j}) - μ_{0} (X_{j}, U_{j}) - τ + (2 A_{j} - 1) {1 + M^{- 1} K_{2, (X, U), j}} {Y_{j} - μ_{A_{j}} (X_{j}, U_{j})}] .

The difference $μ_{1 - A_{j}} (X_{j}, U_{j}) - μ_{1 - A_{j}} (X_{l}, U_{l})$ in (3) accounts for the matching discrepancy, and therefore B₂ contributes to the asymptotic bias of the matching estimator. Abadie and Imbens (2006) showed that the matching estimators have nonnegligible biases when the dimension of V is greater than one. Let ${\hat{μ}}_{a} (X, U)$ be an estimator for μ_a(X, U), obtained either parametrically, for example, by a linear regression estimator, or nonparametrically, for a = 0, 1. Abadie and Imbens (2006) proposed a bias-corrected matching estimator

{\hat{τ}}_{mat, 2} = {\hat{τ}}_{mat, 2}^{(0)} - n_{2}^{- 1 / 2} {\hat{B}}_{2},

where ${\hat{B}}_{2}$ is an estimator for B₂ by replacing μ_a(X, U) with ${\hat{μ}}_{a} (X, U)$ .

3.2. A General Strategy

We give a general strategy for efficient estimation of the ACE by utilizing both the main and validation data. In Sections 3.3 and 3.4, we will provide examples to elucidate the proposed strategy with specific estimators.

Although the estimators based on the validation data $O_{2}$ are consistent for τ under certain regularity conditions, they are inefficient without using the main data $O_{1}$ . However, the main data $O_{1}$ do not contain important confounders U; if we naively use the estimators in Examples 3–6 with U being empty, then the corresponding estimators can be inconsistent for τ and thus are error-prone in general. Moreover, for robustness consideration, we do not want to impose additional modeling assumptions linking U and (A, X, Y).

Our strategy is straightforward: we apply the same error-prone procedure to both the main and validation data. The key insight is that the difference of the two error-prone estimates is consistent for 0 and can be used to improve efficiency of the initial estimator due to its association with ${\hat{τ}}_{2}$ . Let an error-prone estimator of τ from the main data be ${\hat{τ}}_{1, ep}$ , which converges to some constant τ_ep, not necessarily the same as τ. Applying the same method to the validation data ${(A_{j}, X_{j}, Y_{j}) : j \in S_{2}}$ , we can obtain another error-prone estimator ${\hat{τ}}_{2, ep}$ . More generally, we can consider τ_ep to be an L-dimensional vector of parameters identifiable based on the joint distribution of (A, X, Y), and ${\hat{τ}}_{1, ep}$ and ${\hat{τ}}_{2, ep}$ to be the corresponding estimators from the main and validation data, respectively. For example, ${\hat{τ}}_{d, ep}$ an contain estimators of τ using different methods based on $O_{d}$ .

We consider a class of estimators satisfying

n_{2}^{1 / 2} (\begin{matrix} {\hat{τ}}_{2} - τ \\ {\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep} \end{matrix}) \to N {0_{L + 1}, (\begin{matrix} v_{2} & Γ^{T} \\ Γ & V \end{matrix})},

(4)

in distribution, as $n_{2} \to \infty$ , which is general enough to include all the estimators reviewed in Examples 3–6. Heuristically, if (4) holds exactly rather than asymptotically, by the multivariate normal theory, we have the following the conditional distribution

n_{2}^{1 / 2} ({\hat{τ}}_{2} - τ) | n_{2}^{1 / 2} ({\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep}) ~ N {n_{2}^{1 / 2} Γ^{T} V^{- 1} ({\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep}), v_{2} - Γ^{T} V^{- 1} Γ} .

Let ${\hat{v}}_{2}$ , $\hat{Γ}$ and $\hat{V}$ be consistent estimators for v₂, $Γ$ and V. We set $n_{2}^{1 / 2} ({\hat{τ}}_{2} - τ)$ to equal its estimated conditional mean $n_{2}^{1 / 2} {\hat{Γ}}^{T} {\hat{V}}^{- 1} ({\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep})$ , leading to an estimating equation for τ:

n_{2}^{1 / 2} ({\hat{τ}}_{2} - τ) = n_{2}^{1 / 2} {\hat{Γ}}^{T} {\hat{V}}^{- 1} ({\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep}) .

Solving this equation for τ, we obtain the estimator

\hat{τ} = {\hat{τ}}_{2} - {\hat{Γ}}^{T} {\hat{V}}^{- 1} ({\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep}) .

(5)

Proposition 1.

Under Assumption 1 and certain regularity conditions, if (4) holds, then $\hat{τ}$ is consistent for τ, and

n_{2}^{1 / 2} (\hat{τ} - τ) \to N (0, v_{2} - Γ^{T} V^{- 1} Γ),

(6)

in distribution, as $n_{2} \to \infty$ . Given a nonzero Γ, the asymptotic variance, $v_{2} - Γ^{T} V^{- 1} Γ$ , is smaller than the asymptotic variance of ${\hat{τ}}_{2}, v_{2}$ .

The consistency of $\hat{τ}$ does not require any component in ${\hat{τ}}_{1, ep}$ and ${\hat{τ}}_{2, ep}$ to correctly estimate τ. That is, these estimators can be error prone. The requirement for the error-prone estimators is minimal, as long as they are consistent for the same (finite) parameters. Under Assumption 1, ${\hat{τ}}_{1, ep} - {\hat{τ}}_{2, ep}$ is consistent for a vector of zeros, as $n_{2} \to \infty$ .

We can estimate the asymptotic variance of $\hat{τ}$ by

\hat{v} = ({\hat{v}}_{2} - {\hat{Γ}}^{T} {\hat{V}}^{- 1} \hat{Γ}) / n_{2} .

(7)

Remark 1.

We construct the error prone estimators ${\hat{τ}}_{1, ep}$ and ${\hat{τ}}_{2, ep}$ based on $O_{1}$ and $O_{2}$ , respectively. Another intuitive way is to construct ${\hat{τ}}_{1, ep}$ and ${\hat{τ}}_{2, ep}$ based on $O_{1} \ O_{2}$ and $O_{2}$ , respectively. In general, we can construct the error prone estimators based on different subsets of $O_{1}$ and $O_{2}$ as long as their difference converges in probability to zero. We show in the supplementary material that our construction maximizes the variance reduction for ${\hat{τ}}_{2}, Γ^{T} V^{- 1} Γ$ , given the procedure of the error prone estimators.

Remark 2.

We can view (5) as the best consistent estimator of τ among all linear combinations ${{\hat{τ}}_{2} + λ^{T} ({\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep}) : λ \in ℝ^{L}}$ , in the sense that (5) achieves the minimal asymptotic variance among this class of consistent estimators. Similar ideas appeared in design-optimal regression estimation in survey sampling (Deville and Särndal 1992; Fuller 2009), regression analyses (Chen and Chen 2000; Chen 2002; Wang and Wang 2015), improved prediction in high dimensional datasets (Boonstra, Taylor, and Mukherjee 2012), and meta-analysis (Collaboration 2009). In the supplementary material, we show that the proposed estimator in (5) is the best estimator of τ among the class of estimators { $\hat{τ} = f ({\hat{τ}}_{2}, {\hat{τ}}_{1, ep}, {\hat{τ}}_{2, ep}) : f (x, y, z)$ is a smooth function of (x, y, z), and $\hat{τ}$ is consistent for τ}, in the sense that (5) achieves the minimal asymptotic variance among this class.

Remark 3.

The choice of the error-prone estimators will affect the efficiency of $\hat{τ}$ . From (6), for a given ${\hat{τ}}_{2}$ , to improve the efficiency of $\hat{τ}$ with a 1-dimensional error-prone estimator, we would like this estimator to have a small variance V and a large correlation with ${\hat{τ}}_{2}$ , Γ. In principle, increasing the dimension of the error-prone estimator would not decrease the asymptotic efficiency gain as shown in the supplementary material. However, it would also increase the complexity of implementation and harm the finite sample properties. To “optimize” the tradeoff, we suggest choosing the error-prone estimator to be the same type as the initial estimator ${\hat{τ}}_{2}$ . For example, if ${\hat{τ}}_{2}$ is an AIPW estimator, we can choose ${\hat{τ}}_{d, ep}$ to be an AIPW estimator without using U in a possibly misspecified propensity score model. The simulation in Section 5 confirms that this choice is reasonable.

To close this subsection, we comment on the existing literature and the advantages of our strategy. The proposed estimator $\hat{τ}$ in (5) utilizes both the main and validation data and improves the efficiency of the estimator based solely on the validation data. In economics, Imbens and Lancaster (1994) proposed to use the generalized method of moments (Hansen 1982) for using the main data which provide moments of the marginal distribution of some economic variables. In survey sampling, calibration is a standard technique to integrate auxiliary information in estimation or handle nonresponse; see, for example, Chen and Chen (2000), Wu and Sitter (2001), Kott (2006), Chang and Kott (2008), and Kim, Kwon, and Paik (2016). An important issue is how to specify optimal calibration equations; see, for example, Deville and Särndal (1992), Robins, Rotnitzky, and Zhao (1994), Wu and Sitter (2001), and Lumley, Shaw, and Dai (2011). Other researchers developed constrained empirical likelihood methods to calibrate auxiliary information from the main data; see, for example, Chen and Sitter (1999), Qin (2000), Chen, Sitter, and Wu (2002), and Chen, Leung, and Qin (2003).

Compared to these methods, the proposed framework is attractive because it is simple to implement which requires only standard software routines for existing methods, and it can deal with estimators that cannot be derived from moment conditions, for example, matching estimators. Moreover, our framework does not require a correct model specification of the relationship between unmeasured covariates U and measured variables (A, X, Y).

3.3. Regular Asymptotically Linear (RAL) Estimators

We first elucidate the proposed method with RAL estimators.

From the validation data, we consider the case when ${\hat{τ}}_{2} - τ$ is RAL; that is, it can be asymptotically approximated by a sum of IID random vectors with mean 0:

{\hat{τ}}_{2} - τ ≅ n_{2}^{- 1} \sum_{j \in S_{2}} ψ (A_{j}, X_{j}, U_{j}, Y_{j}),

(8)

where ${ψ (A_{j}, X_{j}, U_{j}, Y_{j}) : j \in S_{2}}$ are IID with mean 0. The random vector $ψ (A, X, U, Y)$ is called the influence function of ${\hat{τ}}_{2}$ with $E (ψ) = 0$ and $E (ψ^{2}) < \infty$ (e.g., Bickel et al. 1993). Regarding regularity conditions, see, for example, Newey (1990).

Let $e (X; γ)$ be an error-prone propensity score model for, $e (X)$ , and $μ_{a} (X; η_{a})$ be an error-prone outcome regression model for μ_a(X), for a = 0, 1. The corresponding error-prone estimators of the ACE can be obtained from the main data $O_{1}$ and the validation data $O_{2}$ . We consider the case when ${\hat{τ}}_{d, ep}$ is RAL:

{\hat{τ}}_{d, ep} - τ_{ep} ≅ n_{d}^{- 1} \sum_{j \in S_{d}} ϕ (A_{j}, X_{j}, Y_{j}), (d = 1, 2)

(9)

where ${ϕ (A_{j}, X_{j}, Y_{j}) : j \in S_{d}}$ are IID with mean 0.

Theorem 1.

Under certain regularity conditions, (4) holds for the RAL estimators (8) and (9), where $v_{2} = var {ψ (A, X, U, Y)}$ , $Γ = (1 - ρ) cov {ψ (A, X, U, Y), ϕ (A, X, Y)}$ , and $V = (1 - ρ) \times var {ϕ (A, X, Y)}$ .

To derive $\hat{Γ}$ and $\hat{V}$ for RAL estimators, let ${\hat{ϕ}}_{d} (A, X, Y)$ and $\hat{ψ} (A, X, U, Y)$ be estimators of $ϕ (A, X, Y)$ and $ψ (A, X, U, Y)$ by replacing E(·) with the empirical measure and unknown parameters with their corresponding estimators. Note that the subscript d in ${\hat{ϕ}}_{d} (A, X, Y)$ indicates that it is obtained based on $O_{d}$ . Then, we can estimate Γ and V by

\hat{Γ} = \hat{cov} ({\hat{τ}}_{2}, {\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep}) = (1 - \frac{n_{2}}{n_{1}}) \frac{1}{n_{2}} \sum_{j \in S_{2}} \hat{ψ} (A_{j}, X_{j}, U_{j}, Y_{j}) {\hat{ϕ}}_{2} (A_{j}, X_{j}, Y_{j}),

\hat{V} = \hat{cov} ({\hat{τ}}_{1, ep} - {\hat{τ}}_{2, ep}) = (1 - \frac{n_{2}}{n_{1}}) \frac{1}{n_{1}} \sum_{i \in S_{1}} {{\hat{ϕ}}_{1} (A_{i}, X_{i}, Y_{i})}^{\otimes 2} .

Finally, we can obtain the estimator and its variance estimator by (5) and (7), respectively.

The commonly-used RAL estimators include the regression imputation and (augmented) inverse probability weighting estimators. Because the influence functions for ${\hat{τ}}_{reg, 2}$ and ${\hat{τ}}_{IPW, 2}$ are standard, we present the details in the supplementary material. Below, we state only the influence function for ${\hat{τ}}_{AIPW, 2}$ .

For the outcome model, let S_a (A, X, U, Y; β_a) be the estimating function for β_a, for example,

S_{a} (A, X, U, Y; β_{a}) = \frac{\partial μ_{a} (X, U; β_{a})}{\partial β_{a}} {Y - μ_{a} (X, U; β_{a})},

for a = 0, 1, which is a standard choice for the conditional mean model. For the propensity score model, let S(A, X, U; α) be the estimating function for α, for example,

S (A, X, U; α) = \frac{A - e (X, U; α)}{e (X, U; α) {1 - e (X, U; α)}} \frac{\partial e (X, U; α)}{\partial α},

which is the score function from the likelihood of a binary response model. Moreover, let

Σ_{α α} = E {S^{\otimes 2} (A, X, U; α)} = E [\frac{1}{e (X, U; α^{*}) {1 - e (X, U; α^{*})}} {\frac{\partial e (X, U; α^{*})}{\partial α}}^{\otimes 2}]

be the Fisher information matrix for α in the propensity score model. In addition, let ${\hat{β}}_{a} (a = 0, 1)$ and $\hat{α}$ be the estimators solving the corresponding empirical estimating equations based on $O_{2}$ , with probability limits $β_{a}^{*} (a = 0, 1)$ and α^*, respectively.

Lemma 1 (Augmented inverse probability weighting).

For simplicity, denote $e_{j}^{*} = e (X_{j}, U_{j}; α^{*}), {\dot{e}}_{j}^{*} = \partial e (X_{j}, U_{j}; α^{*}) / \partial α^{T}, S_{j}^{*} = S (A_{j}, X_{j}, U_{j}; α^{*}), μ_{a j}^{*} = μ_{a} (X_{j}, U_{j}; β_{a}^{*}), {\dot{μ}}_{a j}^{*} = \partial μ_{a} (X_{j}, U_{j}; β_{a}^{*}) / \partial β_{a}^{T}, S_{a j}^{*} = S_{a} (A_{j}, X_{j}, U_{j}, Y_{j}; β_{a}^{*})$ , and ${\dot{S}}_{a j}^{*} = \partial S_{a} (A_{j}, X_{j}, U_{j}, Y_{j}; β_{a}^{*}) / \partial β_{a}^{T}$ for a = 0, 1. Under Assumption 4 or 5, ${\hat{τ}}_{AIPW, 2}$ has the influence function

ψ_{AIPW} (A_{j}, X_{j}, U_{j}, Y_{j}) = \frac{A_{j} Y_{j}}{e_{j}^{*}} + (1 - \frac{A_{j}}{e_{j}^{*}}) μ_{1 j}^{*} - \frac{(1 - A_{j}) Y_{j}}{1 - e_{j}^{*}} - (1 - \frac{1 - A_{j}}{1 - e_{j}^{*}}) μ_{0 j}^{*} - τ + H_{AIPW} Σ_{α α}^{- 1} S_{j}^{*} + E {(1 - \frac{1 - A}{1 - e^{*}}) {\dot{μ}}_{0}^{*}} {E ({\dot{S}}_{0}^{*})}^{- 1} S_{0 j}^{*}

(10)

- E {(1 - \frac{A}{e^{*}}) {\dot{μ}}_{1}^{*}} {E ({\dot{S}}_{1}^{*})}^{- 1} S_{1 j}^{*},

(11)

where

H_{AIPW} = E [{\frac{A (Y - μ_{1}^{*})}{{(e^{*})}^{2}} - \frac{(1 - A) (Y - μ_{0}^{*})}{{(1 - e^{*})}^{2}}} {\dot{e}}^{*}] .

Lemma 1 follows from standard asymptotic theory, but as far as we know it has not appeared in the literature. Lunceford and Davidian (2004) suggest a formula without (10) and (11) for $ψ_{AIPW}$ , which, however, works only when both Assumptions 4 and 5 hold. Otherwise, the resulting variance estimator is not consistent if either Assumption 4 or 5 does not hold, as shown by simulation in Funk et al. (2011). The correction terms in (10) and (11) also make the variance estimator doubly robust in the sense that the variance estimator for ${\hat{τ}}_{AIPW, 2}$ is consistent if either Assumption 4 or 5 holds, not necessarily both.

For error-prone estimators, we can obtain the influence functions similarly. The subtlety is that both the propensity score and outcome models can be misspecified. For simplicity of the presentation, we defer the exact formulas to the online supplementary material.

3.4. Matching Estimators

We then elucidate the proposed method with non-RAL estimators. An important class of non-RAL estimators for the ACE are the matching estimators. The matching estimators are not regular estimators because the functional forms are not smooth due to the fixed numbers of matches (Abadie and Imbens 2008). Continuing with Example 6, Abadie and Imbens 2006) express the bias-corrected matching estimator ${\hat{τ}}_{mat, 2}$ in a linear form as

{\hat{τ}}_{mat, 2} - τ ≅ n_{2}^{- 1} \sum_{j \in S_{2}} ψ_{mat, j},

(12)

where

ψ_{mat, j} = μ_{1} (X_{j}, U_{j}) - μ_{0} (X_{j}, U_{j}) - τ + (2 A_{j} - 1) {1 + M^{- 1} K_{2, (X, U), j}} {Y_{j} - μ_{A_{j}} (X_{j}, U_{j})} .

(13)

Similarly, ${\hat{τ}}_{mat, d,ep}$ has a linear form

{\hat{τ}}_{mat, d, ep} - τ_{ep} ≅ n_{d}^{- 1} \sum_{j \in S_{d}} ϕ_{mat, d, j},

(14)

where

ϕ_{mat, d, j} = μ_{1} (X_{j}) - μ_{0} (X_{j}) - τ_{ep} + (2 A_{j} - 1) (1 + M^{- 1} K_{d, X, j}) {Y_{j} - μ_{A_{j}} (X_{j})} .

(15)

Theorem 2.

Under certain regularity conditions, (4) holds for the matching estimators (12) and (14), where

v_{2} = var {τ (X, U)} + plim [n_{2}^{- 1} \sum_{j \in S_{2}} {1 + M^{- 1} K_{2, (X, U), j}}^{2} σ_{A_{j}}^{2} (X_{j}, U_{j})], Γ = (1 - ρ) (cov {μ_{1} (X, U) - μ_{0} (X, U), μ_{1} (X) - μ_{0} (X)} + plim [n_{2}^{- 1} \sum_{j \in S_{2}} {1 + M^{- 1} K_{2, (X, U), j}} \times (1 + M^{- 1} K_{2, X, j}) σ_{A_{j}}^{2} (X_{j}, U_{j})]),

V = (1 - ρ) [var {μ_{1} (X) - μ_{0} (X)} + plim {n_{2}^{- 1} \sum_{j \in S_{2}} {(1 + M^{- 1} K_{2, X, j})}^{2} σ_{A_{j}}^{2} (X_{j})}] .

The existence of the probability limits in Theorem 2 are guaranteed by the regularity conditions specified in the supplementary material (c.f. Abadie and Imbens 2006).

To estimate (v₂, Γ, V) in Theorem 2, we need to estimate the conditional mean and variance functions of the outcome given covariates. Following Abadie and Imbens (2006), we can estimate these functions via matching units with the same treatment level. We will discuss an alternative bootstrap strategy in the next subsection.

3.5. Bootstrap Variance Estimation

The asymptotic results in Theorems 1 and 2 allow for variance estimation of $\hat{τ}$ . In addition, we also consider the bootstrap for variance estimation, which is simpler to implement and often has better finite sample performances (Otsu and Rai 2016). This is particularly important for matching estimators because the analytic variance formulas involve nonparametric estimation of the conditional variances $σ_{a}^{2} (x, u) and σ_{a}^{2} (x)$ .

There are two approaches for obtaining bootstrap observations: (a) the original observations; and (b) the asymptotic linear terms of the proposed estimator. For RAL estimators, bootstrapping the original observations will yield valid variance estimators (Efron and Tibshirani 1986; Shao and Tu 2012). However, for matching estimators, Abadie and Imbens (2008) showed that due to lack of smoothness in their functional form, the bootstrap based on approach (a) does not apply for variance estimation. This is mainly because the bootstrap based on approach (a) cannot preserve the distribution of the numbers of times that the units are used as matches. As a remedy, Otsu and Rai (2016) proposed to construct the bootstrap counterparts by resampling based on approach (b) for the matching estimator.

To unify the notation, let $ψ_{j}$ indicate $ψ (A_{j}, X_{j}, U_{j}, Y_{j})$ for RAL ${\hat{τ}}_{2}$ and $ψ_{mat, j}$ for ${\hat{τ}}_{mat, 2}$ and similar definitions apply to $ϕ_{d, j} (d = 1, 2)$ . Let ${\hat{ψ}}_{j}$ and ${\hat{ϕ}}_{d, j}$ be their estimated version by replacing the population quantities by the estimated quantities (d = 1, 2). Following Otsu and Rai (2016), for b = 1,...,B, we construct the bootstrap replicates for the proposed estimators as follows:

Step 1. Sample n₁ units from $S_{1}$ with replacement as $S_{1}^{* (b)}$ , treating the units with observed U as the bootstrap validation data $S_{2}^{* (b)}$ .

Step 2. Compute the bootstrap replicates of ${\hat{τ}}_{2} - τ and {\hat{τ}}_{d, ep} - τ_{ep}$ as

{\hat{τ}}_{2}^{(b)} - {\hat{τ}}_{2} = n_{2}^{- 1} \sum_{j \in S_{2}^{*} (b)} {\hat{ψ}}_{j}, {\hat{τ}}_{d, ep}^{(b)} - {\hat{τ}}_{d, ep} = n_{d}^{- 1} \sum_{j \in S_{d}^{* (b)}} {\hat{ϕ}}_{d, j}, (d = 1, 2) .

Based on the bootstrap replicates, we estimate Γ, V and v₂ by

\hat{Γ} = {(B - 1)}^{- 1} \sum_{b = 1}^{B} ({\hat{τ}}_{2}^{(b)} - {\hat{τ}}_{2}) ({\hat{τ}}_{2, ep}^{(b)} - {\hat{τ}}_{1, ep}^{(b)} - {\hat{τ}}_{2, ep} + {\hat{τ}}_{1, ep}),

(16)

\hat{V} = {(B - 1)}^{- 1} \sum_{b = 1}^{B} {({\hat{τ}}_{2, ep}^{(b)} - {\hat{τ}}_{1, ep}^{(b)} - {\hat{τ}}_{2, ep} + {\hat{τ}}_{1, ep})}^{\otimes 2},

(17)

{\hat{v}}_{2} = {(B - 1)}^{- 1} \sum_{b = 1}^{B} {({\hat{τ}}_{2}^{(b)} - {\hat{τ}}_{2})}^{2} .

(18)

Finally, we estimate the asymptotic variance of $\hat{τ}$ by (7), that is, $\hat{v} = ({\hat{v}}_{2} - {\hat{Γ}}^{T} {\hat{V}}^{- 1} \hat{Γ}) / n_{2}$ .

Theorem 3.

Under certain regularity conditions, $(\hat{Γ}, \hat{V}, {\hat{v}}_{2}, \hat{v})$ are consistent for ${Γ, V, var ({\hat{τ}}_{2}), var (\hat{τ})}$ .

Remark 4.

If the ratio of n₂ and n₁ is small, the above bootstrap approach may be unstable, because it is likely that some bootstrap validation data contain only a few or even zero observations. In this case, we use an alternative bootstrap approach, where we sample n₂ units from $S_{2}$ with replacement as $S_{2}^{*}$ , sample n₁ − n₂ units from $S_{1}$ \ $S_{2}$ with replacement, combined with $S_{2}^{*}$ , as $S_{1}^{*}$ , and obtain the proposed estimators based on $S_{1}^{*}$ and $S_{2}^{*}$ . This approach guarantees that the bootstrap validation data contain n₂ observations.

Remark 5.

It is worthwhile to comment on a computational issue. When the main data have a substantially large size, the computation for the bootstrap can be demanding if we follow Steps 1 and 2 above. In this case, we can use subsampling (Politis, Romano, and Wolf 1999) or the Bag of Little Bootstraps (Kleiner et al. 2014) to reduce the computational burden. More interestingly, when $n_{1} \to \infty$ and ρ = 0, that is, the validation data contain a small fraction of the main data, Γ and V reduce to $cov ({\hat{τ}}_{2}, {\hat{τ}}_{2, ep})$ and $var ({\hat{τ}}_{2, ep})$ , respectively. That is, when the size of the main data is substantially large, we can ignore the uncertainty of ${\hat{τ}}_{1, ep}$ and treat it as a constant, which is a regime recently considered by Chatterjee et al. (2016). In this case, we need only to bootstrap the validation data, which is computationally simpler.

4. Extensions

4.1. Other Causal Estimands

Our strategy extends to a wide class of causal estimands, as long as (4) holds. For example, we can consider the average causal effects over a subset of population (Crump et al. 2006; Li, Morgan, and Zaslavsky 2016), including the average causal effect on the treated.

We can also consider nonlinear causal estimands. For example, for a binary outcome, the log of the causal risk ratio is

\log CRR = \log \frac{P {Y (1) = 1}}{P {Y (0) = 1}} = \log \frac{E {Y (1)}}{E {Y (0)}},

and the log of the causal odds ratio is

\log COR = \log \frac{P {Y (1) = 1} / P {Y (1) = 0}}{P {Y (0) = 1} / P {Y (0) = 0}} = \log \frac{E {Y (1)} / [1 - E {Y (1)}]}{E {Y (0)} / [1 - E {Y (0)}]} .

We give a brief discussion for the log CRR as an illustration. The key insight is that under Assumptions 2 and 3, we can estimate E{Y(a)} with commonly-used estimators from $O_{2}$ , denoted by $\hat{E} {Y (a)}, for a = 0, 1$ . We can then obtain an estimator for the log CRR as log $[\hat{E} {Y (1)} / \hat{E} {Y (0)}]$ . Similarly, we can obtain error-prone estimators for the log CRR from both $O_{1}$ and $O_{2}$ using only covariates X. By the Taylor expansion, we can linearize these estimators and establish a similar result as (4), which serves as the basis to construct an improved estimator for the log CRR.

4.2. Design Issue: Optimal Sample Size Allocation

As a design issue, we consider planning a study to obtain the data structure in Example 1 in Section 2 subject to a cost constraint. The goal is to find the optimal design, specifically the sample allocation, that minimizes the variance of the proposed estimator subject to a cost constraint, as in the classical two-phase sampling (Cochran 2007).

Suppose that it costs C₁ to collect (A, X, Y) for each unit, and C₂ to collect U for each unit. Thus, the total cost of the study is

C = n_{1} C_{1} + n_{2} C_{2} .

(19)

The variance of the proposed estimator $\hat{τ}$ is of the form

n_{2}^{- 1} v_{2} - (n_{2}^{- 1} - n_{1}^{- 1}) γ,

(20)

for example, for RAL estimators,

γ = cov {ψ (A, X, U, Y), ϕ (A, X, Y)}^{T} {[var {ϕ (A, X, Y)}]}^{- 1} cov {ψ (A, X, U, Y), ϕ (A, X, Y)}

is the variance of the projection of $ψ (A, X, U, Y)$ onto the linear space spanned by $ϕ (A, X, Y)$ . Minimizing (20) with respect to n₁ and n₂ subject to the constraint (19) yields the optimal $n_{1}^{*}$ and $n_{2}^{*}$ , which satisfy

ρ^{*} = \frac{n_{2}^{*}}{n_{1}^{*}} = {(1 - R_{ψ | ϕ}^{2}) \times \frac{C_{1}}{C_{2}}}^{1 / 2},

(21)

where $R_{ψ | ϕ}^{2} = γ / v_{2}$ is the squared multiple correlation coefficient of $ψ (A, X, U, Y) on ϕ (A, X, Y)$ , which measures the association between the initial estimator and the error-prone estimator. We derive (21) using the Lagrange multipliers, and relegate the details to the supplementary material. Not surprisingly, (21) shows that the sizes of the validation data and the main data should be inversely proportional to the square-root of the costs. In addition, from (21), a large size n₂ for the validation data is more desirable when the association between the initial estimator and the error-prone estimator is small.

4.3. Multiple Data Sources

We have considered the setting with two data sources, and we can easily extend the theory to the setting with multiple data sources $O_{1}, \dots, O_{K}$ , where $O_{1}, \dots, O_{K - 1}$ contain partial covariate information, and the validation data, $O_{K}$ , contain full information for (A, X, U, Y). For example, for $d = 1, \dots, K - 1$ , $O_{d}$ contains variables $(A, V_{d}, Y)$ where $V_{d} ⫋ (X, U)$ . Each dataset $O_{d}$ , indexed by $S_{d}$ , has size n_d for d = 1,...,K. This type of data structure arises from a multi-phase sampling as an extension of Example 1 or multiple sources of “big data” as an extension of Example 2.

Let ${\hat{τ}}_{K}$ be the initial estimator for τ from the validation data $O_{K}$ , and ${\hat{τ}}_{d, ep}$ be the error-prone estimator for τ from $O_{d} (d = 1, \dots, K - 1)$ . Let ${\hat{τ}}_{d, K, ep}$ be the estimator obtained by applying the same error-prone estimator for $O_{d}$ to $O_{K}$ , so that ${\hat{τ}}_{d, ep} - {\hat{τ}}_{d, K, ep}$ is consistent for 0, for $d = 1, \dots, K - 1$ . Assume that

n_{K}^{1 / 2} (\begin{matrix} {\hat{τ}}_{K} - τ \\ {\hat{τ}}_{1, ep} - {\hat{τ}}_{1, K, ep} \\ ⋮ \\ {\hat{τ}}_{K - 1, ep} - {\hat{τ}}_{K - 1, K, ep} \end{matrix}) \to N {(\begin{matrix} 0 \\ 0_{L} \end{matrix}), (\begin{matrix} v_{K} & Γ^{T} \\ Γ & V \end{matrix})},

in distribution, as $n_{K} \to \infty$ , where $L = \sum_{d = 1}^{K - 1} \dim ({\hat{τ}}_{d, ep})$ . If Γ and V have consistent estimators $\hat{Γ}$ and $\hat{V}$ , respectively, then, extending the proposed method in Section 3, we can use

\hat{τ} = {\hat{τ}}_{K} - {\hat{Γ}}^{T} {\hat{V}}^{- 1} {({\hat{τ}}_{1, ep}^{T} - {\hat{τ}}_{1, K, ep}^{T}, \dots, {\hat{τ}}_{K - 1, ep}^{T} - {\hat{τ}}_{K - 1, K, ep}^{T})}^{T}

to estimate τ. The estimator $\hat{τ}$ is consistent for τ with the asymptotic variance $v_{K} - Γ^{T} V^{- 1} Γ$ , which is smaller than the asymptotic variance of ${\hat{τ}}_{K}, v_{K}$ , if Γ is nonzero. Similar to the reasoning in Remark 3, using more data sources will improve the asymptotic estimation efficiency of τ.

5. Simulation

In this section, we conduct a simulation study to evaluate the finite sample performance of the proposed estimators. In our data generating model, the covariates are $X_{i} ~ Unif (0, 2)$ and $U_{i} = 0.5 + 0.5 X_{i} - 2 \sin (X_{i}) + 2 sign {\sin (5 X_{i})} + ϵ_{i}$ , where $ϵ_{i} ~ Unif (- 0.5, 0.5)$ . The potential outcomes are $Y_{i} (0) = - X_{i} - U_{i} + ϵ_{i} (0)$ , and $Y_{i} (1) = - X_{i} + 4 U_{i} + ϵ_{i} (1)$ , where $ϵ_{i} (0) ~ N (0, 1), ϵ_{i} (1) ~ N (0, 1), and ϵ_{i} (0) and ϵ_{i} (1)$ are independent. Therefore, the true value of the ACE is τ = E(5U_i). The treatment indicator A_i follows Bernoulli $(π_{i})$ with $logit (π_{i}) = 1 - 0.5 X_{i} - 0.5 U_{i}$ . The main data $O_{1}$ consist of n₁ units, and the validation data $O_{2}$ consist of n₂ units randomly selected from the main data.

The initial estimators are the regression imputation, (A)IPW and matching estimators applied solely to the validation data, denoted by ${\hat{τ}}_{reg, 2}, {\hat{τ}}_{IPW, 2}, {\hat{τ}}_{AIPW, 2}, and {\hat{τ}}_{mat, 2}$ respectively. To distinguish the estimators constructed based on different error-prone methods, we assign each proposed estimator a name with the form ${\hat{τ}}_{method, 2 &methods}$ , where “method,2” indicates the initial estimator applied to the validation data $O_{2}$ and “methods” indicates the error-prone estimator(s) used to improve the efficiency of the initial estimator. For example, ${\hat{τ}}_{reg, 2 & IPW}$ indicates the initial estimator is the regression imputation estimator and the error-prone estimator is the IPW estimator. We compare the proposed estimators with the initial estimators in terms of percentages of reduction of mean squared errors, defined as ${1 - MSE ({\hat{τ}}_{method, 2 & methods}) / MSE ({\hat{τ}}_{method, 2})} \times 100 %$ . To demonstrate the robustness of the proposed estimator against misspecification of the imputation model, we consider the multiple imputation (MI, Rubin 1987) estimator, denoted by ${\hat{τ}}_{mi}$ , which uses a regression model of U given (A, X, Y) for imputation. We implement MI using the “mice” package in R with m = 10.

Based on a point estimate $\hat{τ}$ and a variance estimate $\hat{v}$ obtained by the asymptotic variance formula or the bootstrap method described in Section 3.5, we construct a Wald-type 95% confidence interval $(\hat{τ} - z_{0.975} {\hat{v}}^{1 / 2}, \hat{τ} + z_{0.975} {\hat{v}}^{1 / 2})$ , where $z_{0.975}$ is the 97.5% quantile of the standard normal distribution. We further compare the variance estimators in terms of empirical coverage rates.

Figure 1 shows the simulation results over 2000 Monte Carlo samples for (n₁, n₂) = (1000,200) and (n₁, n₂) = (1000,500). The multiple imputation estimator is biased due to the misspecification of the imputation model. In all scenarios, the proposed estimators are unbiased and improve the initial estimators. Using the error-prone estimator of the same type of the initial estimator achieves a substantial efficiency gain, and the efficiency gain from incorporating additional error-prone estimator is not significantly important. Because of the practical simplicity, we recommend using the same type of error-prone estimator to improve the efficiency of the initial estimator. Confidence intervals constructed from the asymptotic variance formula and the bootstrap method work well, in the sense that the empirical coverage rate of the confidence intervals is close to the nominal coverage rate. In our settings, the matching estimator has the smallest efficiency gain among all types of estimators.

6. Application

We present an analysis to evaluate the effect of chronic obstructive pulmonary disease (COPD) on the development of herpes zoster (HZ). COPD is a chronic inflammatory lung disease that causes obstructed airflow from the lungs, which can cause systematic inflammation and dysregulate a patient’s immune function. The hypothesis is that people with COPD are at increased risk of developing HZ. Yang et al. (2011) find a positive association between COPD and development of HZ; however, they do not control for important counfounders between COPD and HZ, for example, cigarette smoking and alcohol consumption.

We analyze the main data from the 2005 Longitudinal Health Insurance Database (LHID, Yang et al. 2011) and the validation data from the 2005 National Health Interview Survey conducted by the National Health Research Institute and the Bureau of Health Promotion in Taiwan (Lin and Chen 2014). The 2005 LHID consist of 42,430 subjects followed from the date of cohort entry on January 1, 2004 until the development of HZ or December 31, 2006, whichever came first. Among those, there are 8,486 subjects with COPD, denoted by A = 1, and 33,944 subjects without COPD, denoted by A = 0. The outcome Y was the development of HZ during follow up (1, having HZ and 0, not having HZ). The observed prevalence of HZ among COPD and non-COPD subjects are 3.7% and 2.2% in the main data and 2.5% and 0.8% in the validation data.

The confounders X available from the main data were age, sex, diabetes mellitus, hypertension, coronary artery disease, chronic liver disease, autoimmune disease, and cancer. However, important confounders U, including cigarette smoking and alcohol consumption, were not available. The validation data $O_{2}$ use the same inclusion criteria as in the main study and consist of 1,148 subjects who were comparable to the subjects in the main data. Among those, 244 subjects were diagnosed of COPD, and 904 subjects were not. In addition to all variables available from the main data, cigarette smoking and alcohol consumption were measured. In our formulation, the main data $O_{1}$ combine the LHID data and the validation data. Table 4 in Lin and Chen (2014) shows summary statistics on demographic characteristics and comorbid disorders for COPD and Non-COPD subjects in the main and validation data. Because the common covariates in the main and validation data are comparable, it is reasonable to assume that the validation sample is a simple random sample from the main data. Moreover, the difference in distributions of alcohol consumption between COPD and non-COPD subjects is not statistical significant in the validation data. But, the COPD subjects tended to have higher cumulative smoking rates than the non-COPD subjects in the validation data.

We obtain the initial estimators applied solely to the validation data and the proposed estimators applied to both data. As suggested by the simulation in Section 5, we use the same type of the error-prone estimator as the initial estimator. Following Stürmer et al. (2005) and Lin and Chen (2014), we use the propensity score to accommodate the high-dimensional confounders. Specifically, we fit logistic regression models for the propensity score e(X, U; α) and the error-prone propensity score $e (X; γ)$ based on ${(A_{j}, X_{j}, U_{j}) : j \in S_{2}}$ and ${(A_{i}, X_{i}) : j \in S_{1}}$ , respectively. We fit logistic regression models for the outcome mean function μ_a(X, U) based on a linear predictor ${1, e (X, U; \hat{α})}^{T} β_{a}$ , and for $μ_{a} (X)$ based on a linear predictor ${1, e (X; \hat{γ})}^{T} η_{a}$ , for a = 0, 1.

We first estimate the ACE τ. Table 1 shows the results for the average COPD effect on the development of HZ. We find no big differences in the point estimates between our proposed estimators and the corresponding initial estimators, but large reductions in the estimated standard errors of the proposed estimators. As a result, all 95% confidence intervals based on the initial estimators include 0, but the 95% confidence intervals based on the proposed estimators do not include 0, except for ${\hat{τ}}_{mat 2 &mat}$ . As demonstrated by the simulation in Section 5, the variance reduction by utilizing the main data is the smallest for the matching estimator. From the results, on average, COPD increases the percentage of developing HZ by 1.55%.

Table 1.

Point estimate, bootstrapped standard error and 95%Wald-type confidence interval

	Est	SE	95% CI		Est	SE	95% CI
${\hat{τ}}_{reg, 2}$	0.0178	0.0112	(−0.0047, 0.0402)	${\hat{τ}}_{reg, 2 & reg}$	0.0155	0.0023	(0.0109, 0.0200)
${\hat{τ}}_{IPW, 2}$	0.0175	0.0111	(−0.0048, 0.0398)	${\hat{τ}}_{IPW, 2 &IPW}$	0.0155	0.0024	(0.0108, 0.0202)
${\hat{τ}}_{AIPW, 2}$	0.0179	0.0111	(−0.0044, 0.0402)	${\hat{τ}}_{AIPW, 2 & AIPW}$	0.0156	0.0024	(0.0109, 0.0203)
${\hat{τ}}_{mat, 2}$	0.0077	0.0092	(−0.0106, 0.021)	${\hat{τ}}_{mat, 2 & mat}$	0.0079	0.0053	(−0.0027, 0.0184)

Open in a new tab

We also estimate the log of the causal risk ratio of HZ with COPD. The initial IPW estimate from the validation data is log ${\hat{CRR}}_{IPW, 2} = 1.10$ (95% confidence interval: 0.02, 2.18). In contrast, the proposed estimate by using the error-prone IPW estimators is log ${\hat{CRR}}_{IPW, 2 &IPW} = 0.57$ (95% confidence interval: 0.41, 0.72), which is much more accurate than the initial IPW estimate.

7. Relaxing Assumption 1

In previous sections, we invoked Assumption 1 that $S_{2}$ is a random sample from $S_{1}$ . We now relax this assumption and link our framework to existing methods for missing data. Let I_i be the indicator of selecting unit i into the validation data, that is, I_i = 1 if i ∈ $S_{2}$ and I_i = 0 if $i \notin S_{2}$ . Alternatively, I_i can be viewed as the missingness indicator of U_i. Under Assumption 1, $I ⫫ (A, X, U, Y)$ ; that is, U is missing completely at random. We now relax it to $I ⫫ U | (A, X, Y)$ , that is, U is missing at random. In this case, the selection of $S_{2}$ from $S_{1}$ can depend on a probability design, which is common in observational studies, for example, an outcome-dependent two-phase sampling (Breslow, McNeney, and Wellner 2003; Wang et al. 2009).

We assume that each unit in the main data is subjected to an independent Bernoulli trial which determines whether the unit is selected into the validation data. For simplicity, we further assume that the inclusion probability $P (I = 1 | A, X, U, Y) = P (I = 1 | A, X, Y) \equiv π (A, X, Y)$ is known as in two-phase sampling. Otherwise, we need to fit a model for the missing data indicator I given (A, X, Y). We summarize the above in the following assumption.

Assumption 6.

${(I_{i}, A_{i}, X_{i}, U_{i}, Y_{i}) : i \in S_{1}}$ are IID with $I ⫫ U | (A, X, Y) . S_{2}$ is selected from $S_{1}$ with a known inclusion probability $π (A, X, Y) > 0$ .

In what follows, we use π for π(A, X, Y) and π_j for π(A_j, X_j, Y_j) for shorthand. Because of Assumption 6, we drop the indices i and j in the expectations, covariances, and variances, which are taken with respect to both the sampling and superpopulation models.

7.1. RAL estimators

For the illustration of RAL estimators, we focus on the AIPW estimator of the ACE τ, because the regression imputation and inverse probability weighting estimators are its special cases. Let $\hat{α}$ and ${\hat{β}}_{a}$ solve the weighted estimating equations $\sum_{j \in S_{2}} π_{j}^{- 1} S (A_{j}, X_{j}, U_{j}; α) = 0$ and $\sum_{j \in S_{2}} π_{j}^{- 1} S_{a} (A_{j}, X_{j}, U_{j}, Y_{j}; β_{a}) = 0$ , and let $α^{*}$ and $β_{a}^{*}$ satisfy $E {S (A, X, U; α^{*})} = 0$ and $E {S_{a} (A, X, U, Y; β_{a}^{*})} = 0$ . Under suitable regularity condition, $\hat{α} \to α^{*}$ and ${\hat{β}}_{a} \to β_{a}^{*}$ in probability, for a = 0, 1. Let the initial estimator for τ be the Hajek-type estimator (Hájek 1971):

{\hat{τ}}_{2} = \frac{\sum_{j \in S_{2}} π_{j}^{- 1} {\hat{τ}}_{AIPW, 2, j}}{\sum_{j \in S_{2}} π_{j}^{- 1}},

(22)

where ${\hat{τ}}_{AIPW, 2, j}$ has the same form as (2). Under regularity conditions, Assumption 4 or 5, and Assumption 6, we show in the supplementary material that

{\hat{τ}}_{2} - τ ≅ n_{1}^{- 1} \sum_{j \in S_{1}} π_{j}^{- 1} I_{j} ψ (A_{j}, X_{j}, U_{j}, Y_{j}),

(23)

where $ψ (A, X, U, Y)$ is given by (11). Because ${π_{j}^{- 1} I_{j} ψ (A_{j}, X_{j}, U_{j}, Y_{j}) : j \in S_{1}}$ are IID with mean 0, ${\hat{τ}}_{2}$ is consistent for τ.

Similarly, let ${\hat{γ}}_{d}$ and ${\hat{η}}_{d, a}$ solve the weighted estimating equation $\sum_{j \in S_{d}} π_{j}^{- 1} S (A_{j}, X_{j}; γ) = 0$ and $\sum_{j \in S_{d}} π_{j}^{- 1} S_{a} (A_{j}, X_{j}, Y_{j}; η_{a}) = 0$ , and let $γ^{*}$ and $η_{a}^{*}$ satisfy $E {S (A_{j}, X_{j}; γ^{*})} = 0$ and $E {S_{a} (A_{j}, X_{j}, Y_{j}; η_{a}^{*})} = 0$ . Under suitable regularity condition, ${\hat{γ}}_{d} \to γ^{*} and {\hat{η}}_{d, a} \to η_{a}^{*}$ in probability, for a = 0, 1 and d = 1, 2. Let the error-prone estimators be

{\hat{τ}}_{1, ep} = n_{1}^{- 1} \sum_{i \in S_{1}} {\hat{τ}}_{AIPW, 1, ep, i}, {\hat{τ}}_{2, ep} = \frac{\sum_{j \in S_{2}} π_{j}^{- 1} {\hat{τ}}_{AIPW, 2, ep, j}}{\sum_{j \in S_{2}} π_{j}^{- 1}},

(24)

where ${\hat{τ}}_{AIPW, d, ep, j}$ has the same form as (S8) in the supplementary material. Following a similar derivation for (23), we have

{\hat{τ}}_{1, ep} - τ_{ep} ≅ n_{1}^{- 1} \sum_{i \in S_{1}} ϕ (A_{i}, X_{i}, Y_{i}), {\hat{τ}}_{2, ep} - τ_{ep} ≅ n_{1}^{- 1} \sum_{j \in S_{1}} π_{j}^{- 1} I_{j} ϕ (A_{j}, X_{j}, Y_{j}),

(25)

where $ϕ (A, X, Y)$ is given by (S9) in the supplementary material. Because both ${ϕ (A_{i}, X_{i}, Y_{i}) : i \in S_{1}}$ and ${π_{j}^{- 1} I_{j} ϕ (A_{j}, X_{j}, Y_{j}) : j \in S_{1}}$ are IID with mean 0, ${\hat{τ}}_{1, ep} and {\hat{τ}}_{2, ep}$ are consistent for τ_ep.

Theorem 4.

Under certain regularity conditions, (4) holds for the Hajek-type estimators (22) and (24), where $ρ = {plim}_{n_{2} \to \infty} (n_{2} / n_{1}), v_{2} = ρ \times var {π^{- 1} I ψ (A, X, U, Y)}, Γ = ρ \times cov {π^{- 1} I ψ (A, X, U, Y), (π^{- 1} I - 1) ϕ (A, X, Y)} and V = ρ \times var {(π^{- 1} I - 1) ϕ (A, X, Y)}$ .

Similar to Section 3.3, we can construct a consistent variance estimator for $\hat{τ}$ by replacing the variances and covariance in Theorem 4 with their sample analogs.

7.2. Matching Estimators

Recall that $J_{d, V, l}$ is the index set of matches for unit l based on data $O_{d}$ and the matching variable V, which can be (X, U) or X. Define $δ_{d, V, (j, l)} = 1 if j \in J_{d, V, l} and δ_{d, V, (j, l)} = 0$ otherwise. Now, we denote $K_{d, V, j} = π_{j} \sum_{l \in S_{d}} π_{l}^{- 1} 1 {A_{l} = 1 - A_{j}} δ_{d, V, (j, l)}$ as the weighted number of times that unit j is used as a match. If π_j is a constant for all j ∈ $S_{d}$ , then $K_{d, V, j}$ reduces to the number of times that unit j is used as a match defined in Section 3.1, which justifies using the same notation as before.

Let the initial matching estimator for τ be the Hajek-type estimator:

{\hat{τ}}_{mat, 2}^{(0)} = {(\sum_{j \in S_{2}} π_{j}^{- 1})}^{- 1} \times \sum_{j \in S_{2}} π_{j}^{- 1} (2 A_{j} - 1) (Y_{j} - M^{- 1} \sum_{l \in J_{2, (X, U), j}} Y_{l}) = {(\sum_{j \in S_{2}} π_{j}^{- 1})}^{- 1} \times \sum_{j \in S_{2}} π_{j}^{- 1} (2 A_{j} - 1) {1 + M^{- 1} K_{2, (X, U), j}} Y_{j} .

Let a bias-corrected matching estimator be

{\hat{τ}}_{mat, 2} = {\hat{τ}}_{mat, 2}^{(0)} - n_{1}^{- 1 / 2} {\hat{B}}_{2},

(26)

where

{\hat{B}}_{2} = n_{1}^{- 1 / 2} \sum_{j \in S_{2}} π_{j}^{- 1} (2 A_{j} - 1) \times [M^{- 1} \sum_{l \in J_{2, (X, U), j}} {{\hat{μ}}_{1 - A_{j}} (X_{j}, U_{j}) - {\hat{μ}}_{1 - A_{j}} (X_{l}, U_{l})}],

We show in the supplementary material that

{\hat{τ}}_{mat, 2} - τ ≅ n_{1}^{- 1} \sum_{j \in S_{2}} π_{j}^{- 1} ψ_{mat, j},

(27)

where $ψ_{mat, j}$ is defined in (13) with the new definition of $K_{2, (X, U), j}$ .

Similarly, we obtain error-prone matching estimators and express them as

{\hat{τ}}_{mat, 1, ep} - τ_{ep} ≅ n_{1}^{- 1} \sum_{j \in S_{1}} ϕ_{mat, 1, j}, {\hat{τ}}_{mat, 2, ep} - τ_{ep} ≅ n_{1}^{- 1} \sum_{j \in S_{2}} π_{j}^{- 1} ϕ_{mat, 2, j},

(28)

where $ϕ_{mat, d, j}$ is defined in (15) with the new definition of $K_{d, X, j}$ .

From the above decompositions, ${\hat{τ}}_{mat, 2}$ is consistent for τ, and ${\hat{τ}}_{mat, 1, ep} - {\hat{τ}}_{mat, 2, ep}$ is consistent for 0.

Theorem 5.

Under certain regularity conditions, (4) holds for the estimators (26) and ${\hat{τ}}_{mat, d, ep} (d = 1, 2)$ , where $ρ = {plim}_{n_{2} \to \infty} (n_{2} / n_{1})$ ,

v_{2} = ρ \times (E [\frac{1 - π}{π} {τ (X, U) - τ}^{2}] + plim [n_{1}^{- 1} \sum_{j \in S_{1}} \frac{1 - π_{j}}{π_{j}} {1 + M^{- 1} K_{2, (X, U), j}}^{2} σ_{A_{j}}^{2} (X_{j}, U_{j})]),

Γ = ρ \times E [\frac{1 - π}{π} {μ_{1} (X, U) - μ_{0} (X, U) - τ} {μ_{1} (X) - μ_{0} (X) - τ_{e p}}] + ρ \times plim [n_{1}^{- 1} \sum_{j \in S_{1}} \frac{1 - π_{j}}{π_{j}} {1 + M^{- 1} K_{2, (X, U), j}} (1 + M^{- 1} K_{2, X, j}) σ_{A_{j}}^{2} (X_{j}, U_{j})]

V = ρ \times E [\frac{1 - π}{π} {μ_{1} (X) - μ_{0} (X) - τ_{ep}}^{2}] + ρ \times plim [n_{1}^{- 1} \sum_{j \in S_{1}} \frac{1 - π_{j}}{π_{j}} {(1 + M^{- 1} K_{1, X, j})}^{2} σ_{A_{j}}^{2} (X_{j})] .

We can construct variance estimators based on the formulas in Theorem 5. However, this again involves estimating the conditional variances $σ_{0}^{2} (x)$ and $σ_{1}^{2} (x)$ . We recommend using the bootstrap variance estimator in the next subsection.

7.3. A Bootstrap Variance Estimation Procedure

The asymptotic linear forms (23), (25), (27), and (28) are useful for the bootstrap variance estimation. For b = 1,...,B, we construct the bootstrap replicates as follows:

Step 1. Sample n₁ units from $S_{1}$ with replacement as $S_{1}^{* (b)}$ .

Step 2. Compute the bootstrap replicates of ${\hat{τ}}_{2} - τ$ and ${\hat{τ}}_{d, ep} - τ_{ep}$ as

{\hat{τ}}_{2}^{(b)} - {\hat{τ}}_{2} = n_{1}^{- 1} \sum_{i \in S_{1}^{* (b)}} π_{i}^{- 1} I_{i} {\hat{ψ}}_{i}, {\hat{τ}}_{1, ep}^{(b)} - {\hat{τ}}_{1, ep} = n_{1}^{- 1} \sum_{i \in S_{1}^{* (b)}} {\hat{ϕ}}_{1, i}, {\hat{τ}}_{2, ep}^{(b)} - {\hat{τ}}_{2, ep} = n_{1}^{- 1} \sum_{i \in S_{1}^{* (b)}} π_{i}^{- 1} I_{i} {\hat{ϕ}}_{2, i},

where $({\hat{ψ}}_{i}, {\hat{ϕ}}_{d, i})$ are the estimated versions of $(ψ_{i}, ϕ_{i})$ from $O_{d}$ d = 1, 2).

We estimate Γ, V and v₂ by (16)–(18) based on the above bootstrap replicates, and var $(\hat{τ})$ by (7), that is, $\hat{v} = {\hat{v}}_{2} - {\hat{Γ}}^{T} {\hat{V}}^{- 1} \hat{Γ}$ .

Theorem 6.

Under certain regularity conditions, $(\hat{Γ}, \hat{V}, {\hat{v}}_{2}, \hat{v})$ are consistent for ${Γ, V, var ({\hat{τ}}_{2}), var (\hat{τ})}$ .

For RAL estimators, we can also use the classical nonparametric bootstrap based on resampling the IID observations ${(I_{i}, A_{i}, X_{i}, U_{i}, Y_{i}) : i \in S_{1}}$ and repeating the analysis as for the original data. The above bootstrap procedure based on resampling the linear forms are particularly useful for the matching estimator.

7.4. Connection With Missing Data

As a final remark, we express the proposed estimator in a linear form that has appeared in the missing data literature.

Proposition 2.

Under certain regularity conditions and Assumption 6, $\hat{τ}$ has an asymptotic linear form

n_{1}^{1 / 2} (\hat{τ} - τ) = n_{1}^{- 1 / 2} \sum_{i \in S_{1}} {\frac{I_{i}}{π_{i}} ψ_{i} - (\frac{I_{i}}{π_{i}} - 1) Γ V^{- 1} ϕ_{i}} + o_{P} (1)

(29)

where $ψ_{i}$ is $ψ (A_{i}, X_{i}, U_{i}, Y_{i})$ for RAL estimators and $ψ_{mat, i}$ for the matching estimator, and a similar definition applies to $ϕ_{i}$ . Under Assumption 1, π_i ≡ ρ.

Expression (29) is within a class of estimators in the missing data literature with the form

n_{1}^{1 / 2} (\hat{τ} - τ) = n_{1}^{- 1 / 2} \sum_{i \in S_{1}} {\frac{I_{i}}{π_{i}} s (A_{i}, X_{i}, U_{i}, Y_{i}) - (\frac{I_{i}}{π_{i}} - 1) κ (A_{i}, X_{i}, Y_{i})} + o_{P} (1),

(30)

where π = E(I | A, X, U, Y), s(A, X, U, Y) satisfies E{s(A, X, U, Y)} = 0, and s(A, X, U, Y) and $κ (A, X, Y)$ are square integrable. Given $s (A, X, U, Y)$ the optimal choice of $κ (A, X, Y) is κ_{opt} (A, X, Y) = E {s (A, X, U, Y) | A, X, Y}$ , which minimizes the asymptotic variance of (30) (Robins, Rotnitzky, and Zhao 1994; Wang et al. 2009). However, K_opt(A, X, Y) requires a correct specification of the missing data model $f (U | A, X, Y)$ . In our approach, instead of specifying the missing data model, we specify the error-prone estimators and utilize an estimator that is consistent for zero to improve the efficiency of the initial estimator. This is more attractive and closer to empirical practice than calculating K_opt(A, X, Y), because practitioners only need to apply their favorite estimators to the main and validation data using existing software. See also Chen and Chen (2000) for a similar discussion in the regression context under Assumption 1.

8. Discussion

Depending on the roles in statistical inference, there are two types of big data: one with large-sample sizes and the other with richer covariates. In our discussion, the main observational data have a larger sample size, and the validation observational data have more covariates. Although some counterexamples exist (Pearl 2009, 2010; Ding and Miratrix 2015; Ding, Vanderweele, and Robins 2017), it is more reliable to draw causal inference from the validation data. The proposed strategy is applicable even the number of covariates is high in the validation data. In this case, we can consider ${\hat{τ}}_{2}$ to be the double machine learning estimators (Chernozhukov et al. 2018) that use flexible machine learning methods for estimating regression and propensity score functions while retain the property in (4). Our framework allows for more efficient estimators of the causal effects by further combining information in the main data, without imposing any parametric models for the partially observed covariates. Coupled with the bootstrap, our estimators require only software implementations of standard estimators, and thus are attractive for practitioners who want to combine multiple observational data sources.

The key insight is to leverage an estimator of zero to improve the efficiency of the initial estimator. If a certain feature is transportable across datasets (Bareinboim and Pearl 2016), we can construct a consistent estimator of zero. We have shown that if the validation data are simple random samples from the main data, the distribution of (A, X, Y) is transportable from the validation data to the main data. We then construct a consistent estimator of zero by taking the difference of the estimators based on (A, X, Y) from the two datasets. In the presence of heterogeneity between two data sources, the transportability of the whole distribution of (A, X, Y) can be stringent. However, if we are willing to assume the conditional distribution of Y given (A, X) is transportable, we can then take the error prone estimators to be the regression coefficients of Y on (A, X) from the two datasets. As suggested by one of the reviewers, if the subgroups of two samples are comparable, we can construct the error prone estimators based on the subgroups. Similarly, this construction of error prone estimators can adapt to different transportability assumptions based on the subject matter knowledge.

In the worst case, the heterogeneity is intrinsic between the two samples, and we cannot construct two error prone estimators with the same probability limit. We can still conduct a sensitivity analysis combining two data. Instead of (4), we assume

n_{2}^{1 / 2} (\begin{matrix} {\hat{τ}}_{2} - τ \\ {\hat{τ}}_{2, ep} - {\hat{τ}}_{1, ep} - δ \end{matrix}) \to N {0_{L + 1}, (\begin{matrix} v_{2} & Γ^{T} \\ Γ & V \end{matrix})},

(31)

where δ is the sensitivity parameter, quantifying the systematic difference between ${\hat{τ}}_{2, ep}$ and ${\hat{τ}}_{1, ep}$ . The adjusted estimator becomes ${\hat{τ}}_{adj} (δ) = {\hat{τ}}_{2} - {\hat{Γ}}^{T} {\hat{V}}^{- 1} ({\hat{τ}}_{2, ep} - {\hat{τ}}_{1, e p} - δ)$ . With different values of δ, the estimator ${\hat{τ}}_{adj} (δ)$ can provide valuable insight on the impact of the heterogeneity of the two data, allowing an investigator to assess the extent to which the heterogeneity may alter causal inferences.

Supplementary Material

supp

NIHMS1050623-supplement-supp.zip^{(2.2MB, zip)}

Acknowledgments

We thank the editor, the associate editor, and four anonymous reviewers for suggestions which improved the article significantly. We are grateful to Professor Yi-Hau Chen for providing the data and offering help and advice in interpreting the data. Drs. Lo-Hua Yuan and Xinran Li offered helpful comments. Dr. Yang is partially supported by the National Science Foundation grant DMS 1811245, National Cancer Institute grant P01 CA142538, and Oak Ridge Associated Universities. Dr. Ding is partially supported by the National Science Foundation grant DMS 1713152.

Footnotes

Supplementary materials

The online supplementary material contains technical details and proofs. The R package “Integrative CI” is available at https://github.com/shuyang1987/IntegrativeCI to perform the proposed estimators.

References

Abadie A, and Imbens GW (2006), “Large Sample Properties of Matching Estimators for Average Treatment Effects,” Econometrica, 74, 235–267. [2,4,6,7] [Google Scholar]
Abadie A (2008), “On the Failure of the Bootstrap for Matching Estimators,” Econometrica, 76, 1537–1557. [6,7] [Google Scholar]
Abadie A (2016), “Matching on the Estimated Propensity Score,” Econometrica, 84, 781–807. [2] [Google Scholar]
Angrist JD, Imbens GW, and Rubin DB (1996), “Identification of Causal Effects Using Instrumental Variables,” Journal of American Statistical Association, 91, 444–455. [1] [Google Scholar]
Antonelli J, Zigler C, and Dominici F (2017), “Guided Bayesian Imputation to Adjust for Confounding When Combining Heterogeneous Data Sources in Comparative Effectiveness Research,” Biostatistics, 18, 553–568. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]
Bang H, and Robins JM (2005), “Doubly Robust Estimation in Missing Data and Causal Inference Models,” Biometrics, 61, 962–973. [2,3] [DOI] [PubMed] [Google Scholar]
Bareinboim E, and Pearl J (2016), “Causal Inference and the data-fusion problem,” PNAS, 113,7345–7352. [13] [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel PJ, Klaassen C, Ritov Y, and Wellner J (1993), Efficient and Adaptive Inference in Semiparametric Models, Baltimore: Johns Hopkins University Press; [5] [Google Scholar]
Boonstra PS, Taylor JM, and Mukherjee B (2012), “Incorporating Auxiliary Information for Improved Prediction in High-dimensional Datasets: An Ensemble of Shrinkage Approaches,” Biostatistics, 14, 259–272. [5] [DOI] [PMC free article] [PubMed] [Google Scholar]
Breslow N, McNeney B, and Wellner JA (2003), “Large Sample Theory for Semiparametric Regression Models With Two-phase, Outcome Dependent Sampling,” Annals of Statistics, 31, 1110–1139. [11] [Google Scholar]
Cao W, Tsiatis AA, and Davidian M (2009), “Improving Efficiency and Robustness of the Doubly Robust Estimator for a Population Mean With Incomplete Data,” Biometrika, 96, 723–734. [2,3] [DOI] [PMC free article] [PubMed] [Google Scholar]
Chang T, and Kott PS (2008), “Using Calibration Weighting to Adjust for Nonresponse Under a Plausible Model,” Biometrika, 95, 555–571. [5] [Google Scholar]
Chatterjee N, Chen YH, Maas P, and Carroll RJ (2016), “Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources,” Journal of American Statistical Association, 111, 107–117. [1,2,7] [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J, and Sitter R (1999), “A Pseudo Empirical Likelihood Approach to the Effective Use of Auxiliary Information in Complex Surveys,” Statistica Sinica, 9 385–406. [2,5] [Google Scholar]
Chen J, Sitter R, and Wu C (2002), “Using Empirical Likelihood Methods to Obtain Range Restricted Weights in Regression Estimators for Surveys,” Biometrika, 89, 230–237. [2,5] [Google Scholar]
Chen SX, Leung DHY, and Qin J (2003), “Information Recovery in a Study With Surrogate Endpoints,” Journal of American Statistical Association, 98, 1052–1062. [2,5] [Google Scholar]
Chen YH (2002), “Cox Regression in Cohort Studies With Validation Sampling,” Journal of Royal Statistical Society, Series B, 64, 51–62. [5] [Google Scholar]
Chen YH, and Chen H (2000), “A Unified Approach to Regression Analysis Under Double-sampling Designs,” Journal of Royal Statistical Society, Series B, 62, 449–460. [5,13] [Google Scholar]
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, and Robins J (2018), “Double/debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21, C1–C68. [13] [Google Scholar]
Cochran WG (2007), Sampling Techniques (3rd ed), New York: Wiley; [1,2,8] [Google Scholar]
Collaboration FS (2009), “Systematically Missing Confounders in Individual Participant Data Meta-analysis of Observational Cohort Studies,” Statistics in Medicine, 28, 1218–1237. [5] [DOI] [PMC free article] [PubMed] [Google Scholar]
Crump R, Hotz VJ, Imbens G, and Mitnik O (2006), “Moving the Goalposts: Addressing Limited Overlap in the Estimation of Average Treatment Effects by Changing the Estimand,” Technical Report, 330, Cambridge, MA: National Bureau of Economic Research. Available at http://www.nber.org/papers/T0330 [7] [Google Scholar]
Deville J-C, and Särndal C-E (1992), “Calibration Estimators in Survey Sampling,” Journal of American Statistical Association, 87, 376–382. [5]. [Google Scholar]
Ding P, and Miratrix LW (2015), “To Adjustor Not to Adjust? Sensitivity Analysis of M-bias and Butterfly-bias,” Journal of Causal Inference, 3, 41–57. [13] [Google Scholar]
Ding P, Vanderweele T, and Robins J (2017), “Instrumental Variables as Bias Amplifiers With General Outcome and Confounding,” Biometrika, 104,291–302. [13] [DOI] [PMC free article] [PubMed] [Google Scholar]
Efron B, and Tibshirani R (1986), “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy,” Statistical Science, 1, 54–75. [7] [Google Scholar]
Enders D, Kollhorst B, Engel S, Linder R, and Pigeot I (2018), “Comparison of Multiple Imputation and Two-phase Logistic Regression to Analyse Two-phase Case-control Studies With Rich Phase 1: A Simulation Study,” Journal of Statistical Computation and Simulation, 88, 2201–2214. [1] [Google Scholar]
Frangakis CE, and Rubin DB (1999), “Addressing Complications of Intention-to-treat Analysis in the Combined Presence of All-or-none Treatment-noncompliance and Subsequent Missing Outcomes,” Biometrika, 86, 365–379. [3] [Google Scholar]
Freedman DA (2008), “Randomization Does Not Justify Logistic Regression,” Statistical Science, 23, 237–249. [2] [Google Scholar]
Fuller WA (2009), Sampling Statistics, Hoboken, NJ: Wiley; [5] [Google Scholar]
Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, and Davidian M (2011), “Doubly Robust Estimation of Causal Effects,” American Journal of Epidemiology, 173,761–767. [6] [DOI] [PMC free article] [PubMed] [Google Scholar]
Hájek J (1971), Comment on “An Essay on the Logical Foundations of Survey Sampling, Part One,” by D. Basu, in Foundations of Survey Sampling, eds. Godambe VP and Sprott DA, Toronto: Holt, Rinehart, and Winston, p. 236 [3,11] [Google Scholar]
Hansen BB (2004), “Full Matching in an Observational Study of Coaching for the SAT,” Journal of American Statistical Association, 99, 609–618. [2] [Google Scholar]
Hansen LP (1982), “Large Sample Properties of Generalized Method of Moments Estimators,” Econometrica, 50, 1029–1054. [5] [Google Scholar]
Heckman JJ, Ichimura H and Todd PE (1997), “Matching as an Econometric Evaluation Estimator: Evidence From Evaluating a Job Training Programme,” Rev. Econ. Stud, 64, 605–654. [2] [Google Scholar]
Hirano K, Imbens GW, and Ridder G (2003), “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score,” Econometrica, 71, 1161–1189. [2] [Google Scholar]
Horvitz DG, and Thompson DJ (1952), “A Generalization of Sampling Without Replacement From a Finite Universe,” Journal of American Statistical Association, 47, 663–685. [2] [Google Scholar]
Imbens GW (2004), “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” The Review of Economics and Statistics, 86, 4–29. [3] [Google Scholar]
Imbens GW, and Lancaster T (1994), “Combining Micro and Macro Data in Microeconometric Models,” The Review of Economic Studies, 61,655–680. [1,5] [Google Scholar]
Jin H, and Rubin DB (2008), “Principal Stratification for Causal Inference With Extended Partial Compliance,” Journal of American Statistical Association, 103, 101–111. [3] [Google Scholar]
Kim JK, Kwon Y and Paik MC (2016), “Calibrated Propensity Score Method for Survey Nonresponse in Cluster Sampling,” Biometrika, 103,461–473. [5] [DOI] [PubMed] [Google Scholar]
Kleiner A, Talwalkar A, Sarkar P, and Jordan MI (2014), “A Scalable Bootstrap for Massive Data,” Journal of Royal Statistical Society, Series B, 76, 795–816. [7] [Google Scholar]
Kott PS (2006), “Using Calibration Weighting to Adjust for Nonresponse and Coverage Errors,” Survey Methodology, 32, 133–142. [5] [Google Scholar]
Li F, Morgan KL, and Zaslavsky AM (2016), “Balancing Covariates Via Propensity Score Weighting,” Journal of American Statistical Association, 113,390–400. [7] [Google Scholar]
Lin HW, and Chen YH (2014), “Adjustment for Missing Confounders in Studies Based on Observational Databases: 2-stage Calibration Combining Propensity Scores From Primary and Validation Data,” American Journal of Epidemiology, 180,308–317. [1,2,9] [DOI] [PubMed] [Google Scholar]
Lumley T, Shaw PA, and Dai JY (2011), “Connections Between Survey Calibration Estimators and Semiparametric Models for Incomplete Data,” International Statistical Review, 79, 200–220. [5] [DOI] [PMC free article] [PubMed] [Google Scholar]
Lunceford JK, and Davidian M (2004), “Stratification and Weighting Via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study” Statistical Medicine, 23, 2937–2960. [6] [DOI] [PubMed] [Google Scholar]
Lunt M, Glynn RJ, Rothman KJ, Avorn J, and Stürmer T (2012), “Propensity Score Calibration in the Absence of Surrogacy,” American Journal of Epidemiology, 175, 1294–1302. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]
McCandless LC, Richardson S, and Best N (2012), “Adjustment for Missing Confounders Using External Validation Data and Propensity Scores,” Journal of American Statistical Association, 107, 40–51. [1] [Google Scholar]
Newey WK (1990), “Semiparametric Efficiency Bounds,” Journal of Applied Econometrics, 5, 99–135. [5] [Google Scholar]
Neyman J (1923), Sur les applications de la thar des probabilities aux experiences Agaricales: Essay de principle. English translation of excerpts by Dabrowska, D. and Speed, T., Statistical Science, 5, 465–472. [2] [Google Scholar]
Neyman J (1938), “Contribution to the Theory of Sampling Human Populations,” Journal of American Statistical Association, 33, 101–116. [1,2] [Google Scholar]
Otsu T, and Rai Y (2016), “Bootstrap Inference of Matching Estimators for Average Treatment Effects,” Journal of American Statistical Association, 112, 1720–1732. [7] [Google Scholar]
Pearl J (2009), “Letter to the Editor: Remarks on the Method of Propensity Score,” Statistics in Medicine, 28, 1420–1423. [13] [DOI] [PubMed] [Google Scholar]
Pearl J (2010), “On a Class of Bias-amplifying Variables That Endanger Effect Estimates,” in The Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, eds. Grunwald P and Spirtes P, Corvallis, OR: Association for Uncertainty in Artificial Intelligence, pp. 425–432. [13] [Google Scholar]
Politis DN, Romano JP, and Wolf M (1999), Subsampling, New York: Springer-Verlag; [7] [Google Scholar]
Qin J (2000), “Combining Parametric and Empirical Likelihoods,” Biometrika, 87, 484–490. [2,5] [Google Scholar]
Robins JM, Rotnitzky A, and Zhao LP (1994), “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed,” Journal of American Statistical Association, 89, 846–866. [2,5,13] [Google Scholar]
Rosenbaum PR (1989), “Optimal Matching for Observational Studies,” Journal of American Statistical Association, 84, 1024–1032. [2] [Google Scholar]
Rosenbaum PR (2002), Studies Observational (2nd ed), New York: Springer; [3] [Google Scholar]
Rosenbaum PR, and Rubin DB (1983a), “Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study With Binary Outcome,” Journal of Royal Statistical Society, Series B, 45, 212–218. [1] [Google Scholar]
Rosenbaum PR (1983b), “The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrika, 70, 41–55. [2,3] [Google Scholar]
Rosenbaum PR (1973), “Matching to Remove Bias in Observational Studies,” Biometrics,29, 159–183. [2] [Google Scholar]
Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies., J. Educ. Psychol 66: 688–701. [2] [Google Scholar]
Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys, New York: Wiley; [9] [Google Scholar]
Rubin DB (2006). Matched Sampling for Causal Effects, New York: Cambridge University Press; [2,3] [Google Scholar]
Schneeweiss S, Glynn RJ, Tsai EH, Avorn J, and Solomon DH (2005), “Adjusting for Unmeasured Confounders in Pharmacoepidemiologic Claims Data Using External Information: The Example of Cox2 Inhibitors and Myocardial Infarction,” Epidemiology, 16, 17–24. [1] [DOI] [PubMed] [Google Scholar]
Shao J, and Tu D (2012), The Jackknife and Bootstrap, New York: Springer; [7] [Google Scholar]
Stuart EA (2010), “Matching Methods for Causal Inference: A Review and a Look Forward,” Statistical Science, 25, 1–21. [2] [DOI] [PMC free article] [PubMed] [Google Scholar]
Stürmer T, Schneeweiss S, Avorn J and Glynn RJ (2005). Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration, American Journal of Epidemiology, 162, 279–289. [1,9] [DOI] [PMC free article] [PubMed] [Google Scholar]
Stürmer T, Schneeweiss S, Rothman KJ, Avorn J, and Glynn RJ (2007), “Performance of Propensity Score Calibratio-a Simulation Study, American Journal of Epidemiology, 165, 1110–1118. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsiatis A (2006), Semiparametric Theory and Missing Data, Springer, New York: [3] [Google Scholar]
Wang W, Scharfstein D, Tan Z, and MacKenzie EJ (2009), “Causal Inference in Outcome-Dependent Two-Phase Sampling Designs,” Journal of Royal Statistical Society, Series B, 71, 947–969. [1,2,11,13] [Google Scholar]
Wang X, and Wang Q (2015), “Semiparametric Linear Transformation Model With Differential Measurement Error and Validation Sampling,” Journal of Multivariate Analysis, 141,67–80. [5] [Google Scholar]
Wu C, and Sitter RR (2001), “A Model-Calibration Approach to Using Complete Auxiliary Information From Survey Data,” Journal of American Statistical Association, 96, 185–193. [5] [Google Scholar]
Yang YW, Chen YH, Wang KH, Wang CY, and Lin HW (2011), “Risk of Herpes Zoster Among Patients With Chronic Obstructive Pulmonary Disease: A Population-based Study,” Canadian Medical Association. Journal, 183, 275–280. [9] [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

NIHMS1050623-supplement-supp.zip^{(2.2MB, zip)}

[R1] Abadie A, and Imbens GW (2006), “Large Sample Properties of Matching Estimators for Average Treatment Effects,” Econometrica, 74, 235–267. [2,4,6,7] [Google Scholar]

[R2] Abadie A (2008), “On the Failure of the Bootstrap for Matching Estimators,” Econometrica, 76, 1537–1557. [6,7] [Google Scholar]

[R3] Abadie A (2016), “Matching on the Estimated Propensity Score,” Econometrica, 84, 781–807. [2] [Google Scholar]

[R4] Angrist JD, Imbens GW, and Rubin DB (1996), “Identification of Causal Effects Using Instrumental Variables,” Journal of American Statistical Association, 91, 444–455. [1] [Google Scholar]

[R5] Antonelli J, Zigler C, and Dominici F (2017), “Guided Bayesian Imputation to Adjust for Confounding When Combining Heterogeneous Data Sources in Comparative Effectiveness Research,” Biostatistics, 18, 553–568. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Bang H, and Robins JM (2005), “Doubly Robust Estimation in Missing Data and Causal Inference Models,” Biometrics, 61, 962–973. [2,3] [DOI] [PubMed] [Google Scholar]

[R7] Bareinboim E, and Pearl J (2016), “Causal Inference and the data-fusion problem,” PNAS, 113,7345–7352. [13] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Bickel PJ, Klaassen C, Ritov Y, and Wellner J (1993), Efficient and Adaptive Inference in Semiparametric Models, Baltimore: Johns Hopkins University Press; [5] [Google Scholar]

[R9] Boonstra PS, Taylor JM, and Mukherjee B (2012), “Incorporating Auxiliary Information for Improved Prediction in High-dimensional Datasets: An Ensemble of Shrinkage Approaches,” Biostatistics, 14, 259–272. [5] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Breslow N, McNeney B, and Wellner JA (2003), “Large Sample Theory for Semiparametric Regression Models With Two-phase, Outcome Dependent Sampling,” Annals of Statistics, 31, 1110–1139. [11] [Google Scholar]

[R11] Cao W, Tsiatis AA, and Davidian M (2009), “Improving Efficiency and Robustness of the Doubly Robust Estimator for a Population Mean With Incomplete Data,” Biometrika, 96, 723–734. [2,3] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Chang T, and Kott PS (2008), “Using Calibration Weighting to Adjust for Nonresponse Under a Plausible Model,” Biometrika, 95, 555–571. [5] [Google Scholar]

[R13] Chatterjee N, Chen YH, Maas P, and Carroll RJ (2016), “Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources,” Journal of American Statistical Association, 111, 107–117. [1,2,7] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Chen J, and Sitter R (1999), “A Pseudo Empirical Likelihood Approach to the Effective Use of Auxiliary Information in Complex Surveys,” Statistica Sinica, 9 385–406. [2,5] [Google Scholar]

[R15] Chen J, Sitter R, and Wu C (2002), “Using Empirical Likelihood Methods to Obtain Range Restricted Weights in Regression Estimators for Surveys,” Biometrika, 89, 230–237. [2,5] [Google Scholar]

[R16] Chen SX, Leung DHY, and Qin J (2003), “Information Recovery in a Study With Surrogate Endpoints,” Journal of American Statistical Association, 98, 1052–1062. [2,5] [Google Scholar]

[R17] Chen YH (2002), “Cox Regression in Cohort Studies With Validation Sampling,” Journal of Royal Statistical Society, Series B, 64, 51–62. [5] [Google Scholar]

[R18] Chen YH, and Chen H (2000), “A Unified Approach to Regression Analysis Under Double-sampling Designs,” Journal of Royal Statistical Society, Series B, 62, 449–460. [5,13] [Google Scholar]

[R19] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, and Robins J (2018), “Double/debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21, C1–C68. [13] [Google Scholar]

[R20] Cochran WG (2007), Sampling Techniques (3rd ed), New York: Wiley; [1,2,8] [Google Scholar]

[R21] Collaboration FS (2009), “Systematically Missing Confounders in Individual Participant Data Meta-analysis of Observational Cohort Studies,” Statistics in Medicine, 28, 1218–1237. [5] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Crump R, Hotz VJ, Imbens G, and Mitnik O (2006), “Moving the Goalposts: Addressing Limited Overlap in the Estimation of Average Treatment Effects by Changing the Estimand,” Technical Report, 330, Cambridge, MA: National Bureau of Economic Research. Available at http://www.nber.org/papers/T0330 [7] [Google Scholar]

[R23] Deville J-C, and Särndal C-E (1992), “Calibration Estimators in Survey Sampling,” Journal of American Statistical Association, 87, 376–382. [5]. [Google Scholar]

[R24] Ding P, and Miratrix LW (2015), “To Adjustor Not to Adjust? Sensitivity Analysis of M-bias and Butterfly-bias,” Journal of Causal Inference, 3, 41–57. [13] [Google Scholar]

[R25] Ding P, Vanderweele T, and Robins J (2017), “Instrumental Variables as Bias Amplifiers With General Outcome and Confounding,” Biometrika, 104,291–302. [13] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Efron B, and Tibshirani R (1986), “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy,” Statistical Science, 1, 54–75. [7] [Google Scholar]

[R27] Enders D, Kollhorst B, Engel S, Linder R, and Pigeot I (2018), “Comparison of Multiple Imputation and Two-phase Logistic Regression to Analyse Two-phase Case-control Studies With Rich Phase 1: A Simulation Study,” Journal of Statistical Computation and Simulation, 88, 2201–2214. [1] [Google Scholar]

[R28] Frangakis CE, and Rubin DB (1999), “Addressing Complications of Intention-to-treat Analysis in the Combined Presence of All-or-none Treatment-noncompliance and Subsequent Missing Outcomes,” Biometrika, 86, 365–379. [3] [Google Scholar]

[R29] Freedman DA (2008), “Randomization Does Not Justify Logistic Regression,” Statistical Science, 23, 237–249. [2] [Google Scholar]

[R30] Fuller WA (2009), Sampling Statistics, Hoboken, NJ: Wiley; [5] [Google Scholar]

[R31] Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, and Davidian M (2011), “Doubly Robust Estimation of Causal Effects,” American Journal of Epidemiology, 173,761–767. [6] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Hájek J (1971), Comment on “An Essay on the Logical Foundations of Survey Sampling, Part One,” by D. Basu, in Foundations of Survey Sampling, eds. Godambe VP and Sprott DA, Toronto: Holt, Rinehart, and Winston, p. 236 [3,11] [Google Scholar]

[R33] Hansen BB (2004), “Full Matching in an Observational Study of Coaching for the SAT,” Journal of American Statistical Association, 99, 609–618. [2] [Google Scholar]

[R34] Hansen LP (1982), “Large Sample Properties of Generalized Method of Moments Estimators,” Econometrica, 50, 1029–1054. [5] [Google Scholar]

[R35] Heckman JJ, Ichimura H and Todd PE (1997), “Matching as an Econometric Evaluation Estimator: Evidence From Evaluating a Job Training Programme,” Rev. Econ. Stud, 64, 605–654. [2] [Google Scholar]

[R36] Hirano K, Imbens GW, and Ridder G (2003), “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score,” Econometrica, 71, 1161–1189. [2] [Google Scholar]

[R37] Horvitz DG, and Thompson DJ (1952), “A Generalization of Sampling Without Replacement From a Finite Universe,” Journal of American Statistical Association, 47, 663–685. [2] [Google Scholar]

[R38] Imbens GW (2004), “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” The Review of Economics and Statistics, 86, 4–29. [3] [Google Scholar]

[R39] Imbens GW, and Lancaster T (1994), “Combining Micro and Macro Data in Microeconometric Models,” The Review of Economic Studies, 61,655–680. [1,5] [Google Scholar]

[R40] Jin H, and Rubin DB (2008), “Principal Stratification for Causal Inference With Extended Partial Compliance,” Journal of American Statistical Association, 103, 101–111. [3] [Google Scholar]

[R41] Kim JK, Kwon Y and Paik MC (2016), “Calibrated Propensity Score Method for Survey Nonresponse in Cluster Sampling,” Biometrika, 103,461–473. [5] [DOI] [PubMed] [Google Scholar]

[R42] Kleiner A, Talwalkar A, Sarkar P, and Jordan MI (2014), “A Scalable Bootstrap for Massive Data,” Journal of Royal Statistical Society, Series B, 76, 795–816. [7] [Google Scholar]

[R43] Kott PS (2006), “Using Calibration Weighting to Adjust for Nonresponse and Coverage Errors,” Survey Methodology, 32, 133–142. [5] [Google Scholar]

[R44] Li F, Morgan KL, and Zaslavsky AM (2016), “Balancing Covariates Via Propensity Score Weighting,” Journal of American Statistical Association, 113,390–400. [7] [Google Scholar]

[R45] Lin HW, and Chen YH (2014), “Adjustment for Missing Confounders in Studies Based on Observational Databases: 2-stage Calibration Combining Propensity Scores From Primary and Validation Data,” American Journal of Epidemiology, 180,308–317. [1,2,9] [DOI] [PubMed] [Google Scholar]

[R46] Lumley T, Shaw PA, and Dai JY (2011), “Connections Between Survey Calibration Estimators and Semiparametric Models for Incomplete Data,” International Statistical Review, 79, 200–220. [5] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Lunceford JK, and Davidian M (2004), “Stratification and Weighting Via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study” Statistical Medicine, 23, 2937–2960. [6] [DOI] [PubMed] [Google Scholar]

[R48] Lunt M, Glynn RJ, Rothman KJ, Avorn J, and Stürmer T (2012), “Propensity Score Calibration in the Absence of Surrogacy,” American Journal of Epidemiology, 175, 1294–1302. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] McCandless LC, Richardson S, and Best N (2012), “Adjustment for Missing Confounders Using External Validation Data and Propensity Scores,” Journal of American Statistical Association, 107, 40–51. [1] [Google Scholar]

[R50] Newey WK (1990), “Semiparametric Efficiency Bounds,” Journal of Applied Econometrics, 5, 99–135. [5] [Google Scholar]

[R51] Neyman J (1923), Sur les applications de la thar des probabilities aux experiences Agaricales: Essay de principle. English translation of excerpts by Dabrowska, D. and Speed, T., Statistical Science, 5, 465–472. [2] [Google Scholar]

[R52] Neyman J (1938), “Contribution to the Theory of Sampling Human Populations,” Journal of American Statistical Association, 33, 101–116. [1,2] [Google Scholar]

[R53] Otsu T, and Rai Y (2016), “Bootstrap Inference of Matching Estimators for Average Treatment Effects,” Journal of American Statistical Association, 112, 1720–1732. [7] [Google Scholar]

[R54] Pearl J (2009), “Letter to the Editor: Remarks on the Method of Propensity Score,” Statistics in Medicine, 28, 1420–1423. [13] [DOI] [PubMed] [Google Scholar]

[R55] Pearl J (2010), “On a Class of Bias-amplifying Variables That Endanger Effect Estimates,” in The Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, eds. Grunwald P and Spirtes P, Corvallis, OR: Association for Uncertainty in Artificial Intelligence, pp. 425–432. [13] [Google Scholar]

[R56] Politis DN, Romano JP, and Wolf M (1999), Subsampling, New York: Springer-Verlag; [7] [Google Scholar]

[R57] Qin J (2000), “Combining Parametric and Empirical Likelihoods,” Biometrika, 87, 484–490. [2,5] [Google Scholar]

[R58] Robins JM, Rotnitzky A, and Zhao LP (1994), “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed,” Journal of American Statistical Association, 89, 846–866. [2,5,13] [Google Scholar]

[R59] Rosenbaum PR (1989), “Optimal Matching for Observational Studies,” Journal of American Statistical Association, 84, 1024–1032. [2] [Google Scholar]

[R60] Rosenbaum PR (2002), Studies Observational (2nd ed), New York: Springer; [3] [Google Scholar]

[R61] Rosenbaum PR, and Rubin DB (1983a), “Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study With Binary Outcome,” Journal of Royal Statistical Society, Series B, 45, 212–218. [1] [Google Scholar]

[R62] Rosenbaum PR (1983b), “The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrika, 70, 41–55. [2,3] [Google Scholar]

[R63] Rosenbaum PR (1973), “Matching to Remove Bias in Observational Studies,” Biometrics,29, 159–183. [2] [Google Scholar]

[R64] Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies., J. Educ. Psychol 66: 688–701. [2] [Google Scholar]

[R65] Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys, New York: Wiley; [9] [Google Scholar]

[R66] Rubin DB (2006). Matched Sampling for Causal Effects, New York: Cambridge University Press; [2,3] [Google Scholar]

[R67] Schneeweiss S, Glynn RJ, Tsai EH, Avorn J, and Solomon DH (2005), “Adjusting for Unmeasured Confounders in Pharmacoepidemiologic Claims Data Using External Information: The Example of Cox2 Inhibitors and Myocardial Infarction,” Epidemiology, 16, 17–24. [1] [DOI] [PubMed] [Google Scholar]

[R68] Shao J, and Tu D (2012), The Jackknife and Bootstrap, New York: Springer; [7] [Google Scholar]

[R69] Stuart EA (2010), “Matching Methods for Causal Inference: A Review and a Look Forward,” Statistical Science, 25, 1–21. [2] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] Stürmer T, Schneeweiss S, Avorn J and Glynn RJ (2005). Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration, American Journal of Epidemiology, 162, 279–289. [1,9] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] Stürmer T, Schneeweiss S, Rothman KJ, Avorn J, and Glynn RJ (2007), “Performance of Propensity Score Calibratio-a Simulation Study, American Journal of Epidemiology, 165, 1110–1118. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] Tsiatis A (2006), Semiparametric Theory and Missing Data, Springer, New York: [3] [Google Scholar]

[R73] Wang W, Scharfstein D, Tan Z, and MacKenzie EJ (2009), “Causal Inference in Outcome-Dependent Two-Phase Sampling Designs,” Journal of Royal Statistical Society, Series B, 71, 947–969. [1,2,11,13] [Google Scholar]

[R74] Wang X, and Wang Q (2015), “Semiparametric Linear Transformation Model With Differential Measurement Error and Validation Sampling,” Journal of Multivariate Analysis, 141,67–80. [5] [Google Scholar]

[R75] Wu C, and Sitter RR (2001), “A Model-Calibration Approach to Using Complete Auxiliary Information From Survey Data,” Journal of American Statistical Association, 96, 185–193. [5] [Google Scholar]

[R76] Yang YW, Chen YH, Wang KH, Wang CY, and Lin HW (2011), “Risk of Herpes Zoster Among Patients With Chronic Obstructive Pulmonary Disease: A Population-based Study,” Canadian Medical Association. Journal, 183, 275–280. [9] [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Combining Multiple Observational Data Sources to Estimate Causal Effects

Shu Yang

Peng Ding

Abstract

1. Introduction

2. Basic Setup

2.1. Notation: Causal Effect and Two Data Sources

Assumption 1.

Example 1.

Example 2.

2.2. Identification and Model Assumptions

Assumption 2 (Ignorability).

Assumption 3 (Overlap).

Assumption 4 (Outcome model).

Assumption 5 (Propensity score model).

3. Methodology and Important Estimators

3.1. Review of Commonly Used Estimators Based on Validation Data

Example 3 (Regression imputation).

Example 4 (Inverseprobability weighting).

Example 5 (Augmented inverse probability weighting).

Example 6 (Matching).

3.2. A General Strategy

Proposition 1.

Remark 1.

Remark 2.

Remark 3.

3.3. Regular Asymptotically Linear (RAL) Estimators

Theorem 1.

Lemma 1 (Augmented inverse probability weighting).

3.4. Matching Estimators

Theorem 2.

3.5. Bootstrap Variance Estimation

Theorem 3.

Remark 4.

Remark 5.

4. Extensions

4.1. Other Causal Estimands

4.2. Design Issue: Optimal Sample Size Allocation

4.3. Multiple Data Sources

5. Simulation

Figure 1.

6. Application

Table 1.

7. Relaxing Assumption 1

Assumption 6.

7.1. RAL estimators

Theorem 4.

7.2. Matching Estimators

Theorem 5.

7.3. A Bootstrap Variance Estimation Procedure

Theorem 6.

7.4. Connection With Missing Data

Proposition 2.

8. Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases