Abstract
Integrating data from multiple heterogeneous sources has become increasingly popular to achieve a large sample size and diverse study population. This paper reviews development in causal inference methods that combines multiple datasets collected by potentially different designs from potentially heterogeneous populations. We summarize recent advances on combining randomized clinical trial with external information from observational studies or historical controls, combining samples when no single sample has all relevant variables with application to two-sample Mendelian randomization, distributed data setting under privacy concerns for comparative effectiveness and safety research using real-world data, Bayesian causal inference, and causal discovery methods.
Keywords: Causal inference, data integration, data fusion, generalizability, transportability
1 |. INTRODUCTION
The availability of multiple datasets collected by different designs from heterogeneous populations has brought emerging challenges and opportunities for causal inference. Integrating data from multiple sources to facilitate causal inference has become increasingly popular. For example, randomized clinical trial (RCT) has been the gold standard for causal inference but often suffers from insufficient sample size and homogeneous study population due to inclusion/exclusion criteria. Results from RCTs may not be generalizable to a real-world population. In contrast, observational study typically offers a diverse sample representative of the target population with a large sample size but often suffers from unmeasured confounding. Combining data from both designs allows one to extend causal inference from an RCT to a target population, to correct for bias in observational studies, and to improve efficiency (Colnet et al., 2020). Another prominent example is when no single dataset contains all relevant variables, that is, there are no complete data for any subject. In this case, identification becomes difficult even for parameters that are straightforward to be identified with complete data (Ridder and Moffitt, 2007). This is typical in survey sample combination where variables collected in each survey may differ (Yang and Kim, 2020). This is also the case in two-sample instrumental variable methods, which is widely applied in Mendelian randomization studies where individual-level genetic data are not available due to privacy concerns (Angrist and Krueger, 1992).
In this paper, we review selected literature on data integration methods in causal inference. Recent review studies focused on combining randomized and observational data (Colnet et al., 2020; Degtiar and Rose, 2021) and data combination in survey sampling (Ridder and Moffitt, 2007; Yang and Kim, 2020). We aim to provide a more systematic review and cover a range of research areas. We start with notation and introduce key assumptions and concepts frequently appeared in the literature in Section 2. We then summarize recent methodological advances in integrating data from RCTs and observational studies in Section 3, and combining data when no single sample has all relevant variables in Section 4. We briefly review the literature on data integration for causal discovery, distributed data analysis for privacy protection, and Bayesian methods for integrated causal inference in Section 5. We close with a discussion in Section 6.
2 |. PRELIMINARIES
In this section, we briefly introduce the potential outcome framework and review key concepts in causal inference and data integration. Let A denote a binary treatment (1: treated, 0: untreated), Y denote an observed outcome, and X denote a vector of measured covariates. When all circumstances are the same except for the treatment status, any difference observed in the outcomes has to be attributed to the treatment. Correspondingly, for each subject we define a pair of potential outcomes, (Y (1),Y (0)), that would be observed if the subject had been given treatment, Y (1), and control, Y (0) (Rubin, 1974), under the stable unit treatment value assumption that there is no interference between units and no multiple versions of treatment (Rubin, 1980). As such, the observed outcome is equal to the potential outcome corresponding to the subject’s treatment condition, that is
Assumption 1 (Consistency) Y = Y (a) if A = a, for a = 0 or 1.
A fundamental problem in causal inference is that for each subject, we can only observe one of the potential outcomes. Because it is impossible to compute the difference in Y (1) and Y (0) for a specific subject, we often specify a target population of interest, and study the mean difference in the target population, referred to as the average treatment effect (ATE). In practice, we cannot observe data on all subjects in the prespecified target population but rather data on a sample of subjects referred to as the study sample. Let S be a binary indicator of whether a subject is selected into the study sample (1: sampled, 0: not sampled). It is important to note that the ATE is population-specific. In fact, we can define multiple ATEs each with respect to a different target population as follows:
For example, the ATE is τ if the combined S = 1 and S = 0 sample is a random sample of the target population. The ATE estimated based on the study sample, i.e., the S = 1 sample, is an estimate of τ1, which is not necessarily equal to τ because the study sample is not necessarily a representative sample of the target population. Identification of the ATE, which is a function of the potential outcome distribution in a target population involves expressing it as a function of the observed data distribution, such that distinct data generating mechanisms lead to distinct values.
To identify the ATE, ideally we would like to observe Y (a) of all subjects in the target population to compute E [Y (a)], for a = 0 or 1. However, both sample selection mechanism and treatment assignment mechanism lead to missingness in Y (a): generally Y (a)’s are missing for all subjects in the S = 0 sample (sample selection); in the S = 1 sample, Y (a)’s are unobserved for subjects in the other treatment arm with A = a′, a′ ≠ a (treatment assignment). Confounding bias, also referred to as violation of interval validity, occurs when factors that impact treatment assignment also predict the outcome, such that the observed Y (a)’s in the A = a arm cannot represent the missing Y (a)’s in the A = a′ arm, i.e. E [Y (a) | A = a, S = 1] ≠ E [Y (a) | A = a′, S = 1]. Selection bias, also referred to as violation of external validity, occurs when factors that impact sample selection also predict the outcome, such that the observed Y (a)’s in the S = 1 sample cannot represent the missing Y (a)’s in the S = 0 sample. A less stringent condition targeting treatment effect estimation, i.e., the mean difference rather than the mean, defines selection bias as when factors that impact sample selection also modifies the treatment effect, i.e., E [Y (1) −Y (0) | S = 1] ≠ E [Y (1) −Y (0) | S = 0] (Lesko et al., 2017; Stuart et al., 2011). Collider-stratification bias may also occur due to conditioning the analysis on the study sample, if S is a common consequence of the treatment (or a predictor of the treatment) and the outcome (or a predictor of the outcome) (Greenland, 2003).
Two key assumptions about the treatment assignment mechanism are often imposed, which we refer to as treatment exchangeability and positivity. The treatment exchangeability assumption states that within a strata of X, Y (a) of subjects in the A = a arm can be exchanged with Y (a) of subjects in the A = a′ arm:
Assumption 2 (Treatment exchangeability) Y (a) ⫫ A | X, S = 1 for a = 0 or 1.
Assumption 2 allows us to represent the conditional distribution of the unobserved potential outcome using that of the observed potential outcome. We thus have that for a = 0 or 1,
(1) |
Eq. (1) has also been used as a weaker version of Assumption 2. Within each strata of the covariates sufficient for the treatment exchangeability, we also need to have nonzero subjects in both treatment arms:
Assumption 3 (Treatment positivity) P (A = a |X, S = 1) > 0 for all a almost surely.
Often P (A = 1|X, S = 1) is referred to as the propensity score. Note that Assumptions 2 and 3 are conditional on the study sample, thus the set of covariates X sufficient for Assumptions 2 and 3 to hold may include variables beyond common causes of treatment and outcome, i.e., the typical confounders. For example, a covariate that causes selection S and outcome but is independent of the treatment can become a confounder if the treatment also causes selection. This is a consequence of collider-stratification bias where conditioning on S results in a spurious association between the treatment and the covariate.
Besides conditions to ensure internal validity, people often impose another two key assumptions about the sample selection mechanism to ensure external validity, which we refer to as selection exchangeability and positivity, in analogy to Assumptions 2–3 (Stuart et al., 2011; Lesko et al., 2017; Dahabreh et al., 2020b).
Assumption 4 (Selection exchangeability) Y (a) ⫫ S | X for a = 0 or 1.
Assumption 4 allows us to generalize the conditional distribution of the potential outcome from the study sample to a target population, such as the one represented by the S = 0 sample or the combination of S = 0 and S = 1 sample:
(2) |
Weaker versions of the selection exchangeability assumption include (I) mean conditional exchangeability, i.e. Eq. (2) and (II) all treatment effect modifiers are measured, i.e., E [Y (1) −Y (0) | X] = E [Y (1) −Y (0) | X, S = 1]. We further assume that variables required for selection exchangeability do not serve as study eligibility criteria that completely exclude certain subjects from the study sample.
Assumption 5 (Selection positivity) P (S = s |X) > 0 for all s almost surely.
For example, suppose geographic location restricted study participation such that there is zero probability of selecting subjects in certain area, then Assumption 5 requires that geographic location is not needed for Assumption 4, i.e., conditional on X, geographic location is not associated with the outcome or does not modify the treatment effect.
3 |. COMBINING A RANDOMIZED CLINICAL TRIAL WITH EXTERNAL DATA
There is a rich literature on combining information from both experimental and non-experimental designs and bridging findings from an RCT to a target population (Cole and Stuart, 2010; O’Muircheartaigh and Hedges, 2014; Tipton, 2013; Hartman et al., 2015; Lesko et al., 2017; Rudolph and van der Laan, 2017; Westreich et al., 2017; Buchanan et al., 2018; Dahabreh and Hernán, 2019; Dahabreh et al., 2019a,b,d,e,c, 2020a,b; Dong et al., 2020). In this setting, S = 1 indicates the sample of trial participants, and we observe (Y, A, X, S = 1) in RCT data. Due to randomization or stratified randomization, the propensity score, P (A = a | X, S = 1) is a known function designed by the investigator, and Assumptions 2–3 naturally hold in RCT with X being the variables defining the strata.
Two problems are frequently studied: generalizability (Cole and Stuart, 2010; Stuart et al., 2011; Dahabreh et al., 2019d; Buchanan et al., 2018) and transportability (Pearl and Bareinboim, 2014; Bareinboim and Pearl, 2016; Westreich et al., 2017; Rudolph and van der Laan, 2017; Hünermund and Bareinboim, 2019). The distinction between the two concepts is well summarized in Dahabreh and Hernán (2019) and Degtiar and Rose (2021): generalizability focuses on the setting when the study sample is a subset of the target population, and transportability considers the setting when the study sample and the target population are partially- or non-overlapping. An example of the generalizability problem is: suppose the target population is the trial-eligible population, and the combined S = 1 and S = 0 sample is a random sample of the target population, in which trial participants are in the S = 1 sample and non-participants are in the S = 0 sample. In this case, the target ATE is τ and we would like to generalize inference about τ1 obtained from the trial data to τ. An example of the transportability problem is: suppose the target population is a real-world population, and S = 0 sample is a random sample of the target population separately obtained from external data sources such as administrative healthcare databases or survey studies. In this case, the target ATE is τ0 and we would like to transport inference about τ1 to τ0.
Both problems require some information in the S = 0 sample, and often two scenarios are considered: (S1) covariates are measured on all individuals in the S = 0 sample, i.e., we observe (X, S = 0); (S2) covariates are measured on a subsample of the S = 0 sample, i.e., we observe (X, S = 0, D = 1), where D indicates whether we have data on X. In scenario (S2), it is often assumed that D ⫫ (Y, A, X) | S such that P (D = 1 | Y, A, X, S) = P (D = 1 | S). That is, (X, S = 0, D = 1) is a simple random sample of the S = 0 sample with two possibilities: (S2.1) P (D = 1 | S = 0) is a known constant; (S2.2) P (D = 1 | S = 0) is an unknown constant. Dahabreh et al. (2019a) and Dahabreh et al. (2019e) showed that τ is not identifiable under (S2.2), while τ0 is always identifiable in (S2).
3.1 |. Generalizability and transportability methods
In this section, we review three common strategies for identification and estimation of E [Y (a)] (generalizability) and E [Y (a) | S = 0] (transportability) for a = 0 or 1. Correspondingly, the ATE τ and τ0 can be directly obtained based on E [Y (a)] and E [Y (a) | S = 0] by definition. To illustrate the methods, we take scenario (S1) as an example where we observe and from a total of n = n1 + n0 subjects. We summarize the methods under all scenarios in Table 1.
TABLE 1.
Generalizability (E [Y (a)]) | Transportability (E [Y (a) | S = 0]) | |
---|---|---|
(S1) Covariates are measured on all individuals in the S = 0 sample, i.e., we have (X, S = 0) | ||
OR | E {ma (X)} | |
IPW | **(1) | **(2) |
AIPW | ||
(S2.1) Covariates are measured on all individuals in the S = 0 sample and P (D = 1 | S = 0) is known | ||
OR | **(3) | E {ma (X) | S = 0, D = 1} **(4) |
IPW | **(5) | **(6) |
AIPW | ||
(S2.2) Covariates are measured on a subsample of the S = 0 sample and P (D = 1 | S = 0) is unknown | ||
OR | Not identifiable **(7) | E {ma(X) | S = 0, D = 1} |
IPW | Not identifiable **(8) | **(9) |
AIPW | Not identifiable |
P (S = 1, A = a | X) = P (A = s | S = 1, X)P (S = 1 | X) where P (A = s | S = 1, X) is designed by the investigator in an RCT and can also be estimated based on the S = 1 sample, and P (S = 1 | X) identified from the combined sample.
, hence .
.
E {ma (X) | S = 0} = E {ma (X) | S = 0, D = 1} because D ⊥ X | S.
P (S = 1 | X) identified by . Estimation strategies are proposed in Dahabreh et al. (2019a).
Similar to footnote (5), P (S = 1) identified by .
Unlike footnote (3), P (D = 1 | S) is not identifiable because P (D = 1 | S = 0) is unknown.
Unlike footnote (5), P (S = 1 | X) is not identifiable because P (D = 1 | S = 0) is unknown.
3.1.1 |. Outcome regression (OR)
Let ma (x) = E [Y | A = a, X = x, S = 1] denote the conditional mean outcome in the study sample and denote an estimated model using . Under Assumptions 1–5, we have the following identification result
(3) |
Both f (x) and f (x | S = 0) are identifiable in scenario (S1) where we have observed X on all individuals. Therefore, we can marginalize over the empirical distribution of X in the combined sample and the S = 0 sample, respectively, which gives the following outcome regression estimators (Lesko et al., 2017; Dahabreh et al., 2019d,e)
(4) |
Where . Eq. (3) has been referred to as the g-formula (Greenland and Robins, 1986; Robins, 1986) or standardization (Vansteelandt and Keiding, 2011) in epidemiology, and can also be viewed as imputation in missing data literature (Cheng, 1994).
3.1.2 |. Inverse probability weighting (IPW)
Inverse probability weighting is a very commonly used technique (Cole and Stuart, 2010; Lesko et al., 2017; Westreich et al., 2017; Dahabreh et al., 2019d,e). Note that the g-formula in Eq. (3) can be re-expressed as follows
(5) |
where P (S = 1, A = a | X) = P (A = a | S = 1, X)P (S = 1 | X). The propensity score, P (A = a | S = 1, X), is a known function designed by the investigator in an RCT, while the trial participation probability P (S = 1 | X) can be estimated in the combined sample because X is fully observed under (S1). We arrive at the following inverse probability weighted estimators
(6) |
where and is a product of the estimated treatment and trial participation probabilities. Although the propensity score is known, estimating the model parameters rather than using the true value can improve efficiency (Robins et al., 1994; Hahn, 1998; Lunceford and Davidian, 2004). Comparing Eq. (6) to traditional IPW estimator using the trial data only, i.e.,
(7) |
we further weight each subject who participated in the trial by the inverse of the trial participation probability, P (S = 1 | X), to generalize the ATE from the S = 1 sample to the combined sample, while to transport the ATE from the S = 1 sample to the S = 0 sample, trial participants are weighted by the inverse of both the odds of trial participation P (S = 1 | X)/P (S = 0 | X) and P (S = 0).
3.1.3 |. Augmented inverse probability weighting (AIPW)
So far, each of the estimators relies on estimating components of the likelihood such as ma (X) and P (S = 1, A = a | X), which are not necessarily in themselves of scientistic interest. Nonparametric estimation may not be feasible when X is of high dimension, while parametric working models may be prone to model misspecification. We can combine the two estimators to gain robustness. A common approach to derive a robust estimator is by constructing an estimating equation from the efficient influence function (EIF) and evaluating it under a working model for the observed data distribution to solve for the parameter of interest, which is widely used in missing data problems (Tsiatis, 2007). Any regular and asymptotic linear estimator is asymptotically equivalent to the sample average of the influence function, which is a function of the observed data with mean zero and finite variance, and the one with the smallest variance is referred to as the EIF (Van der Vaart, 2000; Tsiatis, 2007). The EIFs for E [Y (a)] and E [Y (a) | S = 0] under a nonparametric model where the distribution of the observed data is unrestricted are
(8) |
where O = (S×Y, S×A, X, S) denotes the observed data, and E [U (Y, A, X, S; E [Y (a)])] = E [U0 (Y, A, X, S; E [Y (a) | S = 0])] = 0 at the true values. Let and respectively denote the evaluation of U (·) and U0 (·) under an estimated working model, then we can obtain the AIPW estimators by solving and (Dahabreh et al., 2019d,e). As mentioned in Section 3.1.2, P (A = a | S = 1, X) is guaranteed to be correctly specified in an RCT, therefore P (A = a, S = 1 | X) is correctly specified as long as P (S = 1 | X) is. Hence the above AIPW estimators are doubly robust in the sense that it remains consistent when either the probability of trial participation P (S = 1 | X) or the outcome regression model ma (X) is correctly specified. This can be seen by the following observation: the IPW estimator introduced in Section 3.1.2 can be obtained by misspecifying ma (X) as zero in Eq. (8), while the OR estimator introduced in Section 3.1.1 can be obtained by setting the weight in the first term of both U (·) and U0 (·) to zero in Eq. (8).
3.1.4 |. Other methods for combining data from clinical trial and external data
Other doubly robust estimators include a targeted maximum likelihood estimator (Rudolph and van der Laan, 2017) and an augmented calibration weighted estimator (Dong et al., 2020). Sensitivity analysis that replaces Assumption 4 with a pre-specified bias function has also been proposed (Dahabreh et al., 2019f). Meta-analysis is often used to synthesize information about parameters from data collected from multiple trials, which allows for extensions of the above methods to the setting of generalizing or transporting inferences from multiple randomized RCTs to a target population (Dahabreh et al., 2019c; Manski, 2000; Steele et al., 2020; Dahabreh et al., 2020a). Identification under an arbitrary collection of observational and experimental data has been investigated (Lee et al., 2020). Combining probability and non-probability samples with high dimensional data has also been studied (Yang et al., 2020a).
3.2 |. Correcting for bias in observational study using validation or trial data
Internal validity, i.e., Assumptions 2–3, naturally holds in RCTs due to randomization but not necessarily in observational studies due to potential unmeasured confounding. Borrowing strength from the internal validity of RCT data and the large sample size of observation data can mitigate bias and improve efficiency.
In this vein, Yang et al. (2020b) considered estimation of the average treatment effect on the treated (ATT) in the scenario where X = (X1, U), and U is unobserved. Data are obtained from RCT (Y, A, X1, S = 1) and from observational study (Y, A, X1, S = 0). In RCT, X1 is sufficient for Assumption 2, while in the observational study, the unmeasured confounding U leads to bias. A weaker version of Assumption 4 is further assumed. Yang et al. (2020b) proposed to model unmeasured confounding bias via λ(X1; ϕ) = E [Y (0) | A = 1, X1, S = 0; ϕ] − E [Y (0) | A = 0, X1, S = 0; ϕ], which is equal to zero if U = ∅. Modeling this bias function allows one to improve efficiency in estimation of the ATT by combining observational data and RCT data. A similar idea was considered in Kallus et al. (2018) where a confounding bias correction term was learned with interpolation of E [Y | A, X1] between RCT and observational data, and Gui (2020) where RCT data were used to correct bias in an imperfect estimator based on an invalid instrumental variable defined on observation data.
In Athey et al. (2020), it was assumed that we observe data from RCT (W, A, X, S = 1) and from observational study (Y, W, A, X, S = 0), where W denotes a secondary outcome observed in both studies, Y denotes the primary outcome expensive to measure in RCT, and the S = 0 sample is a random sample of the target population. Motivated by the observation that the treatment effects on the secondary outcome should be similar in the RCT and observational data if X is sufficient for Assumption 2, Athey et al. (2020) developed a control function method for using differences in the estimated causal effects on the secondary outcome between the two samples to adjust estimation of the treatment effect on the primary outcome.
Yang and Ding (2019) considered the scenario where a small validation dataset with all confounders (Y, A, X1, U, S = 1) and a big main dataset with unmeasured confounders (Y, A, X1, S = 0) are available. Both are random samples of the target population hence external validity is satisfied. The big main data can improve efficiency and the small validation data can ensure consistency. For each dataset S = s, let , s = 0, 1 denote a consistent estimator of the ATE based on a user-specified estimation strategy adjusting for all confounders (X1, U), and let , s = 0, 1 denote an error-prone estimator using the same estimation strategy but with U uncontrolled. Apparently cannot be obtained. A key insight is that the two error-prone estimates should be consistent for zero. By modeling the joint distribution of and , they derived the most efficient consistent estimator of τ among all linear combinations . Other methods for controlling unmeasured confounding with validation data include the propensity score calibration (Stürmer et al., 2005) and conditional propensity scores (McCandless et al., 2012).
3.3 |. Combining clinical trial with external control
Single-arm clinical trials are typically conducted for rare diseases due to difficulties in recruiting enough patients for an adequately powered two-arm trial, or for diseases with high unmet medical need that raise ethical concerns (Cuffe, 2011; Viele et al., 2014; Abrahami et al., 2021). Historical or contemporaneous information on the control arm is often available from previous RCT or observational studies. Such external controls have been used to emulate the control arm in the setting of single-arm trials, which can decrease costs and duration and improve power.
Formally, the single-arm trial data (Y, A = 1, X, S = 0) are a random sample of the target population, while the external control data contain (Y, A = 0, X, S = 1). Our goal is to estimate E [Y (0) | S = 0] leveraging historical data in order to contrast it with the mean response in the single-arm trial to estimate the treatment effect. Traditional methods to account for differences in patient characteristics between the external control and the target population include meta-analysis (Schmidli et al., 2014; Hasegawa et al., 2017; Weber et al., 2018; Zhang et al., 2019; Schmidli et al., 2020) and matching (Signorovitch et al., 2010; Schmidli et al., 2020). Typically, a form of exchangeability across different studies like Assumption 4 is assumed. Recently, Li and Song (2020) proposed to build an outcome regression model using external control data under exchangeability, and then estimate E [Y (0) | S = 0] by standardization, which is similar to the identification strategy in Eq. (3) with a = 0. Besides single-arm trial data, external controls have also been used to improve efficiency in a traditional RCT with data on both arms available. Li et al. (2020b) showed that the semiparametric efficiency bound for estimating E [Y (1) − Y (0) | S = 0] is reduced by incorporating external control data, and proposed a doubly robust and locally efficient estimator that combines outcome regression and inverse probability of treatment weighting.
4 |. NO SINGLE SAMPLE CONTAINS ALL RELEVANT VARIABLES
The data integration problems described so far have complete data on all relevant variables in at least one sample. A more challenging problem is when there are no complete data at any data source. This setting has been referred to as data combination (Ridder and Moffitt, 2007; Shu and Tan, 2020) or data fusion (Evans et al., 2018; Sun and Miao, 2018; Li et al., 2020a) in the literature. In the following, we will first introduce methods applicable to the general data combination problem in Section 4.1. We will use a new set of notation in Section 4.1 while notation in the rest of the paper follows Section 2. We will then overview specific causal inference problems and methods in Sections 4.2–4.3.
4.1 |. General data combination methods
We first introduce some new notation. Suppose for each member from a population of interest, we can define a vector of relevant variables (Y, X, Z). A sample of complete data on (Y, X, Z) is unavailable, instead two separate samples are available. In one sample we observe variables (Z, Y, S = 1) and in the other sample we observe (Z, X, S = 0), with Z shared by the two datasets. Suppose the S = 1 and S = 0 samples are of size n1 and n0, respectively, with total sample size n = n1 + n0, then a merged sample combining the two samples is an i.i.d. sample containing .
4.1.1 |. Estimation of general parameters defined through moment restrictions
We assume that the S = 1 sample is drawn from the population of interest, while the S = 0 sample is an auxiliary sample independent of the S = 1 sample, which ensures identification that could not be achieved by the S = 1 sample alone. We are often interested in a population parameter defined as the unique solution to the k × 1 vector of population moment conditions E [m (Y, X, Z; θ) | S = 1] = 0, which includes the maximum likelihood estimation and generalized method of moments as special cases. For example, θ is the ATT when S is the binary treatment indicator, (Y, X) are the potential outcomes under treatment and control respectively, Z is a vector of pretreatment covariates, and m(Y, X, Z; θ) = Y − X − θ. Another example is the two-sample instrumental variable (IV) problem, where Z is a vector of IVs, X is the treatment (not necessarily binary), Y is the outcome, and . We will detail the two-sample IV literature in Section 4.2. Typically selection exchangeability (S ⫫ (Y, X) | Z) and positivity (P (S = s | Z) > 0) are assumed to identify θ by combining the two samples.
Graham et al. (2016) and Shu and Tan (2020) proposed doubly robust and locally efficient estimators of θ extending the semiparametric efficiency theory of Hahn (1998) and Chen et al. (2008). We illustrate the estimation strategies in Shu and Tan (2020) below. When Y = ∅, the moment restriction becomes E [m (X, Z; θ) | S = 1] = 0 in which X is unobserved in the S = 1 sample and we need to combine the two samples for estimation. Shu and Tan (2020) took the EIF in Chen et al. (2008) as the estimating function to obtain an AIPW estimator, which solves where
(9) |
The AIPW estimator is doubly robust in that it remains consistent when either the propensity score model P (S = 1 | Z) or the outcome regression model E [m(X, Z; θ) | Z] is correctly specified. This can be seen by the following observation: an IPW estimator can be obtained by misspecifying E [m(X, Z; θ) | Z] as zero in Eq. (9), while an outcome regression estimator can be obtained by setting P (S = 1 | Z)/P (S = 0 | Z) to zero in Eq. (9).
When Y ≠ ∅, Graham et al. (2016) and Shu and Tan (2020) further imposed a key identification assumption that the moment condition is separable in the sense that E [m(Y, X, Z; θ) | S = 1] = E [m1 (Y, Z; θ) − m0 (X, Z; θ) | S = 1], where m1 and m0 only depend on variables observed in one sample. We can see that E [m1 (Y, Z; θ) | S = 1] can be directly estimated from the S = 1 sample, while the challenge is to estimate E [m0 (X, Z; θ) | S = 1] combining both samples. Motivated by the observation that estimation of E [m0 (X, Z; θ) | S = 1] reduces to the Y = ∅ case with m (·) substituted with m0 (·), Shu and Tan (2020) proposed an AIPW estimator that solves where
with U (S, X, Z; m0 (·), θ) being the estimating function in Eq. (9) with m (·) substituted with m0 (·).
An alternative assumption often imposed is the conditional independence assumption, i.e., Y ⫫ X | Z (Ridder and Moffitt, 2007; Ogburn et al., 2020). Under this assumption we have f (Y, X, Z) = f (Y | Z)f (X, Z) = f (X | Z)f (Y, Z) where each of f (Y, Z) and f (X, Z) can be estimated from one sample. Therefore, the sample moment conditions can be computed combining the two samples.
4.1.2 |. Statistical matching
Another set of methods in data combination problems is statistical matching, which has been proposed mainly under two scenarios. In the first scenario, a sufficient number of units are shared between the two data sources, i.e., the two samples are partially overlapping. In this case, it is convenient to merge the two samples by linking the records relating to the same unit. There is a rich literature on record linkage which is beyond the scope of this paper (Fellegi and Sunter, 1969; Winkler, 1999; Herzog et al., 2010; Sayers et al., 2016; Deepak and Jurek-Loughrey, 2018; Komarova et al., 2018). In the second scenario, the two samples are selected from the same population but have no common unit. In this case, a statistical matching framework has been proposed in survey studies, which finds a matched pair of units according to the shared variable Z, then imputes the missing value for one unit using the observed value from its matched counterpart (Radner, 1980; D’Orazio et al., 2006; Ridder and Moffitt, 2007; D’Orazio, 2015; Yang and Kim, 2020). Validity of the statistical matching approach depends on the conditional independence assumption that conditional on the shared variable Z, the potentially missing variables Y and X are independent. Under this assumption, matching on Z is sufficient to impute Y in S = 1 sample regardless of whether X are the same. A similar argument holds for imputation in the S = 0 sample.
4.1.3 |. Data combinition in regression analysis
Evans et al. (2018) studied a different problem of estimating the regression coefficient of a correctly specified model E [Y | Z, X; θ] when both samples are i.i.d. random samples of the same population. Selection exchangeability and positivity were assumed similar to Section 4.1.1, while no assumption on separable moments (Graham et al., 2016; Shu and Tan, 2020) or conditional independence (Ridder and Moffitt, 2007) introduced in previous sections was made. In this setting, identification of θ can be hard even under linear models, which has been discussed in Pacini (2019), Yang and Kim (2020), and Miao et al. (2020). Evans et al. (2018) proposed a doubly robust estimator for θ that solves where
(10) |
where g (·) is of the same dimension as θ. The doubly robust estimator remains consistent under misspecification of either f (X | Z) or P (S = 1 | Z). Therefore, an IPW estimator can be obtained by misspecifying f (X | Z) as zero, i.e., by substituting E [Y | Z] with zero in Eq. (10), while an imputation estimator can be obtained by substituting P (S = 1 | Z) with 0.5 in Eq. (10).
4.2 |. Two-sample instrumental variable and Mendelian randomization
An important setting of data combination problem is the two-sample instrumental variable methods. An instrumental variable is an exogenous variable known to satisfy the following three core assumptions: (I) the IV must be associated with the treatment; (II) the IV must not have a direct effect on the outcome that is not mediated by the treatment; (III) the IV must be independent of unmeasured confounders. The IV approach is one of the most frequently used methods to mitigate unmeasured confounding denoted as U. It turns out that the causal effect can be estimated by combining information from two data sources. Let Z denote an instrumental variable. The two-sample IV estimation concerns the scenario when (Z, A, X, S = 1) are available in one data source and (Z, Y, X, S = 0) are available in a separate data source, with (Z, X) shared by the two datasets. No complete data on all variables (Z, Y, A, X) are available. In the following we will suppress the measured covariates X to simplify notation, and all arguments are made implicitly conditional on X.
We first consider the case of a binary treatment. Assuming that U does not modify the causal effect of A at the individual level, i.e., Y = h(ϵ)A + g (U, ϵ), the ATE is identified by ATE = E [h(ϵ)] = cov (Z, Y)/cov (Z, A). Hence common IV methods often estimate the effect of the treatment using the IV-outcome and IV-treatment associations. The numerator and denominator can be separately estimated from two distinct samples if both are random samples of the same target population. In a general case where A is not necessarily binary and could be a vector, the most common IV approach assumesY = βA+ϵY, and A = γZ +ϵA, and the IV estimator is given by , where denotes the sample covariance matrix. In the one-sample setting, the IV estimator is equivalent to a two-stage least squares (2SLS) estimator obtained by first regressing A on Z, and then regressing Y on , the fitted values of A. Angrist and Krueger (1992) and Arellano and Meghir (1992) showed that the IV estimator can be obtained by computing based on the S = 1 sample and computing based on the S = 0 sample, referred to as the two-sample IV estimator. Klevmarken (1982) and Angrist and Krueger (1995) showed that the 2SLS can also be separately carried out using two samples, referred to as the two-sample two-stage least squares (TS2SLS) estimation (Björklund and Jäntti, 1997). In the first stage, A is regressed on Z using the S = 1 sample, and the estimates are then combined with observations on Z in the S = 0 sample to form . In the second stage, Y is regressed on . Inoue and Solon (2010) pointed out that the equivalence of IV and 2SLS estimation in the one-sample setting does not hold in the two-sample setting. In fact, TS2SLS is more efficient than two-sample IV because it implicitly corrects for differences in the distribution of Z between the two samples.
The above classical two-sample IV methods often assume that the two samples are compatible with the same observed data distribution f (Z, Y, A). However it is found that the common variable, i.e., the IV, can have different distributions between the two samples, i.e. f (Z | S = 1) ≠ f (Z | S = 0). Graham et al. (2016) modeled the selection probability, P (S = 1 | Z), parametrically and developed a doubly robust and locally efficient estimator which can be applied in more general data combination problems. Similar methods proposed in Shu and Tan (2020), detailed in Section 4.1, were also applied to the two-sample IV problem. It is important to note that the estimator proposed by Graham et al. (2016) is based on EIF derived under a correct model for P (S = 1 | Z) and is therefore doubly robust only under such restricted model specification of nuisance parameters, whereas the estimator of Shu and Tan (2020) is based on EIF under a nonparametric model for the observed data and is doubly robust without such restrictions. Sun and Miao (2018) established sufficient conditions for nonparametric identification of the ATE allowing for heterogeneous samples, derived the efficiency bound for estimating the ATE, and proposed a multiply robust and locally efficient estimator for estimation and inference.
Using genetic variants as IVs, two-sample Mendelian randomization (MR) methods have also been studied recently, which leverage publicly available summary statistics on genetic instrument-treatment and genetic instrument-outcome associations typically obtained from genome-wide association studies (GWAS) (Davey Smith and Ebrahim, 2003; Pierce and Burgess, 2013; Davey Smith and Hemani, 2014; Lawlor, 2016; Zhu et al., 2018; Davey Smith and Ebrahim, 2003; Spiller et al., 2019). Although simple and convenient, the traditional two-sample MR methods typically rely on valid instruments. Methods robust to invalid instruments have been studied (Bowden et al., 2015, 2016; Hartwig et al., 2017; Li, 2017; Zhao et al., 2020; Sanderson et al., 2021), and extension to the setting of weak instruments has also been studied (Burgess et al., 2016; Wang and Kang, 2019; Sanderson et al., 2021). Zhao et al. (2019) further considered the scenario when the sample compatibility assumption is violated and proposed methods that are robust to heterogeneous samples.
4.3 |. Other causal inference problems
Fan et al. (2014) studied the scenario when the shared variable is the treatment variable. More specifically, (Y, A, X) are partially observed from two separate datasets: the outcome dataset contains (A,Y, S = 1), while the demographics dataset contains (A, X, S = 0). In this case, E [Y | A, X] is not identified from neither dataset unless one is willing to make additional identification assumptions. Nevertheless, Fan et al. (2014) established sharp bounds for E [Y (a)] via bounding its inverse probability weighting representation under a continuous version of the classical monotone rearrangement inequality (Hardy et al., 1952; Cambanis et al., 1976). Other related works include Manski (2000), Cross and Manski (2002), and Ridder and Moffitt (2007).
A more general setting is studied in Li et al. (2020a) assuming K +1 datasets. Specifically, let X = (X1, X2, …, XK), S ∈ {1, …, K +1} indicate each dataset, and Dk, k = 1, …, K +1 denote the set of observed variable in the k-th dataset, with D1 = (A, Y, X1, S = 1), D2 = (A, Y, X2, S = 2), …, DK = (A, Y, XK, S = K), DK+1 = (X, S = K + 1). Assuming that Y (a) ⫫ A | X, S is randomly assigned, and E [Y | A, X; β] is linear and additive, Li et al. (2020a) showed that the coefficient of A, which is the ATE under linear additive model, is identifiable by combining summary-level statistics obtained from the separate datasets.
5 |. OTHER SETTINGS OF DATA INTEGRATION IN CAUSAL INFERENCE
5.1 |. Distributed data setting
Meta-analysis has a long history in integration of the results from multiple clinical trials with no access to individual-level trial data (DerSimonian and Laird, 1986, 2015). Recently, another widely studied topic is the analysis of distributed data where individual-level observational data are not shareable due to privacy concerns (Toh, 2020). This is increasingly needed in multidatabase or multicenter study of comparative effectiveness and safety of medical products using real-world data such as electronic health records data. Each data partner can share a summary-level dataset with the analysis center. A few methods have been proposed and we summarize them ordered by the amount of information shared. The first method is to reduce the dimension of measured confounders using the propensity score or the prognostic score (Rosenbaum and Rubin, 1983; Hansen, 2008), then share individual-level treatment, outcome, and score with the analysis center to apply propensity score methods (Rassen and Schneeweiss, 2012; Shi et al., 2019). The second method is to aggregate subjects into cells defined by confounders or the propensity score strata, then adjust for confounding based on counts of subjects in each cell (Cook and Goldman, 1989; Rassen et al., 2010; Shu et al., 2020). Propensity score matching within each data partner can be done prior to the aggregation (Toh et al., 2013; Yoshida et al., 2018). The third one is distributed regression (Zhang et al., 2013; Toh et al., 2018), and the fourth one is meta-analysis of site-specific results (Toh et al., 2013).
5.2 |. Bayesian causal inference
Bayesian framework can naturally facilitate the borrowing of prior information across data sources (Ibrahim and Chen, 2000; Gelman, 2006; Hobbs et al., 2011; Kaizer et al., 2018). Boatman et al. (2020) studied the problem of estimating causal effects from a primary source and borrowing from any number of supplemental sources when data on outcome, treatment, and confounders are available in all data sources. When some confounders are unmeasured in a large main dataset but are available in a small validation dataset, a missing data perspective has been used to impute the missing covariates (Gelman et al., 1998; Jackson et al., 2009; Murray and Reiter, 2016). When the number of missing covariates in the main study is large relative to the sample size of the validation study, Antonelli et al. (2017) proposed a Bayesian approach to estimate the ATE in the main study that combines Bayesian variable selection and missing data imputation, allowing for heterogeneous treatment effects between the main and validation studies. Comment et al. (2019) proposed to use informative priors on quantities related to the unmeasured confounding bias in a range of settings including both static and dynamic treatment regimes as well as treatment-induced mediator-outcome confounding.
5.3 |. Causal discovery
Data integration has also been studied in causal discovery, which aims to learn the causal relations between variables of a system, using multiple heterogeneous datasets that measure the system under different environments or experimental conditions and with different sets of variables. There are two main types of methods. The first type pools data from different experiments to learn a context-independent causal graph of the system (Cooper and Yoo, 1999; Tian and Pearl, 2001; Eaton and Murphy, 2007; Peters et al., 2016; Zhang et al., 2017). For example, Peters et al. (2016) provided an invariant prediction method built on the idea that the conditional distribution of the outcome given the direct causes is invariant across different experimental conditions. Mooij et al. (2020) proposed to take into account context variables that discriminate the different datasets in standard causal discovery methods applied to the pooled data. The second type derives statistics or constraints from each context separately without pooling data and combines them to learn a single graph (Claassen and Heskes, 2010; Tillman and Spirtes, 2011; Triantafillou and Tsamardinos, 2015).
6 |. DISCUSSION
In this paper, we reviewed a collection of data integration methods in causal inference. A common perspective views data integration in causal inference as a missing data problem where the study sample is a subset of the target population. This problem is referred to as generalizability or verify-in-sample. We summarize the data missing patterns in Sections 3–4 in Table 2. Another setting increasingly recognized is when the study sample and the target population are partially- or non-overlapping, in which selection exchangeability requires that the variables that determine study inclusion/exclusion should not be predictive of the outcome or at least does not modify the treatment effect. This problem is referred to as transportability or verify-out-of-sample (Chen et al., 2008; Colnet et al., 2020; Dahabreh et al., 2020b; Degtiar and Rose, 2021). We summarized causal inference methods under both scenarios and their applications in important real-world problems including combining clinical trial with external information, correcting for unmeasured confounding in observational study using auxiliary or trial data, two-sample Mendelian randomization, and distributed data network. Majority of the methods rely on some form of exchangeability/homogeneity across different data sources, hence sensitivity to violation of exchangeability assumptions should be routinely conducted. In addition, identification strategies in complex settings such as when no single sample contains all relevant variables have not been fully explored, and connection to the covariate shift problem in machine learning has yet to be fully studied.
TABLE 2.
Funding information
Xu Shi is support by the NIH/NIGMS grant R01GM139926
references
- Abrahami D, Pradhan R, Yin H, Honig P, Baumfeld Andre E and Azoulay L (2021) Use of real-world data to emulate a clinical trial and support regulatory decision making: Assessing the impact of temporality, comparator choice, and method of adjustment. Clinical Pharmacology & Therapeutics, 109, 452–461. [DOI] [PubMed] [Google Scholar]
- Angrist JD and Krueger AB (1992) The effect of age at school entry on educational attainment: An application of instrumental variables with moments from two samples. Journal of the American Statistical Association, 87, 328–336. [Google Scholar]
- — (1995) Split-Sample instrumental variables estimates of the return to schooling. Journal of Business & Economic Statistics, 13, 225–235. [Google Scholar]
- Antonelli J, Zigler C and Dominici F (2017) Guided Bayesian imputation to adjust for confounding when combining heterogeneous data sources in comparative effectiveness research. Biostatistics, 18, 553–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arellano M and Meghir C (1992) Female labour supply and on-the-job search: An empirical model estimated using complementary data sets. The Review of Economic Studies, 59, 537–559. [Google Scholar]
- Athey S, Chetty R and Imbens G (2020) Combining experimental and observational data to estimate treatment effects on long term outcomes. arXiv preprint arXiv:2006.09676. [Google Scholar]
- Bareinboim E and Pearl J (2016) Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113, 7345–7352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Björklund A and Jäntti M (1997) Intergenerational income mobility in Sweden compared to the United States. The American Economic Review, 87, 1009–1018. [Google Scholar]
- Boatman JA, Vock DM and Koopmeiners JS (2020) Borrowing from supplemental sources to estimate causal effects from a primary data source. arXiv preprint arXiv:2003.09680. [DOI] [PubMed] [Google Scholar]
- Bowden J, Davey Smith G and Burgess S (2015) Mendelian randomization with invalid instruments: Effect estimation and bias detection through Egger regression. International Journal of Epidemiology, 44, 512–525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowden J, Davey Smith G, Haycock PC and Burgess S (2016) Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genetic Epidemiology, 40, 304–314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buchanan AL, Hudgens MG, Cole SR, Mollan KR, Sax PE, Daar ES, Adimora AA, Eron JJ and Mugavero MJ (2018) Generalizing evidence from randomized trials using inverse probability of sampling weights. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181, 1193–1209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burgess S, Davies NM and Thompson SG (2016) Bias due to participant overlap in two-sample Mendelian randomization. Genetic Epidemiology, 40, 597–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cambanis S, Simons G and Stout W (1976) Inequalities for Ek(x, y) when the marginals are fixed. Zeitschrift für Wahrschein-lichkeitstheorie und verwandte Gebiete, 36, 285–294. [Google Scholar]
- Chen X, Hong H and Tarozzi A (2008) Semiparametric efficiency in GMM models with auxiliary data. The Annals of Statistics, 36, 808–843. [Google Scholar]
- Cheng PE (1994) Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89, 81–87. [Google Scholar]
- Claassen T and Heskes T (2010) Causal discovery in multiple models from different experiments. In Twenty-fourth Annual Conference on Neural Information Processing Systems. [Google Scholar]
- Cole SR and Stuart EA (2010) Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology, 172, 107–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Colnet B, Mayer I, Chen G, Dieng A, Li R, Varoquaux G, Vert J-P, Josse J and Yang S (2020) Causal inference methods for combining randomized trials and observational studies: A review. arXiv preprint arXiv:2011.08047. [Google Scholar]
- Comment L, Coull BA, Zigler C and Valeri L (2019) Bayesian data fusion for unmeasured confounding. arXiv preprint arXiv:1902.10613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cook EF and Goldman L (1989) Performance of tests of significance based on stratification by a multivariate confounder score or by a propensity score. Journal of Clinical Epidemiology, 42, 317–324. [DOI] [PubMed] [Google Scholar]
- Cooper GF and Yoo C (1999) Causal discovery from a mixture of experimental and observational data. In the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 116–125. [Google Scholar]
- Cross PJ and Manski CF (2002) Regressions, short and long. Econometrica, 70, 357–368. [Google Scholar]
- Cuffe RL (2011) The inclusion of historical control data may reduce the power of a confirmatory study. Statistics in Medicine, 30, 1329–1338. [DOI] [PubMed] [Google Scholar]
- Dahabreh IJ, Haneuse SJ, Robins JM, Robertson SE, Buchanan AL, Stuart EA and Hernán MA (2019a) Study designs for extending causal inferences from a randomized trial to a target population. arXiv preprint arXiv:1905.07764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dahabreh IJ and Hernán MA (2019) Extending inferences from a randomized trial to a target population. European Journal of Epidemiology, 34, 719–722. [DOI] [PubMed] [Google Scholar]
- Dahabreh IJ, Hernán MA, Robertson SE, Buchanan A and Steingrimsson JA (2019b) Generalizing trial findings using nested trial designs with sub-sampling of non-randomized individuals. arXiv preprint arXiv:1902.06080. [Google Scholar]
- Dahabreh IJ, Petito LC, Robertson SE, Hernán MA and Steingrimsson JA (2020a) Toward causally interpretable meta-analysis: Transporting inferences from multiple randomized trials to a new target population. Epidemiology, 31, 334–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dahabreh IJ, Robertson SE, Petito LC, Hernán MA and Steingrimsson JA (2019c) Efficient and robust methods for causally interpretable meta-analysis: Transporting inferences from multiple randomized trials to a target population. arXiv preprint arXiv:1908.09230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dahabreh IJ, Robertson SE, Steingrimsson JA, Stuart EA and Hernan MA (2020b) Extending inferences from a randomized trial to a new target population. Statistics in Medicine, 39, 1999–2014. [DOI] [PubMed] [Google Scholar]
- Dahabreh IJ, Robertson SE, Tchetgen EJ, Stuart EA and Hernán MA (2019d) Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics, 75, 685–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dahabreh IJ, Robins JM, Haneuse SJ and Hernán MA (2019e) Generalizing causal inferences from randomized trials: Counterfactual and graphical identification. arXiv preprint arXiv:1906.10792. [Google Scholar]
- Dahabreh IJ, Robins JM, Haneuse SJ, Saeed I, Robertson SE, Stuart EA and Hernán MA (2019f) Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population. arXiv preprint arXiv:1905.10684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davey Smith G and Ebrahim S (2003) ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology, 32, 1–22. [DOI] [PubMed] [Google Scholar]
- Davey Smith G and Hemani G (2014) Mendelian randomization: Genetic anchors for causal inference in epidemiological studies. Human Molecular Genetics, 23, R89–R98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deepak P and Jurek-Loughrey A (2018) Linking and Mining Heterogeneous and Multi-view Data. Springer. [Google Scholar]
- Degtiar I and Rose S (2021) A review of generalizability and transportability. arXiv preprint arXiv:2102.11904. [Google Scholar]
- DerSimonian R and Laird N (1986) Meta-Analysis in clinical trials. Controlled Clinical Trials, 7, 177–188. [DOI] [PubMed] [Google Scholar]
- — (2015) Meta-Analysis in clinical trials revisited. Contemporary Clinical Trials, 45, 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong L, Yang S, Wang X, Zeng D and Cai J (2020) Integrative analysis of randomized clinical trials with real world evidence studies. arXiv preprint arXiv:2003.01242. [Google Scholar]
- D’Orazio M, Di Zio M and Scanu M (2006) Statistical Matching: Theory and Practice. John Wiley & Sons. [Google Scholar]
- D’Orazio M (2015) Integration and imputation of survey data in R: The StatMatch package. Romanian Statistical Review, 63, 57–68. [Google Scholar]
- Eaton D and Murphy K (2007) Exact Bayesian structure learning from uncertain interventions. In Artificial Intelligence and Statistics, 107–114. PMLR. [Google Scholar]
- Evans K, Sun B, Robins J and Tchetgen EJT (2018) Doubly robust regression analysis for data fusion. arXiv preprint arXiv:1808.07309. [Google Scholar]
- Fan Y, Sherman R and Shum M (2014) Identifying treatment effects under data combination. Econometrica, 82, 811–822. [Google Scholar]
- Fellegi IP and Sunter AB (1969) A theory for record linkage. Journal of the American Statistical Association, 64, 1183–1210. [Google Scholar]
- Gelman A (2006) Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1, 515–534. [Google Scholar]
- Gelman A, King G and Liu C (1998) Not asked and not answered: Multiple imputation for multiple surveys. Journal of the American Statistical Association, 93, 846–857. [Google Scholar]
- Graham BS, Pinto C. C. d. X. and Egel D (2016) Efficient estimation of data combination models by the method of auxiliary-to-study tilting (AST). Journal of Business & Economic Statistics, 34, 288–301. [Google Scholar]
- Greenland S (2003) Quantifying biases in causal models: Classical confounding vs collider-stratification bias. Epidemiology, 14, 300–306. [PubMed] [Google Scholar]
- Greenland S and Robins JM (1986) Identifiability, exchangeability, and epidemiological confounding. International Journal of Epidemiology, 15, 413–419. [DOI] [PubMed] [Google Scholar]
- Gui G (2020) Combining observational and experimental data using first-stage covariates. arXiv preprint arXiv:2010.05117. [Google Scholar]
- Hahn J (1998) On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331. [Google Scholar]
- Hansen BB (2008) The prognostic analogue of the propensity score. Biometrika, 95, 481–488. [Google Scholar]
- Hardy G, Littlewood J and Polya G (1952) Inequalities. Cambridge University Press. [Google Scholar]
- Hartman E, Grieve R, Ramsahai R and Sekhon JS (2015) From SATE to PATT: Combining experimental with observational studies to estimate population treatment effects. Journal of the Royal Statistical Society: Series A (Statistics in Society), 178, 757–778. [Google Scholar]
- Hartwig FP, Davey Smith G and Bowden J (2017) Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. International Journal of Epidemiology, 46, 1985–1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hasegawa T, Claggett B, Tian L, Solomon SD, Pfeffer MA and Wei L-J (2017) The myth of making inferences for an overall treatment efficacy with data from multiple comparative studies via meta-analysis. Statistics in Biosciences, 9, 284–297. [Google Scholar]
- Herzog TH, Scheuren F and Winkler WE (2010) Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 535–543. [Google Scholar]
- Hobbs BP, Carlin BP, Mandrekar SJ and Sargent DJ (2011) Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics, 67, 1047–1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hünermund P and Bareinboim E (2019) Causal inference and data-fusion in econometrics. arXiv preprint arXiv:1912.09104. [Google Scholar]
- Ibrahim JG and Chen M-H (2000) Power prior distributions for regression models. Statistical Science, 15, 46–60. [Google Scholar]
- Inoue A and Solon G (2010) Two-Sample instrumental variables estimators. The Review of Economics and Statistics, 92, 557–561. [Google Scholar]
- Jackson CH, Best NG and Richardson S (2009) Bayesian graphical models for regression on multiple data sets with different variables. Biostatistics, 10, 335–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaizer AM, Koopmeiners JS and Hobbs BP (2018) Bayesian hierarchical modeling based on multisource exchangeability. Biostatistics, 19, 169–184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kallus N, Puli AM and Shalit U (2018) Removing hidden confounding by experimental grounding. arXiv preprint arXiv:1810.11646. [Google Scholar]
- Klevmarken A (1982) Missing variables and two-stage least-squares estimation from more than one data set. Tech. rep., IUI Working Paper. [Google Scholar]
- Komarova T, Nekipelov D and Yakovlev E (2018) Identification, data combination, and the risk of disclosure. Quantitative Economics, 9, 395–440. [Google Scholar]
- Lawlor DA (2016) Commentary: Two-sample Mendelian randomization: opportunities and challenges. International Journal of Epidemiology, 45, 908–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Correa JD and Bareinboim E (2020) General identifiability with arbitrary surrogate experiments. In Uncertainty in Artificial Intelligence, 389–398. PMLR. [Google Scholar]
- Lesko CR, Buchanan AL, Westreich D, Edwards JK, Hudgens MG and Cole SR (2017) Generalizing study results: A potential outcomes perspective. Epidemiology, 28, 553–561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Miao W, Cai Z, Liu X, Zhang T, Xue F and Geng Z (2020a) Causal data fusion methods using summary-level statistics for a continuous outcome. Statistics in Medicine, 39, 1054–1067. [DOI] [PubMed] [Google Scholar]
- Li S (2017) Mendelian randomization when many instruments are invalid: Hierarchical empirical Bayes estimation. arXiv preprint arXiv:1706.01389. [Google Scholar]
- Li X, Miao W, Lu F and Zhou X-H (2020b) Improving efficiency of inference in clinical trials with external control data. arXiv preprint arXiv:2011.07234. [DOI] [PubMed] [Google Scholar]
- Li X and Song Y (2020) Target population statistical inference with data integration across multiple sources — An approach to mitigate information shortage in rare disease clinical trials. Statistics in Biopharmaceutical Research, 12, 322–333. [Google Scholar]
- Lunceford JK and Davidian M (2004) Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine, 23, 2937–2960. [DOI] [PubMed] [Google Scholar]
- Manski CF (2000) Identification problems and decisions under ambiguity: Empirical analysis of treatment response and normative analysis of treatment choice. Journal of Econometrics, 95, 415–442. [Google Scholar]
- McCandless LC, Richardson S and Best N (2012) Adjustment for missing confounders using external validation data and propensity scores. Journal of the American Statistical Association, 107, 40–51. [Google Scholar]
- Miao W, Li W, Hu W, Wang R and Geng Z (2020) Invited Commentary: Estimation and Bounds Under Data Fusion. American Journal of Epidemiology. [DOI] [PubMed] [Google Scholar]
- Mooij JM, Magliacane S and Claassen T (2020) Joint causal inference from multiple contexts. Journal of Machine Learning Research, 21, 1–108.34305477 [Google Scholar]
- Murray JS and Reiter JP (2016) Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association, 111, 1466–1479. [Google Scholar]
- Ogburn EL, Rudolph KE, Morello-Frosch R, Khan A and Casey JA (2020) A Warning About Using Predicted Values From Regression Models for Epidemiologic Inquiry. American Journal of Epidemiology. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Muircheartaigh C and Hedges LV (2014) Generalizing from unrepresentative experiments: A stratified propensity score approach. Journal of the Royal Statistical Society: Series C: Applied Statistics, 63, 195–210. [Google Scholar]
- Pacini D (2019) Two-Sample least squares projection. Econometric Reviews, 38, 95–123. [Google Scholar]
- Pearl J and Bareinboim E (2014) External validity: From do-calculus to transportability across populations. Statistical Science, 29, 579–595. [Google Scholar]
- Peters J, Bühlmann P and Meinshausen N (2016) Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5), 947–1012. [Google Scholar]
- Pierce BL and Burgess S (2013) Efficient design for Mendelian randomization studies: Subsample and 2-sample instrumental variable estimators. American Journal of Epidemiology, 178, 1177–1184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Radner D (1980) Report on exact and statistical matching techniques. Washington, D.C.: U.S. Dept. of Commerce, Office of Federal Statistical Policy and Standards : For sale by the Supt. of Docs., U.S. G.P.O., 1980. [Google Scholar]
- Rassen JA and Schneeweiss S (2012) Using high-dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system. Pharmacoepidemiology and Drug Safety, 21, 41–49. [DOI] [PubMed] [Google Scholar]
- Rassen JA, Solomon DH, Curtis JR, Herrinton L and Schneeweiss S (2010) Privacy-Maintaining propensity score-based pooling of multiple databases applied to a study of biologics. Medical Care, 48, S83–S39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ridder G and Moffitt R (2007) The econometrics of data combination. Handbook of Econometrics, 6, 5469–5547. [Google Scholar]
- Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period — Application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512. [Google Scholar]
- Robins JM, Rotnitzky A and Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89, 846–866. [Google Scholar]
- Rosenbaum PR and Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. [Google Scholar]
- Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688–701. [Google Scholar]
- — (1980) Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75, 591–593. [Google Scholar]
- Rudolph KE and van der Laan MJ (2017) Robust estimation of encouragement-design intervention effects transported across sites. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 1509–1525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanderson E, Spiller W, Bowden J (2021). Testing and correcting for weak and pleiotropic instruments in two-sample multivariable Mendelian randomization. Stat Med, 40(25):5434–5452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sayers A, Ben-Shlomo Y, Blom AW and Steele F (2016) Probabilistic record linkage. International Journal of Epidemiology, 45, 954–964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidli H, Gsteiger S, Roychoudhury S, O’Hagan A, Spiegelhalter D and Neuenschwander B (2014) Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics, 70, 1023–1032. [DOI] [PubMed] [Google Scholar]
- Schmidli H, Häring DA, Thomas M, Cassidy A, Weber S and Bretz F (2020) Beyond randomized clinical trials: Use of external controls. Clinical Pharmacology & Therapeutics, 107, 806–816. [DOI] [PubMed] [Google Scholar]
- Shi X, Wellman R, Heagerty PJ, Nelson JC and Cook AJ (2019) Safety surveillance and the estimation of risk in select populations: Flexible methods to control for confounding while targeting marginal comparisons via standardization. Statistics in Medicine, 39, 369–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shu D, Yoshida K, Fireman BH and Toh S (2020) Inverse probability weighted Cox model in multi-site studies without sharing individual-level data. Statistical Methods in Medical Research, 29, 1668–1681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shu H and Tan Z (2020) Improved methods for moment restriction models with data combination and an application to two-sample instrumental variable estimation. Canadian Journal of Statistics, 48, 259–284. [Google Scholar]
- Signorovitch JE, Wu EQ, Yu AP, Gerrits CM, Kantor E, Bao Y, Gupta SR and Mulani PM (2010) Comparative effectiveness without head-to-head trials. Pharmacoeconomics, 28, 935–945. [DOI] [PubMed] [Google Scholar]
- Spiller W, Davies NM and Palmer TM (2019) Software application profile: mrrobust — A tool for performing two-sample summary Mendelian randomization analyses. International Journal of Epidemiology, 48, 684–690. [Google Scholar]
- Steele RJ, Schnitzer ME and Shrier I (2020) Importance of homogeneous effect modification for causal interpretation of meta-analyses. Epidemiology, 31, 353–355. [DOI] [PubMed] [Google Scholar]
- Stuart EA, Cole SR, Bradshaw CP and Leaf PJ (2011) The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174, 369–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stürmer T, Schneeweiss S, Avorn J and Glynn RJ (2005) Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. American Journal of Epidemiology, 162, 279–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun B and Miao W (2018) On semiparametric instrumental variable estimation of average treatment effects through data fusion. arXiv preprint arXiv:1810.03353. [Google Scholar]
- Tian J and Pearl J (2001) Causal discovery from changes. In the 17th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 512–521. [Google Scholar]
- Tillman R and Spirtes P (2011) Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 3–15. JMLR Workshop and Conference Proceedings. [Google Scholar]
- Tipton E (2013) Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38, 239–266. [Google Scholar]
- Toh S (2020) Analytic and data sharing options in real-world multidatabase studies of comparative effectiveness and safety of medical products. Clinical Pharmacology & Therapeutics, 107, 834–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toh S, Gagne JJ, Rassen JA, Fireman BH, Kulldorff M and Brown JS (2013) Confounding adjustment in comparative effectiveness research conducted within distributed research networks. Medical Care, 51, S4–S10. [DOI] [PubMed] [Google Scholar]
- Toh S, Wellman R, Coley RY, Horgan C, Sturtevant J, Moyneur E, Janning C, Pardee R, Coleman KJ, Arterburn D, McTigue K, Anau J and Cook AJ (2018) Combining distributed regression and propensity scores: A doubly privacy-protecting analytic method for multicenter research. Clinical Epidemiology, 10, 1773–1768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Triantafillou S and Tsamardinos I (2015) Constraint-Based causal discovery from multiple interventions over overlapping variable sets. Journal of Machine Learning Research, 16, 2147–2205. [Google Scholar]
- Tsiatis A (2007) Semiparametric Theory and Missing Data. Springer Science & Business Media. [Google Scholar]
- Van der Vaart AW (2000) Asymptotic Statistics, vol. 3. Cambridge University press. [Google Scholar]
- Vansteelandt S and Keiding N (2011) Invited commentary: G-computation — Lost in translation? American Journal of Epidemiology, 173, 739–742. [DOI] [PubMed] [Google Scholar]
- Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas N, Hobbs B, Ibrahim JG, Kinnersley N, Lindborg S, Micallef S, Roychoudhury S and Thompson L (2014) Use of historical control data for assessing treatment effects in clinical trials. Pharmaceutical Statistics, 13, 41–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang S and Kang H (2019) Weak-Instrument robust tests in two-sample summary-data Mendelian randomization. arXiv preprint arXiv:1909.06950. [DOI] [PubMed] [Google Scholar]
- Weber K, Hemmings R and Koch A (2018) How to use prior knowledge and still give new data a chance? Pharmaceutical Statistics, 17, 329–341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Westreich D, Edwards JK, Lesko CR, Stuart E and Cole SR (2017) Transportability of trial results using inverse odds of sampling weights. American Journal of Epidemiology, 186, 1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winkler WE (1999) T he state of record linkage and current research problems. In Statistical Research Division, US Census Bureau. [Google Scholar]
- Yang S and Ding P (2019) Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association, 115(531), 1540–1554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang S and Kim JK (2020) Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3, 625–650. [Google Scholar]
- Yang S, Kim JK and Song R (2020a) Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82, 445–465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang S, Zeng D and Wang X (2020b) Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. arXiv preprint arXiv:2007.12922. [Google Scholar]
- Yoshida K, Gruber S, Fireman BH and Toh S (2018) Comparison of privacy-protecting analytic and data-sharing methods: A simulation study. Pharmacoepidemiology and Drug Safety, 27, 1034–1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J, Ko C-W, Nie L, Chen Y and Tiwari R (2019) Bayesian hierarchical methods for meta-analysis combining randomized-controlled and single-arm studies. Statistical Methods in Medical Research, 28, 1293–1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang K, Huang B, Zhang J, Glymour C and Schölkopf B (2017) Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI: Proceedings of the Conference, vol. 2017, 1347–1353. NIH Public Access. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y, Duchi J and Wainwright M (2013) Divide and conquer kernel ridge regression. In the 26th Annual Conference on Learning Theory, 592–617. PMLR. [Google Scholar]
- Zhao Q, Wang J, Hemani G, Bowden J and Small DS (2020) Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. The Annals of Statistics, 48, 1742–1769. [Google Scholar]
- Zhao Q, Wang J, Spiller W, Bowden J and Small DS (2019) Two-Sample instrumental variable analyses using heterogeneous samples. Statistical Science, 34, 317–333. [Google Scholar]
- Zhu Z, Zheng Z, Zhang F, Wu Y, Trzaskowski M, Maier R, Robinson MR, McGrath JJ, Visscher PM, Wray NR and Yang J (2018) Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9. [DOI] [PMC free article] [PubMed] [Google Scholar]