Abstract
Integrating data from multiple heterogeneous sources has become increasingly popular to achieve a large sample size and diverse study population. This article reviews development in causal inference methods that combines multiple datasets collected by potentially different designs from potentially heterogeneous populations. We summarize recent advances on combining randomized clinical trials with external information from observational studies or historical controls, combining samples when no single sample has all relevant variables with application to two‐sample Mendelian randomization, distributed data setting under privacy concerns for comparative effectiveness and safety research using real‐world data, Bayesian causal inference, and causal discovery methods.
This article is categorized under:
Statistical Models > Semiparametric Models
Applications of Computational Statistics > Clinical Trials
Keywords: causal inference, data fusion, data integration, generalizability, transportability
Data missing patterns in the major settings discussed in Sections 3 and 4. For each variable in each sample, ✓ stands for observed, empty stands for unobserved, and ✓/✗ indicates different settings considered by different papers.
1. INTRODUCTION
The availability of multiple datasets collected by different designs from heterogeneous populations has brought emerging challenges and opportunities for causal inference. Integrating data from multiple sources to facilitate causal inference has become increasingly popular. For example, randomized clinical trial (RCT) has been the gold standard for causal inference but often suffers from insufficient sample size and homogeneous study population due to inclusion/exclusion criteria. Results from RCTs may not be generalizable to a real‐world population. In contrast, observational study typically offers a diverse sample representative of the target population with a large sample size but often suffers from unmeasured confounding. Combining data from both designs allows one to extend causal inference from an RCT to a target population, to correct for bias in observational studies, and to improve efficiency (Colnet et al., 2020). Another prominent example is when no single dataset contains all relevant variables, that is, there are no complete data for any subject. In this case, identification becomes difficult even for parameters that are straightforward to be identified with complete data (Ridder & Moffitt, 2007). This is typical in survey sample combination where variables collected in each survey may differ (Yang & Kim, 2020). This is also the case in two‐sample instrumental variable methods, which is widely applied in Mendelian randomization studies where individual‐level genetic data are not available due to privacy concerns (Angrist & Krueger, 1992).
In this article, we review selected literature on data integration methods in causal inference. Recent review studies focused on combining randomized and observational data (Colnet et al., 2020; Degtiar & Rose, 2021) and data combination in survey sampling (Ridder & Moffitt, 2007; Yang & Kim, 2020). We aim to provide a more systematic review and cover a range of research areas. We start with notation and introduce key assumptions and concepts that frequently appeared in the literature in Section 2. We then summarize recent methodological advances in integrating data from RCTs and observational studies in Section 3, and combining data when no single sample has all relevant variables in Section 4. We briefly review the literature on data integration for causal discovery, distributed data analysis for privacy protection, and Bayesian methods for integrated causal inference in Section 5. We close with a discussion in Section 6.
2. PRELIMINARIES
In this section, we briefly introduce the potential outcome framework and review key concepts in causal inference and data integration. Let denote a binary treatment (1: treated, 0: untreated), denote an observed outcome, and denote a vector of measured covariates. When all circumstances are the same except for the treatment status, any difference observed in the outcomes has to be attributed to the treatment. Correspondingly, for each subject we define a pair of potential outcomes, , that would be observed if the subject had been given treatment, , and control, (Rubin, 1974), under the stable unit treatment value assumption that there is no interference between units and no multiple versions of treatment (Rubin, 1980). As such, the observed outcome is equal to the potential outcome corresponding to the subject's treatment condition, that is.
Assumption 1
(Consistency) if , for or .
A fundamental problem in causal inference is that for each subject, we can only observe one of the potential outcomes. Because it is impossible to compute the difference in and for a specific subject, we often specify a target population of interest, and study the mean difference in the target population, referred to as the average treatment effect (ATE). In practice, we cannot observe data on all subjects in the prespecified target population but rather data on a sample of subjects referred to as the study sample. Let be a binary indicator of whether a subject is selected into the study sample (1: sampled, 0: not sampled). It is important to note that the ATE is population‐specific. In fact, we can define multiple ATEs each with respect to a different target population as follows:
For example, the ATE is if the combined and sample is a random sample of the target population. The ATE estimated based on the study sample, that is, the sample, is an estimate of , which is not necessarily equal to because the study sample is not necessarily a representative sample of the target population. Identification of the ATE, which is a function of the potential outcome distribution in a target population involves expressing it as a function of the observed data distribution, such that distinct data‐generating mechanisms lead to distinct values.
To identify the ATE, ideally we would like to observe of all subjects in the target population to compute , for or . However, both sample selection mechanism and treatment assignment mechanism lead to missingness in : generally 's are missing for all subjects in the sample (sample selection); in the sample, 's are unobserved for subjects in the other treatment arm with , (treatment assignment). Confounding bias, also referred to as violation of interval validity, occurs when factors that impact treatment assignment also predict the outcome, such that the observed 's in the arm cannot represent the missing 's in the arm, that is, . Selection bias, also referred to as violation of external validity, occurs when factors that impact sample selection also predict the outcome, such that the observed 's in the sample cannot represent the missing 's in the sample. A less stringent condition targeting treatment effect estimation, that is, the mean difference rather than the mean, defines selection bias as when factors that impact sample selection also modifies the treatment effect, that is, (Lesko et al., 2017; Stuart et al., 2011). Collider‐stratification bias may also occur due to conditioning the analysis on the study sample, if is a common consequence of the treatment (or a predictor of the treatment) and the outcome (or a predictor of the outcome; Greenland, 2003).
Two key assumptions about the treatment assignment mechanism are often imposed, which we refer to as treatment exchangeability and positivity. The treatment exchangeability assumption states that within a strata of , of subjects in the arm can be exchanged with of subjects in the arm:
Assumption 2
(Treatment exchangeability) for or .
Assumption 2 allows us to represent the conditional distribution of the unobserved potential outcome using that of the observed potential outcome. We thus have that for or ,
(1) |
Equation (1) has also been used as a weaker version of Assumption 2. Within each strata of the covariates sufficient for the treatment exchangeability, we also need to have nonzero subjects in both treatment arms:
Assumption 3
(Treatment positivity) for all almost surely.
Often is referred to as the propensity score. Note that Assumptions 2 and 3 are conditional on the study sample, thus the set of covariates sufficient for Assumptions 2 and 3 to hold may include variables beyond common causes of treatment and outcome, that is, the typical confounders. For example, a covariate that causes selection and outcome but is independent of the treatment can become a confounder if the treatment also causes selection. This is a consequence of collider‐stratification bias where conditioning on results in a spurious association between the treatment and the covariate.
Besides conditions to ensure internal validity, people often impose another two key assumptions about the sample selection mechanism to ensure external validity, which we refer to as selection exchangeability and positivity, in analogy to Assumptions 2–3 (Dahabreh, Robertson, et al., 2020; Lesko et al., 2017; Stuart et al., 2011).
Assumption 4
(Selection exchangeability) for or .
Assumption 4 allows us to generalize the conditional distribution of the potential outcome from the study sample to a target population, such as the one represented by the sample or the combination of and sample:
(2) |
Weaker versions of the selection exchangeability assumption include (I) mean conditional exchangeability, that is, Equation (2) and (II) all treatment effect modifiers are measured, that is, . We further assume that variables required for selection exchangeability do not serve as study eligibility criteria that completely exclude certain subjects from the study sample.
Assumption 5
(Selection positivity) for all almost surely.
For example, suppose geographic location restricted study participation such that there is zero probability of selecting subjects in a certain area, then Assumption 5 requires that geographic location is not needed for Assumption 4, that is, conditional on , geographic location is not associated with the outcome or does not modify the treatment effect.
3. COMBINING A RANDOMIZED CLINICAL TRIAL WITH EXTERNAL DATA
There is a rich literature on combining information from both experimental and nonexperimental designs and bridging findings from an RCT to a target population (Buchanan et al., 2018; Cole & Stuart, 2010; Dahabreh, Haneuse, Robins, Robertson, et al., 2019; Dahabreh & Hernán, 2019; Dahabreh, Hernán, Robertson, Buchanan, & Steingrimsson, 2019; Dahabreh, Petito, et al., 2020; Dahabreh, Robertson, et al., 2020; Dahabreh, Robertson, Petito, Hernán, & Steingrimsson, 2019; Dahabreh, Robertson, Tchetgen, Stuart, & Hernán, 2019; Dahabreh, Robins, Haneuse, & Hernán, 2019; Dong et al., 2020; Hartman et al., 2015; Lesko et al., 2017; O'Muircheartaigh & Hedges, 2014; Rudolph & van der Laan, 2017; Tipton, 2013; Westreich et al., 2017). In this setting, indicates the sample of trial participants, and we observe in RCT data. Due to randomization or stratified randomization, the propensity score, is a known function designed by the investigator, and Assumptions 2–3 naturally hold in RCT with being the variables defining the strata.
Two problems are frequently studied: generalizability (Buchanan et al., 2018; Cole & Stuart, 2010; Dahabreh, Robertson, Tchetgen, Stuart, & Hernán, 2019; Stuart et al., 2011) and transportability (Bareinboim & Pearl, 2016; Hünermund & Bareinboim, 2019; Pearl & Bareinboim, 2014; Rudolph & van der Laan, 2017; Westreich et al., 2017). The distinction between the two concepts is well summarized in Dahabreh and Hernán (2019) and Degtiar and Rose (2021): generalizability focuses on the setting when the study sample is a subset of the target population, and transportability considers the setting when the study sample and the target population are partially‐ or non‐overlapping. An example of the generalizability problem is: suppose the target population is the trial‐eligible population, and the combined and sample is a random sample of the target population, in which trial participants are in the sample and non‐participants are in the sample. In this case, the target ATE is and we would like to generalize inference about obtained from the trial data to . An example of the transportability problem is: suppose the target population is a real‐world population, and sample is a random sample of the target population separately obtained from external data sources such as administrative healthcare databases or survey studies. In this case, the target ATE is and we would like to transport inference about to .
Both problems require some information in the sample, and often two scenarios are considered: (S1) covariates are measured on all individuals in the sample, that is, we observe ; (S2) covariates are measured on a subsample of the sample, that is, we observe , where indicates whether we have data on . In scenario (S2), it is often assumed that such that . That is, is a simple random sample of the sample with two possibilities: (S2.1) is a known constant; (S2.2) is an unknown constant. Dahabreh, Haneuse, Robins, Robertson, et al. (2019) and Dahabreh, Robins, Haneuse, and Hernán (2019) showed that is not identifiable under (S2.2), while is always identifiable in (S2).
3.1. Generalizability and transportability methods
In this section, we review three common strategies for identification and estimation of (generalizability) and (transportability) for or . Correspondingly, the ATE and can be directly obtained based on and by definition. To illustrate the methods, we take scenario (S1) as an example where we observe and from a total of subjects. We summarize the methods under all scenarios in Table 1.
TABLE 1.
Generalizability () | Transportability () | |||
(S1) Covariates are measured on all individuals in the sample, that is, we have | ||||
OR |
|
|
||
IPW |
|
|
||
AIPW |
|
|
||
(S2.1) Covariates are measured on all individuals in the sample and is known | ||||
OR |
|
|
||
IPW |
|
|
||
AIPW |
|
|
||
(S2.2) Covariates are measured on a subsample of the sample and is unknown | ||||
OR | Not identifiable g |
|
||
IPW | Not identifiable h |
|
||
AIPW | Not identifiable |
|
where is designed by the investigator in an RCT and can also be estimated based on the sample, and identified from the combined sample.
, hence .
because .
identified by . Estimation strategies are proposed in Dahabreh, Haneuse, Robins, Robertson, et al. (2019).
Similar to footnote (e), identified by .
Unlike footnote (c), is not identifiable because is unknown.
Unlike footnote (e), is not identifiable because is unknown.
By footnote (a) and (e), . Although is not identifiable as shown in footnote (h), is identifiable.
3.1.1. Outcome regression
Let denote the conditional mean outcome in the study sample and denote an estimated model using . Under Assumptions 1–5, we have the following identification result
(3) |
Both and are identifiable in scenario (S1) where we have observed on all individuals. Therefore, we can marginalize over the empirical distribution of in the combined sample and the sample, respectively, which gives the following outcome regression estimators (Dahabreh et al., 2019,e; Lesko et al., 2017)
(4) |
where . Equation (3) has been referred to as the g‐formula (Greenland & Robins, 1986; Robins, 1986) or standardization (Vansteelandt & Keiding, 2011) in epidemiology, and can also be viewed as imputation in missing data literature (Cheng, 1994).
3.1.2. Inverse probability weighting
Inverse probability weighting is a very commonly used technique (Cole & Stuart, 2010; Dahabreh et al., 2019,e; Lesko et al., 2017; Westreich et al., 2017). Note that the g‐formula in Equation (3) can be re‐expressed as follows
(5) |
where . The propensity score, , is a known function designed by the investigator in an RCT, while the trial participation probability can be estimated in the combined sample because is fully observed under (S1). We arrive at the following inverse probability‐weighted estimators
(6) |
where and is a product of the estimated treatment and trial participation probabilities. Although the propensity score is known, estimating the model parameters rather than using the true value can improve efficiency (Hahn, 1998; Lunceford & Davidian, 2004; Robins et al., 1994). Comparing Equation (6) to traditional IPW estimator using the trial data only, that is,
(7) |
we further weight each subject who participated in the trial by the inverse of the trial participation probability, , to generalize the ATE from the sample to the combined sample, while to transport the ATE from the sample to the sample, trial participants are weighted by the inverse of both the odds of trial participation and .
3.1.3. Augmented inverse probability weighting
So far, each of the estimators relies on estimating components of the likelihood such as and , which are not necessarily in themselves of scientistic interest. Nonparametric estimation may not be feasible when is of high dimension, while parametric working models may be prone to model misspecification. We can combine the two estimators to gain robustness. A common approach to derive a robust estimator is by constructing an estimating equation from the efficient influence function (EIF) and evaluating it under a working model for the observed data distribution to solve for the parameter of interest, which is widely used in missing data problems (Tsiatis, 2007). Any regular and asymptotic linear estimator is asymptotically equivalent to the sample average of the influence function, which is a function of the observed data with mean zero and finite variance, and the one with the smallest variance is referred to as the EIF (Tsiatis, 2007; Van der Vaart, 2000). The EIFs for and under a nonparametric model where the distribution of the observed data is unrestricted are
(8) |
where denotes the observed data, and at the true values. Let and respectively denote the evaluation of and under an estimated working model, then we can obtain the AIPW estimators by solving and (Dahabreh et al., 2019,e). As mentioned in Section 3.1.2, is guaranteed to be correctly specified in an RCT, therefore is correctly specified as long as is. Hence the above AIPW estimators are doubly robust in the sense that it remains consistent when either the probability of trial participation or the outcome regression model is correctly specified. This can be seen by the following observation: the IPW estimator introduced in Section 3.1.2 can be obtained by misspecifying as zero in Equation (8), while the OR estimator introduced in Section 3.1.1 can be obtained by setting the weight in the first term of both and to zero in Equation (8).
3.1.4. Other methods for combining data from clinical trial and external data
Other doubly robust estimators include a targeted maximum likelihood estimator (Rudolph & van der Laan, 2017) and an augmented calibration weighted estimator (Dong et al., 2020). A sensitivity analysis that replaces Assumption 4 with a prespecified bias function has also been proposed (Dahabreh, Robins, Haneuse, Saeed, et al., 2019). Meta‐analysis is often used to synthesize information about parameters from data collected from multiple trials, which allows for extensions of the above methods to the setting of generalizing or transporting inferences from multiple randomized RCTs to a target population (Dahabreh, Petito, et al., 2020; Dahabreh, Robertson, Petito, Hernán, & Steingrimsson, 2019; Manski, 2000; Steele et al., 2020). Identification under an arbitrary collection of observational and experimental data has been investigated (Lee et al., 2020). Combining probability and nonprobability samples with high‐dimensional data has also been studied (Yang, Kim, & Song, 2020).
3.2. Correcting for bias in observational study using validation or trial data
Internal validity, that is, Assumptions 2–3, naturally holds in RCTs due to randomization but not necessarily in observational studies due to potential unmeasured confounding. Borrowing strength from the internal validity of RCT data and the large sample size of observation data can mitigate bias and improve efficiency.
In this vein, Yang, Zeng, and Wang (2020) considered estimation of the average treatment effect on the treated (ATT) in the scenario where , and is unobserved. Data are obtained from RCT and from observational study . In RCT, is sufficient for Assumption 2, while in the observational study, the unmeasured confounding leads to bias. A weaker version of Assumption 4 is further assumed. Yang, Zeng, and Wang (2020) proposed to model unmeasured confounding bias via , which is equal to zero if . Modeling this bias function allows one to improve efficiency in estimation of the ATT by combining observational data and RCT data. A similar idea was considered in Kallus et al. (2018) where a confounding bias correction term was learned with interpolation of between RCT and observational data, and Gui (2020) where RCT data were used to correct bias in an imperfect estimator based on an invalid instrumental variable defined on observation data.
In Athey et al. (2020), it was assumed that we observe data from RCT and from an observational study , where denotes a secondary outcome observed in both studies, denotes the primary outcome expensive to measure in RCT, and the sample is a random sample of the target population. Motivated by the observation that the treatment effects on the secondary outcome should be similar in the RCT and observational data if is sufficient for Assumption 2, Athey et al. (2020) developed a control function method for using differences in the estimated causal effects on the secondary outcome between the two samples to adjust estimation of the treatment effect on the primary outcome.
Yang and Ding (2019) considered the scenario where a small validation dataset with all confounders and a big main dataset with unmeasured confounders are available. Both are random samples of the target population hence external validity is satisfied. The big main data can improve efficiency and the small validation data can ensure consistency. For each dataset , let denote a consistent estimator of the ATE based on a user‐specified estimation strategy adjusting for all confounders , and let denote an error‐prone estimator using the same estimation strategy but with uncontrolled. Apparently cannot be obtained. A key insight is that the two error‐prone estimates should be consistent for zero. By modeling the joint distribution of and , they derived the most efficient consistent estimator of among all linear combinations . Other methods for controlling unmeasured confounding with validation data include the propensity score calibration (Stürmer et al., 2005) and conditional propensity scores (McCandless et al., 2012).
3.3. Combining clinical trial with external control
Single‐arm clinical trials are typically conducted for rare diseases due to difficulties in recruiting enough patients for an adequately powered two‐arm trial, or for diseases with high unmet medical need that raise ethical concerns (Abrahami et al., 2021; Cuffe, 2011; Viele et al., 2014). Historical or contemporaneous information on the control arm is often available from previous RCT or observational studies. Such external controls have been used to emulate the control arm in the setting of single‐arm trials, which can decrease costs and duration and improve power.
Formally, the single‐arm trial data are a random sample of the target population, while the external control data contain . Our goal is to estimate leveraging historical data in order to contrast it with the mean response in the single‐arm trial to estimate the treatment effect. Traditional methods to account for differences in patient characteristics between the external control and the target population include meta‐analysis (Hasegawa et al., 2017; Schmidli et al., 2014; Schmidli et al., 2020; Weber et al., 2018; Zhang et al., 2019) and matching (Schmidli et al., 2020; Signorovitch et al., 2010). Typically, a form of exchangeability across different studies like Assumption 4 is assumed. Recently, Li and Song (2020) proposed to build an outcome regression model using external control data under exchangeability, and then estimate by standardization, which is similar to the identification strategy in Equation (3) with . Besides single‐arm trial data, external controls have also been used to improve efficiency in a traditional RCT with data on both arms available. Li, Miao, Lu, and Zhou (2020) showed that the semiparametric efficiency bound for estimating is reduced by incorporating external control data, and proposed a doubly robust and locally efficient estimator that combines outcome regression and inverse probability of treatment weighting.
4. NO SINGLE SAMPLE CONTAINS ALL RELEVANT VARIABLES
The data integration problems described so far have complete data on all relevant variables in at least one sample. A more challenging problem is when there are no complete data at any data source. This setting has been referred to as data combination (Ridder & Moffitt, 2007; Shu & Tan, 2020) or data fusion (Evans et al., 2018; Li, Miao, Cai, et al., 2020; Sun & Miao, 2018) in the literature. In the following, we will first introduce methods applicable to the general data combination problem in Section 4.1. We will use a new set of notation in Section 4.1 while notation in the rest of the article follows Section 2. We will then overview specific causal inference problems and methods in Sections 4.2 and 4.3.
4.1. General data combination methods
We first introduce some new notation. Suppose for each member from a population of interest, we can define a vector of relevant variables . A sample of complete data on is unavailable, instead two separate samples are available. In one sample we observe variables and in the other sample, we observe , with shared by the two datasets. Suppose the and samples are of size and , respectively, with total sample size , then a merged sample combining the two samples is an i.i.d. sample containing .
4.1.1. Estimation of general parameters defined through moment restrictions
We assume that the sample is drawn from the population of interest, while the sample is an auxiliary sample independent of the sample, which ensures identification that could not be achieved by the sample alone. We are often interested in a population parameter defined as the unique solution to the vector of population moment conditions , which includes the maximum likelihood estimation and generalized method of moments as special cases. For example, is the ATT when is the binary treatment indicator, are the potential outcomes under treatment and control respectively, is a vector of pretreatment covariates, and . Another example is the two‐sample instrumental variable (IV) problem, where is a vector of IVs, is the treatment (not necessarily binary), is the outcome, and . We will detail the two‐sample IV literature in Section 4.2. Typically selection exchangeability () and positivity () are assumed to identify by combining the two samples.
Graham et al. (2016) and Shu and Tan (2020) proposed doubly robust and locally efficient estimators of extending the semiparametric efficiency theory of Hahn (1998) and Chen et al. (2008). We illustrate the estimation strategies in Shu and Tan (2020) below. When , the moment restriction becomes in which is unobserved in the sample and we need to combine the two samples for estimation. Shu and Tan (2020) took the EIF in Chen et al. (2008) as the estimating function to obtain an AIPW estimator, which solves where
(9) |
The AIPW estimator is doubly robust in that it remains consistent when either the propensity score model or the outcome regression model is correctly specified. This can be seen by the following observation: an IPW estimator can be obtained by misspecifying as zero in Equation (9), while an outcome regression estimator can be obtained by setting to zero in Equation (9).
When , Graham et al. (2016) and Shu and Tan (2020) further imposed a key identification assumption that the moment condition is separable in the sense that , where and only depend on variables observed in one sample. We can see that can be directly estimated from the sample, while the challenge is to estimate combining both samples. Motivated by the observation that estimation of reduces to the case with substituted with , Shu and Tan (2020) proposed an AIPW estimator that solves where
with being the estimating function in Equation (9) with substituted with .
An alternative assumption often imposed is the conditional independence assumption, that is, (Ogburn et al., 2020; Ridder & Moffitt, 2007). Under this assumption we have where each of and can be estimated from one sample. Therefore, the sample moment conditions can be computed combining the two samples.
4.1.2. Statistical matching
Another set of methods in data combination problems is statistical matching, which has been proposed mainly under two scenarios. In the first scenario, a sufficient number of units are shared between the two data sources, that is, the two samples are partially overlapping. In this case, it is convenient to merge the two samples by linking the records relating to the same unit. There is a rich literature on record linkage which is beyond the scope of this article (Deepak & Jurek‐Loughrey, 2018; Fellegi & Sunter, 1969; Herzog et al., 2010; Komarova et al., 2018; Sayers et al., 2016; Winkler, 1999). In the second scenario, the two samples are selected from the same population but have no common unit. In this case, a statistical matching framework has been proposed in survey studies, which finds a matched pair of units according to the shared variable then imputes the missing value for one unit using the observed value from its matched counterpart (D'Orazio, 2015; D'Orazio et al., 2006; Radner, 1980; Ridder & Moffitt, 2007; Yang & Kim, 2020). Validity of the statistical matching approach depends on the conditional independence assumption that conditional on the shared variable , the potentially missing variables and are independent. Under this assumption, matching on is sufficient to impute in sample regardless of whether are the same. A similar argument holds for imputation in the sample.
4.1.3. Data combination in regression analysis
Evans et al. (2018) studied a different problem of estimating the regression coefficient of a correctly specified model when both samples are i.i.d. random samples of the same population. Selection exchangeability and positivity were assumed similar to Section 4.1.1, while no assumption on separable moments (Graham et al., 2016; Shu & Tan, 2020) or conditional independence (Ridder & Moffitt, 2007) introduced in previous sections was made. In this setting, identification of can be hard even under linear models, which has been discussed in Pacini (2019), Yang and Kim (2020), and Miao et al. (2022). Evans et al. (2018) proposed a doubly robust estimator for that solves where
(10) |
where is of the same dimension as . The doubly robust estimator remains consistent under misspecification of either or . Therefore, an IPW estimator can be obtained by misspecifying as zero, that is, by substituting with zero in Equation (10), while an imputation estimator can be obtained by substituting with 0.5 in Equation (10).
4.2. Two‐sample instrumental variable and Mendelian randomization
An important setting of data combination problem is the two‐sample instrumental variable methods. An instrumental variable is an exogenous variable known to satisfy the following three core assumptions: (I) the IV must be associated with the treatment; (II) the IV must not have a direct effect on the outcome that is not mediated by the treatment; (III) the IV must be independent of unmeasured confounders. The IV approach is one of the most frequently used methods to mitigate unmeasured confounding denoted as . It turns out that the causal effect can be estimated by combining information from two data sources. Let denote an instrumental variable. The two‐sample IV estimation concerns the scenario when are available in one data source and are available in a separate data source, with shared by the two datasets. No complete data on all variables are available. In the following, we will suppress the measured covariates to simplify notation, and all arguments are made implicitly conditional on .
We first consider the case of a binary treatment. Assuming that does not modify the causal effect of at the individual level, that is, , the ATE is identified by . Hence common IV methods often estimate the effect of the treatment using the IV‐outcome and IV‐treatment associations. The numerator and denominator can be separately estimated from two distinct samples if both are random samples of the same target population. In a general case where is not necessarily binary and could be a vector, the most common IV approach assumes , and , and the IV estimator is given by , where denotes the sample covariance matrix. In the one‐sample setting, the IV estimator is equivalent to a two‐stage least squares (2SLS) estimator obtained by first regressing on , and then regressing on , the fitted values of . Angrist and Krueger (1992) and Arellano and Meghir (1992) showed that the IV estimator can be obtained by computing based on the sample and computing based on the sample, referred to as the two‐sample IV estimator. Klevmarken (1982) and Angrist (1995) showed that the 2SLS can also be separately carried out using two samples, referred to as the two‐sample two‐stage least squares (TS2SLS) estimation (Björklund & Jäntti, 1997). In the first stage, is regressed on using the sample, and the estimates are then combined with observations on in the sample to form . In the second stage, is regressed on . Inoue and Solon (2010) pointed out that the equivalence of IV and 2SLS estimation in the one‐sample setting does not hold in the two‐sample setting. In fact, TS2SLS is more efficient than two‐sample IV because it implicitly corrects for differences in the distribution of between the two samples.
The above classical two‐sample IV methods often assume that the two samples are compatible with the same observed data distribution . However it is found that the common variable, that is, the IV, can have different distributions between the two samples, that is, . Graham et al. (2016) modeled the selection probability, , parametrically and developed a doubly robust and locally efficient estimator which can be applied in more general data combination problems. Similar methods proposed in Shu and Tan (2020), detailed in Section 4.1, were also applied to the two‐sample IV problem. It is important to note that the estimator proposed by Graham et al. (2016) is based on EIF derived under a correct model for and is therefore doubly robust only under such restricted model specification of nuisance parameters, whereas the estimator of Shu and Tan (2020) is based on EIF under a nonparametric model for the observed data and is doubly robust without such restrictions. Sun and Miao (2018) established sufficient conditions for nonparametric identification of the ATE allowing for heterogeneous samples, derived the efficiency bound for estimating the ATE, and proposed a multiply robust and locally efficient estimator for estimation and inference.
Using genetic variants as IVs, two‐sample Mendelian randomization (MR) methods have also been studied recently, which leverage publicly available summary statistics on genetic instrument‐treatment and genetic instrument‐outcome associations typically obtained from genome‐wide association studies (GWAS; Davey Smith & Ebrahim, 2003; Davey Smith & Hemani, 2014; Lawlor, 2016; Pierce & Burgess, 2013; Spiller et al., 2019; Zhu et al., 2018). Although simple and convenient, the traditional two‐sample MR methods typically rely on valid instruments. Methods robust to invalid instruments have been studied (Bowden et al., 2015, 2016; Hartwig et al., 2017; Li, 2017; Sanderson et al., 2021; Zhao et al., 2020), and extension to the setting of weak instruments has also been studied (Burgess et al., 2016; Sanderson et al., 2021; Wang & Kang, 2019). Zhao et al. (2019) further considered the scenario when the sample compatibility assumption is violated and proposed methods that are robust to heterogeneous samples.
4.3. Other causal inference problems
Fan et al. (2014) studied the scenario when the shared variable is the treatment variable. More specifically, are partially observed from two separate datasets: the outcome dataset contains , while the demographics dataset contains . In this case, is not identified from either dataset unless one is willing to make additional identification assumptions. Nevertheless, Fan et al. (2014) established sharp bounds for via bounding its inverse probability weighting representation under a continuous version of the classical monotone rearrangement inequality (Cambanis et al., 1976; Hardy et al., 1952). Other related works include Manski (2000), Cross and Manski (2002), and Ridder and Moffitt (2007).
A more general setting is studied in Li, Miao, Cai, et al. (2020) assuming datasets. Specifically, let , indicate each dataset, and , denote the set of observed variable in the ‐th dataset, with . Assuming that , is randomly assigned, and is linear and additive, Li, Miao, Cai, et al. (2020) showed that the coefficient of , which is the ATE under linear additive model, is identifiable by combining summary‐level statistics obtained from the separate datasets.
5. OTHER SETTINGS OF DATA INTEGRATION IN CAUSAL INFERENCE
5.1. Distributed data setting
Meta‐analysis has a long history in integration of the results from multiple clinical trials with no access to individual‐level trial data (DerSimonian, 2015; DerSimonian & Laird, 1986). Recently, another widely studied topic is the analysis of distributed data where individual‐level observational data are not shareable due to privacy concerns (Toh, 2020). This is increasingly needed in multidatabase or multicenter study of comparative effectiveness and safety of medical products using real‐world data such as electronic health records data. Each data partner can share a summary‐level dataset with the analysis center. A few methods have been proposed and we summarize them ordered by the amount of information shared. The first method is to reduce the dimension of measured confounders using the propensity score or the prognostic score (Hansen, 2008; Rosenbaum & Rubin, 1983), then share individual‐level treatment, outcome, and score with the analysis center to apply propensity score methods (Rassen & Schneeweiss, 2012; Shi et al., 2019). The second method is to aggregate subjects into cells defined by confounders or the propensity score strata, then adjust for confounding based on counts of subjects in each cell (Cook & Goldman, 1989; Rassen et al., 2010; Shu et al., 2020). Propensity score matching within each data partner can be done prior to the aggregation (Toh et al., 2013; Yoshida et al., 2018). The third one is distributed regression (Toh et al., 2018; Zhang et al., 2013), and the fourth one is meta‐analysis of site‐specific results (Toh et al., 2013).
5.2. Bayesian causal inference
Bayesian framework can naturally facilitate the borrowing of prior information across data sources (Gelman, 2006; Hobbs et al., 2011; Ibrahim & Chen, 2000; Kaizer et al., 2018). Boatman et al. (2020) studied the problem of estimating causal effects from a primary source and borrowing from any number of supplemental sources when data on outcome, treatment, and confounders are available in all data sources. When some confounders are unmeasured in a large main dataset but are available in a small validation dataset, a missing data perspective has been used to impute the missing covariates (Gelman et al., 1998; Jackson et al., 2009; Murray & Reiter, 2016). When the number of missing covariates in the main study is large relative to the sample size of the validation study, Antonelli et al. (2017) proposed a Bayesian approach to estimate the ATE in the main study that combines Bayesian variable selection and missing data imputation, allowing for heterogeneous treatment effects between the main and validation studies. Comment et al. (2019) proposed to use informative priors on quantities related to the unmeasured confounding bias in a range of settings including both static and dynamic treatment regimes as well as treatment‐induced mediator‐outcome confounding.
5.3. Causal discovery
Data integration has also been studied in causal discovery, which aims to learn the causal relations between variables of a system, using multiple heterogeneous datasets that measure the system under different environments or experimental conditions and with different sets of variables. There are two main types of methods. The first type pools data from different experiments to learn a context‐independent causal graph of the system (Cooper & Yoo, 1999; Eaton & Murphy, 2007; Peters et al., 2016; Tian & Pearl, 2001; Zhang et al., 2017). For example, Peters et al. (2016) provided an invariant prediction method built on the idea that the conditional distribution of the outcome given the direct causes is invariant across different experimental conditions. Mooij et al. (2020) proposed to take into account context variables that discriminate the different datasets in standard causal discovery methods applied to the pooled data. The second type derives statistics or constraints from each context separately without pooling data and combines them to learn a single graph (Claassen & Heskes, 2010; Tillman & Spirtes, 2011; Triantafillou & Tsamardinos, 2015).
6. DISCUSSION
In this article, we reviewed a collection of data integration methods in causal inference. A common perspective views data integration in causal inference as a missing data problem where the study sample is a subset of the target population. This problem is referred to as generalizability or verify‐in‐sample. We summarize the data missing patterns in Sections 3 and 4 in Table 2. Another setting increasingly recognized is when the study sample and the target population are partially or nonoverlapping, in which selection exchangeability requires that the variables that determine study inclusion/exclusion should not be predictive of the outcome or at least does not modify the treatment effect. This problem is referred to as transportability or verify‐out‐of‐sample (Chen et al., 2008; Colnet et al., 2020; Dahabreh, Robertson, et al., 2020; Degtiar & Rose, 2021). We summarized causal inference methods under both scenarios and their applications in important real‐world problems including combining clinical trial with external information, correcting for unmeasured confounding in observational study using auxiliary or trial data, two‐sample Mendelian randomization, and distributed data network. Majority of the methods rely on some form of exchangeability/homogeneity across different data sources, hence sensitivity to violation of exchangeability assumptions should be routinely conducted. In addition, identification strategies in complex settings such as when no single sample contains all relevant variables have not been fully explored, and connection to the covariate shift problem in machine learning has yet to be fully studied.
TABLE 2.
Section | 3.1 | 3.2 | 3.3 | 4.1 | 4.2 | 4.3 | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Variable |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓/✗ | ✓ | 0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||
|
✓ | ✓ | ✓ | ✓ | ✓ | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Note: For each variable in each sample, ✓ stands for observed, empty stands for unobserved, and ✓/✗ indicates different settings considered by different papers.
CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.
AUTHOR CONTRIBUTIONS
Xu Shi: Writing – original draft (lead). Ziyang Pan: Writing – original draft (supporting). Wang Miao: Writing – original draft (supporting).
Shi, X. , Pan, Z. , & Miao, W. (2023). Data integration in causal inference. WIREs Computational Statistics, 15(1), e1581. 10.1002/wics.1581
Funding InformationXu Shi is support by the NIH/NIGMS grant R01GM139926.
Edited by: James E. Gentle, Commissioning Editor and Editor‐in‐Chief and David W. Scott, Review Editor and Editor‐in‐Chief
DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
REFERENCES
- Abrahami, D. , Pradhan, R. , Yin, H. , Honig, P. , Baumfeld Andre, E. , & Azoulay, L. (2021). Use of real‐world data to emulate a clinical trial and support regulatory decision making: Assessing the impact of temporality, comparator choice, and method of adjustment. Clinical Pharmacology & Therapeutics, 109, 452–461. [DOI] [PubMed] [Google Scholar]
- Angrist, J. D. (1995). Split‐sample instrumental variables estimates of the return to schooling. Journal of Business & Economic Statistics, 13, 225–235. [Google Scholar]
- Angrist, J. D. , & Krueger, A. B. (1992). The effect of age at school entry on educational attainment: An application of instrumental variables with moments from two samples. Journal of the American Statistical Association, 87, 328–336. [Google Scholar]
- Antonelli, J. , Zigler, C. , & Dominici, F. (2017). Guided Bayesian imputation to adjust for confounding when combining heterogeneous data sources in comparative effectiveness research. Biostatistics, 18, 553–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arellano, M. , & Meghir, C. (1992). Female labour supply and on‐the‐job search: An empirical model estimated using complementary data sets. The Review of Economic Studies, 59, 537–559. [Google Scholar]
- Athey, S. , Chetty, R. , & Imbens, G. (2020). Combining experimental and observational data to estimate treatment effects on long term outcomes. arXiv preprint arXiv:2006.09676.
- Bareinboim, E. , & Pearl, J. (2016). Causal inference and the data‐fusion problem. Proceedings of the National Academy of Sciences of the United States of America, 113, 7345–7352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Björklund, A. , & Jäntti, M. (1997). Intergenerational income mobility in Sweden compared to the United States. The American Economic Review, 87, 1009–1018. [Google Scholar]
- Boatman, J. A. , Vock, D. M. , & Koopmeiners, J. S. (2020). Borrowing from supplemental sources to estimate causal effects from a primary data source. arXiv preprint arXiv:2003.09680. [DOI] [PubMed]
- Bowden, J. , Davey Smith, G. , & Burgess, S. (2015). Mendelian randomization with invalid instruments: Effect estimation and bias detection through egger regression. International Journal of Epidemiology, 44, 512–525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowden, J. , Davey Smith, G. , Haycock, P. C. , & Burgess, S. (2016). Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genetic Epidemiology, 40, 304–314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buchanan, A. L. , Hudgens, M. G. , Cole, S. R. , Mollan, K. R. , Sax, P. E. , Daar, E. S. , Adimora, A. A. , Eron, J. J. , & Mugavero, M. J. (2018). Generalizing evidence from randomized trials using inverse probability of sampling weights. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181, 1193–1209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burgess, S. , Davies, N. M. , & Thompson, S. G. (2016). Bias due to participant overlap in two‐sample Mendelian randomization. Genetic Epidemiology, 40, 597–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cambanis, S. , Simons, G. , & Stout, W. (1976). Inequalities for Ek(x,y) when the marginals are fixed. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 36, 285–294. [Google Scholar]
- Chen, X. , Hong, H. , & Tarozzi, A. (2008). Semiparametric efficiency in GMM models with auxiliary data. The Annals of Statistics, 36, 808–843. [Google Scholar]
- Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89, 81–87. [Google Scholar]
- Claassen, T. , & Heskes, T. (2010). Causal discovery in multiple models from different experiments. In Twenty‐fourth Annual Conference on Neural Information Processing Systems .
- Cole, S. R. , & Stuart, E. A. (2010). Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology, 172, 107–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Colnet, B. , Mayer, I. , Chen, G. , Dieng, A. , Li, R. , Varoquaux, G. , Vert, J.‐P. , Josse, J. , & Yang, S. (2020). Causal inference methods for combining randomized trials and observational studies: A review. arXiv preprint arXiv:2011.08047.
- Comment, L. , Coull, B. A. , Zigler, C. , & Valeri, L. (2019). Bayesian data fusion for unmeasured confounding. arXiv preprint arXiv:1902.10613. [DOI] [PMC free article] [PubMed]
- Cook, E. F. , & Goldman, L. (1989). Performance of tests of significance based on stratification by a multivariate confounder score or by a propensity score. Journal of Clinical Epidemiology, 42, 317–324. [DOI] [PubMed] [Google Scholar]
- Cooper, G. F. , & Yoo, C. (1999). Causal discovery from a mixture of experimental and observational data. In 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 116–125.
- Cross, P. J. , & Manski, C. F. (2002). Regressions, short and long. Econometrica, 70, 357–368. [Google Scholar]
- Cuffe, R. L. (2011). The inclusion of historical control data may reduce the power of a confirmatory study. Statistics in Medicine, 30, 1329–1338. [DOI] [PubMed] [Google Scholar]
- Dahabreh, I. J. , Haneuse, S. J. , Robins, J. M. , Robertson, S. E. , Buchanan, A. L. , Stuart, E. A. , & Hernán, M. A. (2019). Study designs for extending causal inferences from a randomized trial to a target population. arXiv preprint arXiv:1905.07764. [DOI] [PMC free article] [PubMed]
- Dahabreh, I. J. , & Hernán, M. A. (2019). Extending inferences from a randomized trial to a target population. European Journal of Epidemiology, 34, 719–722. [DOI] [PubMed] [Google Scholar]
- Dahabreh, I. J. , Hernán, M. A. , Robertson, S. E. , Buchanan, A. , & Steingrimsson, J. A. (2019). Generalizing trial findings using nested trial designs with sub‐sampling of non‐randomized individuals. arXiv preprint arXiv:1902.06080 .
- Dahabreh, I. J. , Petito, L. C. , Robertson, S. E. , Hernán, M. A. , & Steingrimsson, J. A. (2020). Toward causally interpretable meta‐analysis: Transporting inferences from multiple randomized trials to a new target population. Epidemiology, 31, 334–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dahabreh, I. J. , Robertson, S. E. , Petito, L. C. , Hernán, M. A. , & Steingrimsson, J. A. (2019). Efficient and robust methods for causally interpretable meta‐analysis: Transporting inferences from multiple randomized trials to a target population. arXiv preprint arXiv:1908.09230. [DOI] [PMC free article] [PubMed]
- Dahabreh, I. J. , Robertson, S. E. , Steingrimsson, J. A. , Stuart, E. A. , & Hernan, M. A. (2020). Extending inferences from a randomized trial to a new target population. Statistics in Medicine, 39, 1999–2014. [DOI] [PubMed] [Google Scholar]
- Dahabreh, I. J. , Robertson, S. E. , Tchetgen, E. J. , Stuart, E. A. , & Hernán, M. A. (2019). Generalizing causal inferences from individuals in randomized trials to all trial‐eligible individuals. Biometrics, 75, 685–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dahabreh, I. J. , Robins, J. M. , Haneuse, S. J. , & Hernán, M. A. (2019). Generalizing causal inferences from randomized trials: Counterfactual and graphical identification. arXiv preprint arXiv:1906.10792.
- Dahabreh, I. J. , Robins, J. M. , Haneuse, S. J. , Saeed, I. , Robertson, S. E. , Stuart, E. A. , & Hernán, M. A. (2019). Sensitivity analysis using bias functions for studies extending inferences from a randomized trial to a target population. arXiv preprint arXiv:1905.10684. [DOI] [PMC free article] [PubMed]
- Davey Smith, G. , & Ebrahim, S. (2003). ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology, 32, 1–22. [DOI] [PubMed] [Google Scholar]
- Davey Smith, G. , & Hemani, G. (2014). Mendelian randomization: Genetic anchors for causal inference in epidemiological studies. Human Molecular Genetics, 23, R89–R98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deepak, P. , & Jurek‐Loughrey, A. (2018). Linking and mining heterogeneous and multi‐view Data. Springer. [Google Scholar]
- Degtiar, I. , & Rose, S. (2021). A review of generalizability and transportability. arXiv preprint arXiv:2102.11904.
- DerSimonian, R. (2015). Meta‐analysis in clinical trials revisited. Contemporary Clinical Trials, 45, 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DerSimonian, R. , & Laird, N. (1986). Meta‐analysis in clinical trials. Controlled Clinical Trials, 7, 177–188. [DOI] [PubMed] [Google Scholar]
- Dong, L. , Yang, S. , Wang, X. , Zeng, D. , & Cai, J. (2020). Integrative analysis of randomized clinical trials with real world evidence studies. arXiv preprint arXiv:2003.01242.
- D'Orazio, M. (2015). Integration and imputation of survey data in R: The StatMatch package. Romanian Statistical Review, 63, 57–68. [Google Scholar]
- D'Orazio, M. , Di Zio, M. , & Scanu, M. (2006). Statistical matching: Theory and practice. John Wiley & Sons. [Google Scholar]
- Eaton, D. , & Murphy, K. (2007). Exact Bayesian structure learning from uncertain interventions. In Artificial intelligence and statistics (pp. 107–114). PMLR. [Google Scholar]
- Evans, K. , Sun, B. , Robins, J. , & Tchetgen, E. J. T. (2018). Doubly robust regression analysis for data fusion. arXiv preprint arXiv:1808.07309.
- Fan, Y. , Sherman, R. , & Shum, M. (2014). Identifying treatment effects under data combination. Econometrica, 82, 811–822. [Google Scholar]
- Fellegi, I. P. , & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183–1210. [Google Scholar]
- Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1, 515–534. [Google Scholar]
- Gelman, A. , King, G. , & Liu, C. (1998). Not asked and not answered: Multiple imputation for multiple surveys. Journal of the American Statistical Association, 93, 846–857. [Google Scholar]
- Graham, B. S. , de Xavier Pinto, C. C. , & Egel, D. (2016). Efficient estimation of data combination models by the method of auxiliary‐to‐study tilting (AST). Journal of Business & Economic Statistics, 34, 288–301. [Google Scholar]
- Greenland, S. (2003). Quantifying biases in causal models: Classical confounding vs collider‐stratification bias. Epidemiology, 14, 300–306. [PubMed] [Google Scholar]
- Greenland, S. , & Robins, J. M. (1986). Identifiability, exchangeability, and epidemiological confounding. International Journal of Epidemiology, 15, 413–419. [DOI] [PubMed] [Google Scholar]
- Gui, G. (2020). Combining observational and experimental data using first‐stage covariates. arXiv preprint arXiv:2010.05117.
- Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331. [Google Scholar]
- Hansen, B. B. (2008). The prognostic analogue of the propensity score. Biometrika, 95, 481–488. [Google Scholar]
- Hardy, G. , Littlewood, J. , & Polya, G. (1952). Inequalities. Cambridge University Press. [Google Scholar]
- Hartman, E. , Grieve, R. , Ramsahai, R. , & Sekhon, J. S. (2015). From SATE to PATT: Combining experimental with observational studies to estimate population treatment effects. Journal of the Royal Statistical Society: Series A (Statistics in Society), 178, 757–778. [Google Scholar]
- Hartwig, F. P. , Davey Smith, G. , & Bowden, J. (2017). Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. International Journal of Epidemiology, 46, 1985–1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hasegawa, T. , Claggett, B. , Tian, L. , Solomon, S. D. , Pfeffer, M. A. , & Wei, L.‐J. (2017). The myth of making inferences for an overall treatment efficacy with data from multiple comparative studies via meta‐analysis. Statistics in Biosciences, 9, 284–297. [Google Scholar]
- Herzog, T. H. , Scheuren, F. , & Winkler, W. E. (2010). Record linkage. WIREs Computational Statistics, 2, 535–543. [Google Scholar]
- Hobbs, B. P. , Carlin, B. P. , Mandrekar, S. J. , & Sargent, D. J. (2011). Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics, 67, 1047–1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hünermund, P. , & Bareinboim, E. (2019). Causal inference and data‐fusion in econometrics. arXiv preprint arXiv:1912.09104.
- Ibrahim, J. G. , & Chen, M.‐H. (2000). Power prior distributions for regression models. Statistical Science, 15, 46–60. [Google Scholar]
- Inoue, A. , & Solon, G. (2010). Two‐sample instrumental variables estimators. The Review of Economics and Statistics, 92, 557–561. [Google Scholar]
- Jackson, C. H. , Best, N. G. , & Richardson, S. (2009). Bayesian graphical models for regression on multiple data sets with different variables. Biostatistics, 10, 335–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaizer, A. M. , Koopmeiners, J. S. , & Hobbs, B. P. (2018). Bayesian hierarchical modeling based on multisource exchangeability. Biostatistics, 19, 169–184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kallus, N. , Puli, A. M. , & Shalit, U. (2018). Removing hidden confounding by experimental grounding. arXiv preprint arXiv:1810.11646.
- Klevmarken, A. (1982). Missing variables and two‐stage least‐squares estimation from more than one data set. Technical Report, IUI Working Paper.
- Komarova, T. , Nekipelov, D. , & Yakovlev, E. (2018). Identification, data combination, and the risk of disclosure. Quantitative Economics, 9, 395–440. [Google Scholar]
- Lawlor, D. A. (2016). Commentary: Two‐sample Mendelian randomization: Opportunities and challenges. International Journal of Epidemiology, 45, 908–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee, S. , Correa, J. D. , & Bareinboim, E. (2020). General identifiability with arbitrary surrogate experiments. In Uncertainty in artificial intelligence (pp. 389–398). PMLR. [Google Scholar]
- Lesko, C. R. , Buchanan, A. L. , Westreich, D. , Edwards, J. K. , Hudgens, M. G. , & Cole, S. R. (2017). Generalizing study results: A potential outcomes perspective. Epidemiology, 28, 553–561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, H. , Miao, W. , Cai, Z. , Liu, X. , Zhang, T. , Xue, F. , & Geng, Z. (2020). Causal data fusion methods using summary‐level statistics for a continuous outcome. Statistics in Medicine, 39, 1054–1067. [DOI] [PubMed] [Google Scholar]
- Li, S. (2017). Mendelian randomization when many instruments are invalid: Hierarchical empirical Bayes estimation. arXiv preprint arXiv:1706.01389.
- Li, X. , Miao, W. , Lu, F. & Zhou, X.‐H. (2020). Improving efficiency of inference in clinical trials with external control data. arXiv preprint arXiv:2011.07234. [DOI] [PubMed]
- Li, X. , & Song, Y. (2020). Target population statistical inference with data integration across multiple sources — An approach to mitigate information shortage in rare disease clinical trials. Statistics in Biopharmaceutical Research, 12, 322–333. [Google Scholar]
- Lunceford, J. K. , & Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine, 23, 2937–2960. [DOI] [PubMed] [Google Scholar]
- Manski, C. F. (2000). Identification problems and decisions under ambiguity: Empirical analysis of treatment response and normative analysis of treatment choice. Journal of Econometrics, 95, 415–442. [Google Scholar]
- McCandless, L. C. , Richardson, S. , & Best, N. (2012). Adjustment for missing confounders using external validation data and propensity scores. Journal of the American Statistical Association, 107, 40–51. [Google Scholar]
- Miao, W. , Li, W. , Hu, W. , Wang, R. , & Geng, Z. (2022). Invited commentary: Estimation and bounds under Data fusion. American Journal of Epidemiology, 191, 674–678. [DOI] [PubMed] [Google Scholar]
- Mooij, J. M. , Magliacane, S. , & Claassen, T. (2020). Joint causal inference from multiple contexts. Journal of Machine Learning Research, 21, 1–108.34305477 [Google Scholar]
- Murray, J. S. , & Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. Journal of the American Statistical Association, 111, 1466–1479. [Google Scholar]
- Ogburn, E. L. , Rudolph, K. E. , Morello‐Frosch, R. , Khan, A. , & Casey, J. A. (2020). A warning about using predicted values from regression models for epidemiologic inquiry. American Journal of Epidemiology, 190(6), 1142–1147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O'Muircheartaigh, C. , & Hedges, L. V. (2014). Generalizing from unrepresentative experiments: A stratified propensity score approach. Journal of the Royal Statistical Society: Series C: Applied Statistics, 63, 195–210. [Google Scholar]
- Pacini, D. (2019). Two‐sample least squares projection. Econometric Reviews, 38, 95–123. [Google Scholar]
- Pearl, J. , & Bareinboim, E. (2014). External validity: From do‐calculus to transportability across populations. Statistical Science, 29, 579–595. [Google Scholar]
- Peters, J. , Bühlmann, P. , & Meinshausen, N. (2016). Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 78, 947–1012. [Google Scholar]
- Pierce, B. L. , & Burgess, S. (2013). Efficient design for Mendelian randomization studies: Subsample and 2‐sample instrumental variable estimators. American Journal of Epidemiology, 178, 1177–1184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Radner, D. (1980). Report on exact and statistical matching techniques. US Depatment of Commerce, Office of Federal Statistical Policy and Standards. For sale by the Supt. of Docs., U.S. G.P.O., 1980. [Google Scholar]
- Rassen, J. A. , & Schneeweiss, S. (2012). Using high‐dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system. Pharmacoepidemiology and Drug Safety, 21, 41–49. [DOI] [PubMed] [Google Scholar]
- Rassen, J. A. , Solomon, D. H. , Curtis, J. R. , Herrinton, L. , & Schneeweiss, S. (2010). Privacy‐maintaining propensity score‐based pooling of multiple databases applied to a study of biologics. Medical Care, 48, S83–S39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ridder, G. , & Moffitt, R. (2007). The econometrics of data combination. Handbook of Econometrics, 6, 5469–5547. [Google Scholar]
- Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512. [Google Scholar]
- Robins, J. M. , Rotnitzky, A. , & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866. [Google Scholar]
- Rosenbaum, P. R. , & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. [Google Scholar]
- Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688–701. [Google Scholar]
- Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75, 591–593. [Google Scholar]
- Rudolph, K. E. , & van der Laan, M. J. (2017). Robust estimation of encouragement‐design intervention effects transported across sites. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 79, 1509–1525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanderson, E. , Spiller, W. , & Bowden, J. (2021). Testing and correcting for weak and pleiotropic instruments in two‐sample multivariable Mendelian randomisation. Statistics in Medicine, 40, 5434–5452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sayers, A. , Ben‐Shlomo, Y. , Blom, A. W. , & Steele, F. (2016). Probabilistic record linkage. International Journal of Epidemiology, 45, 954–964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidli, H. , Gsteiger, S. , Roychoudhury, S. , O'Hagan, A. , Spiegelhalter, D. , & Neuenschwander, B. (2014). Robust meta‐analytic‐predictive priors in clinical trials with historical control information. Biometrics, 70, 1023–1032. [DOI] [PubMed] [Google Scholar]
- Schmidli, H. , Häring, D. A. , Thomas, M. , Cassidy, A. , Weber, S. , & Bretz, F. (2020). Beyond randomized clinical trials: Use of external controls. Clinical Pharmacology & Therapeutics, 107, 806–816. [DOI] [PubMed] [Google Scholar]
- Shi, X. , Wellman, R. , Heagerty, P. J. , Nelson, J. C. , & Cook, A. J. (2019). Safety surveillance and the estimation of risk in select populations: Flexible methods to control for confounding while targeting marginal comparisons via standardization. Statistics in Medicine, 39, 369–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shu, D. , Yoshida, K. , Fireman, B. H. , & Toh, S. (2020). Inverse probability weighted cox model in multi‐site studies without sharing individual‐level data. Statistical Methods in Medical Research, 29, 1668–1681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shu, H. , & Tan, Z. (2020). Improved methods for moment restriction models with data combination and an application to two‐sample instrumental variable estimation. The Canadian Journal of Statistics, 48, 259–284. [Google Scholar]
- Signorovitch, J. E. , Wu, E. Q. , Yu, A. P. , Gerrits, C. M. , Kantor, E. , Bao, Y. , Gupta, S. R. , & Mulani, P. M. (2010). Comparative effectiveness without head‐to‐head trials. PharmacoEconomics, 28, 935–945. [DOI] [PubMed] [Google Scholar]
- Spiller, W. , Davies, N. M. , & Palmer, T. M. (2019). Software application profile: Mrrobust — A tool for performing two‐sample summary Mendelian randomization analyses. International Journal of Epidemiology, 48, 684–690. [Google Scholar]
- Steele, R. J. , Schnitzer, M. E. , & Shrier, I. (2020). Importance of homogeneous effect modification for causal interpretation of meta‐analyses. Epidemiology, 31, 353–355. [DOI] [PubMed] [Google Scholar]
- Stuart, E. A. , Cole, S. R. , Bradshaw, C. P. , & Leaf, P. J. (2011). The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174, 369–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stürmer, T. , Schneeweiss, S. , Avorn, J. , & Glynn, R. J. (2005). Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. American Journal of Epidemiology, 162, 279–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun, B. , & Miao, W. (2018). On semiparametric instrumental variable estimation of average treatment effects through data fusion. arXiv preprint arXiv:1810.03353.
- Tian, J. , & Pearl, J. (2001). Causal discovery from changes. In 17th Annual Conference on Uncertainty in Artificial Intelligence (UAI) . pp. 512–521.
- Tillman, R. , & Spirtes, P. (2011). Learning equivalence classes of acyclic models with latent and selection variables from multiple datasets with overlapping variables. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 3–15). JMLR Workshop and Conference Proceedings. [Google Scholar]
- Tipton, E. (2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38, 239–266. [Google Scholar]
- Toh, S. (2020). Analytic and data sharing options in real‐world multidatabase studies of comparative effectiveness and safety of medical products. Clinical Pharmacology & Therapeutics, 107, 834–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toh, S. , Gagne, J. J. , Rassen, J. A. , Fireman, B. H. , Kulldorff, M. , & Brown, J. S. (2013). Confounding adjustment in comparative effectiveness research conducted within distributed research networks. Medical Care, 51, S4–S10. [DOI] [PubMed] [Google Scholar]
- Toh, S. , Wellman, R. , Coley, R. Y. , Horgan, C. , Sturtevant, J. , Moyneur, E. , Janning, C. , Pardee, R. , Coleman, K. J. , Arterburn, D. , McTigue, K. , Anau, J. , & Cook, A. J. (2018). Combining distributed regression and propensity scores: A doubly privacy‐protecting analytic method for multicenter research. Clinical Epidemiology, 10, 1773–1768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Triantafillou, S. , & Tsamardinos, I. (2015). Constraint‐based causal discovery from multiple interventions over overlapping variable sets. Journal of Machine Learning Research, 16, 2147–2205. [Google Scholar]
- Tsiatis, A. (2007). Semiparametric theory and missing Data. Springer Science & Business Media. [Google Scholar]
- Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge University press. [Google Scholar]
- Vansteelandt, S. , & Keiding, N. (2011). Invited commentary: G‐computation—Lost in translation? American Journal of Epidemiology, 173, 739–742. [DOI] [PubMed] [Google Scholar]
- Viele, K. , Berry, S. , Neuenschwander, B. , Amzal, B. , Chen, F. , Enas, N. , Hobbs, B. , Ibrahim, J. G. , Kinnersley, N. , Lindborg, S. , Micallef, S. , Roychoudhury, S. , & Thompson, L. (2014). Use of historical control data for assessing treatment effects in clinical trials. Pharmaceutical Statistics, 13, 41–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, S. , & Kang, H. (2019). Weak‐instrument robust tests in two‐sample summary‐data Mendelian randomization. arXiv preprint arXiv:1909.06950. [DOI] [PubMed]
- Weber, K. , Hemmings, R. , & Koch, A. (2018). How to use prior knowledge and still give new data a chance? Pharmaceutical Statistics, 17, 329–341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Westreich, D. , Edwards, J. K. , Lesko, C. R. , Stuart, E. , & Cole, S. R. (2017). Transportability of trial results using inverse odds of sampling weights. American Journal of Epidemiology, 186, 1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winkler, W. E. (1999). The state of record linkage and current research problems. Statistical Research Division, US Census Bureau. [Google Scholar]
- Yang, S. , & Ding, P. (2019). Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association, 115, 1540–1554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang, S. , & Kim, J. K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3, 625–650. [Google Scholar]
- Yang, S. , Kim, J. K. , & Song, R. (2020). Doubly robust inference when combining probability and non‐probability samples with high dimensional data. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 82, 445–465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang, S. , Zeng, D. , & Wang, X. (2020). Improved inference for heterogeneous treatment effects using real‐world data subject to hidden confounding. arXiv preprint arXiv:2007.12922.
- Yoshida, K. , Gruber, S. , Fireman, B. H. , & Toh, S. (2018). Comparison of privacy‐protecting analytic and data‐sharing methods: A simulation study. Pharmacoepidemiology and Drug Safety, 27, 1034–1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, J. , Ko, C.‐W. , Nie, L. , Chen, Y. , & Tiwari, R. (2019). Bayesian hierarchical methods for meta‐analysis combining randomized‐controlled and single‐arm studies. Statistical Methods in Medical Research, 28, 1293–1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, K. , Huang, B. , Zhang, J. , Glymour, C. , & Schölkopf, B. (2017). Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. In Proceedings of the Twenty‐Sixth International Joint Conference on Artificial Intelligence Main track (IJCAI) (Vol. 2017, pp. 1347–1353). NIH Public Access. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, Y. , Duchi, J. , & Wainwright, M. (2013). Divide and conquer kernel ridge regression. In 26th Annual Conference on Learning Theory (pp. 592–617). PMLR. [Google Scholar]
- Zhao, Q. , Wang, J. , Hemani, G. , Bowden, J. , & Small, D. S. (2020). Statistical inference in two‐sample summary‐data Mendelian randomization using robust adjusted profile score. The Annals of Statistics, 48, 1742–1769. [Google Scholar]
- Zhao, Q. , Wang, J. , Spiller, W. , Bowden, J. , & Small, D. S. (2019). Two‐sample instrumental variable analyses using heterogeneous samples. Statistical Science, 34, 317–333. [Google Scholar]
- Zhu, Z. , Zheng, Z. , Zhang, F. , Wu, Y. , Trzaskowski, M. , Maier, R. , Robinson, M. R. , McGrath, J. J. , Visscher, P. M. , Wray, N. R. , & Yang, J. (2018). Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.