Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: Biometrika. 2023 Feb 6;110(4):1041–1054. doi: 10.1093/biomet/asad007

Efficient Estimation under Data Fusion

Sijia Li 1, Alex Luedtke 2
PMCID: PMC10653189  NIHMSID: NIHMS1923270  PMID: 37982010

Summary

We aim to make inferences about a smooth, finite-dimensional parameter by fusing data from multiple sources together. Previous works have studied the estimation of a variety of parameters in similar data fusion settings, including in the estimation of the average treatment effect and average reward under a policy, with the majority of them merging one historical data source with covariates, actions, and rewards and one data source of the same covariates. In this work, we consider the general case where one or more data sources align with each part of the distribution of the target population, for example, the conditional distribution of the reward given actions and covariates. We describe potential gains in efficiency that can arise from fusing these data sources together in a single analysis, which we characterize by a reduction in the semiparametric efficiency bound. We also provide a general means to construct estimators that achieve these bounds. In numerical simulations, we illustrate marked improvements in efficiency from using our proposed estimators rather than their natural alternatives. Finally, we illustrate the magnitude of efficiency gains that can be realized in vaccine immunogenicity studies by fusing data from two HIV vaccine trials.

Some key words: Data fusion, Semiparametric theory, HIV

1. Introduction

The rapid expansion of available data has facilitated the use of data fusion, which allows researchers to combine information from many data sources, each collected on a potentially distinct population at a different time, in order to obtain valid summaries of a target population of interest. In practice, data fusion often renders more relevant information or is less expensive than traditional analyses that only leverage a single data source. For example, technology companies integrate numerous unlabeled data with a small amount of labeled data to make accurate predictions, in a process known as semi-supervised learning (Chakrabortty, 2016). In education, policy-makers leverage multiple datasets generated by different current policies to evaluate a new rule of interest in a fast, inexpensive, and effective way (Kallus et al., 2020). In genomics, integrating expression data, gene sequencing data, and network data gives a heterogeneous description of the gene and a distinct view of the underlying machinery of the cell (Lanckriet et al., 2004). In clinical trials, experimental data can be fused with observational data to evaluate a treatment regime on a different target population than the study population (e.g., Wedam et al., 2020).

There are many recent works introducing statistical methods for particular data fusion problems. Many of them focus on bridging causal conclusions via data fusion as illustrated in the aforementioned clinical trial example. This is true, for example, in works on transportability (Pearl & Bareinboim, 2011; Hernán & VanderWeele, 2011; Bareinboim & Pearl, 2014; Stuart et al., 2015; Rudolph & van der Laan, 2017; Dahabreh & Hernán, 2019; Dahabreh et al., 2019; Dong et al., 2020), re-targeting under covariate shifts (Kallus et al., 2020), using surrogate index to infer long-term outcomes (Athey et al., 2019), and correcting external validity bias (Stuart et al., 2011; Mo et al., 2020), in that all these research areas focus on bridging causal effects from a source population to a different target population. While these works considered merging two datasets only, Dahabreh et al. (2019) and Lu et al. (2021) considered bridging data from multiple trials to a target population and others have studied combining experimental data with multiple observational data sources in the presence of unmeasured confounding (Evans et al., 2018; Sun & Miao, 2018). Moreover, Bareinboim & Pearl (2016) studied the identifiability results for a general causal parameter when multiple heterogeneous data sources are available. Data fusion is also used in non-causal problems. For example, semi-supervised learning (Chapelle et al., 2009; Chakrabortty, 2016) represents another important application of data fusion.

Due to the considerable number of open problems in this area, it is of interest to describe a general framework and approach that allows researchers to tackle data fusion problems in generality without limiting themselves to specific parameters, numbers of datasets, or data structures. In this paper, we will consider a general case where different data sources align with different parts of the distribution of the target population and derive efficient estimators based on all available data. We derive a key object needed to both quantify the best achievable level of statistical efficiency when data from multiple sources are fused together and to construct estimators that achieve these gains. Our results generalize previous works that study estimation of specific parameters in data fusion problems under nonparametric models by allowing for both consideration of general parameters and arbitrary semiparametric or nonparametric models. In addition to an example provided in Section 3, in Supplementary Appendix D we illustrate the wide applicability of the proposed method by presenting five other examples. There we also provide a discussion of implementation, possible extensions, connections to missing data problems, and proofs.

2. Notations and Problem Setup

We begin by defining some notation. For a natural number m, we write [m] to denote {1,,m}. For a distribution ν, we let Eν denote the expectation operator under ν. Throughout we use Z=Z1,,Zd to denote a random variable and, for j[d], we let Zj=Z1,,Zj, where we use the convention that Z0=. We use capital letters, such as Zj and S, to denote random variables and the corresponding lowercase letters, such as zj and s, to denote their realizations. In an abuse of notation, we condition on lowercase letter in expectations to indicate conditioning on the corresponding random variable taking a specific value: for example, EνZ2z1=EνZ2Z1=z1. For any distribution Q of Z and j[d], we will let Qjzj-1 denote the conditional distribution of ZjZj-1=zj-1. Similarly, for any distribution P of (Z,S), we will let Pjzj-1,s denote the conditional distribution of ZjZj-1=zj-1,S=s. Here and throughout we suppose sufficient regularity conditions so that all such conditional distributions are well defined and that all discussed distributions of ZjZj-1=zj-1 and ZjZj-1=zj-1,S=s are defined on some common measurable space. We use → to specify the domain and codomain of an arbitrary function and use ↦ to denote the input and output of a function — for example, if f is the standard normal density function, then it would be accurate to write both f:RR and f:x(2π)-1/2e-x2/2.

Suppose we have a collection of k data sources and want to estimate an Rb-valued summary ψQ0 of a target distribution Q0 that is known to belong to a collection 𝒬 of distributions of a random variable Z=Z1,,Zd, where Z takes values in 𝒵=j=1d𝒵j. The summary ψ may only depend on a subset of the conditional distributions of ZjZj-1. To handle such cases, we let [d] denote a set of irrelevant indices j such that ψ is not a function of the distribution of ZjZj-1 — more concretely, ψ(Q)=ψQ for all Q,Q𝒬 such that Qj=Qj for all j[d]. We do not require that be the largest possible set of irrelevant indices — this means that, for any parameter ψ, we can take =, while, for certain parameters ψ, it will be possible to take to be a nonempty set. To ensure that it makes sense to compare the distributions of ZjZj-1 under different distributions Q and Q in 𝒬, we assume here and throughout that all pairs of distributions in 𝒬 are mutually absolutely continuous. We let 𝒥=[d] denote the set of indices that may be relevant to the evaluation of ψ, termed the set of relevant indices.

Rather than observe draws directly from Q0, we see n independent copies of X=(Z,S) drawn from some common distribution P0, where Z takes values in 𝒵 and S is a categorical random variable denoting the data source has support [k]. The distribution P0 is known to align with Q0 in the sense described below, which makes it possible to relate the conditional distributions Pj0zj-1,s and Qj0zj-1.

Condition 1. (Sufficient alignment) For each relevant index j𝒥, there exists a known set 𝒮j[k] such that, for all s𝒮j, both of the following hold:

  1. (Sufficient overlap) the marginal distribution of Zj-1 under sampling from Q0 is absolutely continuous with respect to the conditional distribution of Zj-1S=s under sampling from P0; and

  2. (Common conditional distributions) Pj0zj-1,s=Qj0zj-1Q0-almost everywhere.

We will provide an example in Section 3 and another five examples in the Supplementary Appendix D where the above condition is plausible. Three of those examples represent generalizations of existing results, and, in all of those cases, a version of the above alignment condition was previously assumed. A detailed comparison between Condition 1 and identification conditions given in existing works is provided in Supplementary Appendix D. We refer to 𝒮j,j[d], as fusion sets and suppose they are known and prespecified in advance. As Q0 is unknown beyond its membership to 𝒬, the above implies that P0 is known to belong to the collection 𝒫 of distributions P with support on 𝒵×[k] for which there exists a Q𝒬 such that, for all j𝒥 and s𝒮j, the following analogues of Condition 1 hold: (a) the marginal distribution of Zj-1 under sampling from Q is absolutely continuous with respect to the conditional distribution of Zj-1S=s under sampling from P, and (b) Pjzj-1,s=Qjzj-1Q-almost everywhere. Hereafter we refer to 𝒫 and 𝒬 as models.

Condition 1a ensures that, for s𝒮j, null sets under the distribution of Zj-1S=s implied by P0 are also null sets under the marginal distribution of Zj-1 implied by Q0, which ensures that the conditional distribution Pj0zj-1,s appearing in Condition 1b is uniquely defined up to Q0-null sets. It is worth noting that we have not assumed that the conditional distribution of Zj-1S=s under sampling from P0 is absolutely continuous with respect to the marginal distribution of Zj-1 under sampling from Q0, which allows Zj-1 to take values not seen in the target distribution when sampled from aligning data sources under P0. Condition 1b implies exchangeability over data sources, namely that ZjSZj-1,S𝒮j for j[d], which imposes a nontrivial conditional independence condition on the data-generating distribution when 𝒮j is not a singleton for at least one j. This exchangeability condition is testable (Luedtke et al., 2019; Westling, 2021). Nevertheless, we advocate choosing the fusion sets based on outside knowledge rather than via hypothesis testing to avoid challenges associated with post-selection inference.

Previous data fusion works have shown that variants of Condition 1 enable the identification of ψQ0 as a functional ϕ of the observed data distribution P0 in particular problems (e.g., Rudolph & van der Laan, 2017; Dahabreh et al., 2019) and in general causal inference problems (e.g., Pearl & Bareinboim, 2011; Bareinboim & Pearl, 2016). We will heavily rely on such an identifiability result to construct estimators of ψQ0, and so we present it explicitly here. To this end, we define a mapping θ:𝒫𝒬. In particular, for any P𝒫, we let θ(P) denote an arbitrarily selected distribution from the set 𝒬(P) of distributions Q𝒬 that are such that, for each j𝒥,ZjZj-1 under sampling from Q has the same distribution as ZjZj-1,S𝒮j under sampling from P. Because P𝒫, there must be at least one distribution in 𝒬(P). Moreover, the value in 𝒬(P) selected when defining θ(P) is irrelevant for our purposes since, as is evident below, our identifiability result only concerns the value of ψθP0, and this value does not depend on the conditional distributions of ZjZj-1 under θP0 for irrelevant indices j.

Theorem 1. Let ϕ=ψθ. Under Condition 1,ψQ0=ϕP0.

Importantly, θP0 can be evaluated without knowing the value of the true target distribution Q0. Consequently, the above result shows that it is possible to learn the summary ψQ0 of the target distribution based only on the distribution of the observed data distribution P0. This motivates estimating ϕP0, and therefore ψQ0, based on a random sample drawn from P0. Before presenting such estimation strategies, we will exhibit an example that fits within this data fusion framework.

3. Example: longitudinal treatment effect

While some clinical trials focus on evaluating a fixed treatment at a single time point, others involve longitudinal treatments that can vary over time. Let X=U1,A1,,UT-1,AT-1,Y, where indices denote time, At denotes the binary treatment at time t, and Ut denotes the time-varying variable of interest at time t. Under this setup, we have Z1=U1,Z2=A1,Z3=U2,,Z2T-1=Y. We suppose that the final outcome of interest, Y, is real-valued. For ease of notation, we let Ht=U1,A1,,Ut for each t[T-1] denote the history up to time t. We consider three models 𝒬 for the unknown target distribution. The first is nonparametric, and consists of all distributions with some common support where treatment assignment satisfies the strong positivity condition that, conditionally on the past, each treatment is assigned with probability bounded away from zero. The second is semiparametric, and supposes that there is some unknown function g:j=12T-2𝒵jR such that the conditional distribution YHT-1=hT-1,AT-1=aT-1 is symmetric about ghT-1,aT-1. When considering this semiparametric model, we suppose that, for each Q𝒬, the conditional distribution Q2T-1 has a corresponding conditional density q2T-1 and that q2T-1HT-1,AT-1 is almost surely differentiable. The third model we consider is also semiparametric and imposes that, under sampling from each Q𝒬,HT-1,AT-1 has support in Rp and there exists some vector of coefficients βRp and error distribution τα belonging to a regular parametric family τα˜:α˜Rc of conditional distributions of a real-valued error ϵ given HT-1,AT-1 such that Y=βκHT-1,AT-1+ϵ, where κ:RpRc is a known transformation of the history and treatment through time T-1 and EταϵHT-1,AT-1=0 almost surely.

Hernán & Robins (2020) discuss a range of causal parameters under such a longitudinal setting, including the mean outcome under a dynamic treatment regime, the parameters indexing a marginal structural mean model, and the average treatment effect of always being on treatment versus never being on treatment. For simplicity, here we focus on the final of these parameters. Under causal assumptions (Chapter 19.4 of Hernán & Robins, 2020), this causal effect is identified with ψ(Q)EQ1L11H1-EQ1L10H1, where, for a{0,1}, we define LTahT=y and, recursively from t=T-1,,1, define Ltaht=EQ2t+1Lt+1Ht+1ht,At=a. Because ψ(Q) can be written as a function of Q1,Q3,,Q2T-1, we see that we can take ={2,4,,2T-2} in this example. This is consistent with the well known fact that the conditional average treatment effect does not depend on the treatment assignment probabilities, namely the distribution Q2t of AtHt for t[T-1].

We consider the scenario where we obtain data from k sources. Some data sources may contain only information about the target baseline covariate distribution. Others may contain observations from all T time points, but possibly with a baseline covariate distribution that differs from the target distribution — for example, measurements of monthly CD4 count in HIV treatment trials or observational settings. The remaining may only contain such measurements up to a time point t<T such that U1,A1,,Ut,At is observed and Us and As are missing for all s>t. We indicate missingness by writing that Us=As= in such cases. Such partial observations may still have valuable information, for example, about how longitudinal CD4 count responds to treatment shortly after the initiation of antiretroviral therapy. Under Condition 1, ψQ0 can be identified as ϕP0=EP0L˜11H1S𝒮1-EP0L˜10H1S𝒮1, where, for a{0,1}, we define L˜TahT=y and, recursively from t=T-1,,1, define L˜taht=EP0L˜t+1Ht+1ht,At=a,S𝒮2t+1. When T=1, this estimand simplifies to the well-studied average treatment effect. Athey et al. (2019) studied a special case of data fusion for inferring average treatment effect with two post-baseline time points, where a group of surrogate markers are measured post-treatment at T=2 and the primary outcome is measured at T=3. Assumptions 3 and 4 in Athey et al. (2019) match our Conditions 1b and 1a. In contrast to their setup, ours allows for more time points (T>3) and data sources (k>2), and also for arbitrary specifications of the relevant data fusion sets 𝒮1,𝒮3,,𝒮2T-1, provided they are nonempty.

4. Methods

4.1. Review of semiparametric theory

We review some important aspects of semiparametric theory. A more comprehensive review is given in Supplementary Appendix E. An estimator ϕˆ of ϕ(P) is called asymptotically linear with influence function DP if it can be written as ϕˆ-ϕ(P)=n-1i=1nDPXi+opn-1/2, where EPDPXi=0 and σP2EPDPXi2<. One reason such estimators are attractive is that they are consistent and asymptotically normal, in the sense that n{ϕˆ-ϕ(P)}dN0,σP2 under sampling n independent draws from P. This facilitates the construction of confidence intervals and hypothesis tests. If ϕˆ is regular and asymptotically linear at P (Bickel et al., 1993), then ϕ is pathwise differentiable and the influence function DP is a gradient of ϕ, in the sense that, for all submodels P(ϵ):ϵ[0,δ)𝒫(P,𝒫), ϵϕPϵϵ=0=EPDP(X)h(X). The tangent set 𝒯(P,𝒫) of 𝒫 at P is defined as the set of all scores of submodels in 𝒫(P,𝒫). The canonical gradient DP* corresponds to the L02(P)-projection of any gradient DP onto the closure of the linear span of scores in 𝒯(P,𝒫), where L02(P) is the Hilbert space of P-mean-zero functions, finite variance functions.

One way to construct a regular asymptotically linear estimator with influence function DP0 is through one-step estimation (Bickel, 1982). Given an estimate P^ of P0, the one-step estimator is given by ϕˆϕ(P^)+i=1nDP^Xi/n. This estimator will be asymptotically linear with influence function DP0 if the remainder term RP^,P0ϕ(P^)-ϕP0+EP0DP^(X) is opn-1/2 and the empirical mean of DP^(X)-DP0(X) is within opn-1/2 of the mean of this term when X~P0. The latter of these requirements will hold under an appropriate empirical process condition (Van Der Vaart et al., 1996). Alternative approaches for constructing asymptotically linear estimators can be found in Van Der Laan & Rubin (2006), Van der Laan & Robins (2003), and Tsiatis (2006).

All of the results in this section extend naturally to the case where ϕ is Rb-valued. In such cases, gradients (respectively, the canonical gradient) of ϕ are Rb-valued functions whose b-th entry corresponds to a gradient (respectively, the canonical gradient) of the b-th coordinate projection of ϕ. Estimators can similarly be constructed coordinatewise. Due to this straightforward extension from univariate to b-variate settings, the theoretical results in the next subsection focus on the special case where b=1.

4.2. Derivation of canonical gradient of a general target parameter

In this section, we provide approaches for obtaining the canonical gradient of ϕ in the model 𝒫 implied by 𝒬 and the data fusion conditions. We will focus on settings where distributions in Q can be separately defined via their conditional distributions, so that it is possible to modify a conditional distribution Qj under a distribution Q𝒬, not modify any of the other conditional distributions Qj,jj, and still have it be the case that the resulting distribution belongs to 𝒬. This condition is formalized in the following.

Condition 2. (Variation independence) There exist sets 𝒬j of conditional distributions of ZjZj-1,j[d], such that 𝒬 is equal to the set of all distributions Q such that, for all j[d], the conditional distribution Qj belongs to 𝒬j.

The above condition is satisfied by the model 𝒬 described in each of our examples. It is also satisfied in many other interesting semiparametric examples, such as those where EQ0Z1 or EQ0Z2Z1 is known, but is not satisfied in some others, such as in cases where EQ0Z2 is known — in this case, knowing the marginal distribution Q1 restricts the values that the conditional distribution Q2 can take.

The upcoming results will provide forms of gradients of ϕ at P0 in terms of gradients of ψ at a generic distribution Q_0𝒬P0, where we recall that 𝒬P0 is the set of distributions in 𝒬 whose relevant conditional distributions align with P0. Since Q0𝒬P0, all of these results are valid when Q_0=Q0. However, since the distribution Q0 is not generally identifiable from P0 — indeed, there may be no alignment for conditional distributions irrelevant to ψ — the particular value of Q0 may be unknowable even given infinite data. In contrast, the set 𝒬P0 would be knowable in such a setting. We, therefore, allow for the specification of an arbitrary distribution from the identifiable set 𝒬P0 — for example, it is always possible to take Q_0=θP0, whose value depends only on P0.

We require a strong overlap condition on the chosen Q_0𝒬P0 and the relevant conditional distributions of P0. If Q_0=Q0, then the upcoming condition strengthens the overlap condition (Condition 1a) that was used to establish identifiability. To state this condition, for each j𝒥, we let λj-1 denote the Radon-Nikodym derivative of the marginal distribution of Zj-1 under sampling from Q_0 relative to the conditional distribution of Zj-1S𝒮j under sampling from P0.

Condition 3. (Strong overlap) For each j𝒥, there exists cj-1>0 such that Q_0cj-1-1λj-1Zj-1cj-1=1.

Since Z0= almost surely under both of these distributions, λj-1 is the constant function that returns 1 when j=1. Hence, the above condition only imposes a nontrivial requirement on j𝒥{1}.

In the upcoming results, we suppose that the tangent set 𝒯Q_0,𝒬 of 𝒬 at Q_0 is a closed linear subspace of L02Q0. We can therefore refer to 𝒯Q_0,𝒬 as the tangent space without causing any confusion. The following lemma shows that, under the above conditions and the earlier stated data fusion condition, the pathwise differentiability of ψ is equivalent to the pathwise differentiability of ϕ.

Lemma 1. Suppose that Conditions 1, 2, and 3 hold. Under these conditions, ψ is pathwise differentiable at Q0 relative to 𝒬 if and only if ϕ is pathwise differentiable at P0 relative to 𝒫.

The next result provides a means to derive gradients of ϕ.

Theorem 2. Suppose that Conditions 1, 2, and 3 hold and that ψ is pathwise differentiable at Q_0 relative to 𝒬 with gradient DQ0. Under these conditions, the following function is a gradient of ϕ at P0 relative to:

DP0(z,s)j𝒥1s𝒮jPS𝒮jλj-1zj-1DQ_0,jzj, (1)

where DQ_0,jzjEQ_0DQ_0(Z)Zj=zj-EQ_0DQ_0(Z)Zj-1=zj-1.

Given any gradient of ϕ, the canonical gradient can be derived by projecting that gradient onto the tangent space of 𝒫 at P0. The form of this projection is provided in Lemma S4 in Supplementary Appendix A.2. Applying this projection to a gradient of the form in (1) provides a form for the canonical gradient. In what follows we use ΠQ_0{𝒜} to denote the L02Q_0-projection operator onto a subspace 𝒜 of L02Q_0.

Corollary 1. Under the conditions of Theorem 2, the canonical gradient of ϕ relative to 𝒫 is

DP0*(z,s)j𝒥1zj-1𝒵-j-11s𝒮jPS𝒮jΠQ_0rj𝒯Q_0,𝒬zj, (2)

where 𝒵-j-1 denotes the support of Zj-1 under sampling from Q_0 and rjL02Q_j0 is such that rjzj= λj-1zj-1DQ_0,jzj.

Previous data fusion works mainly focus on particular parameters in nonparametric models (Stuart et al., 2015; Rudolph & van der Laan, 2017; Dahabreh et al., 2019; Kallus et al., 2020). Corollary 1 generalizes existing results to arbitrary parameters in both nonparametric and semiparametric models. We propose to construct either a one-step estimator using the canonical gradients derived using the procedure above. Under regularity conditions (outlined in Section 4.1), the resulting estimator will be efficient among all regular and asymptotically linear estimators.

Because the canonical gradient is unique, the right-hand side of (2) will be the same regardless of the chosen value of Q_0𝒬P0 that satisfies Condition 3. However, the calculations required to simplify that expression may differ. Indeed, that expression depends on Q_0 through the definitions of rj and through the projection operator ΠQ_0𝒯Q_0,𝒬.

Computing the projection in (2) may be challenging in some semiparametric models, though there is substantial existing work providing the form of this projection in a variety of interesting examples (Pfanzagl, 1990; Bickel et al., 1993; Van der Laan & Robins, 2003; Tsiatis, 2006). In contrast, computing this projection is necessarily trivial when 𝒬 is locally nonparametric, since in this case the tangent space of 𝒬 at Q_0 is equal to L02Q_0 and the projection operator is the identity operator. Hence, in this special case, (1) and (2) are equal, and applying Theorem 2 to the one gradient DQ_0 for ψ that can possibly exist relative to a locally nonparametric model for 𝒬 necessarily yields the canonical gradient relative to 𝒫. In semiparametric models, where there is more than one possible initial candidate gradient DQ_0 to plug into (1), it is natural to wonder whether there is any such candidate for which (1) and (2) coincide. In general, this will fail to hold unless there is a gradient DQ_0 of ψ for which zλj-1zj-1DQ0,jzj belongs to 𝒯Q_0,𝒬 for all j𝒥. This does not hold in general. One example of a case where this typically fails to hold occurs in a model where it is known that ZjZj-2Zj-1 for some j𝒥.

In certain cases, there is a close connection between data fusion problems and missing data problems. In particular, if the fusion sets are nested such that 𝒮d𝒮d-1𝒮1 with 𝒮d nonempty, then the fusion framework can be framed as a monotone missing data problem. In this problem, the full data structure ZF is drawn from Q0. The observed variable Z is such that Zj=ZjF for all j such that S𝒮j and Zj is coded as missing Zj= otherwise. Under Condition 1, this results in a missing data problem that satisfies coarsening at random (Heitjan & Rubin, 1991). Existing works provide general approaches to derive canonical gradients in these models and, when the fusion sets are nested, our results emerge as a special case of them. For example, our Theorem 2 can be derived as a special case of Theorem 1.1 of Van der Laan & Robins (2003) in this case. Moreover, the canonical gradient presented in Corollary 1 can be derived using a general strategy for deriving efficient estimators in monotone missing data problems (Robins et al., 1994; Tsiatis, 2006). To ease notation, when illustrating this we focus on the case where 𝒥=[d] and the support of Zj-1S𝒮j under P0 is equal to 𝒵-j-1. Letting C=maxj:S𝒮j, with the maximum over the empty set being equal to 0, it can be verified that the canonical gradient DP0* from Corollary 1 satisfies

DP0*(z,s)=1(c=d)P0(Cd)j=1dΠQ_0rj𝒯Q_0,𝒬zj+=1d-11(c=)-1(c)P0(C=C)P0(C+1)j=1ΠQ_0rj𝒯Q_0,𝒬zj.

Noting that fQ_0(z):=j=1dΠQ0rj𝒯Q_0,𝒬zj is a gradient of ψ at Q_0 relative to 𝒬, the above identity shows that DP0* is the influence function of an augmented inverse probability weighted completecase estimator derived from the full-data influence function fQ0 (see Theorem 10.1 of Tsiatis, 2006). Consequently, the form of this canonical gradient could be identified or numerically approximated using the general strategies outlined in Chapter 11 of Tsiatis (2006). As is noted in Tsiatis (2006), the strategies outlined in that chapter are quite complex, often involving some form of numerical approximation, and so “the actual implementation needs to be considered on a case-by-case basis.” Corollary 1 demonstrates that, in the special case of data fusion with nested fusion sets, a simple form for the canonical gradient is available. Moreover, this gradient can be computed explicitly whenever the projections used to define it have a known form, which will typically be the case when a closed-form efficient estimator for ψQ0 is known in the non-data fusion setting where a random sample from Q0 is directly observed.

In settings where the fusion sets are not nested, it is not clear that it is even possible to write our data fusion problem as a missing data problem that satisfies coarsening at random. Hence, Corollary 1 presents what is, to our knowledge, the first general strategy for deriving the canonical gradient in data fusion problems. We conclude by noting that non-nested cases are far from the exception. Indeed, many previously studied data fusion problems involve non-nested fusion sets (e.g., Rudolph & van der Laan, 2017; Westreich et al., 2017; Dahabreh & Hernán, 2019; Kallus et al., 2020).

4.3. Canonical Gradient in Our Example

We now derive the canonical gradient of the longitudinal treatment effect described in each of the three semiparametric models described in Section 3. An initial gradient DQ0 to plug into Theorem 2 can be found in Theorem 1 of van der Laan & Gruber (2012) (see also Bang & Robins, 2005). Following notations introduced in Section 3 and the results from Corollary 1, we can use this initial gradient to show that the canonical gradient of ϕ under a locally nonparametric model is DP0(x)=DP01(x)-DP00(x), where, for a{0,1} and letting 𝒰-t denote the support of Ut under sampling from Q0,DP0a=t=1TDP2t-10a with

DP2t10a(x)1ut-1𝒰-t-11s𝒮2t-1prS𝒮2t-1m=1t-11am=aprAm=aum,Am-1=a,S𝒮2t-1m=1t-1dP0umum-1,Am-1=a,S𝒮2m-1dP0umum-1,Am-1=a,S𝒮2t-1L˜taht,s-L˜t-1aht-1,s, (3)

where we abuse notation and write Am-1=a to mean that Aj=a for all j[m-1]. Compared to the canonical gradient DQ0 under no data fusion, the above uses all available data sources that are valid, as reflected by the indicator terms, and then corrects for the variable shifts between the target population and the observed as reflected by the first two terms inside the curly brackets.

For the symmetric location semiparametric model described in Section 3, we provide the canonical gradient of ϕ in Supplementary Appendix C. For the linear semiparametric model described in Section 3, the canonical gradient of ϕ takes the form of DP0=DP01-DP00, where

DP0a(x)=DP0a(x)-DP2T-10a(x)+1uT-1𝒰-T-11s𝒮2T-1P0S𝒮2T-1EQ_0(X)(X)-1EQ_0(X)λ2T-2HT-1,AT-1DQ_2T-10a(X)(x), (4)

where Q_0 is a generic element of 𝒬P0,=β,α with β(z)βlogτ˜αy-βκhT-1,aT-1hT-1,aT-1 and α(z)αlogτ˜αy-βκhT-1,aT-1hT-1,aT-1, where τ˜α denotes the conditional density function of the error distribution τα. The key distinction between (4) and (3) lies in the subspace 𝒯Q_2T-10,𝒬2T-1, where the projection used in (4) can be derived via a score function argument for finite-dimensional parameters. A detailed roadmap is at Supplementary Appendix C.

5. Simulation and data illustration

5.1. Simulation

We simulated k=9 data sources with T=4 and fixed data sizes as specified in Table S1. The variable U3 of the target population follows a multivariate normal distribution as specified in Table S2, while treatments A3 are independent Bernoulli(0.5). The outcome variable Y is such that Y=βκH3,A3+ϵ and the heteroskedastic error ϵ satisfies ϵH3=h3,A3=a3~ταh3,a3, where, for α˜>0,τα˜(h3,a3 denotes the distribution of α˜u3 times a random variable following a student’s t-distribution with 3 degrees of freedom. The indexing parameter α equals 0.1 and the values of β and the form of κ are specified in Supplementary Appendix G. The underlying true distribution belongs to all three models mentioned in Section 3, where, for the second semiparametric model, the error distribution is known to belong to τα˜:α˜>0. We evaluated the treatment effect of always being on treatment versus never being on treatment. Under this setup, data source 9 aligns perfectly with the target population distribution and it is possible to provide valid inferences for ψQ0 using this data source alone. We compared three one-step estimators that were constructed via the canonical gradients under these models respectively, and under three scenarios: (1) no data fusion with 𝒮7=𝒮5=𝒮3=𝒮1={9}, (2) partial data fusion with 𝒮7={6,9},𝒮5={5,9},𝒮3={3,9} and 𝒮1={1,3,9}, and (3) complete data fusion with 𝒮7={6,8,9}, 𝒮5={5,6,8,9},𝒮3={3,5,7,9} and 𝒮1={1,3,9}.

The outcome regressions and the propensity scores were estimated via SuperLearner (Van der Laan et al., 2007) with a library containing a generalized linear model with interaction terms and general additive model under their default settings in the SuperLearner R package (Polley & Van Der Laan, 2010). Each density in the ratios that appear in the second line of Equation 3 was estimated via kernel density estimation using a normal scale bandwidth. Details on the estimation of the conditional density of the regression error in the semiparametric model where this density is known to be symmetric are given in Supplementary Appendix G. For the other semiparametric model considered, we evaluated the scores =β,α in (4) numerically via a finite difference approximation. For each simulation study presented in this work, 1000 Monte Carlo replications were conducted.

Using data fusion yields around 10% and 20% efficiency gains for partial and complete fusion respectively in the nonparametric case. Compared to the nonparametric estimator that was constructed using only data source 9, the semiparametric estimators gained approximately 40%, 50%, and 60% efficiency under no, partial and complete data fusion. Coverage was near nominal for all estimators (93%−98%), and the widths of intervals decreased along the same lines as the mean squared error did, with more data fusion and more restrictive statistical models each leading to tighter intervals (Table 1).

Table 1:

Bias, variance and coverage of the estimated longitudinal treatment effect.

No Data Fusion Partial Data Fusion Complete Data Fusion
Bias Var Cov% Bias Var Cov % Bias Var Cov %
Nonparametric −0.003 0.170 97 −0.004 0.154 98 0.003 0.137 98
Symmetric −0.004 0.106 93 0.064 0.087 94 0.049 0.076 94
Linear −0.012 0.101 96 0.066 0.079 95 0.053 0.064 95

5.2. Data illustration: evaluating the immunogenicity of an HIV vaccine

The STEP study and the Phambili study were two phase IIb trials that evaluated the safety and efficacy of the same HIV vaccine regimen in different populations (Buchbinder et al., 2008; Gray et al., 2011). The STEP study was conducted at 34 sites in multiple continents, whereas the Phambili study tested the same vaccine at 5 sites in South Africa. Although the two studies enrolled in distinct geographic areas, there was overlap in the overall demographic characteristics (Table S3). Both studies suggested that the vaccine did not prevent HIV-1 infection, although most vaccinees developed an HIV-specific immune response to Clade B as measured by interferon-γ ELISpot.

We studied the extent to which baseline covariates predict HIV-specific immune responses. This is important for developing baseline immunogenicity predictors that can be used in subsequent analyses to assess stratified effects (Follmann, 2006). These predictors can subsequently be used to bridge — that is, transport — vaccine efficacy to new settings (Qin et al., 2007; Rolland & Gilbert, 2012; Huang et al., 2013). Specifically, we assessed the best achievable area under the receiver operating characteristic curve for predicting each of the three HIV-specific immune responses using age, sex, body mass index and baseline adenovirous serotype-5 positivity as covariates. This measure assesses the extent which these baseline covariates predict immune responses (Williamson et al., 2021). We separately treated each of the study populations from the two studies as the target population and compared the estimation results generated by using one single dataset with using both datasets.

The Phambili ELISpot data consist of measurements of Gag, Nef, and Pol immune responses for 93 vaccinees, and the STEP immunogenicity data consist of measurements for 722 vacinees and 257 placebo participants. The two trials used different sampling schemes. In Phambili, the immunogenicity assessment was conducted on the first 93 vaccinees who were HIV-1 antibody negative at the week 12 visit and had received the second injection (Gray et al., 2011). In STEP, a two-phase sampling scheme was adopted to oversample HIV cases (Huang et al., 2014). To account for this two-phase sampling scheme, we weighted the STEP data by the inverse probability of being sampled given infection status and treatment group. We aimed to evaluate the three HIV-specific immune responses for the vaccine group, namely Gag, Nef, and Pol to clade-B. We used the same criteria as Huang et al. (2014) for defining a positive immune response. Regardless of the target population, we assumed that the conditional distribution of immune response for the vaccine group between STEP and Phambili given baseline covariates are the same (Condition 1b). These baseline covariates consisted of baseline adenovirus serotype-5 positivity along with age, body mass index, sex, and circumcision status. We combined sex and circumcision status into a single 3-level categorical variable to differentiate uncircumcised men from circumcised men and women. Data from the HVTN 204 phase II trial support the plausibility of Condition 1b. In particular, they suggest that HIV-specific immune response profiles do not differ by geographic region, whereas baseline adenovirus serotype-5 neutralizing antibodies are strongly associated with HIV-specific immune responses (Churchyard et al., 2011). To examine if Condition 1b is reasonable, we followed the approach proposed by Luedtke et al. (2019) and found no strong evidence against it (Gag: p=0.31; Nef: p=0.46; Pol: p=0.98). In addition, we observed reasonable overlap between the distributions of covariates (Table S3 and Figure S1).

We performed 4-fold cross fitting and trained our immune response prediction algorithm and other nuisance parameters using SuperLearner (Van der Laan et al., 2007; Polley & Van Der Laan, 2010) with a library containing simple mean, random forest, gradient boosting, generalized additive model, and elastic net. Results are presented in Table 2. While these baseline covariates predict immune responses reasonably well for the STEP population (area under the curve: 0.64–0.72), they have poor predictive performance for the Phambili population (0.55–0.57). While estimators that make use of data fusion gave estimates that were close to the estimators that only used data from one trial, the corresponding standard errors were reduced more than 30% for each immune response when data from Phambili were augmented with the STEP study data. The proposed methods also brought efficiency gains when data from STEP were augmented, though these gains were more modest due to the smaller sample size of the Phambili study.

Table 2:

Estimated area under the receiver operating characteristic curve using the STEP and Phambili study data. Estimation results are presented as estimates (standard errors).

Augmenting STEP Augmenting Phambili
STEP only
(N=979)
Both
(N=1072)
Phambili only
(N=93)
Both
(N=1072)
Gag 0.718 (0.017) 0.686 (0.016) 0.574 (0.036) 0.574 (0.019)
Nef 0.654 (0.017) 0.640 (0.016) 0.547 (0.031) 0.550 (0.017)
Pol 0.704 (0.017) 0.717 (0.016) 0.546 (0.032) 0.550 (0.015)

We also studied coefficients in univariate logistic working models for baseline covariates age, sex, body mass index and baseline adenovirus serotype-5 positivity for each of the three HIV-specific immune markers. These results, along with further details of our area under the curve analysis, are shown in Supplementary Section H.

Supplementary Material

Supplementary appendices

Acknowledgements

The authors are grateful to Peter Gilbert for helpful discussions and to the participants, investigators, and sponsors of the STEP and Phambili trials. This work was supported by the NIH through award numbers DP2-LM013340 and 5UM1AI068635-09.

Contributor Information

Sijia Li, Department of Biostatistics, University of Washington, Seattle, Washington 98195,.

Alex Luedtke, Department of Statistics, University of Washington, Box 354322, Seattle, Washington 98195,.

References

  1. Athey S, Chetty R, Imbens GW & Kang H (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Tech. rep, National Bureau of Economic Research. [Google Scholar]
  2. Bang H & Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. [DOI] [PubMed] [Google Scholar]
  3. Bareinboim E & Pearl J (2014). Transportability from multiple environments with limited experiments: Completeness results. Advances in neural information processing systems 27, 280–288. [Google Scholar]
  4. Bareinboim E & Pearl J (2016). Causal inference and the data-fusion problem. Proc. Natl. Acad. Sci 113, 7345–7352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bickel PJ (1982). On adaptive estimation. Ann. Stat, 647–671. [Google Scholar]
  6. Bickel PJ, Klaassen CA, Bickel PJ, Ritov Y, Klaassen J, Wellner JA & Ritov Y (1993) Efficient and adaptive estimation for semiparametric models, vol. 4. Johns Hopkins University Press Baltimore. [Google Scholar]
  7. Buchbinder SP, Mehrotra DV, Duerr A, Fitzgerald DW, Mogg R, Li D, Gilbert PB, Lama JR, Marmor M, Del Rio C et al. (2008). Efficacy assessment of a cell-mediated immunity hiv-1 vaccine (the step study): a double-blind, randomised, placebo-controlled, test-of-concept trial. Lancet 372, 1881–1893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chakrabortty A (2016). Robust Semi-Parametric Inference in Semi-Supervised Settings. Ph.D. thesis, Havard University. [Google Scholar]
  9. Chapelle O, Scholkopf B & Zien A (2009). Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Trans. Neural Netw. Learn. Syst 20, 542–542. [Google Scholar]
  10. Churchyard GJ, Morgan C, Adams E, Hural J, Graham BS, Moodie Z, Grove D, Gray G, Bekker L-G, McElrath MJ et al. (2011). A phase iia randomized clinical trial of a multiclade hiv-1 dna prime followed by a multiclade rad5 hiv-1 vaccine boost in healthy adults (hvtn204). PloS one 6, e21225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dahabreh IJ & Hernán MA (2019). Extending inferences from a randomized trial to a target population. Eur. J. Epidemiol 34, 719–722. [DOI] [PubMed] [Google Scholar]
  12. Dahabreh IJ, Robertson SE, Petito LC, Hernán MA & Steingrimsson JA (2019). Efficient and robust methods for causally interpretable meta-analysis: transporting inferences from multiple randomized trials to a target population. arXiv preprint arXiv:1908.09230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dong L, Yang S, Wang X, Zeng D & Cai J (2020). Integrative analysis of randomized clinical trials with real world evidence studies. arXiv preprint arXiv:2003.01242. [Google Scholar]
  14. Evans K, Sun B, Robins J & Tchetgen EJT (2018). Doubly robust regression analysis for data fusion. arXiv preprint arXiv:1808.07309. [Google Scholar]
  15. FollmanN D (2006). Augmented designs to assess immune response in vaccine trials. Biometrics 62, 1161–1169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gray G, Allen M, Moodie Z, Churchyard G, Bekker L, Nchabeleng M, Mlisana K, Metch B, de Bruyn G, Latka M et al. (2011). Safety and efficacy assessment of the hvtn 503/phambili study: A double-blind randomized placebo-controlled test-of-concept study of a clade b-based hiv-1 vaccine in south africa. The Lancet infectious diseases 11, 507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Heituan DF & Rubin DB (1991). Ignorability and coarse data. Ann. Stat, 2244–2253. [Google Scholar]
  18. Hernán MA & Robins JM (2020). Causal inference: what if.
  19. Hernán MA & VanderWeele TJ (2011). Compound treatments and transportability of causal inference. Epidemiology (Cambridge, Mass.) 22, 368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Huang Y, Duerr A, Frahm N, Zhang L, Moodie Z, De Rosa S, McElrath MJ & Gilbert PB (2014). Immune-correlates analysis of an hiv-1 vaccine efficacy trial reveals an association of nonspecific interferon-γ secretion with increased hiv-1 infection risk: a cohort-based modeling study. PLoS One 9, e108631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Huang Y, Gilbert PB & Wolfson J (2013). Design and estimation for evaluating principal surrogate markers in vaccine trials. Biometrics 69, 301–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kallus N, Saito Y & Uehara M (2020). Optimal off-policy evaluation from multiple logging policies. arXiv preprint arXiv:2010.11002. [Google Scholar]
  23. Lanckriet GR, De Bie T, Cristianini N, Jordan MI & Noble WS (2004). A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635. [DOI] [PubMed] [Google Scholar]
  24. Lu B, Ben-Michael E, Feller A & Miratrix L (2021). Is it who you are or where you are? accounting for compositional differences in cross-site treatment variation. arXiv preprint arXiv:2103.14765. [Google Scholar]
  25. Luedtke A, Carone M & Van der Laan MJ (2019). An omnibus non-parametric test of equality in distribution for unknown functions. J. R. Stat. Soc 81, 75–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Mo W, Qi Z & Liu Y (2020). Learning optimal distributionally robust individualized treatment rules. J. Am. Stat. Assoc, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Pearl J & Bareinboim E (2011). Transportability of causal and statistical relations: A formal approach. AAAI 25. [Google Scholar]
  28. Pfanzagl J (1990). Estimation in semiparametric models. In Estimation in Semiparametric Models. Springer, pp. 17–22. [Google Scholar]
  29. Polley EC & Van Der Laan MJ (2010). Super learner in prediction.
  30. Qin L, Gilbert PB, Corey L, McElrath MJ & Self SG (2007). A framework for assessing immunological correlates of protection in vaccine trials. The Journal of infectious diseases 196, 1304–1312. [DOI] [PubMed] [Google Scholar]
  31. Robins JM, Rotnitzky A & Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc 89, 846–866. [Google Scholar]
  32. Rolland M & Gilbert P (2012). Evaluating immune correlates in hiv type 1 vaccine efficacy trials: what rv144 may provide. AIDS research and human retroviruses 28, 400–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Rudolph KE & van der Laan MJ (2017). Robust estimation of encouragement-design intervention effects transported across sites. J. R. Stat. Soc 79, 1509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Stuart EA, Bradshaw CP & Leaf PJ (2015). Assessing the generalizability of randomized trial results to target populations. Prevention Science 16, 475–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Stuart EA, Cole SR, Bradshaw CP & Leaf PJ (2011). The use of propensity scores to assess the generalizability of results from randomized trials. J. R. Stat. Soc 174, 369–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sun B & Miao W (2018). On semiparametric instrumental variable estimation of average treatment effects through data fusion. arXiv preprint arXiv:1810.03353. [Google Scholar]
  37. Tsiatis AA (2006). Semiparametric theory and missing data. Springer. [Google Scholar]
  38. Van der Laan MJ & Gruber S (2012). Targeted minimum loss based estimation of causal effects of multiple time point interventions. Int. J. Biostat 8. [DOI] [PubMed] [Google Scholar]
  39. Van der laan MJ, Polley EC & Hubbard AE (2007). Super learner. Stat. Appl. Genet. Mol. Biol 6. [DOI] [PubMed] [Google Scholar]
  40. Van der Laan MJ & Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]
  41. Van Der Laan MJ & Rubin D (2006). Targeted maximum likelihood learning. Int. J. Biostat 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Van Der Vaart AW, van der Vaart A, van der Vaart AW & Wellner J (1996). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media. [Google Scholar]
  43. Wedam S, Fashoyin-Aje L, Bloomquist E, Tang S, Sridhara R, Goldberg KB, Theoret MR, Amiri-Kordestani L, Pazdur R & Beaver JA (2020). Fda approval summary: palbociclib for male patients with metastatic breast cancer. Clin. Cancer Res 26, 1208–1212. [DOI] [PubMed] [Google Scholar]
  44. Westling T (2021). Nonparametric tests of the causal null with nondiscrete exposures. J. Am. Stat. Assoc, 1–12.35757777 [Google Scholar]
  45. Westreich D, Edwards JK, Lesko CR, Stuart E & Cole SR (2017). Transportability of trial results using inverse odds of sampling weights. Am. J. Epidemiol 186, 1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Williamson BD, Gilbert PB, Simon NR & Carone M (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary appendices

RESOURCES