Efficient Estimation under Data Fusion

Sijia Li; Alex Luedtke

doi:10.1093/biomet/asad007

. Author manuscript; available in PMC: 2023 Dec 1.

Published in final edited form as: Biometrika. 2023 Feb 6;110(4):1041–1054. doi: 10.1093/biomet/asad007

Efficient Estimation under Data Fusion

Sijia Li ¹, Alex Luedtke ²

PMCID: PMC10653189 NIHMSID: NIHMS1923270 PMID: 37982010

Summary

We aim to make inferences about a smooth, finite-dimensional parameter by fusing data from multiple sources together. Previous works have studied the estimation of a variety of parameters in similar data fusion settings, including in the estimation of the average treatment effect and average reward under a policy, with the majority of them merging one historical data source with covariates, actions, and rewards and one data source of the same covariates. In this work, we consider the general case where one or more data sources align with each part of the distribution of the target population, for example, the conditional distribution of the reward given actions and covariates. We describe potential gains in efficiency that can arise from fusing these data sources together in a single analysis, which we characterize by a reduction in the semiparametric efficiency bound. We also provide a general means to construct estimators that achieve these bounds. In numerical simulations, we illustrate marked improvements in efficiency from using our proposed estimators rather than their natural alternatives. Finally, we illustrate the magnitude of efficiency gains that can be realized in vaccine immunogenicity studies by fusing data from two HIV vaccine trials.

Some key words: Data fusion, Semiparametric theory, HIV

1. Introduction

The rapid expansion of available data has facilitated the use of data fusion, which allows researchers to combine information from many data sources, each collected on a potentially distinct population at a different time, in order to obtain valid summaries of a target population of interest. In practice, data fusion often renders more relevant information or is less expensive than traditional analyses that only leverage a single data source. For example, technology companies integrate numerous unlabeled data with a small amount of labeled data to make accurate predictions, in a process known as semi-supervised learning (Chakrabortty, 2016). In education, policy-makers leverage multiple datasets generated by different current policies to evaluate a new rule of interest in a fast, inexpensive, and effective way (Kallus et al., 2020). In genomics, integrating expression data, gene sequencing data, and network data gives a heterogeneous description of the gene and a distinct view of the underlying machinery of the cell (Lanckriet et al., 2004). In clinical trials, experimental data can be fused with observational data to evaluate a treatment regime on a different target population than the study population (e.g., Wedam et al., 2020).

There are many recent works introducing statistical methods for particular data fusion problems. Many of them focus on bridging causal conclusions via data fusion as illustrated in the aforementioned clinical trial example. This is true, for example, in works on transportability (Pearl & Bareinboim, 2011; Hernán & VanderWeele, 2011; Bareinboim & Pearl, 2014; Stuart et al., 2015; Rudolph & van der Laan, 2017; Dahabreh & Hernán, 2019; Dahabreh et al., 2019; Dong et al., 2020), re-targeting under covariate shifts (Kallus et al., 2020), using surrogate index to infer long-term outcomes (Athey et al., 2019), and correcting external validity bias (Stuart et al., 2011; Mo et al., 2020), in that all these research areas focus on bridging causal effects from a source population to a different target population. While these works considered merging two datasets only, Dahabreh et al. (2019) and Lu et al. (2021) considered bridging data from multiple trials to a target population and others have studied combining experimental data with multiple observational data sources in the presence of unmeasured confounding (Evans et al., 2018; Sun & Miao, 2018). Moreover, Bareinboim & Pearl (2016) studied the identifiability results for a general causal parameter when multiple heterogeneous data sources are available. Data fusion is also used in non-causal problems. For example, semi-supervised learning (Chapelle et al., 2009; Chakrabortty, 2016) represents another important application of data fusion.

Due to the considerable number of open problems in this area, it is of interest to describe a general framework and approach that allows researchers to tackle data fusion problems in generality without limiting themselves to specific parameters, numbers of datasets, or data structures. In this paper, we will consider a general case where different data sources align with different parts of the distribution of the target population and derive efficient estimators based on all available data. We derive a key object needed to both quantify the best achievable level of statistical efficiency when data from multiple sources are fused together and to construct estimators that achieve these gains. Our results generalize previous works that study estimation of specific parameters in data fusion problems under nonparametric models by allowing for both consideration of general parameters and arbitrary semiparametric or nonparametric models. In addition to an example provided in Section 3, in Supplementary Appendix D we illustrate the wide applicability of the proposed method by presenting five other examples. There we also provide a discussion of implementation, possible extensions, connections to missing data problems, and proofs.

2. Notations and Problem Setup

We begin by defining some notation. For a natural number $m$ , we write $[m]$ to denote ${1, \dots, m}$ . For a distribution $ν$ , we let $E_{ν}$ denote the expectation operator under $ν$ . Throughout we use $Z = (Z_{1}, \dots, Z_{d})$ to denote a random variable and, for $j \in [d]$ , we let ${\overline{Z}}_{j} = (Z_{1}, \dots, Z_{j})$ , where we use the convention that ${\overline{Z}}_{0} = \emptyset$ . We use capital letters, such as ${\overline{Z}}_{j}$ and $S$ , to denote random variables and the corresponding lowercase letters, such as ${\overline{z}}_{j}$ and $s$ , to denote their realizations. In an abuse of notation, we condition on lowercase letter in expectations to indicate conditioning on the corresponding random variable taking a specific value: for example, $E_{ν} (Z_{2} ∣ z_{1}) = E_{ν} (Z_{2} ∣ Z_{1} = z_{1})$ . For any distribution $Q$ of $Z$ and $j \in [d]$ , we will let $Q_{j} (\cdot ∣ {\overline{z}}_{j - 1})$ denote the conditional distribution of $Z_{j} ∣ {\overline{Z}}_{j - 1} = {\overline{z}}_{j - 1}$ . Similarly, for any distribution $P$ of $(Z, S)$ , we will let $P_{j} (\cdot ∣ {\overline{z}}_{j - 1}, s)$ denote the conditional distribution of $Z_{j} ∣ {\overline{Z}}_{j - 1} = {\overline{z}}_{j - 1}, S = s$ . Here and throughout we suppose sufficient regularity conditions so that all such conditional distributions are well defined and that all discussed distributions of $Z_{j} ∣ {\overline{Z}}_{j - 1} = {\overline{z}}_{j - 1}$ and $Z_{j} ∣ {\overline{Z}}_{j - 1} = {\overline{z}}_{j - 1}, S = s$ are defined on some common measurable space. We use → to specify the domain and codomain of an arbitrary function and use ↦ to denote the input and output of a function — for example, if $f$ is the standard normal density function, then it would be accurate to write both $f : R \to R$ and $f : x \mapsto (2 π)^{- 1 / 2} e^{- x^{2} / 2}$ .

Suppose we have a collection of $k$ data sources and want to estimate an $R^{b}$ -valued summary $ψ (Q^{0})$ of a target distribution $Q^{0}$ that is known to belong to a collection $𝒬$ of distributions of a random variable $Z = (Z_{1}, \dots, Z_{d})$ , where $Z$ takes values in $𝒵 = \prod_{j = 1}^{d} 𝒵_{j}$ . The summary $ψ$ may only depend on a subset of the conditional distributions of $Z_{j} ∣ {\overline{Z}}_{j - 1}$ . To handle such cases, we let $ℐ \subset [d]$ denote a set of irrelevant indices $j$ such that $ψ$ is not a function of the distribution of $Z_{j} ∣ {\overline{Z}}_{j - 1}$ — more concretely, $ψ (Q) = ψ (Q^{'})$ for all $Q, Q^{'} \in 𝒬$ such that $Q_{j} = Q_{j}^{'}$ for all $j \in [d] ∖ ℐ$ . We do not require that $ℐ$ be the largest possible set of irrelevant indices — this means that, for any parameter $ψ$ , we can take $ℐ = \emptyset$ , while, for certain parameters $ψ$ , it will be possible to take $ℐ$ to be a nonempty set. To ensure that it makes sense to compare the distributions of $Z_{j} ∣ {\overline{Z}}_{j - 1}$ under different distributions $Q$ and $Q^{'}$ in $𝒬$ , we assume here and throughout that all pairs of distributions in $𝒬$ are mutually absolutely continuous. We let $𝒥 = [d] ∖ ℐ$ denote the set of indices that may be relevant to the evaluation of $ψ$ , termed the set of relevant indices.

Rather than observe draws directly from $Q^{0}$ , we see $n$ independent copies of $X = (Z, S)$ drawn from some common distribution $P^{0}$ , where $Z$ takes values in $𝒵$ and $S$ is a categorical random variable denoting the data source has support $[k]$ . The distribution $P^{0}$ is known to align with $Q^{0}$ in the sense described below, which makes it possible to relate the conditional distributions $P_{j}^{0} (\cdot ∣ {\overline{z}}_{j - 1}, s)$ and $Q_{j}^{0} (\cdot ∣ {\overline{z}}_{j - 1})$ .

Condition 1. (Sufficient alignment) For each relevant index $j \in 𝒥$ , there exists a known set $𝒮_{j} \subseteq [k]$ such that, for all $s \in 𝒮_{j}$ , both of the following hold:

(Sufficient overlap) the marginal distribution of ${\overline{Z}}_{j - 1}$ under sampling from $Q^{0}$ is absolutely continuous with respect to the conditional distribution of ${\overline{Z}}_{j - 1} ∣ S = s$ under sampling from $P^{0}$ ; and
(Common conditional distributions) $P_{j}^{0} (\cdot ∣ {\overline{z}}_{j - 1}, s) = Q_{j}^{0} (\cdot ∣ {\overline{z}}_{j - 1}) Q^{0}$ -almost everywhere.

We will provide an example in Section 3 and another five examples in the Supplementary Appendix D where the above condition is plausible. Three of those examples represent generalizations of existing results, and, in all of those cases, a version of the above alignment condition was previously assumed. A detailed comparison between Condition 1 and identification conditions given in existing works is provided in Supplementary Appendix D. We refer to $𝒮_{j}, j \in [d]$ , as fusion sets and suppose they are known and prespecified in advance. As $Q^{0}$ is unknown beyond its membership to $𝒬$ , the above implies that $P^{0}$ is known to belong to the collection $𝒫$ of distributions $P$ with support on $𝒵 \times [k]$ for which there exists a $Q \in 𝒬$ such that, for all $j \in 𝒥$ and $s \in 𝒮_{j}$ , the following analogues of Condition 1 hold: (a) the marginal distribution of ${\overline{Z}}_{j - 1}$ under sampling from $Q$ is absolutely continuous with respect to the conditional distribution of ${\overline{Z}}_{j - 1} ∣ S = s$ under sampling from $P$ , and (b) $P_{j} (\cdot ∣ {\overline{z}}_{j - 1}, s) = Q_{j} (\cdot ∣ {\overline{z}}_{j - 1}) Q$ -almost everywhere. Hereafter we refer to $𝒫$ and $𝒬$ as models.

Condition 1a ensures that, for $s \in 𝒮_{j}$ , null sets under the distribution of ${\overline{Z}}_{j - 1} ∣ S = s$ implied by $P^{0}$ are also null sets under the marginal distribution of ${\overline{Z}}_{j - 1}$ implied by $Q^{0}$ , which ensures that the conditional distribution $P_{j}^{0} (\cdot ∣ {\overline{z}}_{j - 1}, s)$ appearing in Condition $1 b$ is uniquely defined up to $Q^{0}$ -null sets. It is worth noting that we have not assumed that the conditional distribution of ${\overline{Z}}_{j - 1} ∣ S = s$ under sampling from $P^{0}$ is absolutely continuous with respect to the marginal distribution of ${\overline{Z}}_{j - 1}$ under sampling from $Q^{0}$ , which allows ${\overline{Z}}_{j - 1}$ to take values not seen in the target distribution when sampled from aligning data sources under $P^{0}$ . Condition $1 b$ implies exchangeability over data sources, namely that $Z_{j} ⫫ S ∣ ({\overline{Z}}_{j - 1}, S \in 𝒮_{j})$ for $j \in [d]$ , which imposes a nontrivial conditional independence condition on the data-generating distribution when $𝒮_{j}$ is not a singleton for at least one $j$ . This exchangeability condition is testable (Luedtke et al., 2019; Westling, 2021). Nevertheless, we advocate choosing the fusion sets based on outside knowledge rather than via hypothesis testing to avoid challenges associated with post-selection inference.

Previous data fusion works have shown that variants of Condition 1 enable the identification of $ψ (Q^{0})$ as a functional $ϕ$ of the observed data distribution $P^{0}$ in particular problems (e.g., Rudolph & van der Laan, 2017; Dahabreh et al., 2019) and in general causal inference problems (e.g., Pearl & Bareinboim, 2011; Bareinboim & Pearl, 2016). We will heavily rely on such an identifiability result to construct estimators of $ψ (Q^{0})$ , and so we present it explicitly here. To this end, we define a mapping $θ : 𝒫 \to 𝒬$ . In particular, for any $P \in 𝒫$ , we let $θ (P)$ denote an arbitrarily selected distribution from the set $𝒬 (P)$ of distributions $Q \in 𝒬$ that are such that, for each $j \in 𝒥, Z_{j} ∣ {\overline{Z}}_{j - 1}$ under sampling from $Q$ has the same distribution as $Z_{j} ∣ {\overline{Z}}_{j - 1}, S \in 𝒮_{j}$ under sampling from $P$ . Because $P \in 𝒫$ , there must be at least one distribution in $𝒬 (P)$ . Moreover, the value in $𝒬 (P)$ selected when defining $θ (P)$ is irrelevant for our purposes since, as is evident below, our identifiability result only concerns the value of $ψ \circ θ (P^{0})$ , and this value does not depend on the conditional distributions of $Z_{j} ∣ {\overline{Z}}_{j - 1}$ under $θ (P^{0})$ for irrelevant indices $j$ .

Theorem 1. Let $ϕ = ψ \circ θ$ . Under Condition $1, ψ (Q^{0}) = ϕ (P^{0})$ .

Importantly, $θ (P^{0})$ can be evaluated without knowing the value of the true target distribution $Q^{0}$ . Consequently, the above result shows that it is possible to learn the summary $ψ (Q^{0})$ of the target distribution based only on the distribution of the observed data distribution $P^{0}$ . This motivates estimating $ϕ (P^{0})$ , and therefore $ψ (Q^{0})$ , based on a random sample drawn from $P^{0}$ . Before presenting such estimation strategies, we will exhibit an example that fits within this data fusion framework.

3. Example: longitudinal treatment effect

While some clinical trials focus on evaluating a fixed treatment at a single time point, others involve longitudinal treatments that can vary over time. Let $X = (U_{1}, A_{1}, \dots, U_{T - 1}, A_{T - 1}, Y)$ , where indices denote time, $A_{t}$ denotes the binary treatment at time $t$ , and $U_{t}$ denotes the time-varying variable of interest at time $t$ . Under this setup, we have $Z_{1} = U_{1}, Z_{2} = A_{1}, Z_{3} = U_{2}, \dots, Z_{2 T - 1} = Y$ . We suppose that the final outcome of interest, $Y$ , is real-valued. For ease of notation, we let ${\overline{H}}_{t} = (U_{1}, A_{1}, \dots, U_{t})$ for each $t \in [T - 1]$ denote the history up to time $t$ . We consider three models $𝒬$ for the unknown target distribution. The first is nonparametric, and consists of all distributions with some common support where treatment assignment satisfies the strong positivity condition that, conditionally on the past, each treatment is assigned with probability bounded away from zero. The second is semiparametric, and supposes that there is some unknown function $g : \prod_{j = 1}^{2 T - 2} 𝒵_{j} \to R$ such that the conditional distribution $Y ∣ {\overline{H}}_{T - 1} = h_{T - 1}, A_{T - 1} = a_{T - 1}$ is symmetric about $g ({\overline{h}}_{T - 1}, a_{T - 1})$ . When considering this semiparametric model, we suppose that, for each $Q \in 𝒬$ , the conditional distribution $Q_{2 T - 1}$ has a corresponding conditional density $q_{2 T - 1}$ and that $q_{2 T - 1} (\cdot ∣ {\overline{H}}_{T - 1}, A_{T - 1})$ is almost surely differentiable. The third model we consider is also semiparametric and imposes that, under sampling from each $Q \in 𝒬, ({\overline{H}}_{T - 1}, A_{T - 1})$ has support in $R^{p}$ and there exists some vector of coefficients $β \in R^{p}$ and error distribution $τ_{α}$ belonging to a regular parametric family $\{τ_{\tilde{α}} : \tilde{α} \in R^{c}\}$ of conditional distributions of a real-valued error $ϵ$ given $({\overline{H}}_{T - 1}, A_{T - 1})$ such that $Y = β^{⊤} κ ({\overline{H}}_{T - 1}, A_{T - 1}) + ϵ$ , where $κ : R^{p} \to R^{c}$ is a known transformation of the history and treatment through time $T - 1$ and $E_{τ_{α}} [ϵ ∣ {\overline{H}}_{T - 1}, A_{T - 1}] = 0$ almost surely.

Hernán & Robins (2020) discuss a range of causal parameters under such a longitudinal setting, including the mean outcome under a dynamic treatment regime, the parameters indexing a marginal structural mean model, and the average treatment effect of always being on treatment versus never being on treatment. For simplicity, here we focus on the final of these parameters. Under causal assumptions (Chapter 19.4 of Hernán & Robins, 2020), this causal effect is identified with $ψ (Q) \equiv E_{Q_{1}} \{L_{1}^{1} ({\overline{H}}_{1})\} - E_{Q_{1}} \{L_{1}^{0} ({\overline{H}}_{1})\}$ , where, for $a \in {0,1}$ , we define $L_{T}^{a} ({\overline{h}}_{T}) = y$ and, recursively from $t = T - 1, \dots, 1$ , define $L_{t}^{a} ({\overline{h}}_{t}) = E_{Q_{2 t + 1}} \{L_{t + 1} ({\overline{H}}_{t + 1}) ∣ {\overline{h}}_{t}, A_{t} = a\}$ . Because $ψ (Q)$ can be written as a function of $Q_{1}, Q_{3}, \dots, Q_{2 T - 1}$ , we see that we can take $ℐ = {2,4, \dots, 2 T - 2}$ in this example. This is consistent with the well known fact that the conditional average treatment effect does not depend on the treatment assignment probabilities, namely the distribution $Q_{2 t}$ of $A_{t} ∣ {\overline{H}}_{t}$ for $t \in [T - 1]$ .

We consider the scenario where we obtain data from $k$ sources. Some data sources may contain only information about the target baseline covariate distribution. Others may contain observations from all $T$ time points, but possibly with a baseline covariate distribution that differs from the target distribution — for example, measurements of monthly CD4 count in HIV treatment trials or observational settings. The remaining may only contain such measurements up to a time point $t < T$ such that $(U_{1}, A_{1}, \dots, U_{t}, A_{t})$ is observed and $U_{s}$ and $A_{s}$ are missing for all $s > t$ . We indicate missingness by writing that $U_{s} = A_{s} = ⋆$ in such cases. Such partial observations may still have valuable information, for example, about how longitudinal CD4 count responds to treatment shortly after the initiation of antiretroviral therapy. Under Condition 1, $ψ (Q^{0})$ can be identified as $ϕ (P^{0}) = E_{P^{0}} \{{\tilde{L}}_{1}^{1} ({\overline{H}}_{1}) ∣ S \in 𝒮_{1}\} - E_{P^{0}} \{{\tilde{L}}_{1}^{0} ({\overline{H}}_{1}) ∣ S \in 𝒮_{1}\}$ , where, for $a \in {0,1}$ , we define ${\tilde{L}}_{T}^{a} ({\overline{h}}_{T}) = y$ and, recursively from $t = T - 1, \dots, 1$ , define ${\tilde{L}}_{t}^{a} ({\overline{h}}_{t}) = E_{P^{0}} \{{\tilde{L}}_{t + 1} ({\overline{H}}_{t + 1}) ∣ {\overline{h}}_{t}, A_{t} = a, S \in 𝒮_{2 t + 1}\}$ . When $T = 1$ , this estimand simplifies to the well-studied average treatment effect. Athey et al. (2019) studied a special case of data fusion for inferring average treatment effect with two post-baseline time points, where a group of surrogate markers are measured post-treatment at $T = 2$ and the primary outcome is measured at $T = 3$ . Assumptions 3 and 4 in Athey et al. (2019) match our Conditions 1b and 1a. In contrast to their setup, ours allows for more time points $(T > 3)$ and data sources $(k > 2)$ , and also for arbitrary specifications of the relevant data fusion sets $𝒮_{1}, 𝒮_{3}, \dots, 𝒮_{2 T - 1}$ , provided they are nonempty.

4. Methods

4.1. Review of semiparametric theory

We review some important aspects of semiparametric theory. A more comprehensive review is given in Supplementary Appendix E. An estimator $\hat{ϕ}$ of $ϕ (P)$ is called asymptotically linear with influence function $D_{P}$ if it can be written as $\hat{ϕ} - ϕ (P) = n^{- 1} \sum_{i = 1}^{n} D_{P} (X_{i}) + o_{p} (n^{- 1 / 2})$ , where $E_{P} \{D_{P} (X_{i})\} = 0$ and $σ_{P}^{2} \equiv E_{P} \{D_{P} {(X_{i})}^{2}\} < \infty$ . One reason such estimators are attractive is that they are consistent and asymptotically normal, in the sense that $\sqrt{n} {\hat{ϕ} - ϕ (P)} \overset{d}{\to} N (0, σ_{P}^{2})$ under sampling $n$ independent draws from $P$ . This facilitates the construction of confidence intervals and hypothesis tests. If $\hat{ϕ}$ is regular and asymptotically linear at $P$ (Bickel et al., 1993), then $ϕ$ is pathwise differentiable and the influence function $D_{P}$ is a gradient of $ϕ$ , in the sense that, for all submodels $\{P^{(ϵ)} : ϵ \in [0, δ)\} \in 𝒫 (P, 𝒫)$ , ${\frac{\partial}{\partial ϵ} ϕ (P_{ϵ})|}_{ϵ = 0} = E_{P} \{D_{P} (X) h (X)\}$ . The tangent set $𝒯 (P, 𝒫)$ of $𝒫$ at $P$ is defined as the set of all scores of submodels in $𝒫 (P, 𝒫)$ . The canonical gradient $D_{P}^{*}$ corresponds to the $L_{0}^{2} (P)$ -projection of any gradient $D_{P}$ onto the closure of the linear span of scores in $𝒯 (P, 𝒫)$ , where $L_{0}^{2} (P)$ is the Hilbert space of $P$ -mean-zero functions, finite variance functions.

One way to construct a regular asymptotically linear estimator with influence function $D_{P^{0}}$ is through one-step estimation (Bickel, 1982). Given an estimate $\hat{P}$ of $P^{0}$ , the one-step estimator is given by $\hat{ϕ} \equiv ϕ (\hat{P}) + \sum_{i = 1}^{n} D_{\hat{P}} (X_{i}) / n$ . This estimator will be asymptotically linear with influence function $D_{P^{0}}$ if the remainder term $R (\hat{P}, P^{0}) \equiv ϕ (\hat{P}) - ϕ (P^{0}) + E_{P^{0}} \{D_{\hat{P}} (X)\}$ is $o_{p} (n^{- 1 / 2})$ and the empirical mean of $D_{\hat{P}} (X) - D_{P^{0}} (X)$ is within $o_{p} (n^{- 1 / 2})$ of the mean of this term when $X ~ P^{0}$ . The latter of these requirements will hold under an appropriate empirical process condition (Van Der Vaart et al., 1996). Alternative approaches for constructing asymptotically linear estimators can be found in Van Der Laan & Rubin (2006), Van der Laan & Robins (2003), and Tsiatis (2006).

All of the results in this section extend naturally to the case where $ϕ$ is $R^{b}$ -valued. In such cases, gradients (respectively, the canonical gradient) of $ϕ$ are $R^{b}$ -valued functions whose $b$ -th entry corresponds to a gradient (respectively, the canonical gradient) of the $b$ -th coordinate projection of $ϕ$ . Estimators can similarly be constructed coordinatewise. Due to this straightforward extension from univariate to $b$ -variate settings, the theoretical results in the next subsection focus on the special case where $b = 1$ .

4.2. Derivation of canonical gradient of a general target parameter

In this section, we provide approaches for obtaining the canonical gradient of $ϕ$ in the model $𝒫$ implied by $𝒬$ and the data fusion conditions. We will focus on settings where distributions in $Q$ can be separately defined via their conditional distributions, so that it is possible to modify a conditional distribution $Q_{j}$ under a distribution $Q \in 𝒬$ , not modify any of the other conditional distributions $Q_{j^{'}}, j^{'} \neq j$ , and still have it be the case that the resulting distribution belongs to $𝒬$ . This condition is formalized in the following.

Condition 2. (Variation independence) There exist sets $𝒬_{j}$ of conditional distributions of $Z_{j} ∣ {\overline{Z}}_{j - 1}, j \in [d]$ , such that $𝒬$ is equal to the set of all distributions $Q$ such that, for all $j \in [d]$ , the conditional distribution $Q_{j}$ belongs to $𝒬_{j}$ .

The above condition is satisfied by the model $𝒬$ described in each of our examples. It is also satisfied in many other interesting semiparametric examples, such as those where $E_{Q^{0}} (Z_{1})$ or $E_{Q^{0}} (Z_{2} ∣ Z_{1})$ is known, but is not satisfied in some others, such as in cases where $E_{Q^{0}} (Z_{2})$ is known — in this case, knowing the marginal distribution $Q_{1}$ restricts the values that the conditional distribution $Q_{2}$ can take.

The upcoming results will provide forms of gradients of $ϕ$ at $P^{0}$ in terms of gradients of $ψ$ at a generic distribution ${\underline{Q}}^{0} \in 𝒬 (P^{0})$ , where we recall that $𝒬 (P^{0})$ is the set of distributions in $𝒬$ whose relevant conditional distributions align with $P^{0}$ . Since $Q^{0} \in 𝒬 (P^{0})$ , all of these results are valid when ${\underline{Q}}^{0} = Q^{0}$ . However, since the distribution $Q^{0}$ is not generally identifiable from $P^{0}$ — indeed, there may be no alignment for conditional distributions irrelevant to $ψ$ — the particular value of $Q^{0}$ may be unknowable even given infinite data. In contrast, the set $𝒬 (P^{0})$ would be knowable in such a setting. We, therefore, allow for the specification of an arbitrary distribution from the identifiable set $𝒬 (P^{0})$ — for example, it is always possible to take ${\underline{Q}}^{0} = θ (P^{0})$ , whose value depends only on $P^{0}$ .

We require a strong overlap condition on the chosen ${\underline{Q}}^{0} \in 𝒬 (P^{0})$ and the relevant conditional distributions of $P^{0}$ . If ${\underline{Q}}^{0} = Q^{0}$ , then the upcoming condition strengthens the overlap condition (Condition 1a) that was used to establish identifiability. To state this condition, for each $j \in 𝒥$ , we let $λ_{j - 1}$ denote the Radon-Nikodym derivative of the marginal distribution of ${\overline{Z}}_{j - 1}$ under sampling from ${\underline{Q}}^{0}$ relative to the conditional distribution of ${\overline{Z}}_{j - 1} ∣ S \in 𝒮_{j}$ under sampling from $P^{0}$ .

Condition 3. (Strong overlap) For each $j \in 𝒥$ , there exists $c_{j - 1} > 0$ such that ${\underline{Q}}^{0} \{c_{j - 1}^{- 1} \leq λ_{j - 1} ({\overline{Z}}_{j - 1}) \leq c_{j - 1}\} = 1$ .

Since ${\overline{Z}}_{0} = \emptyset$ almost surely under both of these distributions, $λ_{j - 1}$ is the constant function that returns 1 when $j = 1$ . Hence, the above condition only imposes a nontrivial requirement on $j \in 𝒥 ∖ {1}$ .

In the upcoming results, we suppose that the tangent set $𝒯 ({\underline{Q}}^{0}, 𝒬)$ of $𝒬$ at ${\underline{Q}}^{0}$ is a closed linear subspace of $L_{0}^{2} (Q^{0})$ . We can therefore refer to $𝒯 ({\underline{Q}}^{0}, 𝒬)$ as the tangent space without causing any confusion. The following lemma shows that, under the above conditions and the earlier stated data fusion condition, the pathwise differentiability of $ψ$ is equivalent to the pathwise differentiability of $ϕ$ .

Lemma 1. Suppose that Conditions 1, 2, and 3 hold. Under these conditions, $ψ$ is pathwise differentiable at $Q^{0}$ relative to $𝒬$ if and only if $ϕ$ is pathwise differentiable at $P^{0}$ relative to $𝒫$ .

The next result provides a means to derive gradients of $ϕ$ .

Theorem 2. Suppose that Conditions 1, 2, and 3 hold and that $ψ$ is pathwise differentiable at ${\underline{Q}}^{0}$ relative to $𝒬$ with gradient $D_{Q^{0}}$ . Under these conditions, the following function is a gradient of $ϕ$ at $P^{0}$ relative to:

D_{P^{0}} (z, s) \equiv \sum_{j \in 𝒥} \frac{1 (s \in 𝒮_{j})}{P (S \in 𝒮_{j})} λ_{j - 1} ({\overline{z}}_{j - 1}) D_{{\underline{Q}}^{0}, j} ({\overline{z}}_{j}),

(1)

where $D_{{\underline{Q}}^{0}, j} ({\overline{z}}_{j}) \equiv E_{{\underline{Q}}^{0}} \{D_{{\underline{Q}}^{0}} (Z) ∣ {\overline{Z}}_{j} = {\overline{z}}_{j}\} - E_{{\underline{Q}}^{0}} \{D_{{\underline{Q}}^{0}} (Z) ∣ {\overline{Z}}_{j - 1} = {\overline{z}}_{j - 1}\}$ .

Given any gradient of $ϕ$ , the canonical gradient can be derived by projecting that gradient onto the tangent space of $𝒫$ at $P^{0}$ . The form of this projection is provided in Lemma S4 in Supplementary Appendix A.2. Applying this projection to a gradient of the form in (1) provides a form for the canonical gradient. In what follows we use $Π_{{\underline{Q}}^{0}} {\cdot ∣ 𝒜}$ to denote the $L_{0}^{2} ({\underline{Q}}^{0})$ -projection operator onto a subspace $𝒜$ of $L_{0}^{2} ({\underline{Q}}^{0})$ .

Corollary 1. Under the conditions of Theorem 2, the canonical gradient of $ϕ$ relative to $𝒫$ is

D_{P^{0}}^{*} (z, s) \equiv \sum_{j \in 𝒥} 1 ({\overline{z}}_{j - 1} \in {\bar{𝒵}}_{j - 1}^{†}) \frac{1 (s \in 𝒮_{j})}{P (S \in 𝒮_{j})} Π_{{\underline{Q}}^{0}} \{r_{j} ∣ 𝒯 ({\underline{Q}}^{0}, 𝒬)\} ({\overline{z}}_{j}),

(2)

where ${\bar{𝒵}}_{j - 1}^{†}$ denotes the support of ${\overline{Z}}_{j - 1}$ under sampling from ${\underline{Q}}^{0}$ and $r_{j} \in L_{0}^{2} ({\underline{Q}}_{j}^{0})$ is such that $r_{j} ({\overline{z}}_{j}) =$ $λ_{j - 1} ({\overline{z}}_{j - 1}) D_{{\underline{Q}}^{0}, j} ({\overline{z}}_{j})$ .

Previous data fusion works mainly focus on particular parameters in nonparametric models (Stuart et al., 2015; Rudolph & van der Laan, 2017; Dahabreh et al., 2019; Kallus et al., 2020). Corollary 1 generalizes existing results to arbitrary parameters in both nonparametric and semiparametric models. We propose to construct either a one-step estimator using the canonical gradients derived using the procedure above. Under regularity conditions (outlined in Section 4.1), the resulting estimator will be efficient among all regular and asymptotically linear estimators.

Because the canonical gradient is unique, the right-hand side of (2) will be the same regardless of the chosen value of ${\underline{Q}}^{0} \in 𝒬 (P^{0})$ that satisfies Condition 3. However, the calculations required to simplify that expression may differ. Indeed, that expression depends on ${\underline{Q}}^{0}$ through the definitions of $r_{j}$ and through the projection operator $Π_{{\underline{Q}}^{0}} \{\cdot ∣ 𝒯 ({\underline{Q}}^{0}, 𝒬)\}$ .

Computing the projection in (2) may be challenging in some semiparametric models, though there is substantial existing work providing the form of this projection in a variety of interesting examples (Pfanzagl, 1990; Bickel et al., 1993; Van der Laan & Robins, 2003; Tsiatis, 2006). In contrast, computing this projection is necessarily trivial when $𝒬$ is locally nonparametric, since in this case the tangent space of $𝒬$ at ${\underline{Q}}^{0}$ is equal to $L_{0}^{2} ({\underline{Q}}^{0})$ and the projection operator is the identity operator. Hence, in this special case, (1) and (2) are equal, and applying Theorem 2 to the one gradient $D_{{\underline{Q}}^{0}}$ for $ψ$ that can possibly exist relative to a locally nonparametric model for $𝒬$ necessarily yields the canonical gradient relative to $𝒫$ . In semiparametric models, where there is more than one possible initial candidate gradient $D_{{\underline{Q}}^{0}}$ to plug into (1), it is natural to wonder whether there is any such candidate for which (1) and (2) coincide. In general, this will fail to hold unless there is a gradient $D_{{\underline{Q}}^{0}}$ of $ψ$ for which $z \mapsto λ_{j - 1} ({\overline{z}}_{j - 1}) D_{Q^{0}, j} ({\overline{z}}_{j})$ belongs to $𝒯 ({\underline{Q}}^{0}, 𝒬)$ for all $j \in 𝒥$ . This does not hold in general. One example of a case where this typically fails to hold occurs in a model where it is known that $Z_{j} ⊥ {\overline{Z}}_{j - 2} ∣ Z_{j - 1}$ for some $j \in 𝒥$ .

In certain cases, there is a close connection between data fusion problems and missing data problems. In particular, if the fusion sets are nested such that $𝒮_{d} \subseteq 𝒮_{d - 1} \subseteq \dots \subseteq 𝒮_{1}$ with $𝒮_{d}$ nonempty, then the fusion framework can be framed as a monotone missing data problem. In this problem, the full data structure $Z^{F}$ is drawn from $Q^{0}$ . The observed variable $Z$ is such that $Z_{j} = Z_{j}^{F}$ for all $j$ such that $S \in 𝒮_{j}$ and $Z_{j}$ is coded as missing $(Z_{j} = ⋆)$ otherwise. Under Condition 1, this results in a missing data problem that satisfies coarsening at random (Heitjan & Rubin, 1991). Existing works provide general approaches to derive canonical gradients in these models and, when the fusion sets are nested, our results emerge as a special case of them. For example, our Theorem 2 can be derived as a special case of Theorem 1.1 of Van der Laan & Robins (2003) in this case. Moreover, the canonical gradient presented in Corollary 1 can be derived using a general strategy for deriving efficient estimators in monotone missing data problems (Robins et al., 1994; Tsiatis, 2006). To ease notation, when illustrating this we focus on the case where $𝒥 = [d]$ and the support of ${\overline{Z}}_{j - 1} ∣ S \in 𝒮_{j}$ under $P^{0}$ is equal to ${\bar{𝒵}}_{j - 1}^{†}$ . Letting $C = max \{j : S \in 𝒮_{j}\}$ , with the maximum over the empty set being equal to 0, it can be verified that the canonical gradient $D_{P^{0}}^{*}$ from Corollary 1 satisfies

D_{P^{0}}^{*} (z, s) = \frac{1 (c = d)}{P^{0} (C \geq d)} \sum_{j = 1}^{d} Π_{{\underline{Q}}^{0}} \{r_{j} ∣ 𝒯 ({\underline{Q}}^{0}, 𝒬)\} ({\overline{z}}_{j}) + \sum_{ℓ = 1}^{d - 1} \frac{1 (c = ℓ) - 1 (c \geq ℓ) P^{0} (C = ℓ ∣ C \geq ℓ)}{P^{0} (C \geq ℓ + 1)} \sum_{j = 1}^{ℓ} Π_{{\underline{Q}}^{0}} \{r_{j} ∣ 𝒯 ({\underline{Q}}^{0}, 𝒬)\} ({\overline{z}}_{j}) .

Noting that $f_{{\underline{Q}}^{0}} (z) : = \sum_{j = 1}^{d} Π_{Q^{0}} \{r_{j} ∣ 𝒯 ({\underline{Q}}^{0}, 𝒬)\} ({\overline{z}}_{j})$ is a gradient of $ψ$ at ${\underline{Q}}^{0}$ relative to $𝒬$ , the above identity shows that $D_{P^{0}}^{*}$ is the influence function of an augmented inverse probability weighted completecase estimator derived from the full-data influence function $f_{Q^{0}}$ (see Theorem 10.1 of Tsiatis, 2006). Consequently, the form of this canonical gradient could be identified or numerically approximated using the general strategies outlined in Chapter 11 of Tsiatis (2006). As is noted in Tsiatis (2006), the strategies outlined in that chapter are quite complex, often involving some form of numerical approximation, and so “the actual implementation needs to be considered on a case-by-case basis.” Corollary 1 demonstrates that, in the special case of data fusion with nested fusion sets, a simple form for the canonical gradient is available. Moreover, this gradient can be computed explicitly whenever the projections used to define it have a known form, which will typically be the case when a closed-form efficient estimator for $ψ (Q^{0})$ is known in the non-data fusion setting where a random sample from $Q^{0}$ is directly observed.

In settings where the fusion sets are not nested, it is not clear that it is even possible to write our data fusion problem as a missing data problem that satisfies coarsening at random. Hence, Corollary 1 presents what is, to our knowledge, the first general strategy for deriving the canonical gradient in data fusion problems. We conclude by noting that non-nested cases are far from the exception. Indeed, many previously studied data fusion problems involve non-nested fusion sets (e.g., Rudolph & van der Laan, 2017; Westreich et al., 2017; Dahabreh & Hernán, 2019; Kallus et al., 2020).

4.3. Canonical Gradient in Our Example

We now derive the canonical gradient of the longitudinal treatment effect described in each of the three semiparametric models described in Section 3. An initial gradient $D_{Q^{0}}$ to plug into Theorem 2 can be found in Theorem 1 of van der Laan & Gruber (2012) (see also Bang & Robins, 2005). Following notations introduced in Section 3 and the results from Corollary 1, we can use this initial gradient to show that the canonical gradient of $ϕ$ under a locally nonparametric model is $D_{P^{0}} (x) = D_{P^{0}}^{1} (x) - D_{P^{0}}^{0} (x)$ , where, for $a^{'} \in {0,1}$ and letting ${\bar{𝒰}}_{t}^{†}$ denote the support of ${\overline{U}}_{t}$ under sampling from $Q^{0}, D_{P^{0}}^{a^{'}} = \sum_{t = 1}^{T} D_{P_{2 t - 1}^{0}}^{a^{'}}$ with

D_{P_{2 t - 1}^{0}}^{a^{'}} (x) \equiv 1 ({\overline{u}}_{t - 1} \in {\bar{𝒰}}_{t - 1}^{†}) \frac{1 (s \in 𝒮_{2 t - 1})}{pr (S \in 𝒮_{2 t - 1})} \{\prod_{m = 1}^{t - 1} \frac{1 (a_{m} = a^{'})}{pr (A_{m} = a^{'} ∣ {\overline{u}}_{m}, {\overline{A}}_{m - 1} = a^{'}, S \in 𝒮_{2 t - 1})}\} \cdot \{\prod_{m = 1}^{t - 1} \frac{d P^{0} (u_{m} ∣ {\overline{u}}_{m - 1}, {\overline{A}}_{m - 1} = a^{'}, S \in 𝒮_{2 m - 1})}{d P^{0} (u_{m} ∣ {\overline{u}}_{m - 1}, {\overline{A}}_{m - 1} = a^{'}, S \in 𝒮_{2 t - 1})}\} \{{\tilde{L}}_{t}^{a^{'}} ({\overline{h}}_{t}, s) - {\tilde{L}}_{t - 1}^{a^{'}} ({\overline{h}}_{t - 1}, s)\},

(3)

where we abuse notation and write ${\overline{A}}_{m - 1} = a^{'}$ to mean that $A_{j} = a^{'}$ for all $j \in [m - 1]$ . Compared to the canonical gradient $D_{Q^{0}}$ under no data fusion, the above uses all available data sources that are valid, as reflected by the indicator terms, and then corrects for the variable shifts between the target population and the observed as reflected by the first two terms inside the curly brackets.

For the symmetric location semiparametric model described in Section 3, we provide the canonical gradient of $ϕ$ in Supplementary Appendix C. For the linear semiparametric model described in Section 3, the canonical gradient of $ϕ$ takes the form of $D_{P^{0}}^{†} = D_{P^{0}}^{† 1} - D_{P^{0}}^{† 0}$ , where

D_{P^{0}}^{† a^{'}} (x) = D_{P^{0}}^{a^{'}} (x) - D_{P_{2 T - 1}^{0}}^{a^{'}} (x) + 1 ({\overline{u}}_{T - 1} \in {\bar{𝒰}}_{T - 1}^{†}) \frac{1 (s \in 𝒮_{2 T - 1})}{P^{0} (S \in 𝒮_{2 T - 1})} \cdot {[E_{{\underline{Q}}^{0}} {\{ℓ (X) ℓ (X)^{⊤}\}}^{- 1} E_{{\underline{Q}}^{0}} \{ℓ (X) λ_{2 T - 2} ({\overline{H}}_{T - 1}, A_{T - 1}) D_{{\underline{Q}}_{2 T - 1}^{0}}^{a^{'}} (X)\}]}^{⊤} ℓ (x),

(4)

where ${\underline{Q}}^{0}$ is a generic element of $𝒬 (P^{0}), ℓ = (ℓ_{β}, ℓ_{α})$ with $ℓ_{β} (z) \equiv \nabla_{β} log {\tilde{τ}}_{α} \{y - β^{⊤} κ ({\overline{h}}_{T - 1}, a_{T - 1}) ∣ {\overline{h}}_{T - 1}, {\overline{a}}_{T - 1}\}$ and $ℓ_{α} (z) \equiv \nabla_{α} log {\tilde{τ}}_{α} \{y - β^{⊤} κ ({\overline{h}}_{T - 1}, a_{T - 1}) ∣ {\overline{h}}_{T - 1}, a_{T - 1}\}$ , where ${\tilde{τ}}_{α}$ denotes the conditional density function of the error distribution $τ_{α}$ . The key distinction between (4) and (3) lies in the subspace $𝒯 ({\underline{Q}}_{2 T - 1}^{0}, 𝒬_{2 T - 1})$ , where the projection used in (4) can be derived via a score function argument for finite-dimensional parameters. A detailed roadmap is at Supplementary Appendix C.

5. Simulation and data illustration

5.1. Simulation

We simulated $k = 9$ data sources with $T = 4$ and fixed data sizes as specified in Table S1. The variable ${\overline{U}}_{3}$ of the target population follows a multivariate normal distribution as specified in Table S2, while treatments ${\overline{A}}_{3}$ are independent Bernoulli(0.5). The outcome variable $Y$ is such that $Y = β^{⊤} κ ({\overline{H}}_{3}, A_{3}) + ϵ$ and the heteroskedastic error $ϵ$ satisfies $ϵ ∣ {\overline{H}}_{3} = {\overline{h}}_{3}, A_{3} = a_{3} ~ τ_{α} (\cdot ∣ {\overline{h}}_{3}, a_{3})$ , where, for $\tilde{α} > 0, τ_{\tilde{α}} (\cdot ∣ {\overline{h}}_{3}, a_{3})$ denotes the distribution of $\tilde{α} u_{3}$ times a random variable following a student’s t-distribution with 3 degrees of freedom. The indexing parameter $α$ equals 0.1 and the values of $β$ and the form of $κ$ are specified in Supplementary Appendix G. The underlying true distribution belongs to all three models mentioned in Section 3, where, for the second semiparametric model, the error distribution is known to belong to $\{τ_{\tilde{α}} : \tilde{α} > 0\}$ . We evaluated the treatment effect of always being on treatment versus never being on treatment. Under this setup, data source 9 aligns perfectly with the target population distribution and it is possible to provide valid inferences for $ψ (Q^{0})$ using this data source alone. We compared three one-step estimators that were constructed via the canonical gradients under these models respectively, and under three scenarios: (1) no data fusion with $𝒮_{7} = 𝒮_{5} = 𝒮_{3} = 𝒮_{1} = {9}$ , (2) partial data fusion with $𝒮_{7} = {6,9}, 𝒮_{5} = {5,9}, 𝒮_{3} = {3,9}$ and $𝒮_{1} = {1,3, 9}$ , and (3) complete data fusion with $𝒮_{7} = {6,8, 9}$ , $𝒮_{5} = {5,6, 8,9}, 𝒮_{3} = {3,5, 7,9}$ and $𝒮_{1} = {1,3, 9}$ .

The outcome regressions and the propensity scores were estimated via SuperLearner (Van der Laan et al., 2007) with a library containing a generalized linear model with interaction terms and general additive model under their default settings in the SuperLearner R package (Polley & Van Der Laan, 2010). Each density in the ratios that appear in the second line of Equation 3 was estimated via kernel density estimation using a normal scale bandwidth. Details on the estimation of the conditional density of the regression error in the semiparametric model where this density is known to be symmetric are given in Supplementary Appendix G. For the other semiparametric model considered, we evaluated the scores $ℓ = (ℓ_{β}, ℓ_{α})$ in (4) numerically via a finite difference approximation. For each simulation study presented in this work, 1000 Monte Carlo replications were conducted.

Using data fusion yields around 10% and 20% efficiency gains for partial and complete fusion respectively in the nonparametric case. Compared to the nonparametric estimator that was constructed using only data source 9, the semiparametric estimators gained approximately 40%, 50%, and 60% efficiency under no, partial and complete data fusion. Coverage was near nominal for all estimators (93%−98%), and the widths of intervals decreased along the same lines as the mean squared error did, with more data fusion and more restrictive statistical models each leading to tighter intervals (Table 1).

Table 1:

Bias, variance and coverage of the estimated longitudinal treatment effect.

	No Data Fusion			Partial Data Fusion			Complete Data Fusion
	Bias	Var	Cov%	Bias	Var	Cov %	Bias	Var	Cov %
Nonparametric	−0.003	0.170	97	−0.004	0.154	98	0.003	0.137	98
Symmetric	−0.004	0.106	93	0.064	0.087	94	0.049	0.076	94
Linear	−0.012	0.101	96	0.066	0.079	95	0.053	0.064	95

Open in a new tab

5.2. Data illustration: evaluating the immunogenicity of an HIV vaccine

The STEP study and the Phambili study were two phase IIb trials that evaluated the safety and efficacy of the same HIV vaccine regimen in different populations (Buchbinder et al., 2008; Gray et al., 2011). The STEP study was conducted at 34 sites in multiple continents, whereas the Phambili study tested the same vaccine at 5 sites in South Africa. Although the two studies enrolled in distinct geographic areas, there was overlap in the overall demographic characteristics (Table S3). Both studies suggested that the vaccine did not prevent HIV-1 infection, although most vaccinees developed an HIV-specific immune response to Clade B as measured by interferon- $γ$ ELISpot.

We studied the extent to which baseline covariates predict HIV-specific immune responses. This is important for developing baseline immunogenicity predictors that can be used in subsequent analyses to assess stratified effects (Follmann, 2006). These predictors can subsequently be used to bridge — that is, transport — vaccine efficacy to new settings (Qin et al., 2007; Rolland & Gilbert, 2012; Huang et al., 2013). Specifically, we assessed the best achievable area under the receiver operating characteristic curve for predicting each of the three HIV-specific immune responses using age, sex, body mass index and baseline adenovirous serotype-5 positivity as covariates. This measure assesses the extent which these baseline covariates predict immune responses (Williamson et al., 2021). We separately treated each of the study populations from the two studies as the target population and compared the estimation results generated by using one single dataset with using both datasets.

The Phambili ELISpot data consist of measurements of Gag, Nef, and Pol immune responses for 93 vaccinees, and the STEP immunogenicity data consist of measurements for 722 vacinees and 257 placebo participants. The two trials used different sampling schemes. In Phambili, the immunogenicity assessment was conducted on the first 93 vaccinees who were HIV-1 antibody negative at the week 12 visit and had received the second injection (Gray et al., 2011). In STEP, a two-phase sampling scheme was adopted to oversample HIV cases (Huang et al., 2014). To account for this two-phase sampling scheme, we weighted the STEP data by the inverse probability of being sampled given infection status and treatment group. We aimed to evaluate the three HIV-specific immune responses for the vaccine group, namely Gag, Nef, and Pol to clade-B. We used the same criteria as Huang et al. (2014) for defining a positive immune response. Regardless of the target population, we assumed that the conditional distribution of immune response for the vaccine group between STEP and Phambili given baseline covariates are the same (Condition 1b). These baseline covariates consisted of baseline adenovirus serotype-5 positivity along with age, body mass index, sex, and circumcision status. We combined sex and circumcision status into a single 3-level categorical variable to differentiate uncircumcised men from circumcised men and women. Data from the HVTN 204 phase II trial support the plausibility of Condition 1b. In particular, they suggest that HIV-specific immune response profiles do not differ by geographic region, whereas baseline adenovirus serotype-5 neutralizing antibodies are strongly associated with HIV-specific immune responses (Churchyard et al., 2011). To examine if Condition 1b is reasonable, we followed the approach proposed by Luedtke et al. (2019) and found no strong evidence against it (Gag: p=0.31; Nef: p=0.46; Pol: p=0.98). In addition, we observed reasonable overlap between the distributions of covariates (Table S3 and Figure S1).

We performed 4-fold cross fitting and trained our immune response prediction algorithm and other nuisance parameters using SuperLearner (Van der Laan et al., 2007; Polley & Van Der Laan, 2010) with a library containing simple mean, random forest, gradient boosting, generalized additive model, and elastic net. Results are presented in Table 2. While these baseline covariates predict immune responses reasonably well for the STEP population (area under the curve: 0.64–0.72), they have poor predictive performance for the Phambili population (0.55–0.57). While estimators that make use of data fusion gave estimates that were close to the estimators that only used data from one trial, the corresponding standard errors were reduced more than 30% for each immune response when data from Phambili were augmented with the STEP study data. The proposed methods also brought efficiency gains when data from STEP were augmented, though these gains were more modest due to the smaller sample size of the Phambili study.

Table 2:

Estimated area under the receiver operating characteristic curve using the STEP and Phambili study data. Estimation results are presented as estimates (standard errors).

	Augmenting STEP		Augmenting Phambili
	STEP only (N=979)	Both (N=1072)	Phambili only (N=93)	Both (N=1072)
Gag	0.718 (0.017)	0.686 (0.016)	0.574 (0.036)	0.574 (0.019)
Nef	0.654 (0.017)	0.640 (0.016)	0.547 (0.031)	0.550 (0.017)
Pol	0.704 (0.017)	0.717 (0.016)	0.546 (0.032)	0.550 (0.015)

Open in a new tab

We also studied coefficients in univariate logistic working models for baseline covariates age, sex, body mass index and baseline adenovirus serotype-5 positivity for each of the three HIV-specific immune markers. These results, along with further details of our area under the curve analysis, are shown in Supplementary Section H.

Supplementary Material

Supplementary appendices

NIHMS1923270-supplement-Supplementary_appendices.pdf^{(369.2KB, pdf)}

Acknowledgements

The authors are grateful to Peter Gilbert for helpful discussions and to the participants, investigators, and sponsors of the STEP and Phambili trials. This work was supported by the NIH through award numbers DP2-LM013340 and 5UM1AI068635-09.

Contributor Information

Sijia Li, Department of Biostatistics, University of Washington, Seattle, Washington 98195,.

Alex Luedtke, Department of Statistics, University of Washington, Box 354322, Seattle, Washington 98195,.

References

Athey S, Chetty R, Imbens GW & Kang H (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Tech. rep, National Bureau of Economic Research. [Google Scholar]
Bang H & Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. [DOI] [PubMed] [Google Scholar]
Bareinboim E & Pearl J (2014). Transportability from multiple environments with limited experiments: Completeness results. Advances in neural information processing systems 27, 280–288. [Google Scholar]
Bareinboim E & Pearl J (2016). Causal inference and the data-fusion problem. Proc. Natl. Acad. Sci 113, 7345–7352. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel PJ (1982). On adaptive estimation. Ann. Stat, 647–671. [Google Scholar]
Bickel PJ, Klaassen CA, Bickel PJ, Ritov Y, Klaassen J, Wellner JA & Ritov Y (1993) Efficient and adaptive estimation for semiparametric models, vol. 4. Johns Hopkins University Press Baltimore. [Google Scholar]
Buchbinder SP, Mehrotra DV, Duerr A, Fitzgerald DW, Mogg R, Li D, Gilbert PB, Lama JR, Marmor M, Del Rio C et al. (2008). Efficacy assessment of a cell-mediated immunity hiv-1 vaccine (the step study): a double-blind, randomised, placebo-controlled, test-of-concept trial. Lancet 372, 1881–1893. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakrabortty A (2016). Robust Semi-Parametric Inference in Semi-Supervised Settings. Ph.D. thesis, Havard University. [Google Scholar]
Chapelle O, Scholkopf B & Zien A (2009). Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Trans. Neural Netw. Learn. Syst 20, 542–542. [Google Scholar]
Churchyard GJ, Morgan C, Adams E, Hural J, Graham BS, Moodie Z, Grove D, Gray G, Bekker L-G, McElrath MJ et al. (2011). A phase iia randomized clinical trial of a multiclade hiv-1 dna prime followed by a multiclade rad5 hiv-1 vaccine boost in healthy adults (hvtn204). PloS one 6, e21225. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dahabreh IJ & Hernán MA (2019). Extending inferences from a randomized trial to a target population. Eur. J. Epidemiol 34, 719–722. [DOI] [PubMed] [Google Scholar]
Dahabreh IJ, Robertson SE, Petito LC, Hernán MA & Steingrimsson JA (2019). Efficient and robust methods for causally interpretable meta-analysis: transporting inferences from multiple randomized trials to a target population. arXiv preprint arXiv:1908.09230. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dong L, Yang S, Wang X, Zeng D & Cai J (2020). Integrative analysis of randomized clinical trials with real world evidence studies. arXiv preprint arXiv:2003.01242. [Google Scholar]
Evans K, Sun B, Robins J & Tchetgen EJT (2018). Doubly robust regression analysis for data fusion. arXiv preprint arXiv:1808.07309. [Google Scholar]
FollmanN D (2006). Augmented designs to assess immune response in vaccine trials. Biometrics 62, 1161–1169. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gray G, Allen M, Moodie Z, Churchyard G, Bekker L, Nchabeleng M, Mlisana K, Metch B, de Bruyn G, Latka M et al. (2011). Safety and efficacy assessment of the hvtn 503/phambili study: A double-blind randomized placebo-controlled test-of-concept study of a clade b-based hiv-1 vaccine in south africa. The Lancet infectious diseases 11, 507. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heituan DF & Rubin DB (1991). Ignorability and coarse data. Ann. Stat, 2244–2253. [Google Scholar]
Hernán MA & Robins JM (2020). Causal inference: what if.
Hernán MA & VanderWeele TJ (2011). Compound treatments and transportability of causal inference. Epidemiology (Cambridge, Mass.) 22, 368. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang Y, Duerr A, Frahm N, Zhang L, Moodie Z, De Rosa S, McElrath MJ & Gilbert PB (2014). Immune-correlates analysis of an hiv-1 vaccine efficacy trial reveals an association of nonspecific interferon-γ secretion with increased hiv-1 infection risk: a cohort-based modeling study. PLoS One 9, e108631. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang Y, Gilbert PB & Wolfson J (2013). Design and estimation for evaluating principal surrogate markers in vaccine trials. Biometrics 69, 301–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kallus N, Saito Y & Uehara M (2020). Optimal off-policy evaluation from multiple logging policies. arXiv preprint arXiv:2010.11002. [Google Scholar]
Lanckriet GR, De Bie T, Cristianini N, Jordan MI & Noble WS (2004). A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635. [DOI] [PubMed] [Google Scholar]
Lu B, Ben-Michael E, Feller A & Miratrix L (2021). Is it who you are or where you are? accounting for compositional differences in cross-site treatment variation. arXiv preprint arXiv:2103.14765. [Google Scholar]
Luedtke A, Carone M & Van der Laan MJ (2019). An omnibus non-parametric test of equality in distribution for unknown functions. J. R. Stat. Soc 81, 75–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mo W, Qi Z & Liu Y (2020). Learning optimal distributionally robust individualized treatment rules. J. Am. Stat. Assoc, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pearl J & Bareinboim E (2011). Transportability of causal and statistical relations: A formal approach. AAAI 25. [Google Scholar]
Pfanzagl J (1990). Estimation in semiparametric models. In Estimation in Semiparametric Models. Springer, pp. 17–22. [Google Scholar]
Polley EC & Van Der Laan MJ (2010). Super learner in prediction.
Qin L, Gilbert PB, Corey L, McElrath MJ & Self SG (2007). A framework for assessing immunological correlates of protection in vaccine trials. The Journal of infectious diseases 196, 1304–1312. [DOI] [PubMed] [Google Scholar]
Robins JM, Rotnitzky A & Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc 89, 846–866. [Google Scholar]
Rolland M & Gilbert P (2012). Evaluating immune correlates in hiv type 1 vaccine efficacy trials: what rv144 may provide. AIDS research and human retroviruses 28, 400–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rudolph KE & van der Laan MJ (2017). Robust estimation of encouragement-design intervention effects transported across sites. J. R. Stat. Soc 79, 1509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stuart EA, Bradshaw CP & Leaf PJ (2015). Assessing the generalizability of randomized trial results to target populations. Prevention Science 16, 475–485. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stuart EA, Cole SR, Bradshaw CP & Leaf PJ (2011). The use of propensity scores to assess the generalizability of results from randomized trials. J. R. Stat. Soc 174, 369–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun B & Miao W (2018). On semiparametric instrumental variable estimation of average treatment effects through data fusion. arXiv preprint arXiv:1810.03353. [Google Scholar]
Tsiatis AA (2006). Semiparametric theory and missing data. Springer. [Google Scholar]
Van der Laan MJ & Gruber S (2012). Targeted minimum loss based estimation of causal effects of multiple time point interventions. Int. J. Biostat 8. [DOI] [PubMed] [Google Scholar]
Van der laan MJ, Polley EC & Hubbard AE (2007). Super learner. Stat. Appl. Genet. Mol. Biol 6. [DOI] [PubMed] [Google Scholar]
Van der Laan MJ & Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]
Van Der Laan MJ & Rubin D (2006). Targeted maximum likelihood learning. Int. J. Biostat 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van Der Vaart AW, van der Vaart A, van der Vaart AW & Wellner J (1996). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media. [Google Scholar]
Wedam S, Fashoyin-Aje L, Bloomquist E, Tang S, Sridhara R, Goldberg KB, Theoret MR, Amiri-Kordestani L, Pazdur R & Beaver JA (2020). Fda approval summary: palbociclib for male patients with metastatic breast cancer. Clin. Cancer Res 26, 1208–1212. [DOI] [PubMed] [Google Scholar]
Westling T (2021). Nonparametric tests of the causal null with nondiscrete exposures. J. Am. Stat. Assoc, 1–12.35757777 [Google Scholar]
Westreich D, Edwards JK, Lesko CR, Stuart E & Cole SR (2017). Transportability of trial results using inverse odds of sampling weights. Am. J. Epidemiol 186, 1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Williamson BD, Gilbert PB, Simon NR & Carone M (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary appendices

NIHMS1923270-supplement-Supplementary_appendices.pdf^{(369.2KB, pdf)}

[R1] Athey S, Chetty R, Imbens GW & Kang H (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Tech. rep, National Bureau of Economic Research. [Google Scholar]

[R2] Bang H & Robins JM (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. [DOI] [PubMed] [Google Scholar]

[R3] Bareinboim E & Pearl J (2014). Transportability from multiple environments with limited experiments: Completeness results. Advances in neural information processing systems 27, 280–288. [Google Scholar]

[R4] Bareinboim E & Pearl J (2016). Causal inference and the data-fusion problem. Proc. Natl. Acad. Sci 113, 7345–7352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bickel PJ (1982). On adaptive estimation. Ann. Stat, 647–671. [Google Scholar]

[R6] Bickel PJ, Klaassen CA, Bickel PJ, Ritov Y, Klaassen J, Wellner JA & Ritov Y (1993) Efficient and adaptive estimation for semiparametric models, vol. 4. Johns Hopkins University Press Baltimore. [Google Scholar]

[R7] Buchbinder SP, Mehrotra DV, Duerr A, Fitzgerald DW, Mogg R, Li D, Gilbert PB, Lama JR, Marmor M, Del Rio C et al. (2008). Efficacy assessment of a cell-mediated immunity hiv-1 vaccine (the step study): a double-blind, randomised, placebo-controlled, test-of-concept trial. Lancet 372, 1881–1893. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chakrabortty A (2016). Robust Semi-Parametric Inference in Semi-Supervised Settings. Ph.D. thesis, Havard University. [Google Scholar]

[R9] Chapelle O, Scholkopf B & Zien A (2009). Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Trans. Neural Netw. Learn. Syst 20, 542–542. [Google Scholar]

[R10] Churchyard GJ, Morgan C, Adams E, Hural J, Graham BS, Moodie Z, Grove D, Gray G, Bekker L-G, McElrath MJ et al. (2011). A phase iia randomized clinical trial of a multiclade hiv-1 dna prime followed by a multiclade rad5 hiv-1 vaccine boost in healthy adults (hvtn204). PloS one 6, e21225. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Dahabreh IJ & Hernán MA (2019). Extending inferences from a randomized trial to a target population. Eur. J. Epidemiol 34, 719–722. [DOI] [PubMed] [Google Scholar]

[R12] Dahabreh IJ, Robertson SE, Petito LC, Hernán MA & Steingrimsson JA (2019). Efficient and robust methods for causally interpretable meta-analysis: transporting inferences from multiple randomized trials to a target population. arXiv preprint arXiv:1908.09230. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Dong L, Yang S, Wang X, Zeng D & Cai J (2020). Integrative analysis of randomized clinical trials with real world evidence studies. arXiv preprint arXiv:2003.01242. [Google Scholar]

[R14] Evans K, Sun B, Robins J & Tchetgen EJT (2018). Doubly robust regression analysis for data fusion. arXiv preprint arXiv:1808.07309. [Google Scholar]

[R15] FollmanN D (2006). Augmented designs to assess immune response in vaccine trials. Biometrics 62, 1161–1169. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Gray G, Allen M, Moodie Z, Churchyard G, Bekker L, Nchabeleng M, Mlisana K, Metch B, de Bruyn G, Latka M et al. (2011). Safety and efficacy assessment of the hvtn 503/phambili study: A double-blind randomized placebo-controlled test-of-concept study of a clade b-based hiv-1 vaccine in south africa. The Lancet infectious diseases 11, 507. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Heituan DF & Rubin DB (1991). Ignorability and coarse data. Ann. Stat, 2244–2253. [Google Scholar]

[R18] Hernán MA & Robins JM (2020). Causal inference: what if.

[R19] Hernán MA & VanderWeele TJ (2011). Compound treatments and transportability of causal inference. Epidemiology (Cambridge, Mass.) 22, 368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Huang Y, Duerr A, Frahm N, Zhang L, Moodie Z, De Rosa S, McElrath MJ & Gilbert PB (2014). Immune-correlates analysis of an hiv-1 vaccine efficacy trial reveals an association of nonspecific interferon-γ secretion with increased hiv-1 infection risk: a cohort-based modeling study. PLoS One 9, e108631. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Huang Y, Gilbert PB & Wolfson J (2013). Design and estimation for evaluating principal surrogate markers in vaccine trials. Biometrics 69, 301–309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Kallus N, Saito Y & Uehara M (2020). Optimal off-policy evaluation from multiple logging policies. arXiv preprint arXiv:2010.11002. [Google Scholar]

[R23] Lanckriet GR, De Bie T, Cristianini N, Jordan MI & Noble WS (2004). A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635. [DOI] [PubMed] [Google Scholar]

[R24] Lu B, Ben-Michael E, Feller A & Miratrix L (2021). Is it who you are or where you are? accounting for compositional differences in cross-site treatment variation. arXiv preprint arXiv:2103.14765. [Google Scholar]

[R25] Luedtke A, Carone M & Van der Laan MJ (2019). An omnibus non-parametric test of equality in distribution for unknown functions. J. R. Stat. Soc 81, 75–99. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Mo W, Qi Z & Liu Y (2020). Learning optimal distributionally robust individualized treatment rules. J. Am. Stat. Assoc, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Pearl J & Bareinboim E (2011). Transportability of causal and statistical relations: A formal approach. AAAI 25. [Google Scholar]

[R28] Pfanzagl J (1990). Estimation in semiparametric models. In Estimation in Semiparametric Models. Springer, pp. 17–22. [Google Scholar]

[R29] Polley EC & Van Der Laan MJ (2010). Super learner in prediction.

[R30] Qin L, Gilbert PB, Corey L, McElrath MJ & Self SG (2007). A framework for assessing immunological correlates of protection in vaccine trials. The Journal of infectious diseases 196, 1304–1312. [DOI] [PubMed] [Google Scholar]

[R31] Robins JM, Rotnitzky A & Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc 89, 846–866. [Google Scholar]

[R32] Rolland M & Gilbert P (2012). Evaluating immune correlates in hiv type 1 vaccine efficacy trials: what rv144 may provide. AIDS research and human retroviruses 28, 400–404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Rudolph KE & van der Laan MJ (2017). Robust estimation of encouragement-design intervention effects transported across sites. J. R. Stat. Soc 79, 1509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Stuart EA, Bradshaw CP & Leaf PJ (2015). Assessing the generalizability of randomized trial results to target populations. Prevention Science 16, 475–485. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Stuart EA, Cole SR, Bradshaw CP & Leaf PJ (2011). The use of propensity scores to assess the generalizability of results from randomized trials. J. R. Stat. Soc 174, 369–386. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Sun B & Miao W (2018). On semiparametric instrumental variable estimation of average treatment effects through data fusion. arXiv preprint arXiv:1810.03353. [Google Scholar]

[R37] Tsiatis AA (2006). Semiparametric theory and missing data. Springer. [Google Scholar]

[R38] Van der Laan MJ & Gruber S (2012). Targeted minimum loss based estimation of causal effects of multiple time point interventions. Int. J. Biostat 8. [DOI] [PubMed] [Google Scholar]

[R39] Van der laan MJ, Polley EC & Hubbard AE (2007). Super learner. Stat. Appl. Genet. Mol. Biol 6. [DOI] [PubMed] [Google Scholar]

[R40] Van der Laan MJ & Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]

[R41] Van Der Laan MJ & Rubin D (2006). Targeted maximum likelihood learning. Int. J. Biostat 2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Van Der Vaart AW, van der Vaart A, van der Vaart AW & Wellner J (1996). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media. [Google Scholar]

[R43] Wedam S, Fashoyin-Aje L, Bloomquist E, Tang S, Sridhara R, Goldberg KB, Theoret MR, Amiri-Kordestani L, Pazdur R & Beaver JA (2020). Fda approval summary: palbociclib for male patients with metastatic breast cancer. Clin. Cancer Res 26, 1208–1212. [DOI] [PubMed] [Google Scholar]

[R44] Westling T (2021). Nonparametric tests of the causal null with nondiscrete exposures. J. Am. Stat. Assoc, 1–12.35757777 [Google Scholar]

[R45] Westreich D, Edwards JK, Lesko CR, Stuart E & Cole SR (2017). Transportability of trial results using inverse odds of sampling weights. Am. J. Epidemiol 186, 1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Williamson BD, Gilbert PB, Simon NR & Carone M (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Efficient Estimation under Data Fusion

Sijia Li

Alex Luedtke

Summary

1. Introduction

2. Notations and Problem Setup

3. Example: longitudinal treatment effect

4. Methods

4.1. Review of semiparametric theory

4.2. Derivation of canonical gradient of a general target parameter

4.3. Canonical Gradient in Our Example

5. Simulation and data illustration

5.1. Simulation

Table 1:

5.2. Data illustration: evaluating the immunogenicity of an HIV vaccine

Table 2:

Supplementary Material

Acknowledgements

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Efficient Estimation under Data Fusion

Sijia Li

Alex Luedtke

Summary

1. Introduction

2. Notations and Problem Setup

3. Example: longitudinal treatment effect

4. Methods

4.1. Review of semiparametric theory

4.2. Derivation of canonical gradient of a general target parameter

4.3. Canonical Gradient in Our Example

5. Simulation and data illustration

5.1. Simulation

Table 1:

5.2. Data illustration: evaluating the immunogenicity of an HIV vaccine

Table 2:

Supplementary Material

Acknowledgements

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases