Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2023 Sep 11;25(3):617–632. doi: 10.1093/biostatistics/kxad024

Semi-supervised mixture multi-source exchangeability model for leveraging real-world data in clinical trials

Lillian M F Haine 1,, Thomas A Murry 2, Raquel Nahra 3, Giota Touloumi 4, Eduardo Fernández-Cruz 5, Kathy Petoumenos 6, Joseph S Koopmeiners, for the Insight FLU 003 Plus and FLU-IVIG Study Groups7
PMCID: PMC11247180  PMID: 37697901

Summary

The traditional trial paradigm is often criticized as being slow, inefficient, and costly. Statistical approaches that leverage external trial data have emerged to make trials more efficient by augmenting the sample size. However, these approaches assume that external data are from previously conducted trials, leaving a rich source of untapped real-world data (RWD) that cannot yet be effectively leveraged. We propose a semi-supervised mixture (SS-MIX) multisource exchangeability model (MEM); a flexible, two-step Bayesian approach for incorporating RWD into randomized controlled trial analyses. The first step is a SS-MIX model on a modified propensity score and the second step is a MEM. The first step targets a representative subgroup of individuals from the trial population and the second step avoids borrowing when there are substantial differences in outcomes among the trial sample and the representative observational sample. When comparing the proposed approach to competing borrowing approaches in a simulation study, we find that our approach borrows efficiently when the trial and RWD are consistent, while mitigating bias when the trial and external data differ on either measured or unmeasured covariates. We illustrate the proposed approach with an application to a randomized controlled trial investigating intravenous hyperimmune immunoglobulin in hospitalized patients with influenza, while leveraging data from an external observational study to supplement a subgroup analysis by influenza subtype.

Keywords: Bayesian model averaging, Causal inference, Influenza, Propensity scores, Real-world data

1 Introduction

Randomized controlled trials (RCTs) are the gold-standard for studying causal relationships (Hariton and Locascio, 2018), yet are criticized as being inefficient, requiring substantial time, money, and participants (Morgan and others, 2011). These inefficiencies have important implications for clinical research. Many trials struggle to enroll, often terminating before a definitive conclusion can be reached, and subgroup analyses to characterize treatment effect heterogeneity often are underpowered. In contrast, real-world data (RWD) are plentiful but lack the internal validity of RCTs. Statistical methods that leverage the complementary strengths of RCTs and RWD are needed to improve the efficiency of clinical research.

Consider the following motivating example. From 2013 to 2019, the International Network for Strategic Initiatives in Global HIV Trials (INSIGHT) conducted a double-blind, placebo controlled RCT (FLU-IVIG) investigating anti-influenza intravenous hyperimmune immunoglobulin (hIVIG) as a treatment for individuals hospitalized with either influenza A or B. While FLU-IVIG failed to show a significant benefit for hIVIG when compared with placebo, a subgroup analysis by influenza subtype found a favorable signal in patients with influenza B (Davey and others, 2019). Though promising, this subgroup analysis had a limited sample size and cannot be considered definitive. Since 2009, the INSIGHT network has also conducted an observational study (FLU 003) following individuals hospitalized with influenza for 60 days Dwyer (2011). FLU 003 is a promising source of supplementary data because it enrolled individuals from many of the same sites and measured many of the same baseline and outcome variables as FLU-IVIG. However, leveraging FLU 003 data must be done with care to minimize the risk of biasing FLU IVIG analyses that are protected by randomization.

Leveraging supplemental information in the analysis of RCTs has been studied extensively. These methods can be broadly classified as either static or dynamic (Viele and others, 2014). Static approaches are limited because the degree of borrowing from the external data is defined a priori. In contrast, dynamic approaches empirically evaluate the consistency of the RCT and external data to determine an appropriate degree of borrowing. Examples of dynamic borrowing methods include the commensurate prior approach (Hobbs and others, 2011); the power prior approach (Ibrahim and Chen, 2000); and multisource exchangeability models (MEMs) (Kaizer and others, 2018b).

The above approaches effectuate borrowing based on the marginal treatment effect, implicitly assuming that the external data are from previous RCTs conducted in the same study population as the current RCT and are inadequate for leveraging RWD for several reasons. First, RWD and RCT data will likely differ with respect to important prognostic covariates. Methods that do not account for these differences will result in either little-to-no borrowing, due to high between source heterogeneity, or severely biased treatment effect estimates. Second, RCTs offer strong internal validity due to randomization. If supplemental RWD are available for both the treatment and control arm, then these differences will likely be confounded, which must be properly controlled to avoid biasing treatment effects due to borrowing. Finally, there may be unmeasurable systematic differences between the RCT and RWD that impact outcomes, in which case borrowing should be avoided. Our goal is to develop an approach to dynamic borrowing that accounts for heterogeneity between the RCT and RWD populations, avoids borrowing in the presence of residual between source differences, and accounts for confounding in the RWD when it contains treated and control information.

Several recent papers have discussed approaches to static and dynamic borrowing that account for population heterogeneity, confounding, or both. Wang and others (2019) proposed the propensity score (PS) composite likelihood approach, which uses a modified PS to mitigate population heterogeneity. While the method will properly account for population heterogeneity, borrowing is done in an outcome blind manner to allow for greater design transparency, and the approach will borrow regardless of the presence or absence of residual between source differences. Kotalik and others (2021) extended MEMs to account for treatment effect heterogeneity, but not confounding. Finally, Boatman and others (2021) proposed an approach that accounts for population heterogeneity and confounding through regression-based estimators and dynamic borrowing via MEMs. However, this approach is inefficient and does not model the target population average treatment effect (ATE) directly.

We propose a novel, two-step Bayesian method for incorporating supplemental RWD into an RCT analysis. Step 1 is a semi-supervised mixture (SS-MIX) model on the PS and Step 2 is a MEM, creating the SS-MIX–MEM approach. The SS-MIX step accounts for population heterogeneity by relying on the balancing property of the PS to identify a subgroup of the RWD that is representative of the RCT participant population, with the aim of balancing the distribution of baseline covariates between the RWD subgroup and the RCT. The MEM step avoids borrowing when substantial differences remain between the outcomes in the RCT and the representative subgroup from the RWD.

The remainder of this article proceeds as follows. In Section 2, we introduce and derive the SS-MIX–MEM approach assuming that supplemental data are only available on the control arm. In Section 3, we present simulation results comparing the SS-MIX–MEM approach to other existing borrowing methods. In Section 4, we apply the SS-MIX–MEM approach to the FLU-IVIG and FLU 003 studies. In Section 5, we illustrate how SS-MIX–MEM can be extended to scenarios where supplemental data are available on both study arms, resulting in confounding in the observational data. Finally, in Section 6, we conclude with a brief discussion.

2 Methods

2.1 Notation and preliminaries

Suppose we have data from a two-armed RCT and supplemental data for the control arm from an untreated observational study or RWD where each dataset contains equivalently defined binary outcomes and baseline covariates. We use upper case letters to denote random variables and lower case letters for their realized values. Specifically, we use Y for outcomes, X for the vector of baseline covariates, T for treatment group (with 0 indicating control and 1 indicating treated), and S for study source (with 0 indicating observational study membership and 1 indicating trial membership). Let n1 denote the trial sample size and n0 denote the observational study sample size, so that n=n1+n0 encompasses all available data. Let di=(yi,xi,si,ti) denote all observed data for individual i, i=1,,n.

We frame our discussion using potential outcomes (Rubin, 1974). We use Y(s, t) to denote the potential outcome under study s and treatment t. Under the stable unit treatment value assumption (SUTVA) (Rubin, 1980), the potential outcome under the assigned treatment is the same as the observed outcome such that Y=s=01t=01Y(s,t)I(S=s)I(T=t). We are interested in estimating the ATE in a population of individuals with the same distribution of covariates as those who enrolled in the trial, i.e., Δ=E[Y(1,1)Y(1,0)|S=1]. Due to randomization and SUTVA, Δ=E[Y(1,1)Y(1,0)|S=1]=E[Y|S=1,T=1]E[Y|S=1,T=0] and thus can be estimated using RCT data.

Our goal is to incorporate external observational data into the RCT control arm. However, as the observational and RCT data likely exhibit different covariate distributions, naively pooling the two datasets would bias and confound the ATE estimate. If we assume that there are no unmeasured prognostic covariates, so that X captures all covariates related to the outcome, then (Y(t,0),Y(t,1))S|X and thus f(Y|X,T,S)=f(Y|X,T). To estimate Δ, we can marginalize over the covariate distribution of the individuals enrolled in the RCT: Δ=EY{Xf(Y|X,T=1)f(X|S=1)dXXf(Y|X,T=0)f(X|S=1)dX}. This in turn motivates our use of the PS defined as e(X)=Pr(S=1|X), which reflects the probability of enrolling in the RCT given baseline covariates. Here, the PS reflects the probability of enrolling in the RCT and not the probability of receiving treatment, as is typically the case in observational studies or RWD, but its purpose is similar. Due to the balancing property of the PS, the distribution of observed covariates will be balanced between the RCT and the observational study conditional on the PS (Rubin, 2005; Rosenbaum and Rubin, 1983): f(X|e(X),S=1)=f(X|e(X),S=0). In practice, the PS must be estimated from the data. We normalize the estimated PS via a logit-transformation and denote qi as the logit-transformed PS.

The PS defined above reflects the probability of participating in the trial, not the probability of being in the RCT population. If the RWD and RCT have the same covariate distribution, then the estimated PS would be constant and equal to the proportion of individuals enrolling in the RCT despite that the entire RWD sample is representative of the RCT participants. As such, to identify the subgroup of individuals in the RWD with the same covariate distribution as those who enrolled in the RCT, we introduce a latent indicator Z for whether the individual is in the RCT participant population. For all RCT individuals Z = 1, whereas Z is unknown for the RWD and must be estimated. We use a SS-MIX model on the transformed PS to identify the individuals with Z = 1 from the observational study, such that f(e(X)|Z=1,S=0)=f(e(X)|S=1)), and thus f(X|S=1)=f(X|Z=1). The group of individuals with Z = 1 from the observational study are a representative subgroup as they have the same distribution of baseline covariates as the RCT participants. Additionally, when (Y(t,0),Y(t,1))S|X, then Y(1,t)=Y(0,t)=Y(t),Δ=E[Y|S=1,T=1]E[Y|Z=1,T=0], and we may leverage the subgroup of individuals from the observational study with Z = 1. Note that we condition on the estimated PSs and do not account for uncertainty in the estimated PSs. The estimated PSs can be thought of as both a dimension reduction tool and a means to identifying a subgroup of the RWD that is more similar to the RCT population than the entire observational dataset. Furthermore, conditioning on the estimated PS allows us to consider a broader set of tools for estimating the PSs (i.e., frequentist or machine-learning approaches), rather than restricting to only Bayesian approaches.

Conditioning on the estimated PS will balance the RCT and RWD on observed covariates, but there may still remain systematic differences between the RCT and RWD that are not explained by observed covariates. Therefore, rather than assuming that Y(1,t)=Y(0,t)=Y(t), we will use dynamic borrowing to evaluate this assumption and borrow as a function of the observed similarity between the RCT and RWD.

Accounting for both population differences and potential residual differences between the RCT and RWD motivates a two-step approach for combining the data. In Step 1, the SS-MIX model on the PS separates the observational study into a representative subgroup (Z = 1) and the complement (Z = 0). In Step 2, we use MEMs to dynamically borrow between the RCT data and the representative subgroup. A conceptual diagram of our approach can be found in Figure 1.

Fig. 1.

Fig. 1

Overview of SS-MIX–MEM approach.

2.2 Likelihoods and models

Step 1: SS-MIX Step. The SS-MIX component for the logit-transformed PS is used to draw from the representative subgroup of the observational study. We assume the logit-transformed PS for the RCT participants, qi, follows some distribution with density f(θ1) with parameter(s) θ1, while qi for the observational data follows a two-component mixture distribution with the first component having the same distribution as the RCT, and the second component following some distribution with parameter(s) θ2θ1

Q|S=1f(q|θ1)Q,Z|S=0[wf(q|θ1)]Z×[(1w)f(q|θ2)]1Z.

Where w is the mixture weight of the observational study, which can be interpreted as the proportion of observational study individuals that are from the subgroup which is representative of the population of individuals who enrolled in the RCT. For posterior derivation, we perform data augmentation through the definition of a latent indicator variable Z, which indicates component membership (Chib, 1996). The posterior distribution for Q,Z|S=0 will be proper only when w=P(Z=1), as we specify here. Our complete joint likelihood for (θ1,θ2) with observed logit-transformed PS q, and unobserved vector z, is:

L(θ1,θ2|q,z)=i=1nf(qi|θ1)si[wf(qi|θ1)]zi(1si)[(1w)f(qi|θ2)](1zi)(1si). (2.1)

Step 2: MEM Step. Let rt, nt and rc, nc be the number of events and the sample size of the RCT treated and control arms, respectively. Similarly, let r1,m=j=1nyjzj(1sj) and n1,m=j=1nzj(1sj) for the mth draw m=1,,M1 from Step 1 serve as our external data source for Step 2.

To utilize MEMs we start by enumerating all possible exchangeability patterns between the RCT and mth representative subgroup. With only one external data source, this results in two exchangeability patterns: the representative subgroup is assumed exchangeable with the RCT control arm or the representative subgroup is assumed not exchangeable with the RCT control arm. A given draw of the representative subgroup and the RCT control arm is considered exchangeable if they have the same outcome event rate and not exchangeable otherwise. The MEM derivations for a Bernoulli random variable can be found in Kaizer and others (2018a) and are reviewed below.

For model specification, we define E as an indicator for whether or not the representative subgroup is assumed to be exchangeable with the RCT control arm. When E = 1, we assume that the representative subgroup and the RCT control arm have the same event rate. When E = 0, we assume that the representative subgroup and the RCT control arm do not have the same event rate. That is,

RtBinomial(nt,pt)RcBinomial(nc,pc)R1,m|E=1,Z=1Binomial(n1,m,pc),m=1,M1R1,m|E=0,Z=1Binomial(n1,m,po),m=1,M1.

Let p=(pc,pt,po) and d=(rc,rt,r1,m,nc,nt,n1,m), the conditional likelihood function is:

L(p|d,E)ptrt(1pt)ntrtpcrc+Er1,m(1pc)ncrc+E(n1,mr1,m)po(1E)rm(1po)(1E)(n1,mr1,m). (2.2)

2.3 Priors

Note that for the proposed approach there are two mechanisms that will limit borrowing: (1) the mixing parameter w in the SS-MIX model and (2) the prior probability of exchangeability in the MEM model. We place a weakly informative prior distribution on wBeta(1,1), to capture all observational study individuals from the RCT population. We place minimally informative independent priors on the following parameters: pcBeta(1,1),ptBeta(1,1),poBeta(1,1).

For the prior probability of exchangeability, we explore two approaches to prior specification. The first, denoted πA, is the agnostic approach, where we assign equal prior probability to exchangeability and non-exchangeability (i.e., πA(E=1)=0.5). The second, denoted πC, is the capped approach, which is derived such that the maximum ESSS is less than some pre-specified level (Ling and others, 2021). As observational studies are often much larger than RCTs, it may be desirable to limit the information incorporated from the observational study to not overwhelm the RCT data. Here, we set the capping prior such that the maximum ESSS is less than the sample size of the RCT control arm. Using the capped prior will decrease P(E=1|d), thereby decreasing the contribution of the observational study. The derivation of the capping prior can be found in the Supplementary Materials.

2.4 Posteriors

Step 1: SS-MIX Step. We approximate the posterior distributions for the SS-MIX Step using a modified Gibbs sampler with M1 draws with a partial likelihood, which ignores the likelihood contribution from the observational study data for the posterior distribution of θ1. We use a partial likelihood instead of the full likelihood for posterior sampling so that the posterior parameters for the RCT distributions are not altered by the observational data and are only informed by the RCT data. The posterior distributions are derived in the Supplementary Materials. For each draw we compute the sufficient statistics for a Binomial likelihood, which is the number of responses and the number of individuals in a given draw. In the Supplementary Materials, we compute the standardized mean difference for each covariate across the m=1,,M1 SS-MIX draws to investigate if each SS-MIX draw has achieved the desired covariate balance between the representative subgroup and the RCT data.

Step 2: MEM Step. Following Kaizer and others (2018a) with assumed independent priors pcBeta(αc,βc),ptBeta(αt,βt), the posterior distributions for pt and pc are:

q(pt|d)Beta(αt+rt,βt+ntrt)and (2.3)
q(pc|d)=Pr(E=1|d)q(pc|d,E=1)+Pr(E=0|d)q(pc|d,E=0). (2.4)

The conditional posterior distribution for pc is the beta distribution

q(pc|d,E)Beta(αc+rc+r1,m*E,βc+ncrc+(n1,mr1,m)*E).

We sample from the posterior distributions via Markov chain Monte Carlo (MCMC) methods. To average across the exchangeability models we use reversible jump MCMC (RJMCMC) to move between parameter spaces with varying dimensionality (Green and Hastie, 2009). The posterior and RJMCMC derivations are in the Supplementary Materials.

For our simulation study and data application, we use a parametric model and assume that f(q) is Gaussian with mean μs and precision τs. For our assumed priors, we place minimally informative independent priors on the following parameters: μsN(0,0.00001),τsGamma(0.001,0.001), s = 1, 2. Here, we assume normality for the components of the mixture distribution based on visual inspection of density plots of the logit-transformed PSs from our motivating data, which are provided in the Supplementary Material.

3 Simulation Study

We completed a simulation study to evaluate the operating characteristics of the proposed SS-MIX–MEM approach under various scenarios. We simulate a two-armed RCT and one observational data source. To reflect our data application, we use similar RCT and observational study sample sizes and focus on incorporating observational data into the control arm of the RCT. For all scenarios, the RCT has a total sample size of 200 with a 1:1 allocation to each arm, which results in 90% power to detect an odds ratio (OR) of 1.95. The observational dataset has a sample size of 1000 individuals.

We compare the proposed SS-MIX–MEM method with and without a capped prior on E to the following methods: (1) the reference model, where only the RCT data are used, (2) marginal MEMs model from Kaizer and others (2018a) with and without capping on E, and (3) PS composite likelihood (PSCL) approach proposed by Chen and others (2020) using the psrwe R package (Chenguang Wang, 2021).

3.1 Data generation

For the RCT, we simulate the covariate vectors from a five-dimensional multivariate normal distribution X|S=1MVN5(05,I5). For the observational study, we simulate the covariate vectors from a two component multivariate normal mixture distribution X|S=0w*MVN5(05,I5)+(1w)*MVN5(1.55,I5). Here, the first component has the same covariate distribution as the RCT participants, while the second component does not. We simulate the binary outcome, Yi, such that

logit(Pr(Yi=1|Xi,Ti,Si))=β0+τ*Ti+βTXi+Bias*(1Si), (3.5)

with β=15 and choose β0 and τ to have E[Y|T=0,Z=1]=0.30 and E[Y|T=1,Z=1] dependent on the assumed OR. We assume that β0, τ, and β are the same for the RCT and the observational study. In scenarios with residual between source differences between the representative subgroup and the RCT data we add a constant bias term to the observational study.

We design our simulation study to systematically evaluate the impact of two major sources of bias: (1) the proportion of observations from the observational data that are drawn from the RCT participant population by varying the mixing parameter between 0 and 1 and (2) unmeasured systematic differences between the RCT and observational data, by varying the bias term between –2 and 2. Varying the mixing parameter will result in population heterogeneity between the RCT and RWD, resulting in bias in the marginal event rate.

3.2 Evaluation criteria

We evaluate the methods using two metrics: (1) relative mean-squared error (relMSE) of the log OR, defined as the MSE of the borrowing approach divided by the MSE of the reference model, and (2) the effective supplemental sample size (ESSS; Hobbs and others (2011)). The ESSS for model k is defined as: ESSS(k)=n1(Prec(k)Prec(R)1), where n1 is the sample size of the RCT, Prec(k) is the posterior precision from model k, and Prec(R) is the posterior precision from the reference model.

We estimate the PS using logistic regression with main effects for all covariates. We complete 3000 MCMC draws with 3000 burn-in draws for the SS-MIX step and 8000 RJMCMC draws with 1000 burn-in draws for the MEM step. For all other methods that use (RJ)MCMC, we perform 10 000 draws with 1000 draws as burn-in. We replicate each scenario 1000 times. Simulations were conducted using R version 4.0.4 (R Core Team, 2021).

3.3 Results

3.3.1 Scenario 1

Results from Scenario 1 are in Figure 2. In Scenario 1, we expect all borrowing methods to perform well. In this scenario, the mixing parameter is equal to 1 (i.e., the RCT and observational populations are identical) and bias is fixed at 0 so there are no residual between source differences. All borrowing approaches exhibit good performance with relMSE values below 1 and ESSS values greater than 0. As the log OR varies from –1 to 1, we observe an increase in the ESSS and a decrease in the relMSE for the borrowing approaches. This is due to an increased treatment arm event rate resulting in a larger variance for the OR and a greater benefit due to borrowing. As expected, SS-MIX–MEM and marginal MEMs with capping priors borrow less than the agnostic priors.

Fig. 2.

Fig. 2

Simulations figures. Scenario 1: the “ideal scenario” where the RCT and observational populations are identical, there is no residual bias. Scenario 2: we vary the proportion of observational study individuals from the RCT population. Scenario 3: we introduce residual differences between the RCT and the representative subgroup.

3.3.2 Scenario 2

Results from Scenario 2 are in Figure 2. In this scenario, we vary the proportion of observational study individuals drawn from the RCT population, while fixing the log OR and bias terms at 0. We see that the marginal MEMs approach is inefficient, avoiding borrowing until the mixing proportion is near 1. Note that the PS weights are unstable for the PSCL approach when the mixing parameters are at or near 0, and are not included in the graph. The SS-MIX–MEM approaches effectively borrow from the observational study as shown by relMSE values below 1 and ESSS values above 0.

3.3.3 Scenario 3

Results for Scenario 3 are in Figure 2. In this scenario, we fix the log OR at 0 and the mixing proportion at 1, however, we vary the bias term, to introduce residual between source differences between the RCT data and the representative subgroup. The ESSS for the PSCL approach is well above zero, regardless of the absence or presence of residual between source differences, resulting in substantial bias. As the bias increases, the relMSE for the PSCL increases exponentially, driven by increased bias. In contrast, both the SS-MIX–MEM and the marginal MEM approaches borrow when there is minimal bias, but ignore the observational data as the bias increases.

4 Application to FLU-IVIG and FLU 003

We illustrate the SS-MIX–MEM approach through an application to FLU-IVIG (Davey and others, 2019) and FLU 003 (Dwyer and INSIGHT Influenza Study Group, 2011). A subgroup analysis of FLU-IVIG indicated substantially different treatment effects for influenza subtype A and influenza subtype B. The subgroup analyses had limited sample size and lacks the precision that would typically be required to confirm a clinically meaningful effect size. By re-analyzing these data with the proposed method, we aim to improve the precision of the results.

FLU-IVIG enrolled 307 participants, with 156 randomized to the treatment arm and 151 randomized to the placebo arm. The FLU 003 study was accessed on May 26, 2020, and is composed of 1790 total individuals. FLU 003 did not administer any treatment, so would not be eligible for incorporation into the hIVIG arm of FLU-IVIG. FLU-IVIG enrolled 223 participants with influenza subtype A, with 114 in the treatment arm and 109 in the placebo arm, while FLU 003 enrolled 1502 participants with influenza subtype A. FLU-IVIG enrolled 84 participants with influenza subtype B, with 42 in the treatment arm and 42 in the placebo arm, while FLU 003 enrolled 288 participants with influenza subtype B. Our outcome of interest is a binary indicator of hospitalization or death at day 7. This is different from the FLU-IVIG primary outcome, a six-category ordinal outcome ranging from death to discharged from hospital with resumption of normal activities (Davey and others, 2019).

Table 1 illustrates that for both influenza A and B the FLU-IVIG and FLU 003 studies differ in many key baseline covariates. PS for the SS-MIX–MEM and PSCL approaches were estimated using GLiDeR (Koch and others, 2018). All baseline covariates that were equivalently measured by both FLU 003 and FLU-IVIG were considered, and the resulting PS model included age, race, days of symptoms, enrollment location, region, NEW score, and diabetes. The distributions of the logit-transformed PS are in the Supplementary Materials and indicate that the SS-MIX–MEM normality assumptions are reasonable.

Table 1.

All values are measured at study baseline

Influenza subtype A:
FLU-IVIG FLU 003: FLU 003:§ FLU 003:§
All Z = 1 Z = 0
N 223.00 1502.00 748.25 753.75
PS 0.22 0.12 0.19 0.05
Hosp./Dead at day 7 0.25 0.23 0.25 0.20
Age (years) 58.51 58.23 58.34 58.11
Days of symptoms 3.54 3.87 3.70 4.04
NEW score 4.33 2.95 3.87 2.02
Race: Black 18.83 7.26 12.46 2.12
Enrollment location: ICU 12.11 7.32 10.56 4.12
Diagnosed diabetes 27.80 19.64 20.69 18.60
Current smoker 22.42 17.84 17.54 18.15
Gender: M 49.78 45.87 43.30 48.45
Region: Eur/Aus 18.39 53.06 18.70 87.27
Region: Thailand 20.18 16.91 25.40 8.40
Influenza subtype B:
FLU-IVIG FLU 003: FLU 003:§ FLU 003:§
All Z = 1 Z = 0
N 84.00 288.00 30.16 257.84
PS 0.38 0.18 0.35 0.16
Hosp./Dead at day 7 0.21 0.26 0.25 0.26
Age (years) 54.12 58.36 54.46 58.83
Days of symptoms 3.74 4.36 3.91 4.42
NEW score 3.79 2.55 3.76 2.40
Race: Black 17.86 6.60 17.30 5.32
Enrollment location: ICU 9.52 5.21 7.35 4.94
Diagnosed diabetes 19.05 19.44 20.04 19.39
Current smoker 25.00 12.50 20.10 11.58
Gender: M 33.33 46.53 39.59 47.41
Region: Eur/Aus 16.67 55.56 18.84 60.07
Region: Thailand 26.19 17.71 25.95 16.65

Notes: N, number of samples; PS, GLiDeR estimated PS; Hosp./Dead at day 7, proportion hospitalized/dead at day 7 after study enrollment; Days of symptoms, days of symptoms prior to study enrollment; NEW score, NEW score at study enrollment; Race, self-reported race categorized into black and non-black; Enrollment location, enrollment location either intensive care unit (ICU) or general hospital; Diagnosed diabetes, diagnosed diabetes at study enrollment; Current smoker, current smoker at study enrollment; Gender, gender at study enrollment either Male (M) or Female (F); Region, region of enrollment categorized as Eur./Aus. is Europe/Australia, Thailand, or USA/Mexico/South America.

RCT values are the mean or percent values of the FLU-IVIG with treated and control groups pooled

Values are the mean or percent values of the full observational study

§

Values are the means of the mean or percent values across all the SS-MIX step draws.

We again consider the SS-MIX–MEM approach, alongside the reference model, which only considers data from FLU-IVIG, a marginal MEM model with both a capped and agnostic prior specification, and the PSCL approach. For the PSCL, SS-MIX–MEM, and marginal MEM approaches with a capped prior specification, we set the effective maximum number of individuals to be borrowed as equal to the sample size of the FLU-IVIG control arm. For the SS-MIX–MEM approach, we use three chains of length M1=7000 and M2=6000 with 4000 burn-in draws for the SS-MIX step and 2000 burn-in draws for the MEM step. For all other sampling approaches we use 10000 draws with 2000 burn-in draws. We assess convergence for MCMC draws via the Gelmin–Rubin diagnostic value of convergence (Gelman and Rubin, 1992), which are below 1.002 and trace plots. Assessing convergence for RJMCMC is difficult due to the changing model space (Green and Hastie, 2009) so convergence was evaluated via trace plots.

4.1 Results

Table 1 presents baseline covariates for FLU-IVIG, the entire FLU 003 study, the representative subgroup (Z = 1) of the FLU 003 and its complement, stratified by influenza subtype. For subtype A, there are baseline differences between FLU 003 and FLU-IVIG. For instance, FLU 003 study has a lower mean NEW score and a lower percentage of black participants. The SS-MIX step separates FLU 003 participants into those included in the subgroup that is representative of the FLU-IVIG participant population and those that are not.

Figure 3 presents the subtype A results. All methods produce similar null results with 95% confidence/credible intervals (CIs) for the OR including 1. The SS-MIX–MEM and all other borrowing approaches improve the precision of the treatment effect estimate above the reference model. Borrowing via the SS-MIX–MEM and marginal MEMs with an agnostic prior and the PSCL improved efficiency of the influenza A subgroup analysis analogous to incorporating more than 60 individuals into the FLU-IVIG from the FLU 003 study based on the ESSS value.

Fig. 3.

Fig. 3

FLU-IVIG and FLU 003 application results. Subtype A: All methods produce similar null results with 95% confidence/credible intervals (CIs) for the OR including 1. For all borrowing approaches, ESSS values are greater than 0 and 95% CIs are smaller than a reference model indicating that we gain precision from borrowing. Subtype B: All methods have similar results with OR estimates well below 1 and none of the 95% CIs for the OR including 1. The borrowing approaches do not leverage the FLU 003 data illustrated by ESSS values around 0.

Baseline differences were observed between FLU 003 and FLU-IVIG for influenza subtype B. The FLU 003 study has a lower percentage of black participants and current smokers. After the SS-MIX step, the mean PS, NEW score, and percentage of black participants of the representative subgroup are closer to the means for FLU-IVIG than the complement or overall population.

Figure 3 presents the results for subtype B. All methods have similar results with OR estimates well below 1 and none of the 95% CIs for the OR including 1, indicating a clear benefit of treatment. The borrowing approaches do not leverage the FLU 003 data. The lack of borrowing can be explained by both baseline covariate differences and outcome differences as both the outcome blinded (i.e., the PSCL) and dynamic borrowing approaches (i.e., the SS-MIX–MEM and marginal MEMs) avoid borrowing.

5 Extension to Both Arms

In some settings, supplemental data are available for both treatment arms. In this context, we must account for confounding due to baseline differences between the control and treated population in the observational data, in addition to population differences between the RCT and observational data, i.e., f(X|S=0,T=0)f(X|S=0,T=1). This can occur if there are systematic differences between the RCT and observational data for one or both of the treatment arms. We propose extending the SS-MIX–MEM approach by borrowing the treatment and control arm event rates separately, then deriving the posterior distribution of the treatment effect. If there is no systematic bias, then the balancing property of the PS accounts for both confounding and population heterogeneity as it is balancing the baseline covariates between the observational data and RCT in each arm. This will result in a treated arm representative subgroup from the observational study treated participants and a control arm representative subgroup from the observational study control participants. As the treatment and control arms in the RCT are not systematically different due to randomization, this will result in the same covariate distributions in the representative subgroups of the treated and control observational participants, thus mitigating bias from both confounding and population heterogeneity.

The posterior distribution of the treatment effect can be approximated via MCMC by calculating the summary of the treatment effect for each MCMC draw. The posterior derivations are in the Supplementary Materials.

5.1 Simulation study

The simulation study set up is the same as in Section 3, with both treated and control supplemental data simulated. We induce confounding on the treatment effect by simultaneously varying the treatment and control arm mixture parameters between 0 and 1 and introduce residual between source bias for the treated individuals in the observational study as specified in Equation (3.5). We fix the log OR at 1 and the bias for the control observational data at 0. We compare the performance of the SS-MIX–MEM to marginal MEMs with agnostic priors, but exclude the PSCL approach because it is not easily extended to borrowing from both treatment and control arms.

Results for Scenario 1 are shown in the left panel of Figure 4. In Scenario 1, we introduce confounding on the treatment effect by simultaneously varying the treatment and control arm mixture parameters between 0 and 1 and fix the bias term for all observational study individuals at 0. The x-axis is the true mixing parameter in the treated arm, which goes from 0 to 1, or from the maximum amount of covariate induced confounding (i.e., all the treated observational individuals are drawn from the complement of the RCT participant population) to no covariate induced confounding (i.e., all the treated observational individuals are drawn from the RCT participant population); the different line types represent different true mixing parameter values in the control arm, varying from 0 to 1. Here, the SS-MIX–MEM approach is borrowing consistently from the representative subgroups of both the treated and control observational data as illustrated with stable relMSE values well below 1, regardless of the true mixing parameter values. In contrast, the marginal MEMs approach has higher relMSE values than the SS-MIX–MEM with comparable relMSE values only in the ideal scenario, where both mixing parameters are 1.

Fig. 4.

Fig. 4

Borrowing on both arms simulations results. Scenario 1: there is no bias introduced, confounding is introduced via the treated and control arms mixing parameters, which both vary between 0 and 1. Note that the x-axis is the true mixing parameter of the treated individuals in the observational data and the linetype is the true mixing parameter of the control individuals in the observational data. Scenario 2: the mixing parameter for the treated and control arms is either both 0.5 or both 1, residual bias varies between –2 and 2 for the treated arm observational data, the control arm has no residual bias. Note that the x-axis is the log OR between the treated individuals in the observational data and the treated individuals in the trial.

Results for Scenario 2 are shown in the right panel of Figure 4. In Scenario 2, we introduce bias on the treatment effect by varying the residual bias term for the treated observational data between –2 and 2. (Note: the x-axis is now the log OR between the treated individuals in the observational data and the treated individuals in the trial). We investigate mixing parameter values of 0.5 and 1 for both the treated and control observational data. As expected, when both mixing parameters are 1, the SS-MIX–MEM and marginal MEM approaches behave almost identical and only borrow from the treated observational data when there is minimal residual bias, ignoring the treated observational data as the bias increases. When the mixing parameters are both 0.5, the SS-MIX–MEM performs similarly as to when they are both 1 while the marginal MEMs is inefficient, with minimal borrowing.

6 Discussion

We proposed SS-MIX–MEM, a flexible two-step Bayesian approach for incorporating RWD into an RCT analysis while accounting for population heterogeneity, confounding, and residual between source differences. This approach implements two layers of protection from the introduction of bias. The first step accounts for population heterogeneity and confounding via the PS, which is similar to the PSCL approach. In the second step, we account for unmeasured systematic differences between the RWD and RCT data using dynamic borrowing via MEMs, which is a fundamental difference between our approach and the PSCL approach. In simulations, we see that the SS-MIX–MEM is able to effectively leverage a subgroup of the RWD for RCT analyses, increasing precision of treatment effect estimates, while introducing minimal bias.

There are limitations to the method. We make the assumption that all outcome and baseline covariate measurements are equivalently defined, measured, and complete across both the RCT and RWD. This is a stringent assumption that will not often be met in practice as it is likely that the RWD and RCT will differ on many aspects including outcome and covariate definitions, durations of follow-up, and quality of data. Future work will involve extending the proposed approach to be able to account for these differences between data sources. We chose to dynamically borrow via MEMs, but Step 2 could be implemented using other approaches to borrowing such as a commensurate priors (Hobbs and others, 2011) or power priors (Ibrahim and Chen, 2000).

While RCTs have strong internal validity, they may lack generalizability to the target population, or transportability to other populations of interest to policy-makers and medical guidelines committees (Dahabreh and others, 2020). Methods for transferring RCT results to a target population typically make a set of assumptions about the similarity and differences between the target population and the RCT (Bareinboim and Pearl, 2013). In contrast, the proposed approach estimates the ATE among the RCT participant population. Future work could involve extending the proposed approach to allow for transporting or generalizing the RCT results to a different target population following the techniques of Stuart and others (2018). Another extension could be to use the g-formula in place of the SS-MIX step to estimate the causal effect of interest within the targeted population and then dynamically borrow using an extension of established approaches.

Non-compliance is a common occurrence in RCTs and it is important to think carefully about the effect of interest when analyzing an RCT in the presence of non-compliance. In most cases, and in our case, in particular, researchers are interested in estimating the intention-to-treat (ITT) effect. Similarly, investigators must think carefully about the information provided by the RWD. In many cases, RWD will provide information about treatment prescribed, which is potentially subject to non-compliance, but in other cases RWD may provide information about treatment received. In our motivating data, the treatment is a 2-h infusion of hIVIG administered in the hospital shortly after randomization. In this case, we expect minimal non-compliance and only a small difference between the ITT and per protocol effect. Observational data are only available for the control arm and we expect these patients to receive treatment similar to the standard of care received in the control arm of the trial. That said, a key advantage of our method is that we are not required to assume the RCT and RWD are consistent; rather, using MEMs we evaluate whether the effect in the two data sources is consistent and only borrow when the effects are suitably similar to support borrowing.

Our methodological development was primarily motivated by the need to leverage RWD for more precise estimates for the secondary analysis of subgroups. Alternately, investigators could use RWD to increase precision for the primary analysis, in which case sample size may need to be altered to reflect the potential for borrowing from RWD. In this case, we recommend powering the trial assuming no borrowing, and using group sequential dynamic borrowing methods to stop the trial early if a definitive conclusion is reached (Murray and others, 2021).

Due to recent technological advancements and increases in widespread availability of RWD sources, RWD are becoming increasingly important for healthcare regulation (FDA, 2020). Yet, RCTs remain the gold standard for estimating the causal effect of an intervention on an outcome. RWD have limitations due to the lack of randomization and cannot be viewed as a replacement for RCTs, but as a data source that can be potentially leveraged to improve efficiency. Incorporating external non-randomized data into the analysis of an RCT potentially threatens the internal validity provided by randomization, thus introducing bias into the treatment effect estimate. This magnifies the importance of developing statistical approaches that leverage RWD, while maintaining the strong internal validity of RCTs.

Software

Software in the form of R code can be found at: https://github.com/lillianhaine/GitCode.

Supplementary Material

kxad024_Supplementary_Data

Acknowledgments

The authors acknowledge Medtronic, Inc. for their support of the research reported in this publication through a Faculty Fellowship afforded to the second author. They acknowledge the Minnesota Supercomputing Institute at the University of Minnesota for providing resources that contributed to the research results reported within this article.

Contributor Information

Lillian M F Haine, Division of Biostatistics, University of Minnesota, Minneapolis, MN, 55414, USA.

Thomas A Murry, Division of Biostatistics, University of Minnesota, Minneapolis, MN, 55414, USA.

Raquel Nahra, Cooper Medical School of Rowan University and Medicine, Division of Infectious Diseases, Cooper University Hospital, Camden, New Jersey, 08103, USA.

Giota Touloumi, Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National & Kapodistrian University of Athens, 11527 Athens, Greece.

Eduardo Fernández-Cruz, Department of Immunology, Internal Medicine, and Pathology, Hospital General, Universitario Gregorio Marañón, Madrid, 28007, Spain.

Kathy Petoumenos, The Kirby Institute, University of New South Wales, Sydney, 2052, Australia.

Joseph S Koopmeiners, Division of Biostatistics, University of Minnesota, Minneapolis, MN, 55414, USA.

Supplementary Material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Funding

This research was supported in part by the National Institutes of Health (NIH). Funding for the influenza studies was from subcontract 13XS134 under Leidos Biomed’s Prime Contract HHSN261200800001E and HHSN-261201500003I, NCI/NIAID and Subcontract 18X107C under Leidos Biomeds’s Prime Contract HHSN261200800001E, NIH. See Lancet Respir Med 2019; doi:10.1016/S2213-2600(19)30253-X (Supplementary Appendix) for a complete list of FLU-IVIG investigators and Open Forum Infect Dis 2018; 5(1):ofx228, https://doi.org/10.1093/ofid/ofx228 for a complete list of FLU 003 investigators. Research reported in this publication was supported in part by the National Heart, Lung, and Blood Institute (NHLBI) of the NIH under award number T32HL129956 and in part by the NIHs National Center for Advancing Translational Sciences (NCATS), grants TL1R002493 and UL1TR002494. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIHs NCATS nor the NIH’s NHLBI.

Conflict of Interest

None declared.

References

  1. Bareinboim E., Pearl J. (2013). A general algorithm for deciding transportability of experimental results. Journal of Causal Inference 1, 107–134. [Google Scholar]
  2. Boatman J. A., Vock D. M., Koopmeiners J. S. (2021). Borrowing from supplemental sources to estimate causal effects from a primary data source. Statistics in Medicine 40, 5115–5130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen W. C., Wang C., Li H., Lu N., Tiwari R., Xu Y., Yue L. Q. (2020). Propensity score-integrated composite likelihood approach for augmenting the control arm of a randomized controlled trial by incorporating real-world data. Journal of Biopharmaceutical Statistics 30, 508–520. [DOI] [PubMed] [Google Scholar]
  4. Wang Chenguang (2021). psrwe: PS-integrated methods for incorporating RWE in clinical studies. R package version 2.2. https://github.com/olssol/psrwe.
  5. Chib S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics 75, 79–97. [Google Scholar]
  6. Dahabreh I. J., Robertson S. E., Steingrimsson J. A., Stuart E. A., Hernán M. A. (2020). Extending inferences from a randomized trial to a new target population. Statistics in Medicine 39, 1999–2014. [DOI] [PubMed] [Google Scholar]
  7. Davey R. T. Jr, Fernández-Cruz E., Markowitz N., Pett S., Babiker A. G., Wentworth D., Khurana S., Engen N., Gordin F., Jain M. K., and others. (2019). Anti-influenza hyperimmune intravenous immunoglobulin for adults with influenza A or B infection (FLU-IVIG): a double-blind, randomised, placebo-controlled trial. The Lancet Respiratory Medicine 7, 951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dwyer D. E.; INSIGHT Influenza Study Group. (2011). Surveillance of illness associated with pandemic (H1N1) 2009 virus infection among adults using a global clinical site network approach: the INSIGHT FLU 002 and FLU 003 Studies. Vaccine 29, B56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. FDA. (2020). US food and drug administration, real-world evidence. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence [accessed 2020 May 5].
  10. Gelman A., Rubin D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7, 457–472. http://www.jstor.org/stable/2246093. [Google Scholar]
  11. Green P. J., Hastie D. I. (2009). Reversible jump MCMC. Technical Report.
  12. Hariton E., Locascio J. J. (2018). Randomised controlled trials—the gold standard for effectiveness research: study design: randomised controlled trials. BJOG 125, 1716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hobbs B. P., Carlin B. P., Mandrekar S. J., Sargent D. J. (2011). Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics 67, 1047–1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Ibrahim J. G., Chen M. H. (2000). Power prior distributions for regression models. Statistical Science 15, 46–60. [Google Scholar]
  15. Kaizer A. M., Hobbs B. P., Koopmeiners J. S. (2018a). A multi-source adaptive platform design for testing sequential combinatorial therapeutic strategies. Biometrics 74, 1082–1094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kaizer A. M., Koopmeiners J. S., Hobbs B. P. (2018b). Bayesian hierarchical modeling based on multisource exchangeability. Biostatistics 19, 169–184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Koch B., Vock D. M., Wolfson J. (2018). Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 74, 8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kotalik A., Vock D. M., Donny E. C., Hatsukami D. K., Koopmeiners J. S. (2021). Dynamic borrowing in the presence of treatment effect heterogeneity. Biostatistics 22, 789–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ling S. X., Hobbs B. P., Kaizer A. M., Koopmeiners J. S. (2021). Calibrated dynamic borrowing using capping priors. Journal of Biopharmaceutical Statistics 31, 852–867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Morgan S., Grootendorst P., Lexchin J., Cunningham C., Greyson D. (2011). The cost of drug development: a systematic review. Health Policy 100, 4–17. [DOI] [PubMed] [Google Scholar]
  21. Murray T. A., Thall P. F., Schortgen F., Asfar P., Zohar S., Katsahian S. (2021). Robust adaptive incorporation of historical control data in a randomized trial of external cooling to treat septic shock. 16, 825–844. 10.1214/20-BA1229 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. R Core Team. (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. [Google Scholar]
  23. Rosenbaum P. R., Rubin D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55. [Google Scholar]
  24. Rubin D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 688–701. [Google Scholar]
  25. Rubin D. B. (1980). Randomization analysis of experimental data: the Fisher randomization test comment. Journal of the American Statistical Association 75, 591. [Google Scholar]
  26. Rubin D. B. (2005). Causal inference using potential outcomes: design, modeling, decisions. Journal of the American Statistical Association 100, 322–331. http://www.jstor.org/stable/27590541. [Google Scholar]
  27. Stuart E. A., Ackerman B., Westreich D. (2018). Generalizability of randomized trial results to target populations: design and analysis possibilities. Research on Social Work Practice 28, 532–537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Viele K., Berry S., Neuenschwander B., Amzal B., Chen F., Enas N., Hobbs B., Ibrahim J. G., Kinnersley N., Lindborg S., and others. (2014). Use of historical control data for assessing treatment effects in clinical trials. Pharmaceut. Statist., 13, 41–54. https://doi-org.ezp1.lib.umn.edu/10.1002/pst.1589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Wang C., Li H., Chen W. C., Lu N., Tiwari R., Xu Y., Yue L. Q. (2019). Propensity score-integrated power prior approach for incorporating real-world evidence in single-arm clinical studies. Journal of Biopharmaceutical Statistics 29, 731–748. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxad024_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES