Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Sep 10;44(20-22):e70236. doi: 10.1002/sim.70236

Accounting for Misclassification of Binary Outcomes in External Control Arm Studies for Unanchored Indirect Comparisons: Simulations and Applied Example

Mikail Nourredine 1,2,3,, Antoine Gavoille 1,2, Côme Lepage 4,5, Behrouz Kassai‐Koupai 2,3, Michel Cucherat 6, Fabien Subtil 1,2
PMCID: PMC12422847  PMID: 40930536

ABSTRACT

Single‐arm control trials are increasingly proposed as a potential approach for treatment evaluation. However, the limitations of this design restrict its methodological acceptability. Regulatory agencies have raised concerns about this approach, although it is sometimes required in applications based solely on such studies. Consequently, the need for accurate indirect treatment comparisons has become critical, especially when constructing external control arms using routinely collected data as outcome measurements may differ from those recorded in the single‐arm trial leading to potential misclassification of outcomes. This study aimed to quantify the bias from ignoring misclassification of a binary outcome within unanchored indirect comparisons, through simulations, and to propose a likelihood‐based method to correct this bias (i.e., the outcome‐corrected model). Simulations demonstrated that ignoring misclassification results in significant bias and poor coverage probabilities. In contrast, the outcome‐corrected model reduced bias, improved 95% confidence interval coverage probability and root mean square error in various scenarios. The methodology was applied to two hepatocellular carcinoma trials illustrating a practical application. The findings underscore the importance of addressing outcome misclassification in indirect comparisons. The proposed correction method may improve reliability in unanchored indirect treatment comparisons.

Keywords: external control group, indirect treatment comparison; single‐arm study, measurement error, misclassification

1. Introduction

Randomized controlled trials (RCTs) are the gold standard for assessing experimental interventions. Nevertheless, practical or ethical limitations may prevent the implementation of a RCT, resulting in the adoption of single‐arm trials, in which all participants are administered the same treatment. Regulatory agencies have raised concerns about this approach [1], although it is sometimes required in applications based solely on such studies. The Food and Drug Administration is increasingly granting approvals based on the findings of single‐arm trials [2], especially in the field of oncology [3, 4]. In addition, the molecular fragmentation of cancers means that more and more subgroups need to be considered, making Phase III trials very difficult. Single‐arm trials require an External Control Arm (ECA) to estimate the treatment effect [5], and are part of the unanchored indirect comparison framework [6]. Studies have shown that indirect comparisons in health technology assessments were predominantly unanchored [7, 8]. For example, a new drug might be tested in a single‐arm trial, with the ECA constructed from the standard of care in Electronic Medical Records (EMRs), or a previous clinical trial. When Individual Patient Data (IPD) are available for both studies, causal effects can be estimated using methods for population‐adjusted indirect comparisons (PAICs) based on a frequentist approach, such as g‐computation, inverse‐probability treatment weighting, and targeted maximum likelihood estimation [9, 10, 11, 12]. When researchers only have access to IPD from the single‐arm trial and aggregate data (AgD) from other source, matching‐adjusted indirect comparison (MAIC) [13], and simulated treatment comparison (STC) [14] have been proposed. MAIC is based on propensity score weighting and STC is based on outcome regression adjustment [6]. STC and MAIC have been evaluated in anchored [15, 16, 17, 18] and unanchored [19, 20] indirect comparisons, without a clear conclusion as to whether MAIC or STC performs better.

PAIC methods can account for imbalance in patients characteristics, but not for differences in outcome measurements between the two studies [21]. Therefore, a commonly made assumption for estimating indirect treatment effects is that there is no difference in the outcome measurement between the single‐arm trial and the ECA. However, for example, in routinely collected data (i.e., real‐world data), outcome measures may differ from those recorded in clinical trials [22, 23]. A proxy measure Y*, evaluated in the ECA, can substitute the reference measure Y evaluated in the single‐arm trial. Both outcomes Y and Y* could measure the same concept, with one predicting the other using a different scale and possibly a different level. For example, in a single‐arm trial evaluating a new immunotherapy for cancer, radiological progression Y might be assessed using RECIST criteria [24]; while the ECA might use clinical progression, Y*, collected in EMRs or a previous clinical trial. Similarly, various scales may assess depression in routinely collected data. Misclassification of outcome could bias the treatment effect either towards or away from the null value [25, 26, 27, 28]. Thus, to estimate an unbiased indirect treatment effect, one must first address the proxy outcome's misclassification. Misclassification in a binary outcome can be quantified by estimating sensitivity and specificity using ancillary studies [26]. For instance, validation studies involve measuring both the reference outcome Y and its proxy Y* in the same individuals [26, 27]. When conducted within the same population (i.e., ECA), these are termed internal validation studies, and external when a third study is assessing Y and Y*. Methods have already been developed for incorporating validation study data in regression outcome estimation using a Bayesian framework [29, 30] or a parametric frequentist perspective; we refer the reader to Carroll et al. [25] for general expressions with likelihood‐based methods and to Lyles et al. [31] for implementation examples. However, these methods have not yet been employed in the context of unanchored indirect comparisons.

The present study examines unanchored indirect comparison involving two studies: one with AgD or IPD for the experimental treatment assessing a binary reference outcome Y and another with IPD for the control treatment, potentially derived from EMR data or a previous clinical trial, which uses a proxy outcome Y*. The objectives are to quantify the bias from ignoring misclassification of a binary outcome in unanchored indirect comparisons and to propose a method to correct this bias (i.e., the outcome‐corrected model). We first describe the indirect comparison framework and provides an overview of indirect treatment effect estimation; subsequently, the outcome‐corrected model is introduced. Methods with and without correction for misclassification in the binary outcome were evaluated in a simulation analysis and in a practical example using real data from two trials investigating advanced hepatocellular carcinoma: the SHARP trial [32], which evaluated the effect of sorafenib versus placebo on radiological progression measured using RECIST, and the PRODIGE‐11 trial [33], which evaluated the effect of sorafenib and pravastatin combined versus sorafenib on clinical progression.

2. Outcome Model for Indirect Comparison

2.1. Notations and Data Structures

There are three studies S (Table 1), S=1 study that included either AgD or IPD, labeled s1 study; S=2 the ECA study that included IPD, labeled s2 study; and S=3 that is the validation study used to estimate the outcome measurement error model labeled s3. The outcome Y is the reference binary outcome response and Y* is the proxy binary outcome response. Xp are prognostic and effect modifier covariates. Two treatments A and B will be considered; the patients from study s1 received experimental treatment A, and the patients from study s2 received control treatment B.

TABLE 1.

Data structure with an external validation study s3.

Study set Prognostics and effect modifier covariates Treatment Error‐free outcome Error‐prone outcome

Single‐arm trial

S=1
X
A
Y

External control arm

S=2
B
Y*

Validation study

S=3
Y

2.2. Unanchored Indirect Comparisons Framework

For a binary outcome, the form of indirect comparisons is as follows [6]: within a target population S, the effect of treatment A compared to treatment B, dAB(S), is calculated as the difference in the log odds between each treatment:

dAB(S)=logitμA(S)logitμB(S)

The estimand of interest is the population‐average treatment effect of A versus B in the targeted population s1: dAB(S=1), along with its estimate:

d^AB(S=1)=logitμ^A(S=1)logitμ^B(S=1) (1)

The challenge lies in estimating logitμ^B(S=1), the predicted log odds of treatment B in target population s1, as the participants in s1 received only treatment A.

2.3. Indirect Comparisons Without Measurement Error

Several methods can be used to estimate the indirect treatment effect from Equation (1) [9, 10, 11, 12]. G‐computation method involves fitting an outcome model using the IPD from study s2. This model then uses the prognostic and effect modifier variables from the study s1 population to predict the outcome marginal value of treatment B in target population s1 [11, 12].

Now, consider a conditional indirect treatment effect as follows:

dABcond(S)=logitμA(S)|X(S)logitμB(S)|X(S), where XS is a set of prognostic covariates in population S. To predict logitμB(S=1)|X(S=1), the conditional log odds of outcome Y under treatment B in the target population s1, an outcome regression model is fitted to the IPD from study s2:

logitYi,B(S=2)=β0+p=1PβpXi,p(S=2)Xp(S=1) (2)

where Yi,B(S=2) is the binary outcome value for a patient i from study s2, and Xi,p(S=2) denotes the pth prognostic covariate, along with it is associated coefficient βp. The prediction is centred on the mean covariate value of study s1, Xp(S=1). By doing so, β^0 is interpreted as the predicted conditional log odds of outcome Y under treatment B for an average patient sampled from the target population s1, logitμ^B(S=1)|Xp(S=1).

In case of AgD for s1, logitμA(S=1), the marginal log odds of outcome Y under treatment A in the target population s1, is derived from the reported summary estimate in a published article. However, it is unlikely that the conditional estimate (i.e., logitμ^A(S=1)|Xp(S=1)) is available in a published article. Conversely, when IPD are available for study s1, a second outcome model is fitted to the data of s1 with the same prognostic covariates:

logitYi,A(S=1)=α0+p=1PαpXi,p(S=1)Xp(S=1) (3)

where α^0 is the estimated conditional log odds of outcome Y under treatment A for an average patient sampled from the target population s1, logitμ^A(S=1)|Xp(S=1), to be consistent with β^0 which has also a conditional effect interpretation.

Finally, a conditional population‐adjusted treatment effect is estimated using Equation (1); with logitμ^B(S=1)|Xp(S=1)=β^0 and logitμ^A(S=1)|Xp(S=1)=α^0 (IPD‐study s1), or the log odds of the outcome, for example, reported in a published article (AgD‐study s1).

The variance, sAB(S=1)2 of dABcond(S=1) is:

sAB(S=1)2=varlogitμ^A(S=1)|Xp(S=1)logitμ^B(S=1)|Xp(S=1)=sA(S=1)2+sB(S=1)22covlogitμ^A(S=1)|Xp(S=1)logitμ^B(S=1)|Xp(S=1) (4)

where sA(S=1)2 and sB(S=1)2 are the variance of logitμ^A(S=1)|Xp(S=1), logitμ^B(S=1)|Xp(S=1) respectively. The covariance term in Equation (4) arises because the values of the covariates X in study s1 are used in both outcome models in study s1 (Equation 3) to predict α0 and s2 (Equation 2) to predict β0. More precisely, for the pth covariate, the value of Xp(S=1) are used. In the case of IPD availability for study s1 a likelihood‐based method is proposed to account for this correlation (Section 3). When only AgD are available for study s1 the covariance term cannot be evaluated and sB(S=1)2 is estimated using the model variance in Equation (2) or bootstrap [6, 15].

2.4. Indirect Comparison With Measurement Error

Now consider the case for which only a proxy outcome Yi* is available for study s2. This will bias the estimates of β0, βp from Equation (2):

logitYi,B(S=2)*=β0*+p=1Pβp*xi,p(S=2)xp(S=1)

where β0*, βp* are biased estimates for β0, βp. The magnitude and direction of bias will depend upon the diagnostic properties of Yi* as a substitute for Yi [26]. Misclassification in a binary outcome can be expressed in terms of the sensitivity Se=PY*=1|Y=1 and specificity Sp=PY*=0|Y=0. Misclassification may be differential if the specificity and sensitivity depend on covariates, or non‐differential if they do not [26]. The impact of non‐differential misclassification is that β0* is closer to the null relative to β0 [26]. However, if the misclassification is differential, the bias can either be away or towards the null value [26].

2.5. Outcome‐Corrected Model

The proposed outcome‐corrected model is designed to account for misclassification in binary outcomes when comparing treatments between a single‐arm trial and an ECA. The first step is to correct the biased estimates β0*, βp*, then employ them to perform the indirect treatment comparison. Let the outcome measurement error model be:

logitPY*=1|Y=y,Zl=γ0+γ1Y+l=2LγlZl (5)

where Zl are covariates for a differential error measurement which are a subset of Xp.

For patients with Y=1, sensitivity is

Se=PY*=1|Y=1,Zl=zl=11+expγ0+γ1+l=2LγlZl

and for patients with Y=0 specificity is

Sp=PY*=0|Y=0,Zl=zl=111+expζ0+l=2LζlZl

Note that if γL=ζL=0, it implies a non‐differential outcome measurement error with Se=PY*=1|Y=1, and Sp=PY*=0|Y=0. For simplicity, from now on it will be assumed a single continuous prognostic variable, X and no differential measurement error. To correct the biased estimates β0*, βp*, information on parameters of the measurement error model is required. Often, there is a lack of a priori information about measurement errors, necessitating ancillary studies to estimate parameters of the outcome model (Equation 5). The STRATOS guidance identify different types of ancillary studies [26]. The following section considers an external validation study where both the proxy outcome Y* and the reference outcome Y are available in a third study s3 (Table 1).

According to Lyles et al. [31], using Carroll et al.'s [25] general likelihood, an external validation study can be used to correctly estimate β0,βp from Equation (2). The external validation study is used to estimate the joint distribution PYj,Yj*. Assuming there are j=1,,ns3. patients in the external validation study s3, each providing data as yj*,yj, and assuming a non‐differential measurement error, the individual likelihood, Lj,s3 for the external validation study s3 is:

Lj,s3=PYj*=yj*|Yj=yj×PYj=yj

Using pyj=PYj=yj,

Lj,s3=Se×pyjyj*yj×(1Sp)1pyjyj*1yj×(1Se)pyj1yj*yj×Sp1pyj1yj*1yj

The likelihood for the external validation study s3 is: Ls3=jns3Lj,s3. Note that when using an internal validation study s3, the individual likelihood, Lj,s3, remained in the same form as previously described [31].

Then, study s2 is used to estimate the conditional probability PYi=yi|Xi=xi. Assuming there are i=1,,ns2 patients in study s2, each providing data as yi*,xi, and assuming a non‐differential outcome measurement error (i.e., γk=0), the individual likelihood, Li,s2 for study s2 is:

Li,s2=PYi*=yi*|Xi=xi=yi=01PYi*=yi*|Yi=yi,Xi=xiPYi=yiXi=xi)

When Yi*=1, this term corresponds to:

PYi*=1|Yi=0,Xi=xi×PYi=0|Xi=xi+PYi*=1|Yi=1,Xi=xi×PYi=1|Xi=xi
=(1Sp)×PYi=0|Xi=xi+Se×PYi=1|Xi=xi

When Yi*=0, this term corresponds to:

PYi*=0|Yi=0,Xi=xi×PYi=0|Xi=xi+PYi*=0|Yi=1,Xi=xi×PYi=1|Xi=xi
=Sp×PYi=0|Xi=xi+(1Se)×PYi=1|Xi=xi

So, the individual likelihood of study s2 is:

Li,s2=(1Sp)×PYi=0|Xi=xi+Se×PYi=1|Xi=xiyi*×Sp×PYi=0|Xi=xi+(1Se)×PYi=1|Xi=xi1yi*

where

PYi=1|Xi=xi=11+expβ0+β1xi(S=2)x(S=1) (6)

The likelihood for study s2 is: Ls2=ins2Li,s2. Finally the total likelihood is:

L=Ls2×Ls3 (7)

When only AgD for study s1 are available, the likelihood is maximized to estimate the corrected {β^0, β^p}, which are used to predict logitμ^B(S=1)|X(S=1) (Equation 2).

When IPD for study s1 are available, logitYm,A(S=1) is modeled with the binomial likelihood of the outcome Y in s1 including m=1,,M patients: Lm,s1, using as probability parameters the outcome model in Equation (3). As indicated above, correlation between logitμ^A(S=1) and logitμ^B(S=1) stems for the fact that observed values of X(S=1), the prognostic variable in study s1, are used to predict logitμi,B(S=1). To account for the correlation, the likelihood, Lx of the random variable XS=1, is added. Then, the total likelihood to maximized is

L=Ls2×Ls3×Ls1×Lx (8)

To estimate sAB(S=1)2, the variance of the indirect treatment effect dAB(S=1) (Equation 4), the Fischer information is used. Since the covariate X in study s1 and s2 (Equations 3 and (6)) are each centered with μx(S=1) (i.e., the true mean of the random variable X(S=1)), α^0 from Equation (3) and β^0 from Equation (6) are interpreted as the conditional log odds of treatment A and B, respectively. The indirect treatment effect from Equation (1) is estimated using the difference between α^0 and β^0, consequently, the variance and covariance are estimated using the Fischer information. Specifically, sA(S=1)2 the variance of μA(S=1), is the diagonal term for α^0 (Equation 3); sB(S=1)2, the variance of μB(S=1), is the diagonal term for β^0 (Equation 6), and the covariance term (Equation 4) is the covariance between α^0 and β^0. Confidence intervals can then be derived assuming a normal distribution of the estimate.

All the PAIC methods presented herein build upon the assumption of conditional exchangeability of treatment effects [6], meaning that there were no unknown or unmeasured prognostic factors or effect modifiers missing from the models. The outcome‐corrected model additionally assumes a correctly specified outcome measurement error model (Equation 5). Specifically, the outcome measurement error model (Equation 5) must account for all variables that contribute to a non‐differential measurement error. When estimating the outcome measurement error model using an external validation study, it is essential to assume transportability [25, 26], which refers to the applicability of sensitivity and specificity parameters across both the external validation study s3 and the ECA (study s2).

3. Simulations Study Plan

We adhere to the ADEMP reporting framework [34] by describing the Aim (Section 3.1), the Data‐generating mechanism (Section 3.2), the Estimands (Section 3.3), the Methods under investigation (Section 3.4), and the Performance measures (Section 3.5).

3.1. Aims

This simulation study aimed to quantify the bias resulting from ignoring the misclassification of a binary outcome and to evaluate the performance of three different methods described in Section 2, in the context of unanchored indirect comparison involving a single‐arm trial compared to an ECA.

3.2. Data‐Generating Mechanisms

As a reminder, there are three studies s (Table 1), the targeted population study s1 that included IPD or AgD, the external control group study that included IPD s2, and the external validation study s3 that is used to estimate the outcome measurement error model. Two treatments k=A,B are considered, the patients from study s1 who received experimental treatment A, and the patients from study s2 who received control treatment B.

A binary outcome for individual i is generated under a logistic regression model:

yi,s,kBernθi,s,k
logitθi,s,k=β0,k+β1xi,s

With xi,s a prognostic variable that follows a normal distribution with mean μx(s) and variance σx2: xi,sNμx(s),σx2. β1 is the coefficient for the prognostic variable xi,s. As β1 is not stratified by treatment k, the variable xi,s is thus only a prognostic variable and not an effect modifier. And β0,k is the log odds of the outcome for treatment k for xi,s=0. The proxy binary outcome Y* is generated as a non‐differential error of Y as follows:

yi,s,k*Bernθi,s,k*
logitθi,s,k*=γ0+γ1yi,s,k

where γ0=logit(1Sp), and γ1=logit(Se)γ0. To set β1, the coefficient for the prognostic variable, we express it as a function of the prevalence of the outcome in study s2, denoted py(s2). Specifically, β1 is calculated using the formula: β1=logitpy(s2)β0,Bμx(s=2). Furthermore, the log odds under treatment A in study s1, β0,A, is determined by the ratio of the standardized treatment effect size D1 to the standardized “total” differences between trials D2 (Supporting Information material). The standardized treatment effect size D1 is informally the standardized distance between study s1 and s2 due to the treatment effect A versus B. And the standardized “total” difference D2 is the standardized distance between study s1 and s2 due to the treatment effect and the difference of distribution in prognostic variables. As so, the ratio r=D1D2 represents the proportion of difference between study s1 and s2 that is due to the treatment effect and 1r the percentage of difference due to difference in patients' prognostic variables. The r value was set to 0.6, indicating that the standardized treatment effect constitutes 0.6 times the standardized “total” differences between trials. Thus 40% of the “total” standardized differences in outcome between trials are attributable to the “impact” of the prognostic variable. This impact is determined by the magnitude of β1 and the degree of overlap in the distribution of X between the studies.

The values of the fixed parameters used in the simulations are as follow: μx(s=1)=0.8, μx(s=2,3)=1,σx=0.1, β0,B=0.1, and prevalence of outcome in the study s2, py(s2)=0.2, leading to a treatment effect odds ratio (OR) of 1.47. Five thousand simulations per scenario were performed.

Different scenarios were defined according to the different combinations of the following parameters, with references in bold type and underlined:

  1. Specificity Sp: 0.7 , 0.8, 0.9;

  2. Sensitivity Se: 0.7, 0.8, 0.9 ;

  3. Overall sample size ns in each trial s, ns1: 500 , ns2 and ns3: 200, 500 , 1500

We specified two additional scenarios with sensitivity and specificity fixed at 0.7, with ns2=ns3=500, and ns2=ns3=1000.

3.3. Estimand

The estimand of interest is the population‐average treatment effect of A versus B in the targeted population of the single‐arm, s1: dAB(S=1); which would represent a treatment effect in a randomized trial.

3.4. Methods Under Investigation

Three methods were evaluated:

  • The reference method (see details in Section 2.3), which uses the true outcome Y in both study s1 and study s2

  • The uncorrected method (see details in Section 2.4), which uses the true outcome Y in study s1 and the proxy outcome Y* in study s2, but ignores the misclassification in the outcome.

  • The outcome‐corrected model (see details in Section 2.5), which method uses the true outcome Y in study s1 and the proxy outcome Y* in study s2, and correct the misclassification in the outcome.

3.5. Performance Measures

The performance measures included both absolute and relative bias, along with 95% confidence interval (95% CI) coverage probability of the confidence intervals, empirical standard error (SE), root mean square error (RMSE) and the non‐convergence's proportion of the optim() simulation. As there is no explicit solution for maximizing likelihood in Equations (7) and (8), numerical maximization tools can be use. Specifically, the optim() function with “BFGS” (a quasi‐Newton method) for the method argument was used in R [35]. For convergence, the parameter spaces for standard deviation and probability were restricted using log and logit transformations. Analyses were performed using R version 4.3.3.

The simulation results are presented in Section 4 for scenarios where IPD are available for study s1. See Tables S2 and S3 for the results when only AgD for study s1 are available.

4. Simulation Results

The results presented here are based on simulations using IPD for study s1, employing three different methods: the reference method, outcome‐corrected model, and the uncorrected method. Results using AgD for study s1 are outlined in the Supporting Information materials.

4.1. Variation According to Specificity

In this scenario, specificity varied between 0.7 and 0.9 while other parameters were held at their reference values. Table 2 presents the performance measures for each method. Figure S1a illustrates the distribution of absolute bias of the uncorrected method and the outcome‐corrected model.

TABLE 2.

Performance measure according to specificity and sensitivity in an IPD‐IPD setting.

Se = 0.9 Sp = 0.7
Performance measure Methods Sp = 0.7 Sp = 0.8 Sp = 0.9 Se = 0.7 Se = 0.8 Se = 0.9
Absolute bias Reference 0.00 0.00 0.00 0.00 0.00 0.00
Uncorrected −0.92 −0.60 −0.26 −0.72 −0.82 −0.92
Outcome‐corrected 0.02 0.00 0.00 −0.02 0.01 0.02
Relative bias (%) Reference 0.53 0.79 0.96 0.27 0.65 0.53
Uncorrected 237.27 156.13 67.66 187.72 212.26 237.27
Outcome‐corrected 4.12 1.04 0.51 5.85 2.32 4.12
Empirical SE Reference 0.26 0.26 0.26 0.26 0.26 0.26
Uncorrected 0.22 0.23 0.25 0.22 0.22 0.22
Outcome‐corrected 0.55 0.45 0.37 0.82 0.66 0.55
95% CI coverage (%) Reference 95.50 95.60 95.50 95.50 95.60 95.50
Uncorrected 1.90 25.50 79.60 10.30 4.50 1.90
Outcome‐corrected 96.10 95.90 95.40 97.20 96.50 96.10
RMSE Reference 0.26 0.26 0.26 0.26 0.26 0.26
Uncorrected 0.94 0.64 0.36 0.76 0.85 0.94
Outcome‐corrected 0.55 0.45 0.37 0.82 0.66 0.55
Non‐convergence (n) Reference 0 0 0 0 0 0
Uncorrected 0 0 0 0 0 0
Outcome‐corrected 9 1 0 143 42 9

Note: The reference method uses the true outcome in both study s1 and s2; the uncorrected method ignores outcome misclassification in study s2; the outcome‐corrected model accounts for outcome misclassification in study s2.

Abbreviations: CI: confidence interval; IPD: individual patient‐data; RMSE: root mean square error; SE: square error; Se: sensitivity; Sp: specificity.

Simulations had a relative bias for the uncorrected method of 67% when both specificity and sensitivity were at 0.9, increasing to 156% and 237% for specificities of 0.8 and 0.7, respectively. The 95% CI coverage probability of the uncorrected method was 79.6% when both specificity and sensitivity were at 0.9, decreasing to 26.3% for a specificity of 0.8.

The absolute bias of the outcome‐corrected model remained low, close to zero as the specificity approached 0.9, with a relative bias consistently below 5%. The highest absolute bias of the outcome‐corrected model was 0.02 with a specificity at 0.7. Reducing specificity did increase the variance of estimations, the empirical SE increased by 1.5‐fold between specificities of 0.9 and 0.7. The 95% CI coverage probability for the outcome‐corrected model remained around 95.5% across specificity values. The outcome‐corrected model failed to converge in nine instances (1.8%) at a specificity of 0.7.

For specificities of 0.7 and 0.8, the RMSE of the outcome‐corrected model was approximately twice as high as that of the reference method but approximately half that of the uncorrected method. When both sensitivity and specificity were at 0.9, the outcome‐corrected model demonstrated lower bias, better coverage probability, and a similar RMSE to the uncorrected method (Table 2).

4.2. Variation According to Sensitivity

In this scenario, sensitivity varied between 0.7 and 0.9 while other parameters were held at reference values. Table 2 presents the performance measures for each method. Figure S1b illustrates the distribution of absolute bias of the uncorrected method and the outcome‐corrected model.

The uncorrected method exhibited relative bias increased as the sensitivity approached 0.9; 187% at a sensitivity of 0.7, and 237% at a sensitivity of 0.9 (Table 2). For a specificity of 0.7, 30% of the cases were false positives, which is particularly significant given the low prevalence of the outcome. Moreover, increasing sensitivity without a corresponding improvement in specificity led to an apparent rise in the frequency of the outcome. This rise can easily be misinterpreted as a treatment effect, rather than being attributed to the improved identification of true cases. The 95% CI coverage probability of the uncorrected method was 10.3% when both specificity and sensitivity were 0.7.

For the outcome‐corrected model, the absolute bias remained roughly the same (below 6%) as sensitivity increased (Figure S1b and Table 2). The 95% CI coverage probabilities remained above 95%, reaching 97.2% at a sensitivity of 0.7, and 96% for sensitivities of 0.8 and 0.9. The outcome‐corrected model failed to converge in 143 instances (2.9%) at sensitivity of 0.7.

For a sensitivity of 0.8, the RMSE of the outcome‐corrected model was three times higher than that of the reference method but around 1.5‐fold lower than that of the uncorrected method. When both specificity and sensitivity were at 0.7, the outcome‐corrected model had high variance, with an empirical SE four times higher than the other methods and a RMSE higher than the uncorrected method (Table 2).

4.3. Variation According to Sample Size in Study s3 and s2

In this scenario, the sample sizes in the study s3 and s2 varied between 200 and 1500, while other parameters were held at their reference values. Table 3 presents the performance measures for each method. Figure S2 illustrates the distribution of absolute bias of the uncorrected method and the outcome‐corrected model. Figure S3 presents the distribution of absolute bias of sensitivity and specificity estimated by the outcome‐corrected model for different sample size in study s3.

TABLE 3.

Performance measure according to sample size n3 and n2 in an IPD‐IPD setting.

n 2 = 500 n 3 = 500
Performance measure Method n 3 = 200 n 3 = 500 n 3 = 1500 n 2 = 200 n 2 = 500 n 2 = 1500
Absolute bias Reference 0.00 0.00 0.00 0.00 0.00 0.00
Uncorrected −0.92 −0.92 −0.91 −0.93 −0.92 −0.91
Outcome‐corrected 0.02 0.02 0.02 −0.01 0.02 0.01
Relative bias (%) Reference 0.53 0.53 0.41 0.40 0.53 0.08
Uncorrected 237.71 237.27 236.68 241.26 237.27 236.51
Outcome‐corrected 3.96 4.12 5.56 2.30 4.12 3.39
Empirical SE Reference 0.26 0.26 0.26 0.41 0.26 0.17
Uncorrected 0.22 0.22 0.22 0.33 0.22 0.15
Outcome‐corrected 0.61 0.55 0.54 0.86 0.55 0.35
95% CI coverage (%) Reference 95.60 95.50 95.10 94.90 95.50 95.20
Uncorrected 1.70 1.90 2.00 21.30 1.90 0.00
Outcome‐corrected 96.20 96.10 95.90 96.20 96.10 95.80
RMSE Reference 0.26 0.26 0.26 0.41 0.26 0.17
Uncorrected 0.94 0.94 0.94 0.99 0.94 0.92
Outcome‐corrected 0.61 0.55 0.54 0.86 0.55 0.35
Non‐convergence (n) Reference 0 0 0 0 0 0
Uncorrected 0 0 0 0 0 0
Outcome‐corrected 72 9 8 170 9 0

Note: The reference method uses the true outcome in both study s1 and s2; the uncorrected method ignores outcome misclassification in study s2; the outcome‐corrected model accounts for outcome misclassification in study s2.

Abbreviations: CI: confidence interval; IPD: individual patient‐data; RMSE: root mean square error; SE: square error.

Since the uncorrected method does not use the validation study s3 its performance measures did not change with ns3. The relative bias of the uncorrected method remained around 237% when ns3 varied. Because the uncorrected method ignores misclassification, increasing ns2 did not affect the bias either (Table 3, and Figure S2a). As the bias remained the same, the 95% CI coverage probability decreased as ns2 increased, dropping to 21% with 200 patients in study s2, to 0% with 1500 patients in study s2.

For the outcome‐corrected model, increasing ns3 did not reduce the bias with an absolute bias remaining around 0.01 and the 95% CI coverage around 96% (Table 3). Larger sample size ns3 enhanced the precision in sensitivity and specificity estimates (Figure S3). Increasing ns2 did not decrease the absolute bias (Table 3 and Figure S2); however, it increased precision, reducing empirical SE by 2.5‐fold, and increasing RMSE by a factor of 2.5, between 200 and 1500 patients. RMSE improvement was more important with larger ns2 than ns3, dropping from 0.86 to 0.35 for ns2 increasing from 200 to 1500 patients, and from 0.61 to 0.54 for ns3 increasing from 200 to 1500 patients. A low sample size in s2 had a greater impact on non‐converged iterations, with 170 (3.4%) non‐converged iterations for ns2=200, and 72 (1.4%) non‐converged iterations for ns3=200. When both sensitivity and specificity were at 0.7, the same pattern was observed (Table S1).

4.4. Aggregated Data Results

Results using AgD for study s1 are provided in the Supporting Information materials and are generally consistent with those obtained using IPD for study s1.

5. Applied Example: PRODIGE‐11 and SHARP Trials

The estimand was the effect of adding the sorafenib to the standard of care in adults with advanced‐stage hepatocellular carcinoma. Two RCTs, the SHARP‐trial [32] as the AgD‐study s1 and PRODIGE‐11 trial [33] as the IPD‐study s2, were used. Although SHARP and PRODIGE‐11 are comparative trials, only a single arm from each study was extracted for the analysis, focusing on the placebo arm from SHARP (i.e., standard of care) and the sorafenib‐alone arm from PRODIGE‐11. This estimand was chosen because the treatment effect of sorafenib versus placebo from the SHARP trial served as a reference. The validation study (study s3) was internal, using a subset of patients from the PRODIGE‐11 trial who had both the reference outcome Y and the proxy outcome Y* assessed.

The design characteristics of each trial are outlined in Table 4. The SHARP trial [32], a double‐blind, placebo‐controlled study, included 602 patients with advanced hepatocellular carcinoma, randomized to receive either sorafenib (400 mg twice daily) or a placebo. Radiological progression, defined as the time from randomization to disease progression, was based on independent radiological review according to RECIST criteria. Data on radiological progression at 4 months for the placebo group (i.e., standard of care) were extracted from Kaplan–Meier curves. The PRODIGE‐11 trial [33], a randomized, unblinded controlled trial included 323 patients with advanced hepatocellular carcinoma, randomized to receive either sorafenib (400 mg twice daily) or sorafenib (400 mg twice daily) plus pravastatin (40 mg daily). Radiological progression was assessed according to RECIST criteria every 12 weeks, and clinical progression every 4 weeks. Baseline characteristics of the sorafenib‐arm in PRODIGE‐11 and placebo‐arm in SHARP trial are presented in Table 5. From a reduced set of prognostic covariates, we included those with a standardized mean difference above 0.1, as an empirical threshold [36], between the two groups: age, etiology (using alcohol as reference), and extrahepatic metastases, without any interaction terms.

TABLE 4.

Design characteristics PRODIGE‐11 and SHARP trial designs.

SHARP PRODIGE‐11
Data Aggregated data Individual patient data
Design Randomized, double blind Randomized, unblinded
Inclusion–exclusion criteria

Adults, with advanced‐stage hepatocellular carcinoma, as confirmed by pathological analysis.

Not eligible for or had disease progression after surgical or locoregional therapies.

Child‐Pugh liver function class A

Life expectancy of 12 weeks or more

ASAT and ALAT < 5 N

ECOG < 3

WHO‐PS < 3

CLIP score 0–4

Localisation USA, Europe, Asia France
Inclusion interval 2005–2006 2010–2013
Treatment

Sorafenib 400 mg twice daily

Placebo

Sorafenib 400 mg twice daily

Sorafenib 400 mg twice daily plus pravastatin 40 mg daily

Outcome Y: Radiological progression according to RECIST assessed every 6 weeks Y*: Clinical progression assessed every 4 weeks

Abbreviations: ASAT: aspartate amino‐transferase; ALAT: alanine amino‐transferase; CLIP: Cancer of the Liver Italian Program; ECOG: Eastern Cooperative Oncology Group; WHO‐PS: World Health Organization Performance Status.

TABLE 5.

Baseline characteristics for SHARP and PRODIGE‐11 studies.

SHARP trial Placebo‐arm (n = 303) PRODIGE‐11 trial Sorafenib‐arm (n = 161) SMD
Age (year)—mean (SD) 66.3 (0.2) 68 (9) 0.14
Sex—n (%) 0.03
Male 264 (87) 142 (88)
Female 39 (13) 19 (12)
Etiology—n (%) 0.62
Hepatitis (B or C) 137 (45) 31 (20)
Alcohol 80 (26) 81 (50)
Other or unknown 86 (30) 49 (30)
ECOG performance status—n (%) 0.07
0 164 (54) 93 (58)
> 1 139 (46) 68 (42)
Macroscopic vascular invasion—n (%) 123 (41) 60 (39) 0.07
Extrahepatic metastases—n (%) 150 (50) 49 (30) 0.41
Child–Pugh class—n (%) 0.06
A 297 (98) 155 (97)
B 6 (2) 4 (2.5)

Abbreviations: ECOG: Eastern Cooperative Oncology Group; n: number; SD: standard deviation; SMD: standardized mean difference.

The treatment effect between sorafenib and placebo in the SHARP trial [32] on radiological progression at 4 months was OR = 0.52, used as a reference value. Using the radiological progression (true outcome Y) in the PRODIGE‐11 trial, the indirect treatment effect of sorafenib (from PRODIGE‐11) and placebo (from SHARP) was OR = 0.6, 95% CI 0.34–0.99 (5000 bootstrap iterations). Using the proxy outcome for the PRODIGE‐11 trial (i.e., clinical progression) without accounting for misclassification (i.e., uncorrected method), the indirect treatment effect was OR = 0.36, 95% CI 0.18–0.59 (5000 bootstrap iterations), and using the outcome‐corrected model OR = 0.55, 95% CI 0.09–5.86 (5000 bootstrap iterations).

6. Discussion

The present study addressed the challenges of unanchored indirect treatment comparisons in the presence of misclassification in binary outcomes. The focus was on single‐arm trial compared to an ECA. The aims were to quantify the bias introduced by ignoring misclassification and to introduce a method to correct this bias (i.e., the outcome‐corrected model). The simulations served as a proof of concept in relatively straightforward scenarios. The reference method was the indirect conditional treatment effect estimated in a setting without measurement error. The outcome‐corrected model was compared to a method without the measurement error correction (i.e., the uncorrected method).

The simulations found that ignoring misclassification in binary outcomes leads to substantial bias in estimating indirect treatment effects with single‐arm trial compared to an ECA. Even with a specificity and sensitivity at 0.9, and an outcome model accounting for all prognostic variables, the uncorrected method had a relative bias of 67% and a 95% CI coverage of 79%. Since the uncorrected method fails to reduce bias, increasing the ECA (study s 2) sample size resulted in a dramatic decrease in the 95% CI coverage. The outcome‐corrected model had lower bias and better coverage probabilities than the uncorrected method, even when both specificity and sensitivity were at 0.9. When sensitivity and specificity decreased, estimations remained minimally biased, with a relative bias below 6%. However, this reduction in sensitivity or specificity led to an increase in variance. Increasing the sample size of the ECA (study s2) had a greater impact on improving the RMSE than increasing the sample size of the validation study (s3). The sample size of the validation study s3 is used for estimating the joint probability of reference and proxy outcomes (Equation 5). With a non‐differential error measurement, the model is straightforward and thus even small sample size ns3 can provide precise estimates of sensitivity and specificity, quickly saturating the RMSE. In contrast the sample size of the ECA (study s2) helps to estimate the conditional probability PYi=yi|Xi=xi, which is needed to predict the effect of treatment B in study s1. Since the ECA (study s2) could consist of routinely collected data from EMRs, it is more feasible to achieve a large sample size compared to study s3, where both outcomes are measured. Alternatively, study s3 could be an internal validation sample where a random sample of patients from s2 will have both measurements.

All the PAIC methods presented herein build upon the assumption of conditional exchangeability of treatment effects [6]. The outcome‐corrected model additionally assumes a correctly specified outcome measurement error model (Equation 5). The simulations used an outcome‐model correctly specified, and the impact of a misspecified outcome measurement error model was not evaluated (Equation 5). The sample size of the validation study s3 should have a greater impact when there is non‐differential misclassification. Simulations did not consider measurement errors in prognostic covariates, which could have a substantial impact on the indirect treatment effect [26]. Simulations were performed in a setting where the transportability assumption [25, 26] holds. Transportability might not be considered feasible if the validation study s3 and study s2 involve non‐comparable populations; [37] for instance, if the ECA (study s2) includes patients with advanced‐stage cancer from a general hospital, while the external validation study s3 includes patients from a specialized oncology center known for earlier detection and treatment of cancer. The specialized center's patients might have different disease progression patterns, affecting the sensitivity and specificity of outcome measurements. As a result, applying the sensitivity and specificity estimates from the external validation study s3 to the ECA (study s2) may lead to a biased indirect treatment effect. Furthermore, if one assumes that the treatment impacts the measurement error (i.e., differential error measurement), it should be preferable that all patients in the validation study s3 receive the same treatment as in the ECA (study s2). For all these reasons, an internal validation study s3 may be more suitable because it reduces concerns about transportability and providing flexibility to accommodate general patterns of differential misclassification [31]. Additionally, the individual likelihood, Lj,s3 presented in Section 3 remained the same with an internal validation study s3 [31]. Another limitation is that the method proposed here estimates a conditional indirect treatment effect, and thus could suffer of non‐collapsibility when using a non‐linear link function for the outcome model [38, 39]. When AgD are available in study s1, STC also estimate conditional indirect treatment effect with bootstrap estimates for confidence intervals [6] without accounting for the covariance term in Equation (4). When IPD are available for both studies s1 and s2, estimating the conditional indirect treatment effect allowed a likelihood‐based method to be used to estimate the variance of the indirect treatment effect; when estimating a marginal indirect effect the variance of the indirect treatment effect are estimated with bootstrap or robust variance [12].

We illustrated the application of these methods using the AgD for the standard of care (i.e., placebo arm) in SHARP [32] and the IPD for the sorafenib‐alone arm in PRODIGE‐11 trials [33]. Leveraging the established efficacy of sorafenib compared to placebo in the SHARP population, we used this treatment effect estimate as reference. The reference effect of sorafenib compared to placebo was OR = 0.52. The point estimates of the indirect treatment effect, using the outcome‐corrected model and the true outcome from the PRODIGE‐11 trial (radiological progression) were close to the reference value (OR = 0.55 and OR = 0.6, respectively). Ignoring outcome misclassification resulted in an overestimation of the indirect treatment effect (OR = 0.36). However, with only 161 patients in the sorafenib arm of the PRODIGE‐11 trial, the 95% CI estimated by the outcome‐corrected model was wide. This is conservative, as it transfers the uncertainty in measurement to the uncertainty in the decision (provided that one refrains from concluding the absence of a difference) but may be inefficient for small sample sizes. These findings align with simulation results, where the empirical SE of the outcome‐corrected model was twice that of the reference for a sample size of 200 patients (Table S3).

In the applied example, the outcome‐corrected model assumes no correlation between prognostic factors. Additional simulations are needed to evaluate the potential reduction in estimation variance when incorporating the correlation between prognostic factors into the likelihood, by adding a multivariate probability distribution to model their joint distribution when all IPD are available. It is crucial to emphasize that all methods will be biased when prognostic factors and effect modifiers are absent [6]. Furthermore, the inclusion of unnecessary prognostic factors may amplify the variance of estimation [40]. These challenges are prevalent in health technology assessment and are compounded by insufficient sample size to adjust for all covariates of interest. Consequently, it may be beneficial to employ quantitative bias methods to investigate how robust treatment effect estimates are to unmeasured confounders [41].

While the proposed correction method may enhance the reliability of unanchored indirect treatment comparisons when outcomes differ from those recorded in single‐arm trials, its implementation demands strict conditions that may be hard to meet. Consequently, our simulation study not only illustrates the method's potential benefits but also serves as a cautionary note against the limitations and risks inherent in single‐arm studies using external control groups with proxy outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Figure S1: Treatment effect absolute bias with different specificity (upper) and sensitivity (lower) values for the uncorrected and the outcome corrected model (i.e., corrected).

SIM-44-0-s002.tiff (50MB, tiff)

Figure S2: Treatment effect absolute bias with different sample size for study s2 (upper) and study s3 (lower) for the uncorrected and the outcome corrected model (i.e., corrected).

SIM-44-0-s001.tiff (50MB, tiff)

Figure S3: Absolute bias for specificity (upper) and sensitivity (lower) estimated by the outcome‐corrected model for varying sample size of study s3.

SIM-44-0-s003.tiff (50MB, tiff)

Data S1: Supporting information.

SIM-44-0-s004.docx (39.5KB, docx)

Acknowledgments

The authors would like to thank the Fédération Francophone de Cancérologie Digestive (FFCD) for providing the data for the PRODIGE‐11 trial.

Nourredine M., Gavoille A., Lepage C., Kassai‐Koupai B., Cucherat M., and Subtil F., “Accounting for Misclassification of Binary Outcomes in External Control Arm Studies for Unanchored Indirect Comparisons: Simulations and Applied Example,” Statistics in Medicine 44, no. 20‐22 (2025): e70236, 10.1002/sim.70236.

Funding: The authors received no specific funding for this work.

Data Availability Statement

The data that support the findings of this study are openly available in OSF at https://osf.io/bqs5t/.

References

  • 1. Food and Drug Administration, HHS , “International Conference on Harmonisation; Choice of Control Group and Related Issues in Clinical Trials; Availability,” Notice Fed Register 66, no. 93 (2001): 24390–24391. [PubMed] [Google Scholar]
  • 2. Zhang A. D., Puthumana J., Downing N. S., Shah N. D., Krumholz H. M., and Ross J. S., “Assessment of Clinical Trials Supporting US Food and Drug Administration Approval of Novel Therapeutic Agents, 1995–2017,” JAMA Network Open 3, no. 4 (2020): e203284, 10.1001/jamanetworkopen.2020.3284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Hatswell A. J., Baio G., Berlin J. A., Irs A., and Freemantle N., “Regulatory Approval of Pharmaceuticals Without a Randomised Controlled Study: Analysis of EMA and FDA Approvals 1999–2014,” BMJ Open 6, no. 6 (2016): e011666, 10.1136/bmjopen-2016-011666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Tenhunen O., Lasch F., Schiel A., and Turpeinen M., “Single‐Arm Clinical Trials as Pivotal Evidence for Cancer Drug Approval: A Retrospective Cohort Study of Centralized European Marketing Authorizations Between 2010 and 2019,” Clinical Pharmacology and Therapeutics 108, no. 3 (2020): 653–660, 10.1002/cpt.1965. [DOI] [PubMed] [Google Scholar]
  • 5. Davi R., Mahendraratnam N., Chatterjee A., Dawson C. J., and Sherman R., “Informing Single‐Arm Clinical Trials With External Controls,” Nature Reviews. Drug Discovery 19, no. 12 (2020): 821–822, 10.1038/d41573-020-00146-5. [DOI] [PubMed] [Google Scholar]
  • 6. Phillippo D., Ades A., Dias S., Palmer S., Abrams K., and Welton N., NICE DSU Technical Support Document 18: Methods for Population‐Adjusted Indirect Comparisons in Submission to NICE (National Institute for Health and Care Excellence, 2016). [Google Scholar]
  • 7. Serret‐Larmande A., Zenati B., Dechartres A., Lambert J., and Hajage D., “A Methodological Review of Population‐Adjusted Indirect Comparisons Reveals Inconsistent Reporting and Suggests Publication Bias,” Journal of Clinical Epidemiology 163 (2023): 1–10, 10.1016/j.jclinepi.2023.09.004. [DOI] [PubMed] [Google Scholar]
  • 8. Sultana N. and Ren S., “Review of Methods Used to Estimate Treatment Effects Against Relevant Comparators Using Evidence From Single‐Arm Studies in NICE Single Technology Appraisals,” Value in Health 25, no. 12 (2022): S10, 10.1016/j.jval.2022.09.056. [DOI] [Google Scholar]
  • 9. Chatton A., Le Borgne F., Leyrat C., et al., “G‐Computation, Propensity Score‐Based Methods, and Targeted Maximum Likelihood Estimator for Causal Inference With Different Covariates Sets: A Comparative Simulation Study,” Scientific Reports 10, no. 1 (2020): 9219, 10.1038/s41598-020-65917-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Schuler M. S. and Rose S., “Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies,” American Journal of Epidemiology 185, no. 1 (2017): 65–73, 10.1093/aje/kww165. [DOI] [PubMed] [Google Scholar]
  • 11. Naimi A. I., Cole S. R., and Kennedy E. H., “An Introduction to G Methods,” International Journal of Epidemiology 46, no. 2 (2017): 756–792, 10.1093/ije/dyw323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Faria R., Alava M. H., Manca A., and Wailoo A. J., “NICE DSU Technical Support Document 17: The Use of Observational Data to Inform Estimates of Treatment Effectiveness in Technology Appraisal: Methods for Comparative Individual Patient Data,” https://www.sheffield.ac.uk/sites/default/files/2022‐02/TSD17‐DSU‐Observational‐data‐FINAL.pdf.
  • 13. Signorovitch J. E., Sikirica V., Erder M. H., et al., “Matching‐Adjusted Indirect Comparisons: A New Tool for Timely Comparative Effectiveness Research,” Value in Health 15, no. 6 (2012): 940–947, 10.1016/j.jval.2012.05.004. [DOI] [PubMed] [Google Scholar]
  • 14. Caro J. J. and Ishak K. J., “No Head‐To‐Head Trial? Simulate the Missing Arms,” PharmacoEconomics 28, no. 10 (2010): 957–967, 10.2165/11537420-000000000-00000. [DOI] [PubMed] [Google Scholar]
  • 15. Phillippo D. M., Dias S., Ades A. E., and Welton N. J., “Assessing the Performance of Population Adjustment Methods for Anchored Indirect Comparisons: A Simulation Study,” Statistics in Medicine 39, no. 30 (2020): 4885–4911, 10.1002/sim.8759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Jiang Y. and Ni W., “Performance of Unanchored Matching‐Adjusted Indirect Comparison (MAIC) for the Evidence Synthesis of Single‐Arm Trials With Time‐To‐Event Outcomes,” BMC Medical Research Methodology 20, no. 1 (2020): 241, 10.1186/s12874-020-01124-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Remiro‐Azócar A., Heath A., and Baio G., “Methods for Population Adjustment With Limited Access to Individual Patient Data: A Review and Simulation Study,” Research Synthesis Methods 12, no. 6 (2021): 750–775, 10.1002/jrsm.1511. [DOI] [PubMed] [Google Scholar]
  • 18. Weber D., Jensen K., and Kieser M., “Comparison of Methods for Estimating Therapy Effects by Indirect Comparisons: A Simulation Study,” Medical Decision Making 40, no. 5 (2020): 644–654, 10.1177/0272989X20929309. [DOI] [PubMed] [Google Scholar]
  • 19. Hatswell A. J., Freemantle N., and Baio G., “The Effects of Model Misspecification in Unanchored Matching‐Adjusted Indirect Comparison: Results of a Simulation Study,” Value in Health 23, no. 6 (2020): 751–759, 10.1016/j.jval.2020.02.008. [DOI] [PubMed] [Google Scholar]
  • 20. Ren S., Ren S., Welton N. J., and Strong M., “Advancing Unanchored Simulated Treatment Comparisons: A Novel Implementation and Simulation Study,” Research Synthesis Methods 15, no. 4 (2024): 657–670, 10.1002/jrsm.1718. [DOI] [PubMed] [Google Scholar]
  • 21. Degtiar I. and Rose S., “A Review of Generalizability and Transportability,” Annual Review of Statistics and Its Application 10, no. 1 (2023): 501–524, 10.1146/annurev-statistics-042522-103837. [DOI] [Google Scholar]
  • 22. Dreyer N. A., Hall M., and Christian J. B., “Modernizing Regulatory Evidence With Trials and Real‐World Studies,” Therapeutic Innovation & Regulatory Science 54, no. 5 (2020): 1112–1115, 10.1007/s43441-020-00131-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Monti S., Grosso V., Todoerti M., and Caporali R., “Randomized Controlled Trials and Real‐World Data: Differences and Similarities to Untangle Literature Data,” Rheumatology 57, no. Supplement_7 (2018): vii54–vii58, 10.1093/rheumatology/key109. [DOI] [PubMed] [Google Scholar]
  • 24. Eisenhauer E. A., Therasse P., Bogaerts J., et al., “New Response Evaluation Criteria in Solid Tumours: Revised RECIST Guideline (Version 1.1),” European Journal of Cancer 45, no. 2 (2009): 228–247, 10.1016/j.ejca.2008.10.026. [DOI] [PubMed] [Google Scholar]
  • 25. Carroll R. J., Ruppert D., Stefanski L. A., and Crainiceanu C. M., Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed. (Chapman and Hall/CRC, 2006), 10.1201/9781420010138. [DOI] [Google Scholar]
  • 26. Keogh R. H., Shaw P. A., Gustafson P., et al., “STRATOS Guidance Document on Measurement Error and Misclassification of Variables in Observational Epidemiology: Part 1‐Basic Theory and Simple Methods of Adjustment,” Statistics in Medicine 39, no. 16 (2020): 2197–2231, 10.1002/sim.8532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Nab L., Van Smeden M., Keogh R. H., and Groenwold R. H. H., “Mecor: An R Package for Measurement Error Correction in Linear Regression Models With a Continuous Outcome,” Computer Methods and Programs in Biomedicine 208 (2021): 106238, 10.1016/j.cmpb.2021.106238. [DOI] [PubMed] [Google Scholar]
  • 28. Yi G. Y., Statistical Analysis With Measurement Error or Misclassification (Springer, 2017), 479, 10.1007/978-1-4939-6640-0. [DOI] [Google Scholar]
  • 29. Gerlach R. and Stamey J., “Bayesian Model Selection for Logistic Regression With Misclassified Outcomes,” Statistical Modelling 7, no. 3 (2007): 255–273, 10.1177/1471082X0700700303. [DOI] [Google Scholar]
  • 30. Daniel Paulino C., Soares P., and Neuhaus J., “Binomial Regression With Misclassification,” Biometrics 59, no. 3 (2003): 670–675, 10.1111/1541-0420.00077. [DOI] [PubMed] [Google Scholar]
  • 31. Lyles R. H., Tang L., Superak H. M., et al., “Validation Data‐Based Adjustments for Outcome Misclassification in Logistic Regression: An Illustration,” Epidemiology 22, no. 4 (2011): 589–597, 10.1097/EDE.0b013e3182117c85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Llovet J. M., Hilgard P., de Oliveira A. C., et al., “Sorafenib in Advanced Hepatocellular Carcinoma,” New England Journal of Medicine 359, no. 4 (2008): 378–390. [DOI] [PubMed] [Google Scholar]
  • 33. Jouve J. L., Lecomte T., Bouché O., et al., “Pravastatin Combination With Sorafenib Does Not Improve Survival in Advanced Hepatocellular Carcinoma,” Journal of Hepatology 71, no. 3 (2019): 516–522, 10.1016/j.jhep.2019.04.021. [DOI] [PubMed] [Google Scholar]
  • 34. Morris T. P., White I. R., and Crowther M. J., “Using Simulation Studies to Evaluate Statistical Methods,” Statistics in Medicine 38, no. 11 (2019): 2074–2102, 10.1002/sim.8086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. R Core Team, R , “R: A Language and Environment for Statistical Computing,” 2013.
  • 36. Kibuchi E., Sturgis P., Durrant G. B., and Maslovskaya O., “The Efficacy of Propensity Score Matching for Separating Selection and Measurement Effects Across Different Survey Modes,” Journal of Survey Statistics and Methodology 12, no. 3 (2024): 764–789, 10.1093/jssam/smae017. [DOI] [Google Scholar]
  • 37. Shaw P. A., Gustafson P., Carroll R. J., et al., “STRATOS Guidance Document on Measurement Error and Misclassification of Variables in Observational Epidemiology: Part 2—More Complex Methods of Adjustment and Advanced Topics,” Statistics in Medicine 39, no. 16 (2020): 2232–2263, 10.1002/sim.8531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Greenland S., Pearl J., and Robins J. M., “Confounding and Collapsibility in Causal Inference,” Statistical Science 14 (1999): 29–46, 10.1214/ss/1009211805. [DOI] [Google Scholar]
  • 39. Colnet B., Josse J., Varoquaux G., and Scornet E., “Risk Ratio, Odds Ratio, Risk Difference… Which Causal Measure is Easier to Generalize?” 2023, http://arxiv.org/abs/2303.16008.
  • 40. Colnet B., Josse J., Varoquaux G., and Scornet E., “Re‐Weighting the Randomized Controlled Trial for Generalization: Finite‐Sample Error and Variable Selection,” Journal of the Royal Statistical Society. Series A, Statistics in Society 188, no. 2 (2025): 345–372, 10.1093/jrsssa/qnae043. [DOI] [Google Scholar]
  • 41. Popat S., Liu S. V., Scheuer N., et al., “Addressing Challenges With Real‐World Synthetic Control Arms to Demonstrate the Comparative Effectiveness of Pralsetinib in Non‐Small Cell Lung Cancer,” Nature Communications 13, no. 1 (2022): 3500, 10.1038/s41467-022-30908-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1: Treatment effect absolute bias with different specificity (upper) and sensitivity (lower) values for the uncorrected and the outcome corrected model (i.e., corrected).

SIM-44-0-s002.tiff (50MB, tiff)

Figure S2: Treatment effect absolute bias with different sample size for study s2 (upper) and study s3 (lower) for the uncorrected and the outcome corrected model (i.e., corrected).

SIM-44-0-s001.tiff (50MB, tiff)

Figure S3: Absolute bias for specificity (upper) and sensitivity (lower) estimated by the outcome‐corrected model for varying sample size of study s3.

SIM-44-0-s003.tiff (50MB, tiff)

Data S1: Supporting information.

SIM-44-0-s004.docx (39.5KB, docx)

Data Availability Statement

The data that support the findings of this study are openly available in OSF at https://osf.io/bqs5t/.


Articles from Statistics in Medicine are provided here courtesy of Wiley

RESOURCES