ABSTRACT
Single‐arm control trials are increasingly proposed as a potential approach for treatment evaluation. However, the limitations of this design restrict its methodological acceptability. Regulatory agencies have raised concerns about this approach, although it is sometimes required in applications based solely on such studies. Consequently, the need for accurate indirect treatment comparisons has become critical, especially when constructing external control arms using routinely collected data as outcome measurements may differ from those recorded in the single‐arm trial leading to potential misclassification of outcomes. This study aimed to quantify the bias from ignoring misclassification of a binary outcome within unanchored indirect comparisons, through simulations, and to propose a likelihood‐based method to correct this bias (i.e., the outcome‐corrected model). Simulations demonstrated that ignoring misclassification results in significant bias and poor coverage probabilities. In contrast, the outcome‐corrected model reduced bias, improved 95% confidence interval coverage probability and root mean square error in various scenarios. The methodology was applied to two hepatocellular carcinoma trials illustrating a practical application. The findings underscore the importance of addressing outcome misclassification in indirect comparisons. The proposed correction method may improve reliability in unanchored indirect treatment comparisons.
Keywords: external control group, indirect treatment comparison; single‐arm study, measurement error, misclassification
1. Introduction
Randomized controlled trials (RCTs) are the gold standard for assessing experimental interventions. Nevertheless, practical or ethical limitations may prevent the implementation of a RCT, resulting in the adoption of single‐arm trials, in which all participants are administered the same treatment. Regulatory agencies have raised concerns about this approach [1], although it is sometimes required in applications based solely on such studies. The Food and Drug Administration is increasingly granting approvals based on the findings of single‐arm trials [2], especially in the field of oncology [3, 4]. In addition, the molecular fragmentation of cancers means that more and more subgroups need to be considered, making Phase III trials very difficult. Single‐arm trials require an External Control Arm (ECA) to estimate the treatment effect [5], and are part of the unanchored indirect comparison framework [6]. Studies have shown that indirect comparisons in health technology assessments were predominantly unanchored [7, 8]. For example, a new drug might be tested in a single‐arm trial, with the ECA constructed from the standard of care in Electronic Medical Records (EMRs), or a previous clinical trial. When Individual Patient Data (IPD) are available for both studies, causal effects can be estimated using methods for population‐adjusted indirect comparisons (PAICs) based on a frequentist approach, such as g‐computation, inverse‐probability treatment weighting, and targeted maximum likelihood estimation [9, 10, 11, 12]. When researchers only have access to IPD from the single‐arm trial and aggregate data (AgD) from other source, matching‐adjusted indirect comparison (MAIC) [13], and simulated treatment comparison (STC) [14] have been proposed. MAIC is based on propensity score weighting and STC is based on outcome regression adjustment [6]. STC and MAIC have been evaluated in anchored [15, 16, 17, 18] and unanchored [19, 20] indirect comparisons, without a clear conclusion as to whether MAIC or STC performs better.
PAIC methods can account for imbalance in patients characteristics, but not for differences in outcome measurements between the two studies [21]. Therefore, a commonly made assumption for estimating indirect treatment effects is that there is no difference in the outcome measurement between the single‐arm trial and the ECA. However, for example, in routinely collected data (i.e., real‐world data), outcome measures may differ from those recorded in clinical trials [22, 23]. A proxy measure , evaluated in the ECA, can substitute the reference measure evaluated in the single‐arm trial. Both outcomes and could measure the same concept, with one predicting the other using a different scale and possibly a different level. For example, in a single‐arm trial evaluating a new immunotherapy for cancer, radiological progression might be assessed using RECIST criteria [24]; while the ECA might use clinical progression, , collected in EMRs or a previous clinical trial. Similarly, various scales may assess depression in routinely collected data. Misclassification of outcome could bias the treatment effect either towards or away from the null value [25, 26, 27, 28]. Thus, to estimate an unbiased indirect treatment effect, one must first address the proxy outcome's misclassification. Misclassification in a binary outcome can be quantified by estimating sensitivity and specificity using ancillary studies [26]. For instance, validation studies involve measuring both the reference outcome and its proxy in the same individuals [26, 27]. When conducted within the same population (i.e., ECA), these are termed internal validation studies, and external when a third study is assessing and . Methods have already been developed for incorporating validation study data in regression outcome estimation using a Bayesian framework [29, 30] or a parametric frequentist perspective; we refer the reader to Carroll et al. [25] for general expressions with likelihood‐based methods and to Lyles et al. [31] for implementation examples. However, these methods have not yet been employed in the context of unanchored indirect comparisons.
The present study examines unanchored indirect comparison involving two studies: one with AgD or IPD for the experimental treatment assessing a binary reference outcome and another with IPD for the control treatment, potentially derived from EMR data or a previous clinical trial, which uses a proxy outcome . The objectives are to quantify the bias from ignoring misclassification of a binary outcome in unanchored indirect comparisons and to propose a method to correct this bias (i.e., the outcome‐corrected model). We first describe the indirect comparison framework and provides an overview of indirect treatment effect estimation; subsequently, the outcome‐corrected model is introduced. Methods with and without correction for misclassification in the binary outcome were evaluated in a simulation analysis and in a practical example using real data from two trials investigating advanced hepatocellular carcinoma: the SHARP trial [32], which evaluated the effect of sorafenib versus placebo on radiological progression measured using RECIST, and the PRODIGE‐11 trial [33], which evaluated the effect of sorafenib and pravastatin combined versus sorafenib on clinical progression.
2. Outcome Model for Indirect Comparison
2.1. Notations and Data Structures
There are three studies (Table 1), study that included either AgD or IPD, labeled study; the ECA study that included IPD, labeled study; and that is the validation study used to estimate the outcome measurement error model labeled . The outcome is the reference binary outcome response and is the proxy binary outcome response. are prognostic and effect modifier covariates. Two treatments and will be considered; the patients from study received experimental treatment , and the patients from study received control treatment .
TABLE 1.
Data structure with an external validation study .
| Study set | Prognostics and effect modifier covariates | Treatment | Error‐free outcome | Error‐prone outcome | ||||
|---|---|---|---|---|---|---|---|---|
|
Single‐arm trial |
|
|
|
|||||
|
External control arm |
|
|
||||||
|
Validation study |
|
2.2. Unanchored Indirect Comparisons Framework
For a binary outcome, the form of indirect comparisons is as follows [6]: within a target population , the effect of treatment compared to treatment , , is calculated as the difference in the log odds between each treatment:
The estimand of interest is the population‐average treatment effect of versus in the targeted population : , along with its estimate:
| (1) |
The challenge lies in estimating , the predicted log odds of treatment in target population , as the participants in received only treatment .
2.3. Indirect Comparisons Without Measurement Error
Several methods can be used to estimate the indirect treatment effect from Equation (1) [9, 10, 11, 12]. G‐computation method involves fitting an outcome model using the IPD from study . This model then uses the prognostic and effect modifier variables from the study population to predict the outcome marginal value of treatment in target population [11, 12].
Now, consider a conditional indirect treatment effect as follows:
, where is a set of prognostic covariates in population . To predict , the conditional log odds of outcome under treatment in the target population , an outcome regression model is fitted to the IPD from study :
| (2) |
where is the binary outcome value for a patient from study , and denotes the th prognostic covariate, along with it is associated coefficient . The prediction is centred on the mean covariate value of study , . By doing so, is interpreted as the predicted conditional log odds of outcome under treatment for an average patient sampled from the target population , .
In case of AgD for , , the marginal log odds of outcome under treatment in the target population , is derived from the reported summary estimate in a published article. However, it is unlikely that the conditional estimate (i.e., ) is available in a published article. Conversely, when IPD are available for study , a second outcome model is fitted to the data of with the same prognostic covariates:
| (3) |
where is the estimated conditional log odds of outcome under treatment for an average patient sampled from the target population , , to be consistent with which has also a conditional effect interpretation.
Finally, a conditional population‐adjusted treatment effect is estimated using Equation (1); with and (IPD‐study ), or the log odds of the outcome, for example, reported in a published article (AgD‐study ).
The variance, of is:
| (4) |
where and are the variance of , respectively. The covariance term in Equation (4) arises because the values of the covariates in study are used in both outcome models in study (Equation 3) to predict and (Equation 2) to predict . More precisely, for the pth covariate, the value of are used. In the case of IPD availability for study a likelihood‐based method is proposed to account for this correlation (Section 3). When only AgD are available for study the covariance term cannot be evaluated and is estimated using the model variance in Equation (2) or bootstrap [6, 15].
2.4. Indirect Comparison With Measurement Error
Now consider the case for which only a proxy outcome is available for study . This will bias the estimates of , from Equation (2):
where , are biased estimates for , . The magnitude and direction of bias will depend upon the diagnostic properties of as a substitute for [26]. Misclassification in a binary outcome can be expressed in terms of the sensitivity and specificity . Misclassification may be differential if the specificity and sensitivity depend on covariates, or non‐differential if they do not [26]. The impact of non‐differential misclassification is that is closer to the null relative to [26]. However, if the misclassification is differential, the bias can either be away or towards the null value [26].
2.5. Outcome‐Corrected Model
The proposed outcome‐corrected model is designed to account for misclassification in binary outcomes when comparing treatments between a single‐arm trial and an ECA. The first step is to correct the biased estimates , , then employ them to perform the indirect treatment comparison. Let the outcome measurement error model be:
| (5) |
where are covariates for a differential error measurement which are a subset of .
For patients with , sensitivity is
and for patients with specificity is
Note that if , it implies a non‐differential outcome measurement error with , and . For simplicity, from now on it will be assumed a single continuous prognostic variable, and no differential measurement error. To correct the biased estimates , , information on parameters of the measurement error model is required. Often, there is a lack of a priori information about measurement errors, necessitating ancillary studies to estimate parameters of the outcome model (Equation 5). The STRATOS guidance identify different types of ancillary studies [26]. The following section considers an external validation study where both the proxy outcome and the reference outcome are available in a third study (Table 1).
According to Lyles et al. [31], using Carroll et al.'s [25] general likelihood, an external validation study can be used to correctly estimate from Equation (2). The external validation study is used to estimate the joint distribution . Assuming there are . patients in the external validation study , each providing data as , and assuming a non‐differential measurement error, the individual likelihood, for the external validation study is:
Using ,
The likelihood for the external validation study is: . Note that when using an internal validation study , the individual likelihood, , remained in the same form as previously described [31].
Then, study is used to estimate the conditional probability . Assuming there are patients in study , each providing data as , and assuming a non‐differential outcome measurement error (i.e., ), the individual likelihood, for study is:
When , this term corresponds to:
When , this term corresponds to:
So, the individual likelihood of study is:
where
| (6) |
The likelihood for study is: . Finally the total likelihood is:
| (7) |
When only AgD for study are available, the likelihood is maximized to estimate the corrected , , which are used to predict (Equation 2).
When IPD for study are available, is modeled with the binomial likelihood of the outcome in including patients: , using as probability parameters the outcome model in Equation (3). As indicated above, correlation between and stems for the fact that observed values of , the prognostic variable in study , are used to predict . To account for the correlation, the likelihood, of the random variable , is added. Then, the total likelihood to maximized is
| (8) |
To estimate , the variance of the indirect treatment effect (Equation 4), the Fischer information is used. Since the covariate in study and (Equations 3 and (6)) are each centered with (i.e., the true mean of the random variable ), from Equation (3) and from Equation (6) are interpreted as the conditional log odds of treatment and , respectively. The indirect treatment effect from Equation (1) is estimated using the difference between and , consequently, the variance and covariance are estimated using the Fischer information. Specifically, the variance of , is the diagonal term for (Equation 3); , the variance of , is the diagonal term for (Equation 6), and the covariance term (Equation 4) is the covariance between and . Confidence intervals can then be derived assuming a normal distribution of the estimate.
All the PAIC methods presented herein build upon the assumption of conditional exchangeability of treatment effects [6], meaning that there were no unknown or unmeasured prognostic factors or effect modifiers missing from the models. The outcome‐corrected model additionally assumes a correctly specified outcome measurement error model (Equation 5). Specifically, the outcome measurement error model (Equation 5) must account for all variables that contribute to a non‐differential measurement error. When estimating the outcome measurement error model using an external validation study, it is essential to assume transportability [25, 26], which refers to the applicability of sensitivity and specificity parameters across both the external validation study and the ECA (study ).
3. Simulations Study Plan
We adhere to the ADEMP reporting framework [34] by describing the Aim (Section 3.1), the Data‐generating mechanism (Section 3.2), the Estimands (Section 3.3), the Methods under investigation (Section 3.4), and the Performance measures (Section 3.5).
3.1. Aims
This simulation study aimed to quantify the bias resulting from ignoring the misclassification of a binary outcome and to evaluate the performance of three different methods described in Section 2, in the context of unanchored indirect comparison involving a single‐arm trial compared to an ECA.
3.2. Data‐Generating Mechanisms
As a reminder, there are three studies (Table 1), the targeted population study that included IPD or AgD, the external control group study that included IPD , and the external validation study that is used to estimate the outcome measurement error model. Two treatments are considered, the patients from study who received experimental treatment , and the patients from study who received control treatment .
A binary outcome for individual is generated under a logistic regression model:
With a prognostic variable that follows a normal distribution with mean and variance : . is the coefficient for the prognostic variable . As is not stratified by treatment , the variable is thus only a prognostic variable and not an effect modifier. And is the log odds of the outcome for treatment for . The proxy binary outcome is generated as a non‐differential error of as follows:
where , and . To set , the coefficient for the prognostic variable, we express it as a function of the prevalence of the outcome in study , denoted . Specifically, is calculated using the formula: . Furthermore, the log odds under treatment in study , , is determined by the ratio of the standardized treatment effect size to the standardized “total” differences between trials (Supporting Information material). The standardized treatment effect size is informally the standardized distance between study and due to the treatment effect versus . And the standardized “total” difference is the standardized distance between study and due to the treatment effect and the difference of distribution in prognostic variables. As so, the ratio represents the proportion of difference between study and that is due to the treatment effect and the percentage of difference due to difference in patients' prognostic variables. The value was set to , indicating that the standardized treatment effect constitutes 0.6 times the standardized “total” differences between trials. Thus 40% of the “total” standardized differences in outcome between trials are attributable to the “impact” of the prognostic variable. This impact is determined by the magnitude of and the degree of overlap in the distribution of between the studies.
The values of the fixed parameters used in the simulations are as follow: , , , and prevalence of outcome in the study , , leading to a treatment effect odds ratio (OR) of 1.47. Five thousand simulations per scenario were performed.
Different scenarios were defined according to the different combinations of the following parameters, with references in bold type and underlined:
Specificity : 0.7 , 0.8, 0.9;
Sensitivity : 0.7, 0.8, 0.9 ;
Overall sample size in each trial , : 500 , and : 200, 500 , 1500
We specified two additional scenarios with sensitivity and specificity fixed at 0.7, with , and .
3.3. Estimand
The estimand of interest is the population‐average treatment effect of versus in the targeted population of the single‐arm, : ; which would represent a treatment effect in a randomized trial.
3.4. Methods Under Investigation
Three methods were evaluated:
-
–
The reference method (see details in Section 2.3), which uses the true outcome in both study and study
-
–
The uncorrected method (see details in Section 2.4), which uses the true outcome in study and the proxy outcome in study but ignores the misclassification in the outcome.
-
–
The outcome‐corrected model (see details in Section 2.5), which method uses the true outcome in study and the proxy outcome in study , and correct the misclassification in the outcome.
3.5. Performance Measures
The performance measures included both absolute and relative bias, along with 95% confidence interval (95% CI) coverage probability of the confidence intervals, empirical standard error (SE), root mean square error (RMSE) and the non‐convergence's proportion of the optim() simulation. As there is no explicit solution for maximizing likelihood in Equations (7) and (8), numerical maximization tools can be use. Specifically, the optim() function with “BFGS” (a quasi‐Newton method) for the method argument was used in R [35]. For convergence, the parameter spaces for standard deviation and probability were restricted using log and logit transformations. Analyses were performed using R version 4.3.3.
The simulation results are presented in Section 4 for scenarios where IPD are available for study . See Tables S2 and S3 for the results when only AgD for study are available.
4. Simulation Results
The results presented here are based on simulations using IPD for study , employing three different methods: the reference method, outcome‐corrected model, and the uncorrected method. Results using AgD for study are outlined in the Supporting Information materials.
4.1. Variation According to Specificity
In this scenario, specificity varied between 0.7 and 0.9 while other parameters were held at their reference values. Table 2 presents the performance measures for each method. Figure S1a illustrates the distribution of absolute bias of the uncorrected method and the outcome‐corrected model.
TABLE 2.
Performance measure according to specificity and sensitivity in an IPD‐IPD setting.
| Se = 0.9 | Sp = 0.7 | ||||||
|---|---|---|---|---|---|---|---|
| Performance measure | Methods | Sp = 0.7 | Sp = 0.8 | Sp = 0.9 | Se = 0.7 | Se = 0.8 | Se = 0.9 |
| Absolute bias | Reference | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Uncorrected | −0.92 | −0.60 | −0.26 | −0.72 | −0.82 | −0.92 | |
| Outcome‐corrected | 0.02 | 0.00 | 0.00 | −0.02 | 0.01 | 0.02 | |
| Relative bias (%) | Reference | 0.53 | 0.79 | 0.96 | 0.27 | 0.65 | 0.53 |
| Uncorrected | 237.27 | 156.13 | 67.66 | 187.72 | 212.26 | 237.27 | |
| Outcome‐corrected | 4.12 | 1.04 | 0.51 | 5.85 | 2.32 | 4.12 | |
| Empirical SE | Reference | 0.26 | 0.26 | 0.26 | 0.26 | 0.26 | 0.26 |
| Uncorrected | 0.22 | 0.23 | 0.25 | 0.22 | 0.22 | 0.22 | |
| Outcome‐corrected | 0.55 | 0.45 | 0.37 | 0.82 | 0.66 | 0.55 | |
| 95% CI coverage (%) | Reference | 95.50 | 95.60 | 95.50 | 95.50 | 95.60 | 95.50 |
| Uncorrected | 1.90 | 25.50 | 79.60 | 10.30 | 4.50 | 1.90 | |
| Outcome‐corrected | 96.10 | 95.90 | 95.40 | 97.20 | 96.50 | 96.10 | |
| RMSE | Reference | 0.26 | 0.26 | 0.26 | 0.26 | 0.26 | 0.26 |
| Uncorrected | 0.94 | 0.64 | 0.36 | 0.76 | 0.85 | 0.94 | |
| Outcome‐corrected | 0.55 | 0.45 | 0.37 | 0.82 | 0.66 | 0.55 | |
| Non‐convergence (n) | Reference | 0 | 0 | 0 | 0 | 0 | 0 |
| Uncorrected | 0 | 0 | 0 | 0 | 0 | 0 | |
| Outcome‐corrected | 9 | 1 | 0 | 143 | 42 | 9 | |
Note: The reference method uses the true outcome in both study and ; the uncorrected method ignores outcome misclassification in study ; the outcome‐corrected model accounts for outcome misclassification in study .
Abbreviations: CI: confidence interval; IPD: individual patient‐data; RMSE: root mean square error; SE: square error; Se: sensitivity; Sp: specificity.
Simulations had a relative bias for the uncorrected method of 67% when both specificity and sensitivity were at 0.9, increasing to 156% and 237% for specificities of 0.8 and 0.7, respectively. The 95% CI coverage probability of the uncorrected method was 79.6% when both specificity and sensitivity were at 0.9, decreasing to 26.3% for a specificity of 0.8.
The absolute bias of the outcome‐corrected model remained low, close to zero as the specificity approached 0.9, with a relative bias consistently below 5%. The highest absolute bias of the outcome‐corrected model was 0.02 with a specificity at 0.7. Reducing specificity did increase the variance of estimations, the empirical SE increased by 1.5‐fold between specificities of 0.9 and 0.7. The 95% CI coverage probability for the outcome‐corrected model remained around 95.5% across specificity values. The outcome‐corrected model failed to converge in nine instances (1.8%) at a specificity of 0.7.
For specificities of 0.7 and 0.8, the RMSE of the outcome‐corrected model was approximately twice as high as that of the reference method but approximately half that of the uncorrected method. When both sensitivity and specificity were at 0.9, the outcome‐corrected model demonstrated lower bias, better coverage probability, and a similar RMSE to the uncorrected method (Table 2).
4.2. Variation According to Sensitivity
In this scenario, sensitivity varied between 0.7 and 0.9 while other parameters were held at reference values. Table 2 presents the performance measures for each method. Figure S1b illustrates the distribution of absolute bias of the uncorrected method and the outcome‐corrected model.
The uncorrected method exhibited relative bias increased as the sensitivity approached 0.9; 187% at a sensitivity of 0.7, and 237% at a sensitivity of 0.9 (Table 2). For a specificity of 0.7, 30% of the cases were false positives, which is particularly significant given the low prevalence of the outcome. Moreover, increasing sensitivity without a corresponding improvement in specificity led to an apparent rise in the frequency of the outcome. This rise can easily be misinterpreted as a treatment effect, rather than being attributed to the improved identification of true cases. The 95% CI coverage probability of the uncorrected method was 10.3% when both specificity and sensitivity were 0.7.
For the outcome‐corrected model, the absolute bias remained roughly the same (below 6%) as sensitivity increased (Figure S1b and Table 2). The 95% CI coverage probabilities remained above 95%, reaching 97.2% at a sensitivity of 0.7, and 96% for sensitivities of 0.8 and 0.9. The outcome‐corrected model failed to converge in 143 instances (2.9%) at sensitivity of 0.7.
For a sensitivity of 0.8, the RMSE of the outcome‐corrected model was three times higher than that of the reference method but around 1.5‐fold lower than that of the uncorrected method. When both specificity and sensitivity were at 0.7, the outcome‐corrected model had high variance, with an empirical SE four times higher than the other methods and a RMSE higher than the uncorrected method (Table 2).
4.3. Variation According to Sample Size in Study and
In this scenario, the sample sizes in the study and varied between 200 and 1500, while other parameters were held at their reference values. Table 3 presents the performance measures for each method. Figure S2 illustrates the distribution of absolute bias of the uncorrected method and the outcome‐corrected model. Figure S3 presents the distribution of absolute bias of sensitivity and specificity estimated by the outcome‐corrected model for different sample size in study .
TABLE 3.
Performance measure according to sample size and in an IPD‐IPD setting.
| n 2 = 500 | n 3 = 500 | ||||||
|---|---|---|---|---|---|---|---|
| Performance measure | Method | n 3 = 200 | n 3 = 500 | n 3 = 1500 | n 2 = 200 | n 2 = 500 | n 2 = 1500 |
| Absolute bias | Reference | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Uncorrected | −0.92 | −0.92 | −0.91 | −0.93 | −0.92 | −0.91 | |
| Outcome‐corrected | 0.02 | 0.02 | 0.02 | −0.01 | 0.02 | 0.01 | |
| Relative bias (%) | Reference | 0.53 | 0.53 | 0.41 | 0.40 | 0.53 | 0.08 |
| Uncorrected | 237.71 | 237.27 | 236.68 | 241.26 | 237.27 | 236.51 | |
| Outcome‐corrected | 3.96 | 4.12 | 5.56 | 2.30 | 4.12 | 3.39 | |
| Empirical SE | Reference | 0.26 | 0.26 | 0.26 | 0.41 | 0.26 | 0.17 |
| Uncorrected | 0.22 | 0.22 | 0.22 | 0.33 | 0.22 | 0.15 | |
| Outcome‐corrected | 0.61 | 0.55 | 0.54 | 0.86 | 0.55 | 0.35 | |
| 95% CI coverage (%) | Reference | 95.60 | 95.50 | 95.10 | 94.90 | 95.50 | 95.20 |
| Uncorrected | 1.70 | 1.90 | 2.00 | 21.30 | 1.90 | 0.00 | |
| Outcome‐corrected | 96.20 | 96.10 | 95.90 | 96.20 | 96.10 | 95.80 | |
| RMSE | Reference | 0.26 | 0.26 | 0.26 | 0.41 | 0.26 | 0.17 |
| Uncorrected | 0.94 | 0.94 | 0.94 | 0.99 | 0.94 | 0.92 | |
| Outcome‐corrected | 0.61 | 0.55 | 0.54 | 0.86 | 0.55 | 0.35 | |
| Non‐convergence (n) | Reference | 0 | 0 | 0 | 0 | 0 | 0 |
| Uncorrected | 0 | 0 | 0 | 0 | 0 | 0 | |
| Outcome‐corrected | 72 | 9 | 8 | 170 | 9 | 0 | |
Note: The reference method uses the true outcome in both study and ; the uncorrected method ignores outcome misclassification in study ; the outcome‐corrected model accounts for outcome misclassification in study .
Abbreviations: CI: confidence interval; IPD: individual patient‐data; RMSE: root mean square error; SE: square error.
Since the uncorrected method does not use the validation study its performance measures did not change with . The relative bias of the uncorrected method remained around 237% when varied. Because the uncorrected method ignores misclassification, increasing did not affect the bias either (Table 3, and Figure S2a). As the bias remained the same, the 95% CI coverage probability decreased as increased, dropping to 21% with 200 patients in study , to 0% with 1500 patients in study .
For the outcome‐corrected model, increasing did not reduce the bias with an absolute bias remaining around 0.01 and the 95% CI coverage around 96% (Table 3). Larger sample size enhanced the precision in sensitivity and specificity estimates (Figure S3). Increasing did not decrease the absolute bias (Table 3 and Figure S2); however, it increased precision, reducing empirical SE by 2.5‐fold, and increasing RMSE by a factor of 2.5, between 200 and 1500 patients. RMSE improvement was more important with larger than , dropping from 0.86 to 0.35 for increasing from 200 to 1500 patients, and from 0.61 to 0.54 for increasing from 200 to 1500 patients. A low sample size in had a greater impact on non‐converged iterations, with 170 (3.4%) non‐converged iterations for , and 72 (1.4%) non‐converged iterations for . When both sensitivity and specificity were at 0.7, the same pattern was observed (Table S1).
4.4. Aggregated Data Results
Results using AgD for study are provided in the Supporting Information materials and are generally consistent with those obtained using IPD for study .
5. Applied Example: PRODIGE‐11 and SHARP Trials
The estimand was the effect of adding the sorafenib to the standard of care in adults with advanced‐stage hepatocellular carcinoma. Two RCTs, the SHARP‐trial [32] as the AgD‐study and PRODIGE‐11 trial [33] as the IPD‐study , were used. Although SHARP and PRODIGE‐11 are comparative trials, only a single arm from each study was extracted for the analysis, focusing on the placebo arm from SHARP (i.e., standard of care) and the sorafenib‐alone arm from PRODIGE‐11. This estimand was chosen because the treatment effect of sorafenib versus placebo from the SHARP trial served as a reference. The validation study (study ) was internal, using a subset of patients from the PRODIGE‐11 trial who had both the reference outcome and the proxy outcome assessed.
The design characteristics of each trial are outlined in Table 4. The SHARP trial [32], a double‐blind, placebo‐controlled study, included 602 patients with advanced hepatocellular carcinoma, randomized to receive either sorafenib (400 mg twice daily) or a placebo. Radiological progression, defined as the time from randomization to disease progression, was based on independent radiological review according to RECIST criteria. Data on radiological progression at 4 months for the placebo group (i.e., standard of care) were extracted from Kaplan–Meier curves. The PRODIGE‐11 trial [33], a randomized, unblinded controlled trial included 323 patients with advanced hepatocellular carcinoma, randomized to receive either sorafenib (400 mg twice daily) or sorafenib (400 mg twice daily) plus pravastatin (40 mg daily). Radiological progression was assessed according to RECIST criteria every 12 weeks, and clinical progression every 4 weeks. Baseline characteristics of the sorafenib‐arm in PRODIGE‐11 and placebo‐arm in SHARP trial are presented in Table 5. From a reduced set of prognostic covariates, we included those with a standardized mean difference above 0.1, as an empirical threshold [36], between the two groups: age, etiology (using alcohol as reference), and extrahepatic metastases, without any interaction terms.
TABLE 4.
Design characteristics PRODIGE‐11 and SHARP trial designs.
| SHARP | PRODIGE‐11 | |
|---|---|---|
| Data | Aggregated data | Individual patient data |
| Design | Randomized, double blind | Randomized, unblinded |
| Inclusion–exclusion criteria |
Adults, with advanced‐stage hepatocellular carcinoma, as confirmed by pathological analysis. Not eligible for or had disease progression after surgical or locoregional therapies. Child‐Pugh liver function class A Life expectancy of 12 weeks or more ASAT and ALAT < 5 N |
|
| ECOG < 3 |
WHO‐PS < 3 CLIP score 0–4 |
|
| Localisation | USA, Europe, Asia | France |
| Inclusion interval | 2005–2006 | 2010–2013 |
| Treatment |
Sorafenib 400 mg twice daily Placebo |
Sorafenib 400 mg twice daily Sorafenib 400 mg twice daily plus pravastatin 40 mg daily |
| Outcome | Radiological progression according to RECIST assessed every 6 weeks | Clinical progression assessed every 4 weeks |
Abbreviations: ASAT: aspartate amino‐transferase; ALAT: alanine amino‐transferase; CLIP: Cancer of the Liver Italian Program; ECOG: Eastern Cooperative Oncology Group; WHO‐PS: World Health Organization Performance Status.
TABLE 5.
Baseline characteristics for SHARP and PRODIGE‐11 studies.
| SHARP trial Placebo‐arm (n = 303) | PRODIGE‐11 trial Sorafenib‐arm (n = 161) | SMD | |
|---|---|---|---|
| Age (year)—mean (SD) | 66.3 (0.2) | 68 (9) | 0.14 |
| Sex—n (%) | 0.03 | ||
| Male | 264 (87) | 142 (88) | |
| Female | 39 (13) | 19 (12) | |
| Etiology—n (%) | 0.62 | ||
| Hepatitis (B or C) | 137 (45) | 31 (20) | |
| Alcohol | 80 (26) | 81 (50) | |
| Other or unknown | 86 (30) | 49 (30) | |
| ECOG performance status—n (%) | 0.07 | ||
| 0 | 164 (54) | 93 (58) | |
| > 1 | 139 (46) | 68 (42) | |
| Macroscopic vascular invasion—n (%) | 123 (41) | 60 (39) | 0.07 |
| Extrahepatic metastases—n (%) | 150 (50) | 49 (30) | 0.41 |
| Child–Pugh class—n (%) | 0.06 | ||
| A | 297 (98) | 155 (97) | |
| B | 6 (2) | 4 (2.5) |
Abbreviations: ECOG: Eastern Cooperative Oncology Group; n: number; SD: standard deviation; SMD: standardized mean difference.
The treatment effect between sorafenib and placebo in the SHARP trial [32] on radiological progression at 4 months was OR = 0.52, used as a reference value. Using the radiological progression (true outcome ) in the PRODIGE‐11 trial, the indirect treatment effect of sorafenib (from PRODIGE‐11) and placebo (from SHARP) was OR = 0.6, 95% CI 0.34–0.99 (5000 bootstrap iterations). Using the proxy outcome for the PRODIGE‐11 trial (i.e., clinical progression) without accounting for misclassification (i.e., uncorrected method), the indirect treatment effect was OR = 0.36, 95% CI 0.18–0.59 (5000 bootstrap iterations), and using the outcome‐corrected model OR = 0.55, 95% CI 0.09–5.86 (5000 bootstrap iterations).
6. Discussion
The present study addressed the challenges of unanchored indirect treatment comparisons in the presence of misclassification in binary outcomes. The focus was on single‐arm trial compared to an ECA. The aims were to quantify the bias introduced by ignoring misclassification and to introduce a method to correct this bias (i.e., the outcome‐corrected model). The simulations served as a proof of concept in relatively straightforward scenarios. The reference method was the indirect conditional treatment effect estimated in a setting without measurement error. The outcome‐corrected model was compared to a method without the measurement error correction (i.e., the uncorrected method).
The simulations found that ignoring misclassification in binary outcomes leads to substantial bias in estimating indirect treatment effects with single‐arm trial compared to an ECA. Even with a specificity and sensitivity at 0.9, and an outcome model accounting for all prognostic variables, the uncorrected method had a relative bias of 67% and a 95% CI coverage of 79%. Since the uncorrected method fails to reduce bias, increasing the ECA (study s 2) sample size resulted in a dramatic decrease in the 95% CI coverage. The outcome‐corrected model had lower bias and better coverage probabilities than the uncorrected method, even when both specificity and sensitivity were at 0.9. When sensitivity and specificity decreased, estimations remained minimally biased, with a relative bias below 6%. However, this reduction in sensitivity or specificity led to an increase in variance. Increasing the sample size of the ECA (study ) had a greater impact on improving the RMSE than increasing the sample size of the validation study (). The sample size of the validation study is used for estimating the joint probability of reference and proxy outcomes (Equation 5). With a non‐differential error measurement, the model is straightforward and thus even small sample size can provide precise estimates of sensitivity and specificity, quickly saturating the RMSE. In contrast the sample size of the ECA (study ) helps to estimate the conditional probability , which is needed to predict the effect of treatment in study . Since the ECA (study ) could consist of routinely collected data from EMRs, it is more feasible to achieve a large sample size compared to study , where both outcomes are measured. Alternatively, study could be an internal validation sample where a random sample of patients from will have both measurements.
All the PAIC methods presented herein build upon the assumption of conditional exchangeability of treatment effects [6]. The outcome‐corrected model additionally assumes a correctly specified outcome measurement error model (Equation 5). The simulations used an outcome‐model correctly specified, and the impact of a misspecified outcome measurement error model was not evaluated (Equation 5). The sample size of the validation study should have a greater impact when there is non‐differential misclassification. Simulations did not consider measurement errors in prognostic covariates, which could have a substantial impact on the indirect treatment effect [26]. Simulations were performed in a setting where the transportability assumption [25, 26] holds. Transportability might not be considered feasible if the validation study and study involve non‐comparable populations; [37] for instance, if the ECA (study ) includes patients with advanced‐stage cancer from a general hospital, while the external validation study includes patients from a specialized oncology center known for earlier detection and treatment of cancer. The specialized center's patients might have different disease progression patterns, affecting the sensitivity and specificity of outcome measurements. As a result, applying the sensitivity and specificity estimates from the external validation study to the ECA (study ) may lead to a biased indirect treatment effect. Furthermore, if one assumes that the treatment impacts the measurement error (i.e., differential error measurement), it should be preferable that all patients in the validation study receive the same treatment as in the ECA (study ). For all these reasons, an internal validation study may be more suitable because it reduces concerns about transportability and providing flexibility to accommodate general patterns of differential misclassification [31]. Additionally, the individual likelihood, presented in Section 3 remained the same with an internal validation study [31]. Another limitation is that the method proposed here estimates a conditional indirect treatment effect, and thus could suffer of non‐collapsibility when using a non‐linear link function for the outcome model [38, 39]. When AgD are available in study , STC also estimate conditional indirect treatment effect with bootstrap estimates for confidence intervals [6] without accounting for the covariance term in Equation (4). When IPD are available for both studies and , estimating the conditional indirect treatment effect allowed a likelihood‐based method to be used to estimate the variance of the indirect treatment effect; when estimating a marginal indirect effect the variance of the indirect treatment effect are estimated with bootstrap or robust variance [12].
We illustrated the application of these methods using the AgD for the standard of care (i.e., placebo arm) in SHARP [32] and the IPD for the sorafenib‐alone arm in PRODIGE‐11 trials [33]. Leveraging the established efficacy of sorafenib compared to placebo in the SHARP population, we used this treatment effect estimate as reference. The reference effect of sorafenib compared to placebo was OR = 0.52. The point estimates of the indirect treatment effect, using the outcome‐corrected model and the true outcome from the PRODIGE‐11 trial (radiological progression) were close to the reference value (OR = 0.55 and OR = 0.6, respectively). Ignoring outcome misclassification resulted in an overestimation of the indirect treatment effect (OR = 0.36). However, with only 161 patients in the sorafenib arm of the PRODIGE‐11 trial, the 95% CI estimated by the outcome‐corrected model was wide. This is conservative, as it transfers the uncertainty in measurement to the uncertainty in the decision (provided that one refrains from concluding the absence of a difference) but may be inefficient for small sample sizes. These findings align with simulation results, where the empirical SE of the outcome‐corrected model was twice that of the reference for a sample size of 200 patients (Table S3).
In the applied example, the outcome‐corrected model assumes no correlation between prognostic factors. Additional simulations are needed to evaluate the potential reduction in estimation variance when incorporating the correlation between prognostic factors into the likelihood, by adding a multivariate probability distribution to model their joint distribution when all IPD are available. It is crucial to emphasize that all methods will be biased when prognostic factors and effect modifiers are absent [6]. Furthermore, the inclusion of unnecessary prognostic factors may amplify the variance of estimation [40]. These challenges are prevalent in health technology assessment and are compounded by insufficient sample size to adjust for all covariates of interest. Consequently, it may be beneficial to employ quantitative bias methods to investigate how robust treatment effect estimates are to unmeasured confounders [41].
While the proposed correction method may enhance the reliability of unanchored indirect treatment comparisons when outcomes differ from those recorded in single‐arm trials, its implementation demands strict conditions that may be hard to meet. Consequently, our simulation study not only illustrates the method's potential benefits but also serves as a cautionary note against the limitations and risks inherent in single‐arm studies using external control groups with proxy outcomes.
Conflicts of Interest
The authors declare no conflicts of interest.
Supporting information
Figure S1: Treatment effect absolute bias with different specificity (upper) and sensitivity (lower) values for the uncorrected and the outcome corrected model (i.e., corrected).
Figure S2: Treatment effect absolute bias with different sample size for study (upper) and study (lower) for the uncorrected and the outcome corrected model (i.e., corrected).
Figure S3: Absolute bias for specificity (upper) and sensitivity (lower) estimated by the outcome‐corrected model for varying sample size of study .
Data S1: Supporting information.
Acknowledgments
The authors would like to thank the Fédération Francophone de Cancérologie Digestive (FFCD) for providing the data for the PRODIGE‐11 trial.
Nourredine M., Gavoille A., Lepage C., Kassai‐Koupai B., Cucherat M., and Subtil F., “Accounting for Misclassification of Binary Outcomes in External Control Arm Studies for Unanchored Indirect Comparisons: Simulations and Applied Example,” Statistics in Medicine 44, no. 20‐22 (2025): e70236, 10.1002/sim.70236.
Funding: The authors received no specific funding for this work.
Data Availability Statement
The data that support the findings of this study are openly available in OSF at https://osf.io/bqs5t/.
References
- 1. Food and Drug Administration, HHS , “International Conference on Harmonisation; Choice of Control Group and Related Issues in Clinical Trials; Availability,” Notice Fed Register 66, no. 93 (2001): 24390–24391. [PubMed] [Google Scholar]
- 2. Zhang A. D., Puthumana J., Downing N. S., Shah N. D., Krumholz H. M., and Ross J. S., “Assessment of Clinical Trials Supporting US Food and Drug Administration Approval of Novel Therapeutic Agents, 1995–2017,” JAMA Network Open 3, no. 4 (2020): e203284, 10.1001/jamanetworkopen.2020.3284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Hatswell A. J., Baio G., Berlin J. A., Irs A., and Freemantle N., “Regulatory Approval of Pharmaceuticals Without a Randomised Controlled Study: Analysis of EMA and FDA Approvals 1999–2014,” BMJ Open 6, no. 6 (2016): e011666, 10.1136/bmjopen-2016-011666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Tenhunen O., Lasch F., Schiel A., and Turpeinen M., “Single‐Arm Clinical Trials as Pivotal Evidence for Cancer Drug Approval: A Retrospective Cohort Study of Centralized European Marketing Authorizations Between 2010 and 2019,” Clinical Pharmacology and Therapeutics 108, no. 3 (2020): 653–660, 10.1002/cpt.1965. [DOI] [PubMed] [Google Scholar]
- 5. Davi R., Mahendraratnam N., Chatterjee A., Dawson C. J., and Sherman R., “Informing Single‐Arm Clinical Trials With External Controls,” Nature Reviews. Drug Discovery 19, no. 12 (2020): 821–822, 10.1038/d41573-020-00146-5. [DOI] [PubMed] [Google Scholar]
- 6. Phillippo D., Ades A., Dias S., Palmer S., Abrams K., and Welton N., NICE DSU Technical Support Document 18: Methods for Population‐Adjusted Indirect Comparisons in Submission to NICE (National Institute for Health and Care Excellence, 2016). [Google Scholar]
- 7. Serret‐Larmande A., Zenati B., Dechartres A., Lambert J., and Hajage D., “A Methodological Review of Population‐Adjusted Indirect Comparisons Reveals Inconsistent Reporting and Suggests Publication Bias,” Journal of Clinical Epidemiology 163 (2023): 1–10, 10.1016/j.jclinepi.2023.09.004. [DOI] [PubMed] [Google Scholar]
- 8. Sultana N. and Ren S., “Review of Methods Used to Estimate Treatment Effects Against Relevant Comparators Using Evidence From Single‐Arm Studies in NICE Single Technology Appraisals,” Value in Health 25, no. 12 (2022): S10, 10.1016/j.jval.2022.09.056. [DOI] [Google Scholar]
- 9. Chatton A., Le Borgne F., Leyrat C., et al., “G‐Computation, Propensity Score‐Based Methods, and Targeted Maximum Likelihood Estimator for Causal Inference With Different Covariates Sets: A Comparative Simulation Study,” Scientific Reports 10, no. 1 (2020): 9219, 10.1038/s41598-020-65917-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Schuler M. S. and Rose S., “Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies,” American Journal of Epidemiology 185, no. 1 (2017): 65–73, 10.1093/aje/kww165. [DOI] [PubMed] [Google Scholar]
- 11. Naimi A. I., Cole S. R., and Kennedy E. H., “An Introduction to G Methods,” International Journal of Epidemiology 46, no. 2 (2017): 756–792, 10.1093/ije/dyw323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Faria R., Alava M. H., Manca A., and Wailoo A. J., “NICE DSU Technical Support Document 17: The Use of Observational Data to Inform Estimates of Treatment Effectiveness in Technology Appraisal: Methods for Comparative Individual Patient Data,” https://www.sheffield.ac.uk/sites/default/files/2022‐02/TSD17‐DSU‐Observational‐data‐FINAL.pdf.
- 13. Signorovitch J. E., Sikirica V., Erder M. H., et al., “Matching‐Adjusted Indirect Comparisons: A New Tool for Timely Comparative Effectiveness Research,” Value in Health 15, no. 6 (2012): 940–947, 10.1016/j.jval.2012.05.004. [DOI] [PubMed] [Google Scholar]
- 14. Caro J. J. and Ishak K. J., “No Head‐To‐Head Trial? Simulate the Missing Arms,” PharmacoEconomics 28, no. 10 (2010): 957–967, 10.2165/11537420-000000000-00000. [DOI] [PubMed] [Google Scholar]
- 15. Phillippo D. M., Dias S., Ades A. E., and Welton N. J., “Assessing the Performance of Population Adjustment Methods for Anchored Indirect Comparisons: A Simulation Study,” Statistics in Medicine 39, no. 30 (2020): 4885–4911, 10.1002/sim.8759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Jiang Y. and Ni W., “Performance of Unanchored Matching‐Adjusted Indirect Comparison (MAIC) for the Evidence Synthesis of Single‐Arm Trials With Time‐To‐Event Outcomes,” BMC Medical Research Methodology 20, no. 1 (2020): 241, 10.1186/s12874-020-01124-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Remiro‐Azócar A., Heath A., and Baio G., “Methods for Population Adjustment With Limited Access to Individual Patient Data: A Review and Simulation Study,” Research Synthesis Methods 12, no. 6 (2021): 750–775, 10.1002/jrsm.1511. [DOI] [PubMed] [Google Scholar]
- 18. Weber D., Jensen K., and Kieser M., “Comparison of Methods for Estimating Therapy Effects by Indirect Comparisons: A Simulation Study,” Medical Decision Making 40, no. 5 (2020): 644–654, 10.1177/0272989X20929309. [DOI] [PubMed] [Google Scholar]
- 19. Hatswell A. J., Freemantle N., and Baio G., “The Effects of Model Misspecification in Unanchored Matching‐Adjusted Indirect Comparison: Results of a Simulation Study,” Value in Health 23, no. 6 (2020): 751–759, 10.1016/j.jval.2020.02.008. [DOI] [PubMed] [Google Scholar]
- 20. Ren S., Ren S., Welton N. J., and Strong M., “Advancing Unanchored Simulated Treatment Comparisons: A Novel Implementation and Simulation Study,” Research Synthesis Methods 15, no. 4 (2024): 657–670, 10.1002/jrsm.1718. [DOI] [PubMed] [Google Scholar]
- 21. Degtiar I. and Rose S., “A Review of Generalizability and Transportability,” Annual Review of Statistics and Its Application 10, no. 1 (2023): 501–524, 10.1146/annurev-statistics-042522-103837. [DOI] [Google Scholar]
- 22. Dreyer N. A., Hall M., and Christian J. B., “Modernizing Regulatory Evidence With Trials and Real‐World Studies,” Therapeutic Innovation & Regulatory Science 54, no. 5 (2020): 1112–1115, 10.1007/s43441-020-00131-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Monti S., Grosso V., Todoerti M., and Caporali R., “Randomized Controlled Trials and Real‐World Data: Differences and Similarities to Untangle Literature Data,” Rheumatology 57, no. Supplement_7 (2018): vii54–vii58, 10.1093/rheumatology/key109. [DOI] [PubMed] [Google Scholar]
- 24. Eisenhauer E. A., Therasse P., Bogaerts J., et al., “New Response Evaluation Criteria in Solid Tumours: Revised RECIST Guideline (Version 1.1),” European Journal of Cancer 45, no. 2 (2009): 228–247, 10.1016/j.ejca.2008.10.026. [DOI] [PubMed] [Google Scholar]
- 25. Carroll R. J., Ruppert D., Stefanski L. A., and Crainiceanu C. M., Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed. (Chapman and Hall/CRC, 2006), 10.1201/9781420010138. [DOI] [Google Scholar]
- 26. Keogh R. H., Shaw P. A., Gustafson P., et al., “STRATOS Guidance Document on Measurement Error and Misclassification of Variables in Observational Epidemiology: Part 1‐Basic Theory and Simple Methods of Adjustment,” Statistics in Medicine 39, no. 16 (2020): 2197–2231, 10.1002/sim.8532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Nab L., Van Smeden M., Keogh R. H., and Groenwold R. H. H., “Mecor: An R Package for Measurement Error Correction in Linear Regression Models With a Continuous Outcome,” Computer Methods and Programs in Biomedicine 208 (2021): 106238, 10.1016/j.cmpb.2021.106238. [DOI] [PubMed] [Google Scholar]
- 28. Yi G. Y., Statistical Analysis With Measurement Error or Misclassification (Springer, 2017), 479, 10.1007/978-1-4939-6640-0. [DOI] [Google Scholar]
- 29. Gerlach R. and Stamey J., “Bayesian Model Selection for Logistic Regression With Misclassified Outcomes,” Statistical Modelling 7, no. 3 (2007): 255–273, 10.1177/1471082X0700700303. [DOI] [Google Scholar]
- 30. Daniel Paulino C., Soares P., and Neuhaus J., “Binomial Regression With Misclassification,” Biometrics 59, no. 3 (2003): 670–675, 10.1111/1541-0420.00077. [DOI] [PubMed] [Google Scholar]
- 31. Lyles R. H., Tang L., Superak H. M., et al., “Validation Data‐Based Adjustments for Outcome Misclassification in Logistic Regression: An Illustration,” Epidemiology 22, no. 4 (2011): 589–597, 10.1097/EDE.0b013e3182117c85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Llovet J. M., Hilgard P., de Oliveira A. C., et al., “Sorafenib in Advanced Hepatocellular Carcinoma,” New England Journal of Medicine 359, no. 4 (2008): 378–390. [DOI] [PubMed] [Google Scholar]
- 33. Jouve J. L., Lecomte T., Bouché O., et al., “Pravastatin Combination With Sorafenib Does Not Improve Survival in Advanced Hepatocellular Carcinoma,” Journal of Hepatology 71, no. 3 (2019): 516–522, 10.1016/j.jhep.2019.04.021. [DOI] [PubMed] [Google Scholar]
- 34. Morris T. P., White I. R., and Crowther M. J., “Using Simulation Studies to Evaluate Statistical Methods,” Statistics in Medicine 38, no. 11 (2019): 2074–2102, 10.1002/sim.8086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. R Core Team, R , “R: A Language and Environment for Statistical Computing,” 2013.
- 36. Kibuchi E., Sturgis P., Durrant G. B., and Maslovskaya O., “The Efficacy of Propensity Score Matching for Separating Selection and Measurement Effects Across Different Survey Modes,” Journal of Survey Statistics and Methodology 12, no. 3 (2024): 764–789, 10.1093/jssam/smae017. [DOI] [Google Scholar]
- 37. Shaw P. A., Gustafson P., Carroll R. J., et al., “STRATOS Guidance Document on Measurement Error and Misclassification of Variables in Observational Epidemiology: Part 2—More Complex Methods of Adjustment and Advanced Topics,” Statistics in Medicine 39, no. 16 (2020): 2232–2263, 10.1002/sim.8531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Greenland S., Pearl J., and Robins J. M., “Confounding and Collapsibility in Causal Inference,” Statistical Science 14 (1999): 29–46, 10.1214/ss/1009211805. [DOI] [Google Scholar]
- 39. Colnet B., Josse J., Varoquaux G., and Scornet E., “Risk Ratio, Odds Ratio, Risk Difference… Which Causal Measure is Easier to Generalize?” 2023, http://arxiv.org/abs/2303.16008.
- 40. Colnet B., Josse J., Varoquaux G., and Scornet E., “Re‐Weighting the Randomized Controlled Trial for Generalization: Finite‐Sample Error and Variable Selection,” Journal of the Royal Statistical Society. Series A, Statistics in Society 188, no. 2 (2025): 345–372, 10.1093/jrsssa/qnae043. [DOI] [Google Scholar]
- 41. Popat S., Liu S. V., Scheuer N., et al., “Addressing Challenges With Real‐World Synthetic Control Arms to Demonstrate the Comparative Effectiveness of Pralsetinib in Non‐Small Cell Lung Cancer,” Nature Communications 13, no. 1 (2022): 3500, 10.1038/s41467-022-30908-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1: Treatment effect absolute bias with different specificity (upper) and sensitivity (lower) values for the uncorrected and the outcome corrected model (i.e., corrected).
Figure S2: Treatment effect absolute bias with different sample size for study (upper) and study (lower) for the uncorrected and the outcome corrected model (i.e., corrected).
Figure S3: Absolute bias for specificity (upper) and sensitivity (lower) estimated by the outcome‐corrected model for varying sample size of study .
Data S1: Supporting information.
Data Availability Statement
The data that support the findings of this study are openly available in OSF at https://osf.io/bqs5t/.
