Abstract
As the use of electronic health records (EHR) to estimate treatment effects has become widespread, concern about bias introduced by error in EHR-derived covariates has also grown. While methods exist to address measurement error in individual covariates, little prior research has investigated the implications of using propensity scores for confounder control when the propensity scores are constructed from a combination of accurate and error-prone covariates. We reviewed approaches to account for error in propensity scores and used simulation studies to compare their performance. These comparisons were conducted across a range of scenarios featuring variation in outcome type, validation sample size, main sample size, strength of confounding, and structure of the error in the mismeasured covariate. We then applied these approaches to a real-world EHR-based comparative effectiveness study of alternative treatments for metastatic bladder cancer. This head-to-head comparison of measurement error correction methods in the context of a propensity score-adjusted analysis demonstrated that multiple imputation for propensity scores performs best when the outcome is continuous and regression calibration-based methods perform best when the outcome is binary.
2. Introduction
Data from Electronic Health Records (EHR) have the potential to facilitate research on exposures and outcomes that would be difficult to study using designed observational or experimental studies due to feasibility or ethical considerations. However, because EHR were developed for clinical care and administrative purposes, the data elements needed for research are often incomplete or error-prone. Many recent studies have highlighted data quality challenges arising in EHR-based research (Hersh et al. 2013; Weiskopf and Weng 2013; Rusanov et al. 2014). Statistical methods to account for missing and error-prone data are needed in order for valid research results to be obtained from the available EHR. Without appropriate analytic approaches treatment effect estimates obtained from error-prone EHR data may be biased, and standard errors are likely to be incorrect.
While EHR databases can provide extensive information for large study populations and many outcomes, causal conclusions can be difficult to draw and confounding can lead to biased treatment effect estimates. Unlike randomized trials, the factors influencing whether a patient receives a treatment of interest are uncertain. Estimated propensity scores, which are the probability of receiving treatment conditional on observed covariates, are often used for confounder control (Rosenbaum and Rubin 1983).
Error in confounder variables presents a particularly serious challenge to the validity of EHR-based research. Prior statistical methods research has shown that the use of error-prone confounders results in biased treatment effect estimates due to residual confounding (Rosner, D. Spiegelman, and Willett 1990; Carroll et al. 2006; Guo, R. A. Little, and McConnell 2012). Thus, while propensity scores, under identifying assumptions, offer the possibility of deriving causal conclusions from EHR, error in the confounder variables used to construct the propensity score potentially undermines the effectiveness of this approach. Although a number of methods exist that account for measurement error and misclassification in covariates (Rosner, D. Spiegelman, and Willett 1990; Carroll et al. 2006; Cole, Chu, and Greenland 2006; Guo, R. A. Little, and McConnell 2012; Webb-Vargas et al. 2017), few of these methods have been evaluated in the context of a propensity score estimated from error-prone covariate information (Steiner, Cook, and Shadish 2011).
A variety of methods to address measurement error are available, although they are infrequently used in the context of analyses using EHR-derived propensity scores. Regression calibration accounts for measurement error by regressing the true covariate on its error-prone analog in a validation sample to generate predictions for the true covariate in the full sample. It has been used as a benchmark for performance of measurement error correction approaches against which other methods are compared due to its frequent use in some research settings (Rosner, D. Spiegelman, and Willett 1990). One would expect this method to perform well when the error-prone version of the covariate differs from the true covariate in a systematic way, such as if the error-prone covariate represents systematic inflation of the true covariate by a fixed factor (Carroll et al. 2006). When the error-prone covariate is, on average, a good estimate for the true covariate but has random error present the performance of regression calibration will depend on the amount of noise in the error-prone covariate. Propensity score calibration, which is regression calibration applied to a propensity score instead of a covariate, has been proposed to account for error in propensity scores arising due to omission of key covariates from the propensity score model (Sturmer, Schneeweiss, Avorn, et al. 2005).
A modification of regression calibration is Monte Carlo regression calibration. Instead of treating the coefficients in the calibration model as fixed and known, they are drawn from a multivariate normal distribution and the outcome model is fit multiple times, combining results across imputations (Cole, Chu, and Greenland 2006). This method has been compared to regular regression calibration, but only in the context of error-prone covariates, not propensity scores (Cole, Chu, and Greenland 2006; Freedman et al. 2008; Messer and Natarajan 2008)
Efficient regression calibration is a modification of regression calibration that uses a weighted combination of the regression calibration estimator and the estimate of the treatment effect obtained from the validation subset (Donna Spiegelman, Carroll, and Kipnis 2001). The efficient regression calibration estimator is expected to perform well in the same kind of scenarios in which standard regression calibration performs well. However, the performance of efficient regression calibration has not yet been evaluated in the context of propensity scores estimated using error-prone covariates.
Similar to efficient regression calibration, two-stage calibration uses a weighted combination of the treatment effect estimated from a validation subset in which true covariate data are available and the estimate based on the error-prone covariate available in the full sample (H.-W. Lin and Chen 2014). This method is expected to perform well when the estimate of the treatment effect in the validation subset is a good estimate of the treatment effect in the full sample, had the true value been available for all subjects (H.-W. Lin and Chen 2014). While this method has been evaluated in analyses using propensity scores constructed from error-prone covariates, it was only compared head-to-head against regression calibration (H.-W. Lin and Chen 2014).
Multiple imputation can also be used if one reframes the measurement error problem as a missing data problem (R. J. A. Little 1988). Multiple imputation for measurement error is conceptually similar to regression calibration but uses multiple imputation of the true covariate, as opposed to a single imputation based on the estimated regression relationship, to capture uncertainty in the predicted true covariate value (Cole, Chu, and Greenland 2006).
This paper examines the performance of methods for bias reduction in the context of EHR-based comparative effectiveness analyses in which an error-prone covariate is included in the estimation of a propensity score. We investigated the scenario where gold-standard covariate data are available for a validation subset of the population. These methods were evaluated through simulation studies that mimic the general structure of an EHR-based analysis and were also applied to an analysis of comparative effectiveness of common treatment strategies (e.g., chemotherapy or immunotherapy) for metastatic bladder cancer (mUC) using EHR data from a national sample of cancer clinics curated by Flatiron Health. (USFDA 2018) The overarching objective of this work is to provide guidance on best practices for accounting for measurement error in propensity score-adjusted analyses.
3. Methods
We consider a scenario in which the objective of the analysis is to estimate the association between treatment and outcome, accounting for confounders, one of which contains measurement error, using propensity score adjustment. Several measurement error methods that are commonly applied to error-prone covariates will be compared. We explore the case where the measurement error correction methods are applied directly to the estimated propensity score as opposed to the error-prone covariate.
3.1. Notation
Let the full sample be of size N and contain information on , the error-prone covariate(s), Z, additional covariates measured without error, T, the binary treatment, and Y, the outcome, for each subject. Let be a matrix of size N by p, where p is the number of error-prone covariates, and let Z be a matrix of size N by m, where m is the number of covariates measured without error. has the form ωX + ϵ, where ω is an unknown scaling factor and . The error-prone covariate, may be a scalar or a vector of error-prone covariates. We assume we additionally have access to a validation subset, consisting of a simple random sample of the full sample of size n, with n ≤ N, that contains information on X, the true covariate(s), , Z, T, and Y for each subject, denoted XV, , ZV, TV, and YV. Therefore, for subjects 1, …, n we have access to complete information and for subjects (n + 1), …, N we are missing at least one of the true covariate values, X. Note that if n = N the true covariate is available for the full sample and therefore no adjustment is necessary. The assumption that the validation sample is a simple random sample from the full sample and therefore has the same conditional distribution of X|T, , Z as the larger sample is crucial for the adjustment methods discussed below. We assume that the true data generating model takes the general form
| (1) |
where θ2 is a p by 1 matrix and θ3 is an m by 1 matrix. and explore performance of alternative measurement error methods for two link functions, g(x) = x and g(x) = log(x).
To account for confounding of the treatment effect by X and Z, we fit a regression model adjusted for the logit of the propensity score, referred to as the outcome model,
| (2) |
where e(Q) is the propensity score, defined as P(T = 1|Q) and Q = {X, Z}. The outcome model is presented in scalar form as opposed to the data generating model, which is presented in matrix form due to the fact that the propensity score reduces the dimensionality of the covariates from p + m to one. When the propensity score fulfills positivity, strong ignorability, and the stable unit treatment value assumption, the propensity score is also a balancing score (Rosenbaum and Rubin 1983). The logit of the propensity score is used here because many of the adjustment methods rely on an assumed linear relationship between the error-prone and true covariate. This assumption is more likely to be approximately satisfied for the logit transformed propensity score which is unbounded as compared to the propensity score itself which is constrained to lie in the interval (0,1).
The treatment effect, θ1, is the estimand of interest that we seek to estimate with β1. For subjects that do not have information on X, can be used in the construction of the propensity score instead, producing . In general, .
3.2. True Propensity Score (TPS)
In the true propensity score model the propensity score is estimated from perfectly measured covariates for all subjects (under the assumption of no unmeasured confounders). This approach is included as a positive control so that it can be seen how the other methods perform in comparison to a method using the best possible information for all subjects. In practice, this will not be possible since the true covariate values are not available for all subjects in the observed data set. Let from the outcome model (equation 2).
3.3. Error-Prone Propensity Score (EPPS)
Similarly, the error-prone propensity score model is also not an adjustment method, but in contrast to TPS, EPPS serves as a negative control to which the performance of the other methods may be compared. Under this approach we fit the model
| (3) |
and let from this model be our estimate of β1.
3.4. Propensity Score Regression Calibration (RC)
Regression calibration assumes that an error-prone covariate available for the full sample is related to the true covariate and that this relationship can be described by a linear model (Rosner, D. Spiegelman, and Willett 1990; Carroll et al. 2006). This allows the true value to be predicted based on the error-prone value. We consider application of the regression calibration approach to a propensity score.
An estimate for the logit propensity score using the true covariate values is regressed on an estimate for the logit propensity score based on error-prone covariates in the validation subset. Specifically, we assume
| (4) |
Fitted values for the the logit of the propensity score from this model are then included as a covariate in the outcome model producing an estimate for the treatment effect, . It can be shown that (Rosner, D. Spiegelman, and Willett 1990; Carroll et al. 2006). The variance of the regression calibration estimator has the form . The specific derivation of can be found in the supplemental materials and the general derivation in matrix form can be found in Rosner et al. (Rosner, D. Spiegelman, and Willett 1990).
3.5. Efficient Regression Calibration (ERC)
Efficient regression calibration uses a weighted combination of two estimates of and the estimate of the treatment effect in the validation subset only, , estimated by the following model (Donna Spiegelman, Carroll, and Kipnis 2001).
| (5) |
The efficient regression calibration estimate of β1 is denoted by , where
Additionally, . Spiegelman, et al. show that these weights result in the most efficient estimator in the class of unbiased estimators (Donna Spiegelman, Carroll, and Kipnis 2001).
3.6. Monte Carlo Regression Calibration (MCRC)
Multiple imputation for measurement error, herein referred to as Monte Carlo regression calibration (MCRC), is similar to regression calibration but takes into account the uncertainty in the relationship between the true and observed covariate values (Cole, Chu, and Greenland 2006). Just as in regression calibration the relationship between the logit of the propensity score estimated from the true covariate values and the logit of the propensity score estimated from the observed covariate values is obtained via a regression of the former on the latter and the treatment assignment in the validation subset (equation 4). However, instead of using fitted values from this model as covariates in the outcome model, the variance-covariance matrix for the regression coefficients is used to generate k draws, e.g. k = 10, from the sampling distribution for the coefficients (Cole, Chu, and Greenland 2006).
The predicted values for the logit of the propensity score estimated from the true covariate values are calculated for each sample from the distribution of the coefficients,
The outcome model (equation 6) is fit k times, once for each of the k data sets.
| (6) |
The MCRC estimator is given by , where is the estimate for β1 obtained in the dth sample, and (Cole, Chu, and Greenland 2006). The MCRC approach produces an estimated treatment effect that has been calibrated to account for error in the error-prone covariate and that also takes into account the uncertainty introduced by estimating the coefficients in the regression calibration model.
3.7. Multiple Imputation of Propensity Scores (MIPS)
Multiple imputation of propensity scores transforms the measurement error problem for the logit of the error-prone propensity score into a missing data problem. Under this approach, we impute the logit of the true propensity score using information on the relationship between the logit of the error-prone propensity score, the logit of the true propensity score, the treatment indicator, and the outcome derived from the validation subset. In numerical examples below, multiple imputation is performed by predictive mean matching (PMM) (R. J. A. Little 1988). PMM has some advantages over other imputation methods such as ensuring that all imputed values are plausible and performing well even if the structural form of the model used by PMM does not fit well (Buuren and Groothuis-Oudshoorn 2011).
The logit of the propensity score is imputed for each subject k times and the estimate in the dth imputed sample is denoted , d = 1, …, k. The outcome model (equation 6) is fit k times, and the MIPS estimator and its variance are obtained by combining estimates across imputations, just as for the MCRC estimator.
3.8. Two-Stage Calibration (TSC)
Two-stage calibration uses two models already introduced, the relationship between the logit of the propensity score based on the error-prone covariates and the outcome in the entire sample (equation 3) and the relationship between the logit of the propensity score based on the true value of the covariates and the outcome in the validation subset only (equation 5) in addition to the relationship between the logit of the propensity score based on the error-prone covariates and the outcome in the validation subset only,
The two-stage calibration estimator utilizes the difference between the treatment effect in the validation subset when the error-prone covariate is used, , and when the true covariate is used, , as a measure of how different the treatment effect is likely to be between these two versions of the covariate in the main sample.
The two-stage calibration estimator is given by and (H.-W. Lin and Chend 2014).
3.9. Directly adjusting for covariate measurement error
Adjustment of the regression analysis to reduce bias due to the error-prone covariate can be carried out at two different stages in the analysis. An intuitive option is to directly adjust the error-prone covariate using the above approaches and then estimate the propensity score based on the adjusted covariate. This method is appealing because the error-prone covariate is corrected to more closely resemble the true covariate, and then analysis is carried out using a standard propensity score adjustment. However, this method does not provide any benefit relative to using the error-prone covariate alone. This can be seen by relating the error-prone covariate, , to the true covariate value, X, and the other observed covariates, Z, for the validation subset.
Following the regression calibration approach, can be substituted into the logistic regression for treatment assignment as follows:
If we had simply performed a logistic regression for the treatment regressed on the error-prone covariate and other covariates, we would have
implying that , , and and therefore no new information for the treatment assignment model was gained by using the error-prone covariate after adjustment to more closely resemble the true covariate value.
The alternative is to estimate the propensity score based on the error-prone covariate and then use an adjustment method to correct the imperfect propensity score itself. This is the basis of our work presented above.
4. Simulation Study
We conducted a simulation study to investigate the relative bias and efficiency of the alternative measurement error methods described in the previous section under a range of scenarios resembling those likely to be encountered in real-world studies based on EHR data. Data were simulated for four covariates (Z) in addition to the true covariate (X) and the error prone covariate (Table 1).
Table 1:
Data generation scheme and parameter values used for simulation study.
| Variable | Distribution | Analogous EHR Variable |
|---|---|---|
| Z1 | Uniform(30, 60) | Age |
| Z2 | Bernoulli(0.8) | Gender |
| Z3 | Bernoulli(0.6) | Race |
| Z4 | Bernoulli(0.3) | Medications |
| X | X = 8 + 0.025Z1 + ϵ; ϵ ∼ N(0, 1) | 1-year Comorbidity Score |
| 1-month Comorbidity Score | ||
| T | logit(P(T = 1)) = λi − 0.01Z1 + 0.2Z2 −0.05Z3 − 0.25Z4 + ηiX; i = 1, 2, 3 λ = {−0.21, −4, −6.6}; η = {ln(1.1), ln(1.5), ln(2)} | Immunotherapy |
| Y1 | log(P(Y1 = 1)) = −9.9 + 0.005Z1 − 0.025Z2 −0.05Z3 + 0.075Z4 + ln(2.2)X + 0.22T | 6-month Mortality |
| Y2 | Y2 = 100 + Z1 − 2Z2 − 3Z3 + Z4 + 4X + 10T | Albumin Lab Value |
Two different outcome models were investigated in the simulation study, a log linear model and a linear model. We simulated error from six different scenarios (Table 2). Additionally, we investigated four scenarios for the size of the full sample and validation sample (n = 50, N = 1,000; n = 100, N = 1,000; n = 50, N = 10,000; and n = 100, N = 10,000). Three different strengths of confounding between the true covariate value and the treatment assignment were explored. Ten imputations were used in MCRC and MIPS methods. Relationships among the simulated variables and the complete set of distributions and parameter values used to simulate data along with examples of EHR-derived covariates used to motivate the simulation study are provided in Tables 1 and 2.
Table 2:
Error structures for constructing the error-prone covariate from the true covariate for the simulation study. Error structures A-C are non-differential, while error structures D-F are differential with respect to treatment assignment.
| Error Structure | Treated Subjects | Control Subjects | ||
|---|---|---|---|---|
| ω | ω | |||
| A | 1 | 1 | 1 | 1 |
| B | 0.8 | 0 | 0.8 | 0 |
| C | 0.8 | 1 | 0.8 | 1 |
| D | 0.8 | 0 | 0.8 | 1 |
| E | 0.8 | 1 | 0.9 | 1 |
| F | 0.8 | 0 | 0.9 | 1 |
Each simulation scenario was repeated 10,000 times. We report bias, 95% confidence interval coverage probabilities, empirical standard errors, and model-based standard errors for each method and scenario.
4.1. Simulation Results
The five methods (RC, ERC, MCRC, MIPS, and TSC), along with EPPS as the negative control, were applied under 144 scenarios arising from the factorial combination of the two link functions, two main sample sizes, two validation subset sizes, three values for the magnitude of the regression parameter relating the true covariate value to treatment assignment, and six error structures. Results for the other three main sample size and validation subset combinations were nearly identical to the results for N = 10,000, n = 100 and thus are not presented.
4.1.1. Continuous Outcome
In the scenario with a continuous outcome and a validation subsample of 100 with a full sample of size 10,000 the overall pattern of relative performance of the methods was consistent across values for the strength of confounding (Figure 1).
Figure 1:
Bias for the continuous outcome (true value = 10) with validation size 100 and full sample size 10,000 over three strengths of confounding and six error structures. Intervals displayed are from the 2.5 percentile to 97.5 percentile of the replications with less than 500% bias out of 10,000 total replications.
Under error structure A (ω = 0, ϵ = 1) all five methods had low bias, with MIPS and TSC having the smallest bias across confounding levels. Notably, TSC was less efficient than the other methods (Table 7). Additionally, the regression calibration-based methods (RC, ERC, MCRC) achieved near-nominal coverage probabilities (∼ 90% − 95%) while the coverage probabilities for MIPS and TSC were inflated for weak confounding, approximately nominal for moderate confounding, and below nominal for strong confounding (Table 3).
Table 3:
95% confidence interval coverage probabilities of valid simulations with continuous outcome, validation size 100, and full sample size 10,000
| Confounding | Error Structure | EPPS | RC | ERC | MCRC | MIPS | TSC |
|---|---|---|---|---|---|---|---|
| Weak Confounding | A | 97.15% | 92.05% | 91.12% | 90.87% | 99.36% | 99.35% |
| B | 100% | 99.98% | 99.96% | 99.95% | 97.52% | 99.96% | |
| C | 93.23% | 88.89% | 87.54% | 87.35% | 99.34% | 99.12% | |
| D | 28.9% | 81.98% | 80.89% | 79.64% | 98.38% | 97.47% | |
| E | 0% | 98.38% | 94.39% | 97.1% | 97.22% | 84.36% | |
| F | 0% | 93.83% | 91.84% | 91.94% | 97.66% | 81.38% | |
| Moderate Confounding | A | 0.42% | 90.63% | 90.91% | 89.45% | 92.17% | 94.79% |
| B | 99.91% | 99.57% | 99.53% | 99.44% | 97.9% | 97.62% | |
| C | 0.02% | 82.7% | 83% | 81.71% | 91.94% | 94.91% | |
| D | 0% | 78.79% | 79.12% | 78.66% | 94.04% | 93.28% | |
| E | 0% | 93.05% | 92.66% | 90.06% | 88.55% | 85.29% | |
| F | 0% | 93.23% | 93.19% | 90.41% | 91.67% | 78.73% | |
| Strong Confounding | A | 0% | 94.71% | 94.35% | 92.13% | 89.2% | 89.96% |
| B | 99.82% | 99.65% | 99.58% | 99.38% | 97.48% | 94.94% | |
| C | 0% | 95.16% | 94.93% | 92.12% | 88.2% | 89.8% | |
| D | 0% | 95.75% | 95.8% | 93.22% | 92.68% | 86.84% | |
| E | 0% | 27.21% | 27.86% | 30.98% | 86.12% | 81.02% | |
| F | 0% | 38.79% | 39.43% | 41.28% | 89.66% | 77.13% | |
The performance of the five methods were comparable to one another and to EPPS when the error-prone covariate was a percentage of the true covariate (error structure B), though here MIPS was less efficient than the other methods (Table 7). In the case of error structure B, even EPPS had minimal bias and greater than 95% coverage, which matched the larger than nominal coverage seen by all the other methods across all three levels of confounding (Table 3).
When the error-prone covariate was a percentage of the true covariate plus some random error (error structure C), the results were nearly identical to those of error structure A (Figure 1).
Under error structure D (treatment-dependent difference in ϵ) bias was similar to the results seen under error structures A and C though the regression calibration-based methods (RC, ERC, MCRC) did have residual bias in the weak confounding case and as a result had confidence intervals with less than nominal coverage in the weak and moderate confounding cases (Figure 1, Table 3).
The two remaining differential error structures, E (treatment-dependent difference in ω) and F (treatment-dependent difference in ϵ and ω) displayed more notable differences in bias across methods while their results were remarkably similar to one another. In both of these error structures EPPS had high bias regardless of the strength of confounding (Figure 1). The three regression calibration-based methods over-corrected the estimate given by EPPS, though the absolute bias was less than that for EPPS. Additionally, the empirical standard errors for the regression calibration-based methods were very large in the weak confounding case (Figure 1). Both TSC and MIPS performed well across all levels of confounding, having both better point estimates and smaller intervals than the other adjustment methods (Figure 1). However, while the regression calibration-based methods and MIPS were able to achieve near nominal coverage in the weak and moderate confounding cases, TSC had low coverage (Table 3). Interestingly, in the strong confounding case the coverage of the regression calibration-based methods was poor and the best coverage was by MIPS and TSC, although this was not due to an inflated variance estimate for MIPS (Table 7).
Additionally, EPPS only had nominal coverage under error structure B for all three levels of confounding and additionally for error structures A and C for weak confounding (Table 3).
Additional simulations were conducted where both the true data-generating model and the propensity score model contained an interaction between X and Z. The results of these simulations were largely similar to the results presented above with the following differences. Nearly all the methods had poorer coverage than the results presented above and all the methods with the exception of TSC had larger model-based and empirical standard errors (Supplemental Figure 4, Supplemental Tables 9,11).
4.1.2. Binary Outcome
In the scenario with a binary outcome and a validation sample of 100 sampled from a full sample of size 10,000 the overall pattern of relative performance of the methods was consistent across values for the strength of confounding (Figure 2).
Figure 2:
Bias for the binary outcome (true value = 0.22) with validation size 100 and full sample size 10,000 over three different amounts of confounding and six different error structures. Intervals displayed are from the 2.5 percentile to 97.5 percentile of the replications with less than 500% bias out of 10,000 total replications.
Under error structure A (ω = 0, ϵ = 1) all five methods had low bias, with MIPS and TSC having the smallest bias under weak confounding and TSC along with the regression calibration-based methods having the smallest bias under strong confounding (Figure 2). Confidence interval coverage was approximately nominal for all adjustment methods under weak confounding. Coverage was approximately nominal for the regression calibration-based methods under moderate and strong confounding, while MIPS and TSC had slightly lower coverage (Table 4). Additionally, TSC and MIPS were less efficient than the other methods (Table 8).
Table 4:
95% confidence interval coverage probabilities of valid simulations with binary outcome, validation size 100, and full sample size 10,000
| Confounding | Error Structure | EPPS | RC | ERC | MCRC | MIPS | TSC |
|---|---|---|---|---|---|---|---|
| Weak Confounding | A | 92.46% | 93.49% | 93.56% | 93.41% | 96.83% | 94.79% |
| B | 97.51% | 97.39% | 97.14% | 97.39% | 94.87% | 95.36% | |
| C | 89.31% | 90.88% | 90.82% | 90.82% | 96.77% | 94.94% | |
| D | 60.26% | 64.51% | 65.18% | 65.66% | 96.02% | 94.84% | |
| E | 0% | 99.53% | 98.42% | 98.16% | 97.76% | 80.26% | |
| F | 0% | 98.68% | 97.95% | 97.64% | 97.49% | 74.35% | |
| Moderate Confounding | A | 15.12% | 94.48% | 94.52% | 92.97% | 90.38% | 91.61% |
| B | 97.59% | 97.41% | 97.28% | 97.12% | 95.64% | 94.3% | |
| C | 4.26% | 92.75% | 92.62% | 91.98% | 89.11% | 91.22% | |
| D | 0.41% | 91.71% | 91.84% | 90.07% | 91.32% | 88.34% | |
| E | 0% | 96.67% | 96.23% | 93.58% | 87.55% | 83.66% | |
| F | 0% | 96.95% | 96.82% | 94.47% | 88.83% | 71.99% | |
| Strong Confounding | A | 0.05% | 95.29% | 95.1% | 93.51% | 87.06% | 90.92% |
| B | 97.2% | 97.99% | 97.92% | 97.8% | 95.67% | 94.08% | |
| C | 0% | 94.58% | 94.6% | 92.88% | 86.54% | 90.16% | |
| D | 0% | 95.43% | 95.55% | 93.86% | 91.1% | 86.93% | |
| E | 0% | 97.95% | 97.38% | 95.72% | 82.15% | 86.06% | |
| F | 0% | 97.26% | 96.85% | 94.64% | 87.78% | 76.37% | |
As seen in the continuous outcome case the performance of the five methods was extremely similar under error structure B, and MIPS also had larger confidence intervals than the other methods (Table 8). In the case of error structure B, even EPPS had minimal bias and over 95% coverage, which matched the larger than nominal coverage seen by all the other methods across all three levels of confounding (Table 4).
Also similar to the continuous outcome scenario, the binary outcome scenario had nearly identical results for error structure C and error structure A (Figure 2).
Under error structure D (treatment-dependent difference in ϵ) bias was similar to the results seen under error structures A and C though the regression calibration-based methods (RC, ERC, MCRC) did have residual bias in the weak confounding case and as a result have confidence intervals with less than nominal coverage in the weak and moderate confounding cases (Figure 2, Table 4).
The two remaining differential error structures, E (treatment-dependent difference in ω) and F (treatment-dependent difference in ϵ and ω) displayed more notable differences in bias across methods and again their results were remarkably similar to one another. In both of these error structures, EPPS had high bias regardless of the strength of confounding (Figure 2). The three regression calibration-based methods over-corrected the estimate given by EPPS, though the absolute bias was less than that for EPPS. Additionally, the empirical standard errors for the regression calibration-based methods were very large in the weak confounding case (Figure 2). Both TSC and MIPS perform well across all levels of confounding, having both better point estimates and smaller intervals than the other adjustment methods (Figure 2). However, while the regression calibration-based methods and MIPS were able to achieve near nominal coverage in the weak and moderate confounding cases, TSC had low coverage (Table 4).
Similar to the continuous outcome scenarios, EPPS only had nominal coverage under error structure B for all three levels of confounding and additionally for error structures A and C for weak confounding (Table 4).
Additional simulations were conducted where both the true data-generating model and the propensity score model contained an interaction between X and Z. The results of these simulations had some important differences from the simulation results presented above. The regression calibration-based methods (RC, ERC, MCRC) had considerably more bias across all confounding strengths and error structures (Supplemental Figure 5). Additionally, both MIPS and TSC had much larger variability in the amount of bias present than the regression calibration-based methods. Furthermore, the regression calibration-based methods had much poorer coverage than the results presented above, while MIPS and TSC experienced only a small decrease in coverage probabilities (Supplemental Table 10). Finally, although the model-based and empirical standard errors were similar for the regression calibration-based methods, MIPS and TSC have larger standard errors relative to the main results (Supplemental Table 12).
5. Immunotherapy vs Chemotherapy for Metastatic Urothelial Cancer
The objective of this analysis was to illustrate the performance of alternative statistical methods for measurement error applied to a propensity score estimated using an error-prone covariate in the specific context of a real-world analysis of EHR-derived data. We analyzed data from Flatiron Health database, a nationwide EHR-derived de-identified database consisting of data from over 265 cancer clinics including over 2 million active U.S. cancer patients available for analysis (Berger et al. 2016; Abernethy et al. 2017; Miksad and Abernethy 2018; Curtis et al. 2018; Presley et al. 2018). We compared the effectiveness of carboplatin-based chemotherapy compared to immunotherapy with checkpoint inhibitors via the binary outcome of 6-month mortality in a sample of patients with metastatic bladder carcinoma (mUC) using propensity score adjustment to account for systematic differences between chemotherapy and immunotherapy initiators.
In the context of comparative effectiveness of immunotherapy and chemotherapy for treatment of mUC, the overall health of the patient, as assessed via a comorbidity score, is a key confounder to include in the propensity score. Lin, et al. showed that comorbidity distributions differ between patients with low versus high proportions of their visit information captured in the EHR (Joshua Lin et al. 2017; K. J. Lin et al. 2018). It is thus likely that comorbidity scores for some patients will not reflect the true degree of comorbid disease burden in these patients and that this error could be differential according to the intensity of the patient’s interaction with the healthcare system. We constructed the Elixhauser comorbidity score based on one year of EHR data prior to the start of treatment which we considered the “true” comorbidity score (Elixhauser et al. 1998). Two different error-prone versions of this comorbidity score were artificially constructed by the addition of error to illustrate the case where patients vary in the amount of comorbidity information captured in their EHR, one non-differential and one differential. In the non-differential case, we introduced error with structure of ω = 0.25, σϵ = 4.5, and in the differential case we introduced error with ω = 0.9, σϵ = 0.5 for patients on immunotherapy and ω = 0.3, σϵ = 4.5 for patients on chemotherapy. We assumed that a validation sample of 100 patients was available for whom information on the true propensity score was available. All other covariates included in the propensity score (age, gender, race/ethnicity, calendar year, primary site of cancer, smoking status, opioid prescriptions, and steroid prescriptions) were assumed to be measured without error (Figure 3). The propensity score was estimated by regressing treatment (immunotherapy or chemotherapy) on all of the aforementioned variables using logistic regression.
Figure 3:
DAG describing relationship among variables in the Flatiron Health mUC study; blue indicates only observed in validation sample and black indicates observed for all. ELIX is the weighted Elixhauser comorbidity score.
The study sample included patients with mUC who began first-line therapy with carboplatin-based chemotherapy or immunotherapy between January 1, 2011 and July 31, 2018. Patients on therapies that combined immunotherapy and chemotherapies, contained clinical trial drugs, included cisplatin, or included non-chemotherapy based drugs were excluded from the sample. Patients starting treatment after July 31, 2018 were excluded to allow for all patients to have at least six months of followup for the outcome of six-month mortality. Information about opioid and steroid prescriptions were collected for the two-month period prior to treatment start. This resulted in a cohort size of 2187 patients, 1600 of whom were treated with carboplatin-based chemotherapy and 587 of whom were treated with immunotherapy (Table 5).
Table 5:
Characteristics of patients in the Flatiron Health mUC cohort by treatment type.
| Carboplatin-Based Chemotherapy | Immunotherapy | ||
|---|---|---|---|
| N = 1600 | N = 587 | ||
| Age, mean (SD) | 72.56 (8.48) | 74.74 (8.75) | |
| Gender, N (% Male) | 1171 (73.2) | 427 (72.7) | |
| Race/Ethnicity N (%) | White | 1213 (75.8) | 437 (74.4) |
| Asian | 23 (1.4) | 4 (0.7) | |
| Black or African American | 75 (4.7) | 21 (3.6) | |
| Hispanic or Latino | 71 (4.4) | 23 (3.9) | |
| Other Race | 99 (6.2) | 35 (6.0) | |
| Unknown | 119 (7.4) | 67 (11.4) | |
| Baseline Year N (%) | 2011 | 110 (6.9) | 0 (0.0) |
| 2012 | 165 (10.3) | 0 (0.0) | |
| 2013 | 224 (14.0) | 0 (0.0) | |
| 2014 | 238 (14.9) | 0 (0.0) | |
| 2015 | 283 (17.7) | 8 (1.4) | |
| 2016 | 308 (19.2) | 101 (17.2) | |
| 2017 | 195 (12.2) | 283 (48.2) | |
| 2018 | 77 (4.8) | 195 (33.2) | |
| Primary Site N (%) | Bladder | 1185 (74.1) | 444 (75.6) |
| Renal Pelvis | 263 (16.4) | 75 (12.8) | |
| Ureter | 140 (8.8) | 65 (11.1) | |
| Urethra | 12 (0.8) | 3 (0.5) | |
| Smoking Status N (%) | History of smoking | 1145 (71.6) | 434 (73.9) |
| No history of smoking | 419 (26.2) | 147 (25.0) | |
| Unknown/not documented | 36 (2.2) | 6 (1.0) | |
| 1 year comorbidity score, mean (SD) | 0.90 (2.67) | 1.86 (4.03) | |
5.1. Immunotherapy vs Chemotherapy for Metastatic Urothelial Cancer Results
The relative risk of death in the first six months after start of treatment in patients on immunotherapy as compared to patients on carboplatin-based chemotherapy was 1.42 (95% CI: 1.20–1.68) using the true propensity score for all patients to account for confounding (Table 6). Under non-differential error, several point estimates from the adjustment methods were very similar to one another, however ERC was able to improve upon EPPS in this scenario. Both MIPS and TSC over-corrected as compared to EPPS, providing point estimates smaller than that of TPS though with larger absolute bias than EPPS. (Table 6). Under differential error, ERC and TSC had point estimates closest to the true propensity score method, both with estimates of 1.43, compared to the truth of 1.42. However, TSC again had a tighter confidence interval around this estimate than TPS, which indicates an artificially high level of precision. MIPS had an estimate with slightly larger bias than ERC and a larger confidence interval than that obtained with ERC, demonstrating the loss of efficiency when the MIPS estimator is used. Further, the other two regression calibration-based estimates (RC and MCRC) provided estimates that were less biased than EPPS (Table 6).
Table 6:
Relative risk of 6-month mortality and 95% confidence intervals (CI) for patients on immunotherapy compared to carboplatin-based chemotherapy under differential error (left) and non-differential error (right) in a confounder. LB = lower bound, UB = upper bound
| Relative Risk | 95% CI LB | 95% CI UB | ||||
| TPS | 1.42 | 1.20 | 1.68 | |||
| Differential Error | Non-Differential Error | |||||
| Relative Risk | 95% CI | Relative Risk | 95% CI | |||
| LB | UB | LB | UB | |||
| EPPS | 1.47 | 1.24 | 1.73 | 1.45 | 1.23 | 1.71 |
| RC | 1.46 | 1.24 | 1.73 | 1.45 | 1.22 | 1.71 |
| ERC | 1.43 | 1.21 | 1.68 | 1.43 | 1.21 | 1.69 |
| MCRC | 1.46 | 1.24 | 1.73 | 1.45 | 1.22 | 1.71 |
| MIPS | 1.39 | 1.11 | 1.73 | 1.30 | 1.07 | 1.57 |
| TSC | 1.43 | 1.24 | 1.66 | 1.37 | 1.19 | 1.58 |
6. Discussion
Error in confounder variables presents a particularly serious challenge to the validity of EHR-based research. Prior research has shown that the use of error-prone covariates results in biased treatment effect estimates due to residual confounding (Rosner, D. Spiegelman, and Willett 1990; Carroll et al. 2006; Guo, R. A. Little, and McConnell 2012). Our simulation studies explored the effect of study sample size, validation subset size, degree of confounding, and error structure for both continuous and binary outcomes on five different methods for correcting measurement error.
The method that was found to perform the best for continuous outcomes was MIPS due to its very small bias across all error scenarios, its smaller standard errors, and its nominal or near nominal coverage probabilities. Two-stage calibration was consistently less efficient than MIPS though it often had similar bias to MIPS, which makes it a less reliable adjustment method. The regression calibration-based methods had consistently large bias with relatively small standard errors, indicating their undesirable high certainty about a biased result as evidenced by low coverage probabilities. The regression calibration-based methods (RC, ERC, and MCRC) had considerable bias combined with extremely low coverage probabilities in two of the differential error scenarios, which makes them less suitable for continuous outcomes.
The recommendation of MIPS holds when both the data-generating model and the propensity score model both contain an interaction between X and Z.
On the other hand, the methods that performed the best for binary outcomes were the regression calibration-based methods (RC, ERC, and MCRC). These methods had very small biases across the different error structures and achieved nearly perfect nominal coverage while having reasonably sized standard errors. Both MIPS and TSC performed well in the binary outcome scenario, having slightly smaller bias than the regression calibration-based methods though this was mitigated by the combination of poorer coverage and less efficient estimation for both MIPS and TSC.
In the scenario with an interaction between X and Z in both the data-generating model and the propensity score model, the greatly reduced bias and greater coverage probabilities as compared to the regression calibration-based methods outweighs the larger standard errors seen in MIPS and TSC. The performance of MIPS and TSC are quite comparable to one another across the metrics mentioned, so either MIPS or TSC is recommended when an interaction term between an error-prone and correctly-measured covariate is present.
Stürmer, et al. previously evaluated the performance of regression calibration applied to an error-prone propensity score. Their evaluation of “propensity score calibration” differed from that considered here in several ways. First, they focused on error in a propensity score arising due to omitted variables rather than error-prone covariates (Sturmer, Schneeweiss, Avorn, et al. 2005; Sturmer, Schneeweiss, J Rothman, et al. 2007). Our method examines the scenario where the same set of covariates are available for the full and validation samples, but the values are known to be error-prone for most of the subjects and only a subset have data on the error-prone and true covariates. Additionally, Stürmer, et al. performed regression calibration on the untransformed propensity score (Sturmer, Schneeweiss, Avorn, et al. 2005; Sturmer, Schneeweiss, J Rothman, et al. 2007). We found that the relationship between two propensity scores is rarely linear, instead often following an “s”-shaped relationship due to the truncated range of the propensity score. Therefore, we carried out measurement error correction methods on the propensity score transformed to the log odds scale. Despite these differences, our findings support those of Stürmer, et al., that regression calibration on a propensity score can be a useful bias-correction method. (Sturmer, Schneeweiss, Avorn, et al. 2005; Sturmer, Schneeweiss, J Rothman, et al. 2007).
Additionally, Lin and Chen previously evaluated the performance of their two-stage calibration method in the context of propensity scores (H.-W. Lin and Chen 2014). Their evaluation of TSC was similar to our exploration, with the distinction that they used either the untransformed propensity score or a spline of the propensity score in the model as opposed to the logit of the propensity score (H.-W. Lin and Chen 2014). Lin and Chen only compared TSC to RC, and found that the point estimates were less biased using TSC than RC, which supports our conclusions (H.-W. Lin and Chen 2014). However, the model-based standard errors that Lin and Chen obtained for each simulation scenario were very similar between TSC and RC, while in the scenarios we investigated TSC was substantially more efficient than RC (H.-W. Lin and Chen 2014).
Hong, et al. explored the relative performance of calibration methods when more than one covariate is error-prone and how the performance varies when the error-prone covariates are either highly correlated or have a low correlation (Hong et al. 2019). This has not yet been explored when the covariates are used to estimate a propensity score. This represents an interesting direction for future work as measurement error correction for a single adjustment variable, such as the propensity score, may be easier to implement than adjustment for multiple and potentially correlated covariates.
A limitation of the work presented here is that all compared methods rely on a validation sample that is a simple random sample of the full study population. Throughout our simulation study, a simple random sample was taken in order to form our validation subset. For low prevalence exposure variables, this is an unrealistic sampling strategy because few or no exposed individuals might be encountered in a small simple random sample. In the EHR literature, validation sampling is often carried out conditional on the surrogate measure itself. Additional research is needed to explore measurement error correction methods in the context of a validation sample that is drawn conditional on the value of the error-prone covariate.
In conclusion, bias induced by measurement error is not mitigated by the use of propensity scores. Additionally, while many of the methods investigated perform well in the case of non-differential error, the true error model is unlikely to be known and therefore methods that can accommodate differential error should be used in practice. Using appropriate analytic methods to address error in EHR-derived covariates is key to obtaining valid results from these data sources.
Supplementary Material
7. Acknowledgements
The authors would like to thank Flatiron Health for providing us with the data for patients with metastatic bladder cancer.
8 Funding
Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under award number R21CA227613 and K23CA187185. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Declaration of conflicting interests
Dr. Mamtani reports having served as a consultant for Seattle genetics / Astellas.
The author(s) declared no other potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.
References
- Abernethy Amy P et al. (2017). “Use of electronic health record data for quality reporting”. In: Journal of oncology practice 13.8, pp. 530–534. [DOI] [PubMed] [Google Scholar]
- Berger Marc L et al. (2016). “Opportunities and challenges in leveraging electronic health record data in oncology”. In: Future Oncology 12.10, pp. 1261–1274. [DOI] [PubMed] [Google Scholar]
- Buuren Stef van and Groothuis-Oudshoorn Catharina (2011). “MICE: Multivariate Imputation by Chained Equations in R”. In: Journal of Statistical Software 45. DOI: 10.18637/jss.v045.i03. [DOI] [Google Scholar]
- Carroll Raymond J. et al. (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Chapman & Hall/CRC. [Google Scholar]
- Cole Stephen R, Chu Haitao, and Greenland Sander (2006). “Multiple-Imputation for Measurement Error Correction”. In: International Journal of Epidemiology 35, pp. 1074–1081. [DOI] [PubMed] [Google Scholar]
- Curtis Melissa D et al. (2018). “Development and validation of a high-quality composite real-world mortality endpoint”. In: Health services research 53.6, pp. 4460–4476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elixhauser Anne et al. (1998). “Comorbidity Measures for Use with Administrative Data”. In: Medical Care 36.1, pp. 8–27. [DOI] [PubMed] [Google Scholar]
- Freedman Laurence S. et al. (2008). “A Comparison of Regression Calibration, Moment Reconstruction and Imputation for Adjusting for Covariate Measurement Error in Regression”. In: Statistics in Medicine 27, pp. 5195–5216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo Ying, Little Roderick A, and McConnell Daniel S (January. 2012). “On Using Summary Statistics From an External Calibration Sample to Correct for Covariate Measurement Error”. In: Epidemiology 23.1, pp. 165–174. [DOI] [PubMed] [Google Scholar]
- Hersh William R. et al. (August. 2013). “Caveats for the Use of Operational Electronic Health Record Data in Comparative Effectiveness Research”. In: Medical Care 51.8 0 3, S30–S37. issn: 0025–7079. doi: 10.1097/MLR.0b013e31829b1dbd. url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3748381/ (visited on 03/16/2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hong Hwanhee et al. (2019). “Propensity Score-Based Estimators with Multiple Error-Prone Covariates”. In: American Journal of Epidemiology. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Joshua, Kueiyu et al. (2017). “Identifying Patients With High Data Completeness to Improve Validity of Comparative Effectiveness Research in Electronic Health Records Data”. In: Clinical Pharmacology & Therapeutics 103. doi: 10.1002/cpt.861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Hui-Wen and Chen Yi-Hau (2014). “Adjustment for Missing Confounders in Studies Based on Observational Databases: 2-Stage Calibration Combining Propensity Scores From Primary and Validation Data”. In: American journal of epidemiology 180. doi: 10.1093/aje/kwu130. [DOI] [PubMed] [Google Scholar]
- Lin Kueiju Joshua et al. (2018). “Out-of-system care and recording of patient characteristics critical for comparative effectiveness research”. In: Epidemiology 29, pp. 356–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Little Roderick J. A. (1988). “Missing-Data Adjustments in Large Surveys”. In: Journal of Business & Economic Statistics 6.3, pp. 287–296. doi: 10.1080/07350015.1988.10509663. url: https://www.tandfonline.com/doi/abs/10.1080/07350015.1988.10509663. [DOI] [Google Scholar]
- Messer Karen and Natarajan Loki (2008). “Maximum likelihood, multiple imputation and regression calibration for measurement error adjustment”. In: Statistics in Medicine 27, pp. 6332–50. doi: 10.1002/sim.3458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miksad Rebecca A and Abernethy Amy P (2018). “Harnessing the power of real-world evidence (RWE): a checklist to ensure regulatory-grade data quality”. In: Clinical Pharmacology & Therapeutics 103.2, pp. 202–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Presley Carolyn J et al. (2018). “Association of broad-based genomic sequencing with survival among patients with advanced non-small cell lung cancer in the community oncology setting”. In: Jama 320.5, pp. 469–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenbaum Paul R. and Rubin Donald B. (1983). “The Central Role of the Propensity Score in Observational Studies for Causal Effects”. In: Biometrika 70.1, pp. 41–55. [Google Scholar]
- Rosner B, Spiegelman D, and Willett WC (1990). “Correction of Logistic Regression Relative Risk Estimates and Confidence Intervals for Measurement Error: The Case of Multiple Covariates Measured With Error”. In: American Journal of Epidemiology 132.4, pp. 734–745. [DOI] [PubMed] [Google Scholar]
- Rusanov A et al. (2014). “Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research”. In: Bmc Medical Informatics and Decision Making 14. issn: 1472–6947. doi: Artn5110.1186/1472-6947-14-51. url: %3CGo%20to%20ISI%3E://WOS:000338259400001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spiegelman Donna, Carroll Raymond J., and Kipnis Victor (2001). “Efficient regression calibration for logistic regression regression in main study/internal validation study designs with an imperfect reference instrument”. In: Statistics in Medicine 20, pp. 139–160. [DOI] [PubMed] [Google Scholar]
- Steiner Peter M., Cook Thomas D., and Shadish William R. (2011). “On the Importance of Reliable Covariate Measurement in Selection Bias Adjustments Using Propensity Scores”. In: Journal of Educational and Behavioral Statistics 36.2, pp. 213–236. [Google Scholar]
- Sturmer Til, Schneeweiss Sebastian, Avorn Jerry, et al. (2005). “Adjusting Effect Estimates for Unmeasured Confounding with Validation Data using Propensity Score Calibration”. In: American Journal of Epidemiology 162.3, pp. 279–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sturmer Til, Schneeweiss Sebastian, Rothman Kenneth J, et al. (2007). “Performance of Propensity Score Calibration: A Simulation Study”. In: American journal of epidemiology 165, pp. 1110–8. doi: 10.1093/aje/kwm074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- USFDA (2018). Framework for FDA’s Real-World Evidence Program. [Google Scholar]
- Webb-Vargas Yenny et al. (2017). “An Imputation-Based Solution to Using Mismeasured Covariates in Propensity Score Analysis”. In: Statistical Methods in Medical Research 26.4, pp. 1824–1837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weiskopf NG and Weng C (2013). “Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research”. In: J Am Med Inform Assoc 20.1, pp. 144–51. issn: 1527–974X (Electronic) 1067–5027 (Linking). doi: 10.1136/amiajnl-2011-000681. url: http://www.ncbi.nlm.nih.gov/pubmed/22733976. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



