A comparison of confounder selection and adjustment methods for estimating causal effects using large healthcare databases

Imane Benasseur; Denis Talbot; Madeleine Durand; Anne Holbrook; Alexis Matteau; Brian J Potter; Christel Renoux; Mireille E Schnitzer; Jean‐Éric Tarride; Jason R Guertin

doi:10.1002/pds.5403

. 2022 Jan 7;31(4):424–433. doi: 10.1002/pds.5403

A comparison of confounder selection and adjustment methods for estimating causal effects using large healthcare databases

Imane Benasseur ^1,², Denis Talbot ^2,^3,^✉, Madeleine Durand ^4,⁵, Anne Holbrook ⁶, Alexis Matteau ^4,⁵, Brian J Potter ^4,⁵, Christel Renoux ^7,^8,⁹, Mireille E Schnitzer ^10,^11,⁸, Jean‐Éric Tarride ^12,¹³, Jason R Guertin ^2,³

PMCID: PMC9304306 PMID: 34953160

Abstract

Purpose

Confounding adjustment is required to estimate the effect of an exposure on an outcome in observational studies. However, variable selection and unmeasured confounding are particularly challenging when analyzing large healthcare data. Machine learning methods may help address these challenges. The objective was to evaluate the capacity of such methods to select confounders and reduce unmeasured confounding bias.

Methods

A simulation study with known true effects was conducted. Completely synthetic and partially synthetic data incorporating real large healthcare data were generated. We compared Bayesian adjustment for confounding (BAC), generalized Bayesian causal effect estimation (GBCEE), Group Lasso and Doubly robust estimation, high‐dimensional propensity score (hdPS), and scalable collaborative targeted maximum likelihood algorithms. For the hdPS, two adjustment approaches targeting the effect in the whole population were considered: Full matching and inverse probability weighting.

Results

In scenarios without hidden confounders, most methods were essentially unbiased. The bias and variance of the hdPS varied considerably according to the number of variables selected by the algorithm. In scenarios with hidden confounders, substantial bias reduction was achieved by using machine‐learning methods to identify proxies as compared to adjusting only by observed confounders. hdPS and Group Lasso performed poorly in the partially synthetic simulation. BAC, GBCEE, and scalable collaborative‐targeted maximum likelihood algorithms performed particularly well.

Conclusions

Machine learning can help to identify measured confounders in large healthcare databases. They can also capitalize on proxies of unmeasured confounders to substantially reduce residual confounding bias.

Keywords: algorithms, biostatistics, confounding factors, machine learning, pharmacoepidemiology, propensity score

1. INTRODUCTION

Large healthcare database (LHDs) are frequently used to estimate treatment effects in a real‐world setting. Such data have many advantages, including the possibility of obtaining a sufficient sample size to investigate rare events, ¹ , ² , ³ and population representativeness. ² , ³ Despite these advantages, because treatment is not randomized, the treatment‐outcome association is susceptible to confounding bias. ¹ , ² , ³ Various adjustment methods can be employed to control this bias, such as propensity score matching or inverse probability of treatment weighting (IPTW).

The application of adjustment methods in LHD studies is faced with particular challenges. First, hundreds of variables are available in LHDs. Identifying true confounders based on substantive knowledge alone can be difficult. Omitting a true confounder may produce biased results, whereas including nonconfounders can increase the variance. In addition, confounders such as lifestyle habits, are often missing from LHDs.

Machine learning algorithms may help address these challenges. ⁴ Indeed, several algorithms have been developed for performing confounder selection. It has also been proposed that machine‐learning algorithms could identify proxies for unmeasured confounders within the rich information available in LHD (see Figure 1). ⁵

Directed acyclic graph illustrating the problem of unmeasured confounders. Double‐arrows between variables are used as notational shorthand to mean that unobserved common causes may exist resulting in correlations. True confounders affect both the exposure and the outcome, but only measured confounders can be included in the analysis (box). Proxies for unmeasured confounders are variables that are affected by or correlated with the unmeasured confounders but are not confounders themselves

Some studies support that machine learning can be useful to control confounding in LHD. In a few studies, estimates closer to those of randomized trials were observed when using the high‐dimensional propensity score (hdPS) than when adjusting only for user‐defined covariates. ⁵ , ⁶ Moreover, it has been observed that the hdPS can produce balanced treatment groups relative to clinically identified confounders that were excluded from the algorithm, ⁷ suggesting that proxies for unmeasured confounders can be identified by machine learning. Conversely, another study indicated that estimates obtained using the hdPS on data typically available in LHDs may substantially differ from those obtained when clinical data are additionally available, ⁸ suggesting that machine learning is sometimes unable to compensate for unmeasured confounders.

Overall, there is currently contradictory evidence concerning the usefulness of machine learning algorithms to control unmeasured confounding in LHDs. This may be because it arises from “case studies” where one or several real datasets are analyzed. A first limitation of such studies is that the true effect is unknown. Even when a benchmark is available, such as randomized trials, it is unclear whether the true effect in the population covered by the LHDs is the same as the one from the benchmark. In addition, results from “case studies” may reflect random fluctuations instead of the true properties of the methods investigated. Computer simulation studies can alleviate these challenges because the true effect is known and comparisons can be replicated multiple times to reduce random variability.

The goal of the present paper is to investigate the ability of different machine learning algorithms to select variables among potential confounders and to compensate for unmeasured confounders, and to compare different confounding adjustment methods.

2. METHODS

An overview of the simulation framework is provided in Figure 2. The effect of interest is the risk difference between the exposed and unexposed groups among the whole population. Simulations were conducted in R. ⁹

2.1. Synthetic data generation

We considered four different synthetic simulation scenarios, described hereafter, inspired by those presented in Shortreed and Ertefaie (2017). ¹⁰ Scenarios 1 and 3 represent situations with no unmeasured confounders, whereas Scenarios 2 and 4 feature an unmeasured confounder. Moreover, to explore the role of correlations between covariates in the ability of machine learning methods to identify proxies for unmeasured confounders, Scenarios 1 and 2 feature weaker correlations than Scenarios 3 and 4. These scenarios lack features of real LHDs but were explored to better understand properties of each algorithm in a simple setting.

A total of 1000 replications of each scenario were generated. For each replicate, 1000 independent observations were generated, where each observation consisted of 100 potential confounding covariates ( $X = (X_{1}, X_{2}, \dots, X_{100})$ ), the exposure (A) and the outcome (Y), all binary. The covariates $X_{j}$ were divided in four sub‐groups to mimic the structure of LHDs (Table 1), where variables are commonly grouped in “dimensions” such as inpatient diagnoses. To generate the potential confounders, we first simulated 100 variables, $X_{1}^{*}, X_{2}^{*}, \dots, X_{100}^{*}$ , from a multivariate normal distribution with mean 0, then dichotomized the values around 0 (if $X_{j}^{*} > 0$ then $X_{j} = 1$ , otherwise $X_{j} = 0, j = 1, \dots, 100)$ such that the prevalence of each covariate was 50%. The variance of each variable $X_{j}^{*}$ was 1, but the correlation differed between scenarios (see Table 1).

TABLE 1.

Summary of the parameters for simulating the synthetic data

Scenario	Sub‐groups	Correlations	Exposure‐covariate associations ( $α$ )	Outcome‐covariate associations ( $β$ )	Hidden variable
1	(X₁,…, X₄₀), (X₄₁,…,X₇₀), X₇₁,…,X₉₀), (X₉₁, X₁₀₀)	W. = 0.2 B. = 0.1	(α_1, α₂, α₅, α₆, α_71, α₇₂, α₇₅, α₇₆) = 1 (α_41, α₄₂, α₄₅, α₄₆, α_91, α₉₂, α₉₅, α₉₆) = 1 all other $α = 0$	(β ₁, β ₂, β ₃, β ₄ β ₇₁, β ₇₂, β ₇₃, β ₇₄) = 0.6 (β ₄₁, β ₄₂, β ₄₃, β ₄₄, β ₉₁, β ₉₂, β ₉₃, β ₉₄) = −0.6 all other $β = 0$	None
2		W. = 0.2 B. = 0.1			X₁
3		W. = 0.4 B. = 0.2			None
4		W. = 0.4 B. = 0.2			X₁

Open in a new tab

Abbreviations: W. = within dimensions, B. = between dimensions.

The exposure and the outcome were generated from a Bernoulli distribution with $logit [P (A = 1)] = \sum_{j = 1}^{100} α_{j} X_{j}$ and $logit [P (Y = 1)] = 2 A + \sum_{j = 1}^{100} β_{j} X_{j}$ . The prevalence of the outcome was approximately 65%. The true confounders (related to both exposure and outcome) were $(X_{1}, X_{2}, X_{41}, X_{42}, X_{71}, X_{72}, X_{91}, X_{92})$ , the risk factors for the outcome (related to the outcome only) were $(X_{3}, X_{4}, X_{43}, X_{44}, X_{73}, X_{74}, X_{93}, X_{94})$ , the instruments (related to the exposure only) were $(X_{5}, X_{6}, X_{45}, X_{46}, X_{75}, X_{76}, X_{95}, X_{96})$ and the other variables were superfluous (not related to exposure or outcome). The coefficients are provided in Table 1. Adjusting for risk factors of the outcome, in addition to true confounders, allows for unbiased estimation with increased precision, ¹⁰ , ¹¹ , ¹² whereas including instruments increases the variance and may also increase bias. ¹⁰ , ¹² , ¹³ , ¹⁴ , ¹⁵ In Scenarios 1 and 3, all $X_{j} s$ , j = 1, …, 100, were supplied to the machine learning algorithms, whereas $X_{1}$ was hidden from the algorithms in Scenarios 2 and 4.

2.2. Plasmode data generation

The plasmode simulation used data from an ongoing real‐world study comparing the use of direct oral anticoagulants (DOACs) and warfarin as treatments for nonvalvular atrial fibrillation in Quebec (Canada). Briefly, we received data from the Régie d'assurance maladie du Québec (RAMQ) on 60 093 patients with nonvalvular atrial fibrillation who were newly initiated on either warfarin (N = 21 514) or DOACs (N = 38 579) between January 1st 2010 and March 31st 2017. We used four of the available datasets (i.e., the Patient Demographics dataset [patients' date of birth and biological sex]; Inpatient Diagnoses and Clinical Interventions dataset [hospital length of stays, primary and secondary diagnoses during a hospitalization]; Inpatient and Outpatient Physician Billings dataset [billing dates, physician billing codes, physician specialty]; and the Outpatient Drug Dispensations dataset [date of the drug dispensation, class of molecule, dosage and number of pills dispensed]) provided to us by RAMQ; linking of patients across the different datasets is made available via the use of a unique anonymized identification number. Cohort entry was at the date of first dispensation of DOACs or warfarin. The outcome was death within 5 years of cohort entry. Except for the demographic variables, the covariates represent the number of occurrences of a given code (e.g., drug dispensation) in the 12 months preceding cohort entry.

Two different plasmode scenarios were considered. Baseline covariates with less than 2% of values different from zero in Scenario 1 and 1% in Scenario 2 were first excluded to avoid numerical problems. The 336 and 573 remaining covariates (out of 12 465) were then divided into five dimensions: One for each of the four datasets that form the original data, and one for the clinically important covariates. The median prevalence of the remaining variables was 4.7% (interquartile range = 2.8%–9.6%). Of note, the hdPS may create up to three times as many empirical covariates by categorizing the frequency of occurrence of codes. Among the original variables, 18 were randomly chosen to be related to the outcome (potential “unknown” confounders) in Scenario 1 and 38 in Scenario 2. Age and sex were selected as “known” potential confounders. The observed outcome was then modeled according to the treatment and potential confounders using a random forest procedure. ¹⁶ Then, to create each plasmode dataset, 10 000 and 20 000 observations were randomly sampled with replacement from the original dataset in Scenario 1 and Scenario 2, respectively. A new synthetic outcome was generated using the previously fitted outcome model; the observed outcome was discarded. The prevalence of the simulated outcome was around 16%. Finally, one of the potential “unknown” confounders was randomly selected to be excluded from the analysis in Scenario 1, and five were excluded in Scenario 2. Only 400 and 110 plasmode datasets were generated for Scenarios 1 and 2, respectively, because of the greater computational burden.

2.3. Statistical analysis

We first searched the literature to find machine‐learning algorithms developed for the identification of confounders. We excluded algorithms for which no R software code was available, those that were not adapted to the LHD setting, and those that did not allow for the estimation of a risk difference. We selected five algorithms among those that met our criteria: Bayesian adjustment for confounding (BAC) ¹⁷ , ¹⁸ ; generalized Bayesian causal effect estimation (GBCEE) ¹⁹ ; Group Lasso and doubly robust estimation (GLiDeR) ²⁰ ; the hdPS ⁵ ; the scalable collaborative targeted maximum likelihood estimation (SC‐TMLE). ²¹ We also considered a modified version of the hdPS (m‐hdPS), where Step 2 of the algorithm, ⁵ which involves selecting potential confounders based on their prevalence, was omitted. ²² In the analysis of plasmode data, age and sex were forced to be included in software that allowed this option (hdPS, m‐hdPS, and GBCEE, only in the outcome model for SC‐TMLE). An overview of the algorithms as well as details concerning their implementation are provided in Web Appendix 1. The parameters of hdPS and m‐hdPS (e.g., final number of covariates to include) were adapted to the number of covariates available in each scenario. In synthetic Scenarios 1–4, the n = 8 most prevalent covariates of each dimension were initially selected for hdPS and k = 10 variables were retained at the end for both hdPS and m‐hdPS. In the plasmode simulations, the parameters were n = 25 and k = 50 in Scenario 1, and n = 100 and k = 200 in Scenario 2. Additional parameter values were explored in Web Appendix 4.

BAC adjusts for confounding using an outcome‐model‐based standardization procedure (g‐computation), GBCEE and SC‐TMLE use a TMLE estimator, GLiDeR uses an augmented inverse probability of treatment weighting estimator, and hdPS and m‐hdPS yield a propensity score. For hdPS and m‐hdPS, we employed two different adjustment methods: IPTW ²³ and full matching with replacement based on the logit of the propensity score with a 0.2 SD caliper (matching). ²⁴ , ²⁵ We chose this specific matching algorithm because it estimates the average treatment effect in the whole population, as the other estimators do.

Numerical methods allowed us to determine that true risk differences were 0.32 in all scenarios of the synthetic simulation, −0.18 in plasmode Scenario 1 and − 0.12 in plasmode Scenario 2 (more details in Web Appendix 2).

The risk difference was estimated in each simulated dataset using BAC, GBCEE, GLiDeR, SC‐TMLE, and hdPS and m‐hdPS using IPTW and matching. A crude unadjusted difference and a “true” logistic regression model adjusting for all available true confounders and risk factors for the outcome (not for hidden variables) were also employed as benchmarks. The risk difference was estimated from the output of this “true” model by standardization. ²⁶ This “true” model served two purposes. First, in Scenarios without hidden variables, it allows us to determine the “cost” of learning the role of covariates from the data. In Scenarios with hidden covariates, it permits evaluating the benefit of identifying proxies for unmeasured confounders.

For each scenario, we computed the bias as the difference between the average estimate and the true risk difference. We also computed the SD of the estimates, the root mean squared error ( $\sqrt{Bia s^{2} + S D^{2}}$ ), the average of the estimated standard error (ESE) and the proportion of replicates where the 95% confidence intervals included the true risk difference (CP). The ratio of the root mean squared error of each method over that of the true model (Rel. RMSE) is reported below. BAC, GBCEE and SC‐TMLE directly yield confidence intervals. For GLiDeR, we estimated the variance as the sample variance of the empirical efficient influence function, scaled by a factor 1/n. ²⁷ For IPTW, the ESE was obtained using a robust variance estimator. ²⁸ For matching, we used Abadie and Imben's variance estimator of the risk difference. ²⁵ We note that, except for BAC, GBCEE and SC‐TMLE, these variance estimators lack theoretical support, notably because they do not account for the variability attributable to variable selection. The proportion of inclusion of observed confounders, risk factors, instruments and superfluous variables was also computed for the synthetic scenarios. In the plasmode simulation, the exact role of variables is unknown since the relationship between treatment and covariates is not determined by the simulation model.

3. RESULTS

The results of the simulations are summarized in Tables 2, 3, 4, 5, 6, 7, Figures 3 and 4, and Web Tables 1–10. In Scenarios 1 and 3 where there was no hidden confounder (Tables 2 and 4, Figure 3, and Web Tables 5 and 7), most estimators almost eliminated the bias. The bias for hdPS and m‐hdPS was high when few variables were included and essentially null when more variables were included. Most methods had similar SD. The SD of hdPS and m‐hdPS was comparable to the other methods when few variables were included, but much larger otherwise. The variance estimator of SC‐TMLE and GLiDeR underestimated the true variability (ESE < SD). The RMSEs of hdPS IPTW and hdPS Match were much greater than those of other methods. All methods except BAC and BCEE yielded confidence intervals that included the true effect much less often (<90%) than the expected 95%. The coverage of hdPS and m‐hdPS improved when more variables were included.