Abstract
Valid causal inference from observational pharmacoepidemiologic studies relies on adequately adjusting for confounding. In this article we elucidate two important components of making valid inference from observational data: measuring the necessary set of variables at the design / data collection phase (measured confounding) and properly accounting for confounding at the modeling / analysis phase (accounted for confounding). For the latter concept, we contrast parametric modeling approaches, which are susceptible to model misspecification bias, with data adaptive approaches. The goal of this article is to provide clarity and guidance on issues related to confounding and provide motivation for using more flexible models for causal inference in pharmacoepidemiology.
Keywords: causal inference, confounding, observational studies
INTRODUCTION
In the design and analysis of observational pharmacoepidemiology studies it is important to both measure and account for confounding. Commonly held beliefs regarding confounding and obtaining valid causal inference in such studies include: (1) in order to minimize bias, one should attempt to measure as many potential confounders (variables that are associated both with the exposure of interest and the outcome) as possible, and (2) it is important to adjust for confounders via appropriate methods such as regression adjustment or inverse probability of treatment weighting. Practitioners might feel, then, that once they have measured and adjusted for these confounders, they have dealt with bias in the treatment effect estimate due to confounding. However, researchers might not realize that measuring all confounders is not the same as fully addressing confounding. In this paper we use causal diagrams and toy data examples to help dispel possible misunderstandings about confounders, confounding (both measured and unmeasured), and approaches to account for confounding. Specifically, we focus on the following issues: the difference between confounding (collectively) and confounders (individual variables); how one can tell from a causal graph if the adjustment variables are sufficient (i.e., that all variables have been measured); how, even if one has measured all of the confounders, there is still a risk of not fully accounting for confounding if parametric models are used because they are susceptible to model misspecification; and finally, arguably, there is no reason to fit parametric models, given the wide array of flexible, data adaptive methods and software that are available. We hope the simple illustrations in this paper will provide insight for those designing and analyzing pharmacoepidemiology studies.
CAUSAL EFFECT IDENTIFICATION AND CONFOUNDING
Brief Review of Causal Effects and Causal Assumptions
For illustrative purposes, we focus on the common setting where there is a binary treatment A (1 if treated, 0 if control), an outcome Y (this can be continuous, binary, counts, etc), and pre-treatment variables L. As a hypothetical example, consider a population of diabetic patients who are on metformin monotherapy. We are interested in whether adding second line therapy, when there is lack of glycemic efficacy, reduces the risk of major adverse cardiovasular events (MACE). The index date could be the date when hemoglobin A1C first exceeds 7.5%. Treatment variable A would be the addition of second line therapy (yes/no). We define Ya as the potential outcome (e.g. MACE) if treatment is set to value a, a = 0, 1.1 We will focus discussion on the average causal effect (ACE), which is given by E(Y1) - E(Y0). The ACE can be thought of as the difference in expected value of the outcome if everyone in the population was treated versus if no one was treated. The key ideas of this paper are not unique to the ACE, but we focus on it to make our presentation clearer.
Since potential outcomes are not observed directly, we need identifying assumptions to express the ACE in terms of the observed data. Throughout, we make the consistency assumption, which is that Y = Ya whenever A = a.2 In other words, we assume that the potential outcomes are defined uniquely by a subjecťs own treatment levels and are not affected by other subjects. Consistency allows us to directly map observed data to potential outcomes.
In observational studies, treatment is not experimentally assigned. There is concern that some variables that affect the treatment decision also affect the outcome (confounding). One common identifying assumption is known as ignorability. Ignorability suggests that treatment is effectively randomized, conditional on a set of pre-treatment variables L. Formally, A is independent of Y0,Y1 given L. In our hypothetical example involving second line antidiabetic therapy, L might include demographics, medication history, laboratory data, and so on. If ignorability holds, we can identify E(Ya). An important component of any study is determining which variables make up L. We next describe a criterion for doing this based on causal graphs.
Confounders and Confounding
A confounder is defined as a variable that affects both treatment and the outcome, either directly or indirectly. For example, consider the two directed acyclic graphs (DAGs) 3 in Figure 1. For DAG 1, L is a confounder because it affects both treatment and the outcome (notice the arrows coming out of L and into both A and Y). In DAG 2, the only confounder is X. X affects A indirectly (it affects W, which in turn affects A). X also affects V, which in turn affects Y. This figure illustrates how one can easily identify confounders from a DAG. We can answer the question “is there confounding?” simply by looking at the DAG to see if there is a confounder present.
Figure 1.
Two hypothetical directed acyclic graphs. A is treatment and Y is the outcome. DAG 1 is the classic case of confounding, where L affects A and Y. DAG 2 is more complicated, where variable X could be thought of as a confounder, but variables U, W, and V, are on various paths involving the treatment or outcome.
However, it is important to note that confounders cannot be identified empirically. Without a DAG (subject matter knowledge) we cannot determine if a variable is a confounder solely from the study data. As a simple example, from our data alone we would not be able to distinguish between DAG 1 and a DAG with the same layout, but where the arrow between A and L was in the opposite direction. However, if we knew that L was a pre-treatment variable (based on subject matter knowledge), then we would know that A could not cause L.
Practitioners have often been taught to test for confounding using two models: one model to test for an association between the exposure and X and another model to test for an association between the outcome and X. If both associations are statistically significant, X is deemed to be a confounder. Here we demonstrate why that approach can lead to erroneous conclusions. Consider Figure 2. Suppose we observe {A,X,Y}, but not U. Clearly the DAGs on the left and right are different. In DAG 3, X is a confounder. In DAG4, X is not a confounder, because it does not affect Y. In this situation, one may find an association between X and Y based on the data, because U is unmeasured. If we were to conduct the empirical test of confounding described above, we would conclude that X is a confounder. Hence, we would not be able to distinguish between these two DAGs empirically.
Figure 2.
Two hypothetical directed acyclic graphs. A is treatment and Y is the outcome. The variable U is unmeasured and X is observed.
Consider again DAG 4, but now assume U is measured. Using the common two step empirical approach for identifying confounding, we would find the following. First, we would fit a model E(A|X,U). There, we would find that only X is associated with A. This is because U only affects A through its effect on X. Next, we would fit E(Y|A,X,U). There, we would find that A is not associated with Y. So, we would not identify any confounders, even though we know U is a confounder. This DAG illustrates how the two model empirical approach can be misleading and why we do not recommend this practice.
Sufficient to Control for Confounding
As mentioned above, we can identify the ACE if the ignorability assumption holds. Ignorability involves conditional independence between treatment and the potential outcomes given some set of pre-treatment variables L. We can determine from a DAG whether our collection of variables L satisfies the ignorability assumption. To do this, one can check what is known as the backdoor path criterion.3,4 We review the criterion in the supplemental material. However, the key point is that what matters is whether the collection of variables L, as a whole, are sufficient to control for confounding (i.e., ignorability holds).3 As we will see in the next section, this does not necessarily mean that L is a collection of all confounders.
MEASURED CONFOUNDING
In this section we focus on the distinction between confounding and confounders. In causal inference the focus is on controlling for confounding. The phrase “no unmeasured confounding” does not necessarily mean all confounders have been measured. It simply means that at the design stage, we have measured (collected information on) and plan to control for a set of variables that are sufficient to control for confounding. Consider the following examples.
Classic Case of Confounding
DAG 1 above is the classic case of confounding, where a variable or variables L directly affects both A and Y. In that case, L is sufficient to control for confounding (the backdoor path criterion is satisfied by conditioning on L).
You Can Induce Confounding When There Are No Confounders
Now consider the example in Figure 3. The variables U1 and U2 are unmeasured, but X is observed. Note that if we do not control for anything in this situation, the backdoor path from A to Y is blocked. There are no confounders. However, if we control for X, we essentially unblock the backdoor path from A to Y. Conditional on X, U1 and U2 are no longer independent from each other. This is because X is a collider on the path from U1 to U2. A collider is a variable on a path that has more than one arrow going into it than one parent). Conditioning on a collider induces an association between the parents. Therefore, in the world depicted by Figure 3, the ignorability assumption does not hold, given X. This situation is sometimes known as collider bias or M-bias.5,6 Essentially, if we condition on X we have induced unmeasured confounding, even though there were no confounders to begin with. This is because in this new world where we look within levels of X, U1 and U2 are dependent - information can flow from U1 to both A and Y.
Figure 3.
A directed acyclic graphic illustrating M-bias. A is treatment, Y is the outcome, U1 and U2 are unmeasured.
You Can Achieve Ignorability Without Actually Measuring Any Confounders
Now consider the DAG in Figure 4. The variable G is unmeasured. The variable G is also the only confounder. It affects A indirectly and Y directly. So, the only confounder is not actually measured. However, we can still control for confounding. If we control for P, which is not a confounder, we can still block the backdoor path from A to Y.
Figure 4.
A hypothetical directed acyclic graphic. A is treatment, Y is the outcome, P is observed, and G is unmeasured.
To better understand how this can happen, let us consider a hypothetical example where G represents genotype (binary) and P represents some phenotype (binary) that is affected by the genotype. One could imagine situations where the clinicians making the treatment decisions do not know the genotype, but have a measurement of the phenotype. They therefore might make treatment decisions, in part, based on the phenotype. However, in this example, it is the genotype, not the phenotype, that affects the outcome. In the Supplementary Materials, we illustrate this with an example and show that you get the same value of the causal effect, whether you condition on P, G, or both.
If the ignorability assumption holds, what this means is that one has the opportunity to control for confounding. However, it is just an opportunity and not a guarantee. We next discuss the process of going from measuring the right variables to accounting for them properly.
ACCOUNTED-FOR CONFOUNDING
In practice, it is common to account for confounding using regression models. For instance, a propensity score may be estimated using a parametric model such as logistic regression and then inverse probability of treatment weighting may be used to estimate the causal effect. The modeling step is important to adequately account or adjust for the collection of variables that are, in principle, sufficient to control for confounding. Parametric Models
Parametric models, such as linear and logistic models, are often used to account for confounding, but they rely on correct specification of the functional form. Even if an investigator feels that they have measured an exhaustive list of important covariates, if the model is not correct, there could be bias due to model misspecification. In this context, we think of model misspecification bias as unaccounted for confounding. For example, in a logistic regression model, suppose one only includes the confounders X1 and X2. If the true underlying relationship is defined by an interaction effect between X1 and X2, then confounding due to X1 × X2 will be unaccounted for. In this example, you could think of X3 = X1 × X2 as a confounder. Even though X3 was ‘measured’ in that it is observable in our data set, if we do not include it in the model it is not accounted for. In this example, we had the opportunity to fully account for confounding because we had measured X1 and X2 (and hence X3). However, we mispecified the model and essentially ignored X3. Model misspecification in this example is analogous to leaving out a confounder. This issue may be compounded if the set of variables needed to satisfy the backdoor criterion is large. In that case, the true model could involve complex non-linear, higher order interactions that would be difficult to specify by the researcher. Figure 5 shows a complex relationship that is not captured by a linear regression despite inclusion of all measured covariates (in this case, just X). Here, the true relationship between X and the mean of Y|X is a sine curve. The grey line is the least squares estimate of the line obtained from linear regression. Notably, collecting more data (increasing the sample size) does not help alleviate the problem of unaccounted for confounding in parametric models, a common misconception. In this example, collecting more data would just stabilize the estimated line (i.e., we would estimate the best line through the curve more accurately).
Figure 5.
A complex relationship that is difficult to capture with a parametric model (least squares regression line) but well-captured by a data adaptive approach (BART).
Data Adaptive and Machine Learning Methods
In pharmacoepidemiological studies there are often a large number of variables that are collected such as clinical, demographic, environmental, and genetic data. In these studies, flexible nonparametric data adaptive approaches may be desirable because of their ability to identify complex nonlinear, higher order, and interaction effects that a researcher may find difficult to specify in a parametric model. These data adaptive methods can be used as a part of causal inference methodology.7 For example, if a researcher plans to estimate the causal effect using inverse probability of treatment weighting (IPTW), data adaptive methods can be used to estimate the propensity score. If instead, the researcher plans to use standardization, adaptive methods can be used for the outcome model.
There are many popular data adaptive methods that have been used for causal inference in pharmacoepidemiological studies, such as random forests,8 support vector machines, and Dirichlet process mixture models.9 As one example, Bayesian additive regression trees (BART),10,11 which can roughly be thought of as a Bayesian version of boosting (ensembles of decision trees), are expected to perform better as the sample size increases. BART is illustrated in Figure 5. Notice that a linear regression model does a very poor job in capturing this bumpy sine wave type relationship, while BART (blue curve and credible band) is able to uncover this complex relationship quite well. Various forms of BART have been used for causal inference.12–14
Oftentimes, as with parametric regression, it is difficult to choose the correct data adaptive approach or algorithm from a long menu of choices. Recently, ensemble machine learning approaches, such as super learner, have become popular15,16 for analyzing pharmacoepidemiological data. Ensemble learning approaches can combine multiple approaches or “learners” to obtain a single prediction model, via cross-validation, that has optimal properties (i.e. one that minimizes a pre-specified loss function such as the mean squared error). One approach for estimating causal effects that was designed to take advantage of super learner is targeted minimum loss-based estimation (TMLE). The take-home message here is that mis-specification of a parametric model can lead to unsuspected confounding even if the sample size is large; data adaptive approaches can help to address this problem. Importantly, recent work has shown that single-robust machine learning approaches may be as biased as those obtained from misspecified parametric models; hence, the combination of using doubly-robust estimators with suitably defined machine learning algorithms is recommended (Niami et al, AJE 2020).17
Unmeasured Confounding
It is important to note that bias from unmeasured confounding may be addressed at the analysis stage as well. Sensitivity analysis approaches, for example, allow a researcher to assess how robust their results are to different degrees of potential unmeasured confounding. There is an extensive literature on sensitivity analsysis approaches including Rosenbaum and Rubin (1983)18, Diaz et al (2013)19; Ding and VanderWeele (2016)20; Zhang and Tchetgen Tchetgen (2019)21; and Bonvini and Kennedy, (2020)22, just to name a few. Further, Genback and de Luna (2017)23 provide a novel approach to account for uncertainty due to unmeasured confounders by obtaining bounds on the average causal effect based on doubly robust and outcome model estimators.
DISCUSSION
Obtaining valid causal inference is often a goal of pharmacoepidemiological studies. In order to do this, confounding must be measured and accounted for. In other words, careful attention should be given to both the design stage and the analysis stage of a study. At the design stage of the study, simply collecting and measuring a large number of variables may not be enough to fully account for confounding. We hope researchers will use DAGs to carefully think about confounding. In fact, as we have demonstrated, without a DAG we cannot identify a confounder merely from the collected data. A DAG is based on a detailed understanding of the study components and relies on robust knowledge of the scientific domain being studied. Interdisciplinary scientific teams are often needed to develop a fully realized DAG. Once sources of confounding are determined, they must be accounted for in the analysis stage of the study. Simply including confounders in a parametric model, as we have shown, may result in unaccounted for confounding. Because parametric models are difficult to specify correctly, and because the effect of model mis-specification is not mitigated by increasing the sample size, we recommend the use of data adaptive approaches.
Supplementary Material
Key Messages:
Data alone cannot be used to determine if a variable is a confounder.
Ignorability is an assumption that has to do with whether a collection of measured variables gives the analyst a chance to control for confounding.
Even if confounding has been adequately measured, mis-specified models may lead to unaccounted for confounding.
With mis-specified parametric models, increasing the sample size will not help.
Data adaptive methods can account for more confounding (less bias due to model misspecification) as the sample size increases.
Acknowledgments
FUNDING
Jason Roy was supported by was supported by the National Center for Advancing Translational Sciences (NCATS), a component of the National Institute of Health (NIH) under Award Number UL1TR0030117.
REFERENCES
- 1.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688–701. 10.1037/h0037350 [DOI] [Google Scholar]
- 2.Cole SR, Frangakis CE. The consistency statement in causal inference: a definition or an assumption? Epidemiol Camb Mass. 2009;20(1):3–5. 10.1097/EDE.0b013e31818ef366 [DOI] [PubMed] [Google Scholar]
- 3.Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiol Camb Mass. 1999;10(1):37–48. [PubMed] [Google Scholar]
- 4.Pearl J Causality: Models, Reasoning and Inference. 2nd ed Cambridge University Press; 2009. [Google Scholar]
- 5.Greenland S Quantifying Biases in Causal Models: Classical Confounding vs Collider-Stratification Bias. Epidemiology. 2003;14(3):300–306. 10.1097/01.EDE.0000042804.12056.6C [DOI] [PubMed] [Google Scholar]
- 6.Liu W, Brookhart MA, Schneeweiss S, Mi X, Setoguchi S. Implications of M Bias in Epidemiologic Studies: A Simulation Study. Am J Epidemiol. 2012;176(10):938–948. 10.1093/aje/kws165 [DOI] [PubMed] [Google Scholar]
- 7.Blakely T, Lynch J, Simons K, Bentley R, Rose S. Reflection on modern methods: when worlds collide— prediction, machine learning and causal inference. Int J Epidemiol. 2019;(dyz132). 10.1093/ije/dyz132 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113(523):1228–1242. 10.1080/01621459.2017.1319839 [DOI] [Google Scholar]
- 9.Roy J, Lum KJ, Zeldow B, Dworkin JD, Re VL, Daniels MJ. Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics. 2018;74(4):1193–1202. 10.1111/biom.12875 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chipman HA, George EI, McCulloch RE. BART: Bayesian additive regression trees. Ann Appl Stat. 2010;4(1):266–298. 10.1214/09-AOAS285 [DOI] [Google Scholar]
- 11.Tan YV, Roy J. Bayesian additive regression trees and the General BART model. Stat Med. 2019;38(25):5048–5069. 10.1002/sim.8347 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hill JL. Bayesian Nonparametric Modeling for Causal Inference. J Comput Graph Stat. 2011;20(1):217–240. 10.1198/jcgs.2010.08162 [DOI] [Google Scholar]
- 13.Zeldow B, Iii VLR, Roy J. A semiparametric modeling approach using Bayesian Additive Regression Trees with an application to evaluate heterogeneous treatment effects. Ann Appl Stat. 2019;13(3):1989–2010. 10.1214/19-AOAS1266 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hahn PR, Murray JS, Carvalho CM. Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects. Bayesian Anal. Published online 2020. 10.1214/19-BA1195 [DOI] [Google Scholar]
- 15.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6:Article25. 10.2202/1544-6115.1309 [DOI] [PubMed] [Google Scholar]
- 16.Wyss R, Schneeweiss S, van der Laan M, Lendle SD, Ju C, Franklin JM. Using Super Learner Prediction Modeling to Improve High-dimensional Propensity Score Estimation. Epidemiol Camb Mass. 2018;29(1):96–106. 10.1097/EDE.0000000000000762 [DOI] [PubMed] [Google Scholar]
- 17.Naimi A, Mishler Alan, Kennedy E Challenges in obtaining valid causal effect estimates with machine learning algorithms. Am J Epidemiol. Published online accepted 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rosenbaum PR, Rubin DB. Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome. J R Stat Soc Ser B Methodol. 1983;45(2):212–218. [Google Scholar]
- 19.Díaz I, van der Laan MJ. Sensitivity analysis for causal inference under unmeasured confounding and measurement error problems. Int J Biostat. 2013;9(2):149–160. 10.1515/ijb-2013-0004 [DOI] [PubMed] [Google Scholar]
- 20.Ding P, VanderWeele TJ. Sensitivity Analysis Without Assumptions. Epidemiol Camb Mass. 2016;27(3):368–377. 10.1097/EDE.0000000000000457 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zhang B, Tchetgen EJT. A Semiparametric Approach to Model-based Sensitivity Analysis in Observational Studies. Published online October 30, 2019. Accessed December 10, 2020. https://arxiv.org/abs/1910.14130v1 [DOI] [PMC free article] [PubMed]
- 22.Bonvini M, Kennedy EH. Sensitivity Analysis via the Proportion of Unmeasured Confounding. Published online December 5, 2019. Accessed December 10, 2020. https://arxiv.org/abs/1912.02793v1
- 23.Genbäck M, de Luna X. Causal inference accounting for unobserved confounding after outcome regression and doubly robust estimation. Biometrics. 2019;75(2):506–515. 10.1111/biom.13001 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





