Skip to main content
Lippincott Open Access logoLink to Lippincott Open Access
. 2023 Sep 14;34(6):865–872. doi: 10.1097/EDE.0000000000001660

A Potential Outcomes Approach to Selection Bias

Eben Kenah a,
PMCID: PMC11340682  PMID: 37708480

Abstract

We propose a novel definition of selection bias in analytic epidemiology using potential outcomes. This definition captures selection bias under both the structural approach (where conditioning on selection into the study opens a noncausal path from exposure to disease in a directed acyclic graph) and the traditional definition (where a given measure of association differs between the study sample and the population eligible for inclusion). This approach is nonparametric, and selection bias under the approach can be analyzed using single-world intervention graphs both under and away from the null hypothesis. It allows the simultaneous analysis of confounding and selection bias, it explicitly links the selection of study participants to the estimation of causal effects using study data, and it can be adapted to handle selection bias in descriptive epidemiology. Through examples, we show that this approach provides a novel perspective on the variety of mechanisms that can generate selection bias and simplifies the analysis of selection bias in matched studies and case–cohort studies.

Keywords: Causal inference, Epidemiologic methods, Measures of association, Potential outcomes, Selection bias, Single-world intervention graphs


Along with confounding, selection bias is one of the fundamental threats to the validity of epidemiologic research. Traditionally, selection bias has been defined as a systematic difference (i.e., a difference beyond random variation) between measures of an exposure–disease association in a study sample and the underlying eligible population, where the study sample is included in the study and the eligible population is eligible for inclusion.1 For example, Berkson2 showed that two diseases can be positively correlated among hospitalized patients even when they are independent in the source population. This definition can depend on the parameterization of the exposure–disease association,3 and it provides little or no guidance about identifying and controlling selection bias in the design and analysis of a study. These problems resemble those that arise when confounding is defined as a change in an exposure–disease association upon adjustment for a covariate.47

The structural approach of Hernán et al.8 showed that selection bias occurs when conditioning on selection into the study opens a noncausal path from an exposure X to a disease D in a causal directed acyclic graph (DAG)9,10 for the eligible population. This approach is nonparametric, and its use of DAGs to incorporate background knowledge in the identification and control of selection bias is a practical advantage over the traditional definition.

However, the structural approach only captures selection bias that can occur under the null hypothesis (i.e., no causal path from X to D) and no confounding (i.e., no open backdoor path from X to D).3 An example of selection bias that escapes this approach was given by Greenland11: In a hypothetical cohort study where right censoring was not associated with exposure, there was no selection bias under the null. Away from the null, the risk ratio in the full cohort (censored and uncensored) differed from the risk ratio in the observed cohort (uncensored only).

Here, we use potential outcomes12 to propose a novel definition of selection bias in analytic epidemiology, where the goal is to infer the causal effect of a treatment or exposure on the risk of disease. The proposed definition allows the simultaneous analysis of confounding and selection bias, and it captures all selection bias under the structural approach and the traditional definition. It is nonparametric, it can be analyzed using single-world intervention graphs (SWIGs) of Richardson and Robins13,14 both under and away from the null, and it explicitly links the selection of study participants to the measures of association that can be estimated using study data. We show how it can be adapted to handle selection bias in descriptive epidemiology, where the goal is to estimate the joint distribution of disease and covariates (e.g., demographics or exposures). Through examples, we show how selection bias can be generated by colliders at X and D, how it can arise in randomized clinical trials, and how the potential outcomes approach simplifies the analysis of selection bias in matched studies and case–cohort studies.

Confounding, Exchangeability, and Backdoor Paths

Unlike selection bias, confounding has a standard definition in terms of potential outcomes. Let X be an exposure or treatment, D be a disease outcome, and Dx denote the outcome that occurs if we intervene to set X=x. In the notation of Dawid,15 there is no unmeasured confounding when

DxX|C (1)

(i.e., Dx is conditionally independent of X given C), where C is a set of measured nondescendants of X.16 The conditional independence in equation (1) is called exchangeability.5 This definition of confounding is difficult to use directly as a guide to study design and analysis.

Confounding can also be defined as an open backdoor path from X to D in a causal DAG,10 which provides an intuitive way to use background knowledge to identify and control confounding. SWIGs are transformations of causal DAGs that explicitly represent potential outcomes.13 A SWIG represents the intervention of setting X=x by splitting the node X into two new nodes: One node represents the realized value of X and inherits all incoming edges from the node X. The other node represents the intervention X=x and inherits all outgoing edges from the node X. The two new nodes are not connected by an edge, and all paths through the node representing the intervention are blocked. Any node Y that is a descendant of X in the DAG is written Yx to show that it is a potential outcome that can be observed only in individuals with X=x. If Y is a nondescendant of X, then Yx=Y and can be observed in all individuals.

In a SWIG, the rules of d-separation9 can be used to evaluate conditional independence. This can be used to show that the backdoor path criterion and exchangeability are equivalent.14 Figure 1 shows how exchangeability is guaranteed by no open backdoor path from X to D in a simple example: Conditioning on C blocks the backdoor path from X to D in the DAG, and it d-separates Dx and X in the SWIG. Therefore, confounding has a nonparametric definition in terms of potential outcomes that can be analyzed using causal graphs both under and away from the null hypothesis. The potential outcomes approach to selection bias achieves something similar.

FIGURE 1.

FIGURE 1.

A causal directed acyclic graph (DAG) and single-world intervention graph (SWIG) showing no unmeasured confounding given C. The arrow from X to D is dashed because confounding does not depend on whether X has a causal effect on D.

SELECTION BIAS VIA POTENTIAL OUTCOMES

Let X be an exposure or treatment, D be a disease outcome, and S indicate selection into the study out of a specified eligible population. The study sample is the subset of the eligible population with S=1. Assume that X is not a descendant of D in a causal DAG for the eligible population. We propose the following definition of selection bias in analytic epidemiology:

Definition 1 (Analytic Selection Bias)

There is no unmeasured selection bias for the causal effect of X on D if and only if at least one of the following conditions holds:

  • 1. Analytic cohort condition. If we intervene to set exposure X=x, selection into the study is conditionally independent of the disease outcome given X and C1:

SxDx|(X,C1), (2)

where C1 is a (possibly empty) set of measured nondescendants of X such that exchangeability holds in equation (1).

  • 2. Analytic case–control condition. If we intervene to set disease outcome D=d, selection into the study is conditionally independent of exposure given D, C1, and C2:

SdX|(D,C1,C2), (3)

where C1 is defined above and C2 is a (possibly empty) set of measured nondescendants of D such that

C2xDx|(X,C1). (4)

This conditional independence implies that C2 cannot contain variables on a causal path from X to D.

The analytic cohort condition in equation (2) is relevant to studies where exposure is a cause of selection. It is similar to the conditions given by Daniel et al.17 for the identifiability of a causal effect with missing data, which are based on the do-operator9 rather than potential outcomes. In the DAG at the top of Figure 2, we must condition on A to close the backdoor path from X to D, so C1 must contain A. In SWIG (a), we must condition on B to d-separate Sx and Dx, so C1 must contain B. The cohort condition holds because A and B are a nondescendants of X, exchangeability holds given C1={A,B}, and C1 d-separates Sx and Dx. The case–control condition fails because of the arrow from X to Sd=S in SWIG (b).

FIGURE 2.

FIGURE 2.

Causal directed acyclic graph (DAG) (A) and single-world intervention graphs (SWIGs) for the eligible population in a cohort study. SWIG (B) shows that the cohort condition holds given {A,B}. SWIG (C) shows that the case–control condition fails.

The analytic case–control condition in equation (3) is relevant to studies where disease outcome is a cause of selection. It is based on the principle that controls should be individuals who could become cases if they had a disease onset while under observation.18,19 In Figure 3, the cohort condition fails because of the arrow from Dx to Sx in SWIG (a). Because conditioning on A is necessary to block the backdoor path from X to D, C1 must contain A. In SWIG (b), conditioning on B and V is needed to d-separate X and Sd. The variable V cannot be in C1, but it can be in C2 because VxDX|(X,A. The variable B can be in C1 because it is a nondescendant of X and exchangeability holds given {A,B}. It can also be in C2 because Bx = BDX|(X,A. Thus, the case–control condition holds given {A,B,V} in two ways: C1={A} and C2={B,V} or C1={A,B} and C2={V}.

FIGURE 3.

FIGURE 3.

Causal directed acyclic graph (DAG) (A) and single-world intervention graphs (SWIGs) for the eligible population in a case–control study. SWIG (B) shows that the cohort condition fails. SWIG (C) shows that the case–control condition holds given {A,B,V}.

Evaluation of the analytic cohort and case–control conditions should consider all stable conditional independencies20 implied by the SWIGs derived from a causal DAG for the eligible population that includes S. The examples above show that neither condition implies the other. To estimate a conditional causal effect of X on D given a set V of nondescendants of X, we must find a C1 that contains all variables in V. The conditions for no unmeasured analytic selection bias are similar to the identifiability conditions for the causal odds ratio in Bareinboim and Pearl,21 but they guarantee the identifiability of a greater variety of causal effect measures.

When there is no unmeasured analytic selection bias, a study has external validity in the sense that adjustment for measured variables is sufficient to generalize a causal effect estimate from the study sample to the eligible population or a subset of it defined by measured variables.1 As in Greenland and Pearl,22 adjustment for C means a measure of association standardized to a specified joint distribution of the variables in C, a set of conditional measures of association within strata of C, or a common conditional measure of association given C.

Selection Bias Due to a Collider at S

The structural approach of Hernán et al.8 identifies selection bias when conditioning on S opens a noncausal path from X to D. This occurs when S is a collider on a path from X to D in a causal DAG or a descendant of such a collider. The DAG in Figure 4 represents a path from X to D on which S is a collider. As in Greenland et al.,10 the undirected dashed edges represent open paths whose structure is not specified. Thus, there is an open path from X to S that ends with an arrow pointing into S and an open path from S to D that starts with an arrow pointing into S.

FIGURE 4.

FIGURE 4.

Causal directed acyclic graph showing a variable S as a collider on an open path from X to D in the eligible population. Selection into the study is S or a descendant of S. The dashed lines indicate open paths whose structure is not specified.

Theorem 1. Selection bias under the structural approach implies analytic selection bias under the potential outcomes approach.

Proof. There is selection bias under the structural approach if and only if at least one path matching the pattern in Figure 4 exists such that the paths from X to S and from S to D cannot be blocked. If we condition on S or a descendant of S, then:

  • If we set X=x, the open path from D to S implies that the analytic cohort condition fails in the SWIG for setting X=x. This holds whether or not S is a descendant of X.

  • If we set D=d, the open path from X to S implies that the analytic case–control condition fails in the SWIG for setting D=d. This holds whether or not S is a descendant of D.

Because the analytic cohort and case–control conditions fail, we have analytic selection bias under the potential outcomes approach. □

In each example from Hernán et al.,8 the potential outcomes approach and the structural approach reach identical conclusions about the presence and control of selection bias. As noted by Hernán,3 all of these examples have no causal path and no open backdoor path from X to D. Selection bias under the structural approach compromises both the internal and external validity of a study: The causal effect estimate within the study sample is biased, so it cannot generalize to the eligible population.

Selection Bias in Descriptive Epidemiology

The traditional definition of selection bias applies to a measure of association between X and D whether or not it represents a causal effect. To define selection bias for descriptive epidemiology, we can drop the requirement that exchangeability holds given C1 and remove all restrictions on descendants of X and D so there is no need to distinguish between C1 and C2:

Definition 2 (Descriptive Selection Bias)

There is no unmeasured selection bias for the association between X and D if and only if at least one of the following conditions holds:

  • 1. Descriptive cohort condition. Selection into the study is conditionally independent of disease outcome given X and C:

SD|(X,C), (5)

where C is a (possibly empty) set of measured covariates.

  • 2. Descriptive case–control condition. Selection into the study is conditionally independent of exposure given D and C:

SX|(D,C), (6)

where C is defined above.

When at least one of these conditions holds, the conditional association between X and D given C in the study sample matches the same conditional association in the eligible population (up to random variation). In the cohort study from Figure 2, conditioning on B controls descriptive selection bias but not analytic selection bias. The same is true of conditioning on {B,V} in the case–control study from Figure 3. There are two primary differences between the control of descriptive and analytic selection bias: Analytic selection bias must be controlled using a set of variables sufficient to control confounding, and conditioning on causal descendants of X and D is constrained.

The descriptive cohort and case–control conditions can be assessed on any DAG—causal or not—that represents the joint distribution of X, D, S, and C in the eligible population. This definition of no unmeasured descriptive selection bias matches the necessary and sufficient conditions given in Didelez et al.23 for the X-D odds ratio given C to be collapsible over S.

Selection and Estimation

The selection bias example of Greenland11 was analyzed by Hernán3 using a DAG similar to that in Figure 5, where S is not a collider or a descendant of a collider. Howe et al.24 considered a similar causal structure for selection bias caused by loss to follow-up. In the SWIG for setting X=x, the open path S-C-Dx must be blocked by conditioning on C. In the SWIG for setting D=d, the path S-C-D-X is opened by conditioning on the collider at D, so it must be closed by conditioning on C. In both SWIGs, the potential outcomes approach correctly identifies selection bias and shows that it can be controlled by adjusting for C.

FIGURE 5.

FIGURE 5.

Causal directed acyclic graph (DAG) (A) adapted from Hernán3 and corresponding single-world intervention graphs (SWIGs) for the example in Greenland.11 The box around X in SWIG (B) represents conditioning on X when checking the cohort condition, and the box around D in SWIG (C) represents conditioning on D when checking the case–control condition.

Theorem 2. Selection bias under the traditional definition implies selection bias under the potential outcomes approach.

Proof. We will prove the equivalent statement that no selection bias under the potential outcomes approach implies no selection bias under the traditional definition. If there is no selection bias under the potential outcomes approach, then at least one of the cohort and case–control conditions holds. In each case, we will consider both analytic and descriptive selection bias.

When the analytic cohort condition holds and we intervene to set X=x, the conditional risk of disease given C1=c1 in the eligible population equals the conditional risk given C1=c1 and X=x in the study sample:

Pr(Dx=d|C1=c1) =Pr(Dx=d|C1=c1,X=x) =Pr(Dx=d|C1=c1,X=x,Sx=1) =Pr(D=d|C1=c1,X=x,S=1)  (7)

by exchangeability, the cohort condition, and consistency. Any measure of causal effect based on conditional risks of disease given C1 in the study sample equals the same measure based on potential outcomes in the eligible population (up to random variation).

Under the descriptive cohort condition, we have

Pr(D=d|C=c,X=x) =Pr(D=d|C=c,X=x,S=1). (8)

Any measure of association based on conditional risks of disease given C in the study sample equals the same measure in the eligible population (up to random variation).

For the case–control conditions, assume D is binary and that we are comparing two levels of exposure that we call X=1 and X=0 without loss of generality. If the analytic case–control condition holds and we intervene to set X=x, the conditional odds of disease given C1 is

Pr(D1=1|C1=c1)Pr(D1=0|C1=c1)=Pr(D1=1|C1=c1,X=x)Pr(D1=0|C1=c1,X=x) =Pr(D1=1|C1=c1,X=x,C2x=c2)Pr(D1=0|C1=c1,X=x,C2x=c2) =Pr(D=1|C1=c1,X=x,C2=c2)Pr(D=0|C1=c1,X=x,C2=c2)  (9)

by exchangeability, the conditional independence in equation (4), and consistency. By Bayes’s rule, the final odds equals

Pr(X=x|C1=c1,C2=c2,D=1)Pr(X=x|C1=c1,C2=c2,D=0) ×Pr(D=1|C1=c1,C2=c2)Pr(D=0|C1=c1,C2=c2). (10)

Because the second term does not depend on X, it cancels out of the conditional causal odds ratio for disease given C1. This causal odds ratio equals the conditional odds ratio for exposure given C1 and C2 in the eligible population:

Pr(D1=1|C1=c1) Pr(D1=0|C1=c1)Pr(D0=1|C1=c1)  Pr(D0=0|C1=c1) =Pr(X=1|C1=c1,C2=c2,D=1) Pr(X=0|C1=c1,C2=c2,D=1)Pr(X=1|C1=c1,C2=c2,D=0) Pr(X=0|C1=c1,C2=c2,D=0).  (11)

Each component of the conditional odds ratio for exposure in the eligible population can be estimated using data from the study sample because

Pr(X=x|C1=c1,C2=c2,D=d) =Pr(X=x|C1=c1,C2=c2,D=d,Sd=1) =Pr(X=x|C1=c1,C2=c2,D=d,S=1)  (12)

by the case–control condition and consistency. Thus, the causal odds ratio for disease given C1 in the eligible population equals the conditional odds ratio for exposure given C1 and C2 in the study sample (up to random variation).

Under the descriptive case–control condition,

Pr(D=1|X=1,C=c)Pr(D=0|X=1,C=c)Pr(D=1|X=0,C=c)Pr(D=0|X=0,C=c) =Pr(X=1|D=1,C=c)Pr(X=0|D=1,C=c)Pr(X=1|D=0,C=c)Pr(X=0|D=0,C=c).  (13)

The latter odds ratio can be estimated using the study sample because

Pr(X=x|C=c,D=d) =Pr(X=x|C=c,D=d,S=1) (14)

by the descriptive case–control condition. Thus, the conditional odds ratio for disease given C in the eligible population equals the conditional odds ratio for exposure given C in the study sample (up to random variation). □

The cohort and case–control conditions have different implications for the estimation of causal effects or associations. The cohort condition allows any measure based on conditional risks of disease to be calculated, including marginal measures based on standardization. The case–control condition allows only conditional odds ratios to be estimated. With case–control or case–cohort data, calculating conditional risks of disease requires external information about the eligible population.25

Selection Bias and Adjustment

Selection bias under the potential outcomes approach does not imply selection bias under the traditional definition. Under certain conditions, adjustment can leave a measure of association unchanged.22 Adjustment for C will alter a measure of association if there is effect modification by C or noncollapsibility. When we must condition on C to avoid selection bias, it is important to measure the covariates in C to check these conditions even if adjustment proves unnecessary.

Lu et al.26 proposed a classification of analytic selection bias for the risk difference and risk ratio into forms that occur due to conditioning on a collider (or a descendant of a collider) and forms that occur due to conditioning on an effect measure modifier. Because both measures are collapsible, these categories probably account for almost all selection bias where adjustment of the risk difference or risk ratio is necessary. The potential outcomes definition is nonparametric, so it identifies selection bias whenever it is not precluded by the structure of the causal DAG in the eligible population.

Analytic Versus Descriptive Selection Bias

The conditions given in Didelez et al.23 for the collapsibility of an X-D odds ratio over S do not ensure that an X-D odds ratio that collapses over S is causal.21 This is a special case of the fact that analytic selection bias can occur when there is no descriptive selection bias. However, no analytic selection bias implies no descriptive selection bias.

Theorem 3. No analytic selection bias implies no descriptive selection bias, but analytic selection bias can occur without descriptive selection bias.

Proof. Assume the analytic cohort condition holds given a set of variables C1. By consistency and equation (7), we have:

Pr(D=1|C1=c1,X=x) =Pr(Dx=1|C1=c1,X=x) =Pr(D=1|C1=c1,X=x,S=1)  (15)

If we let C=C1, this is equivalent to equation (8), which guarantees no descriptive selection bias.

Now assume the analytic case–control condition holds given C1 and C2. If we let C=C1C2 (i.e., the union of C1 and C2), then equation (12) is equivalent to equation (14), which guarantees no descriptive selection bias.

In the DAG at the top of Figure 2, the descriptive cohort condition holds given B. However, the analytic cohort condition fails because conditioning on A is required to block the backdoor path from X to D. The analytic case–control condition also fails, so there is analytic selection bias given B. In the DAG at the top of Figure 3, the descriptive case–control condition holds given {B,V}. However, the analytic case–control condition fails because conditioning on A is required to block the backdoor path from X to D. The analytic cohort condition also fails, so there is analytic selection bias given {B,V}. □

APPLICATIONS TO STUDY DESIGN AND ANALYSIS

The potential outcomes approach to selection bias provides a novel perspective on the variety of mechanisms that can generate selection bias, and it correctly handles cases where adjustment is necessary for generalization to the eligible population even though the unadjusted causal effect or association is valid within the study sample. It simplifies the analysis of matched studies and case–cohort studies by eliminating the need to consider the cancellation of associations along different paths in a DAG.27 The eAppendix; http://links.lww.com/EDE/C59 contains examples implemented in R.28

Selection Bias Due to a Collider at X

Measures of association based on the risk of disease condition on X when calculating risks within exposure groups. When there is an open backdoor path from X to D, this conditioning can cause descriptive selection bias. Figure 6 shows an example. The descriptive case–control condition fails because of the arrow from X to S in the underlying causal DAG. In the descriptive cohort condition, conditioning on the collider at X opens the path Sx-V-X-U-Dx. However, SxDX|(X,C, where C={U}, C={V}, or C={U,V}. The descriptive cohort condition holds given any of these sets.

FIGURE 6.

FIGURE 6.

Single-world intervention graph showing selection bias caused by a collider at X. The box around X represents conditioning on X to check the cohort condition. The edge from X to D is dashed because the example does not depend on a causal path from X to D.

The analytic cohort condition holds only given {U} or {U,V} because conditioning on U is necessary to close the backdoor path X-U-D. Conditioning on {V} controls descriptive selection bias but not confounding, so the conditional association between X and D given V is the same in the study sample and the eligible population but differs systematically from the conditional causal effect of X on D given V in the eligible population. This form of selection bias requires a backdoor path from X to D, so it falls outside the range of selection bias considered by Hernán et al.8

Selection Bias Due to a Collider at D

Measures of association in case–control studies condition on D when calculating exposure prevalences among cases and controls. Figure 7 shows selection bias caused by a collider at D in a case–control study. The analytic and descriptive cohort conditions fail because of the arrow from D to S in the underlying causal DAG. In the analytic and descriptive case–control conditions, conditioning on the collider at D opens the path Sd-L-D-X. This path can be blocked by conditioning on L, so the descriptive case–control condition holds given {L}. The variable L can be in C1 because it is not a descendant of X. It cannot be in C2 because Lx = LDX|X. Thus, the analytic case–control condition holds given C1={L} and C2= (the empty set). This form of selection bias cannot occur if there is no causal path and no open backdoor path from X to D, so it also falls outside the range of selection bias considered by Hernán et al.8

FIGURE 7.

FIGURE 7.

Single-world intervention graph showing selection bias caused by a collider at D. The box around D represents conditioning on D to check the case–control condition. This example also works if the causal path from X to D is replaced by an open backdoor path.

Randomized Clinical Trials

Figure 8 shows a randomized clinical trial in which there is a common cause C of disease and selection (e.g., selection into the study is based on risk factors for disease). The arrow from S to X exists because selection into the trial affects the probability of treatment, which might be near zero outside the study sample. Although randomization of X prevents a backdoor path from X to D in the causal DAG for the study sample, there is a backdoor path X-S-C-D in the causal DAG for the eligible population.

FIGURE 8.

FIGURE 8.

Single-world intervention graph for setting X=x in the eligible population for a randomized clinical trial, where S precedes assignment of X and there is a common cause of S and D. When selection is related to risk factors for disease, there is a backdoor path from X to D in the eligible population even if X is randomized.

The analytic and descriptive cohort conditions both hold given C. In the eligible population, randomization of X ensures that:

  • All backdoor paths from X to D are blocked by conditioning on S, so the effect of X on D is unconfounded within the study sample.

  • Because CX|S, all treatment groups have the same distribution of C. Thus, a crude measure of association between X and D is implicitly standardized to the distribution of C in the study sample.

Therefore, a randomized trial provides a valid estimate of the causal effect of X on D in the study sample without adjustment for C. However, generalization to the eligible population can require adjustment for C if there is effect modification or noncollapsibility.22

Matched Cohort Studies

Figure 9 shows a cohort study matched on a confounder C. The analytic cohort condition holds given C because CxDx|(X,C), so the descriptive cohort condition also holds. If matching ensures that the distribution of C is the same in all exposure groups in the study sample, then CX|S even though C and X are d-connected given S. In this case, a crude measure of association based on disease risks is implicitly standardized to a common distribution of C, so no adjustment for C is needed to estimate the marginal causal effect of X within the study sample. If matching ensures that the distribution of C in each exposure group in the study sample matches the distribution of C among the exposed in the eligible population, then this unadjusted estimate corresponds to the average treatment effect among the treated in the eligible population. Generalization requires adjustment for C if there is effect modification or noncollapsibility and a marginal causal effect is being estimated for a different distribution of C than in the study sample.22

FIGURE 9.

FIGURE 9.

Causal directed acyclic graph (DAG) and single-world intervention graph (SWIG) for setting X=x in the eligible population for a cohort study matched on a confounder C. Because of matching, all exposure groups have the same distribution of C in the study sample even though X and C are d-connected in the DAG.

Matched Case–Control Studies

Figure 10 shows a case–control study matched on a confounder C. Matching ensures that the distribution of C is equal in the case and control groups, not in the exposure groups, so the crude odds ratio in the study sample does not have a causal interpretation. The conditional odds ratios given C have a causal interpretation, and they are identical (up to random variation) in the study sample and the eligible population.

FIGURE 10.

FIGURE 10.

Causal directed acyclic graph (DAG) and single-world intervention graph (SWIG) for setting D=d in the eligible population for a case–control study matched on a confounder C. Because of matching, the case and control groups—not the exposure groups—have the same distribution of C.

Case–Cohort Studies

Figure 11 shows a causal DAG and corresponding SWIGs for a case–cohort study. S1 indicates selection into the underlying cohort, and S2 indicates selection into the subcohort or becoming a case (or both). For selection into the cohort, we have Cx1Dx|X so the analytic cohort condition holds. For selection into the study sample, the cohort condition fails because of the edge from D to S2, but the analytic case–control condition holds with C1= and C2={S1}. Because the cohort condition does not hold for S2, the exposure odds ratio comparing cases and controls must be used to estimate the causal effect of exposure on disease. The need to condition on S1 to control selection bias for S2 implies that cases from outside the cohort must be excluded if the cohort was selected based on exposure.

FIGURE 11.

FIGURE 11.

Directed acyclic graph (DAG) and single-world intervention graphs (SWIGs) for the eligible population in a case–cohort study. S1 indicates selection into the underlying cohort, and S2 indicates selection into the subcohort or becoming a case (or both).

DISCUSSION

The potential outcomes approach to selection bias is nonparametric, captures all selection bias under the structural approach of Hernán et al.8 and the traditional definition, and can be adapted to both analytic and descriptive epidemiology. It is an important practical application of SWIGs, and it provides a unified analysis of confounding and selection bias in analytic epidemiology. We hope to extend this approach to studies with time-dependent confounding and complex censoring patterns.2931

We have assumed throughout that our DAGs completely represent causal relationships in the eligible population. Causal relationships in a different population might not be represented by the same DAGs. Thus, no selection bias does not guarantee that an estimated causal effect or association can be generalized to a population containing individuals who were not eligible for the study. It guarantees generalizability but not transportability,1 which is a strong argument for the inclusive recruitment of study participants in epidemiology.

EAPPENDIX

Implementations in R28 of examples from Figures 2 to 10 (text file).

ACKNOWLEDGMENTS

I would like to thank Miguel Hernán, Forrest Crawford, Sander Greenland, and Patrick Schnell for their useful comments.

Supplementary Material

ede-34-865-s001.r (22.7KB, r)

Footnotes

Supported by National Institute of Allergy and Infectious Diseases (NIAID) grants R01 AI116770 and U01 AI169375 and National Institute of General Medical Sciences (NIGMS) grant U54 GM111274. The content is solely the responsibility of the author and does not represent the official views or policies of NIAID, NIGMS, or the National Institutes of Health.

The authors report no conflicts of interest.

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com).

The previously submitted version of this article was posted on arXiv (https://arxiv.org/abs/2008.03786).

The data and code required to replicate the results are included as an R file in the Supplemental Digital Content http://links.lww.com/EDE/C59.

REFERENCES

  • 1.Dahabreh IJ, Hernán MA. Extending inferences from a randomized trial to a target population. Eur J Epidemiol. 2019;34:719–722. [DOI] [PubMed] [Google Scholar]
  • 2.Berkson J. Limitations of the application of fourfold table analysis to hospital data. Biometrics. 1946;2:47–53. [PubMed] [Google Scholar]
  • 3.Hernán MA. Invited commentary: selection bias without colliders. Am J Epidemiol. 2017;185:1048–1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Miettinen OS, Cook EF. Confounding: essence and detection. Am J Epidemiol. 1981;114:593–603. [DOI] [PubMed] [Google Scholar]
  • 5.Greenland S, Robins JM. Identifiability, exchangeability, and epidemiological confounding. Int J Epidemiol. 1986;15:413–419. [DOI] [PubMed] [Google Scholar]
  • 6.Wickramaratne PJ, Holford TR. Confounding in epidemiologic studies: the adequacy of the control group as a measure of confounding. Biometrics. 1987;43:751–765. [PubMed] [Google Scholar]
  • 7.Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci. 1999a;14:29–46. [Google Scholar]
  • 8.Hernán MA, Hernández-Daz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15:615–625. [DOI] [PubMed] [Google Scholar]
  • 9.Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82:702–710. [Google Scholar]
  • 10.Greenland S, Pearl J, Robins J. Causal diagrams for epidemiologic research. Epidemiology. 1999b;10:37–48. [PubMed] [Google Scholar]
  • 11.Greenland S. Response and follow-up bias in cohort studies. Am J Epidemiol. 1977;106:184–187. [DOI] [PubMed] [Google Scholar]
  • 12.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701. [Google Scholar]
  • 13.Richardson TS, Robins JM. Single world intervention graphs: A primer. In Second UAI Workshop on Causal Structure Learning, Bellevue, Washington. Association for Uncertainty in Artificial Intelligence; 2013a. [Google Scholar]
  • 14.Richardson TS, Robins JM. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper, 128(30):2013, 2013b. [Google Scholar]
  • 15.Dawid AP. Conditional independence in statistical theory. J R Stat Soc Series B. 1979;41:1–15. [Google Scholar]
  • 16.Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]
  • 17.Daniel RM, Kenward MG, Cousens SN, De Stavola BL. Using causal diagrams to guide analysis in missing data problems. Stat Methods Med Res. 2012;21:243–256. [DOI] [PubMed] [Google Scholar]
  • 18.Miettinen OS. The “case-control” study: valid selection of subjects. J Chronic Dis. 1985;38:543–548. [DOI] [PubMed] [Google Scholar]
  • 19.Wacholder S, McLaughlin JK, Silverman DT, Mandel JS. Selection of controls in case-control studies: I. principles. Am J Epidemiol. 1992;135:1019–1028. [DOI] [PubMed] [Google Scholar]
  • 20.Mansournia MA, Greenland S. The relation of collapsibility and confounding to faithfulness and stability. Epidemiology. 2015;26:466–472. [DOI] [PubMed] [Google Scholar]
  • 21.Bareinboim E and Pearl J. Controlling selection bias in causal inference. In Artificial Intelligence and Statistics, 100–108. PMLR, 2012. [Google Scholar]
  • 22.Greenland S, Pearl J. Adjustments and their consequences—collapsibility analysis using graphical models. Int Stat Rev. 2011;79:401–426. [Google Scholar]
  • 23.Didelez V, Kreiner S, Keiding N. Graphical models for inference under outcome-dependent sampling. Stat Sci. 2010;25:368–387. [Google Scholar]
  • 24.Howe CJ, Cole SR, Lau B, Napravnik S, Eron JJ, Jr. Selection bias due to loss to follow up in cohort studies. Epidemiology. 2016;27:91–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Miettinen O. Estimability and estimation in case-referent studies. Am J Epidemiol. 1976;103:226–235. [DOI] [PubMed] [Google Scholar]
  • 26.Lu H, Cole SR, Howe CJ, Westreich D. Toward a clearer definition of selection bias when estimating causal effects. Epidemiology. 2022;33:699–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Mansournia MA, Hernán MA, Greenland S. Matched designs and causal diagrams. Int J Epidemiol. 2013;42:860–869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria. R Foundation for Statistical Computing, 2023. Available at: https://www.R-project.org/. [Google Scholar]
  • 29.Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–1512. [Google Scholar]
  • 30.Robins JM, Blevins D, Ritter G, Wulfsohn M. G-estimation of the effect of prophylaxis therapy for pneumocystis carinii pneumonia on the survival of aids patients. Epidemiology. 1992;3:319–336. [DOI] [PubMed] [Google Scholar]
  • 31.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ede-34-865-s001.r (22.7KB, r)

Articles from Epidemiology (Cambridge, Mass.) are provided here courtesy of Wolters Kluwer Health

RESOURCES