Abstract
Using administrative patient-care data such as Electronic Health Records (EHR) and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real-world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.
Keywords: calibration, directed acyclic graphs, inverse probability weighting, Michigan Genomics Initiative, nonprobability sample, poststratification
1. Introduction
Massive amounts of data are routinely collected in health care clinics for administrative and billing purposes. Longitudinally varying time-stamped observational patient care data such as electronic health records (EHRs) allow researchers from various disciplines to run agnostic queries (Denny et al., 2013; Hoffmann et al., 2017) or validate hypothesis driven questions in large databases (Roberts et al., 2022; Shen et al., 2022). However, these observational studies pose several practical challenges for health research which can negatively impact internal validity and external generalizability of the results (Beesley, Fritsche, et al., 2020). Without properly accounting for potential sources of biases and study design issues, association analysis using these data can result in spurious findings (Madigan et al., 2014) and misguided policies (Wang and Wright, 2020). One major challenge in removing or reducing bias in these studies lies in the fact that there can be several potential causes of bias that may be simultaneously at play in an analysis done with a given real-world dataset. With larger datasets at researchers’ fingertips, the impact of bias relative to variance becomes even more pronounced. This phenomenon has recently been termed as the ‘curse of large n’ (Bradley et al., 2021; Kaplan et al., 2014). The common sources of biases related to EHR studies do not disappear with increased sample size and thus with increased precision comes the increased possibility of achieving incorrect inference. This is also termed as the ‘big data paradox’ (Meng et al., 2018). In studies with large n, bias often dominates the mean-squared error of an estimator and thus we need to update our statistical thinking to focus on strategies for reducing bias as opposed to the classical thinking around reducing variance.
Given a scientific question and access to a potentially large and messy database, we first need to define a target population of inference. A careful investigator then needs to think about the possible sources of bias that are most critical for the underlying question at hand. Selection bias, missing data, clinically informative patient encounter process, confounding, lack of consistent data harmonization across cohorts, true heterogeneity of the studied populations, registration of start time or definition of time zero, and misclassification bias due to imperfect phenotyping are some of the most common sources of bias in EHR. An overview of the different kinds of biases mentioned above with relevant references are given in Table 1.
Table 1.
Different types of biases in EHR studies other than selection bias along with their description and relevant literature to reduce the corresponding bias
| Type of Bias | Definition | Literature | Software |
|---|---|---|---|
| Imperfect Phenotyping | Major bias in EHR studies | Neuhaus (1999) and Beesley, Salvatore, et al. (2020) | SAMBA |
| Misclassification of derived disease phenotypes. | Tong et al. (2020) and Chen et al. (2019) | ||
| Overreporting and underreporting can both occur. | Yin et al. (2022) and Liu et al. (2022) | ||
| Underreporting is the primary source. | Huang et al. (2018) | ||
| Missing Data | Lack of routine checkup. Loss of follow-up reasons. |
Hot Deck Madow et al. (1983), Tree-based methods Doove et al. (2014) | MICE MissForest |
| Expectation–Maximization Algorithm Dempster et al. (1977) | |||
| IPW and AIPW techniques Seaman and Vansteelandt (2018) | |||
| Full information maximum likelihood Marcoulides and Schumacker (2013) | |||
| Multiple imputation Rubin (2004), Pattern mixture models Little (1993) | |||
| Heckman imputation Galimard et al. (2016) | |||
| Confounding | Direct cause of both exposure and the response. Major challenge caused by unmeasured confounders. |
Toh et al. (2011) and Sun et al. (2022) | |
| Negative and Double Negative Controls | |||
| Shi et al. (2020) and Lipsitch et al. (2010) | |||
| Lack of data harmonization across cohorts | Integrating disparate data of various sources and formats. |
Almeida et al. (2021) and Abbasizanjani et al. (2023) Fu et al. (2020), Glynn and Hoffman (2019) and Zawistowski et al. (2023). |
meta |
| Different clinics recruit patients using varying selection criteria. | |||
| Misclassification of phenotypes differs across cohorts. | |||
| Heterogeneity of studied populations | Systematic differences between the population characteristics or sampling mechanisms | Beesley, Salvatore, et al. (2020) | meta |
In this paper, we focus on understanding and tackling one major source of bias, namely selection bias in administrative healthcare data. The selection mechanism underlying the question ‘Who is in my study sample?’ may vary widely across the different sources of real-world data. For example, in using EHRs in the USA, where there is no universal healthcare or nationally integrated clinical data warehouse, one challenge is understanding factors that influence selection into a given study such as health care seeking behaviour and insurance coverage (Haneuse and Daniels, 2016; Heart et al., 2017; Heintzman et al., 2015; Rexhepi et al., 2021). Population-based biobanks such as the UK Biobank that are based on invitation to volunteers can lead to specific types of biases such as healthy control bias (Fry et al., 2017). Nationally representative studies such as the NIH All of Us often have a purposeful sampling strategy that leads to, say, oversampling certain underrepresented subgroups (All Of Us Research Programs Investigators, 2019). In contrast, medical centre and health system based studies attempt to recruit patients meeting specific criteria within the health system, often through multiple disease/treatment clinics. This leads to enrichment of certain diseases in the study sample (Pendergrass et al., 2011; Zawistowski et al., 2023). In addition, there is nonresponse and consenting bias among those who are approached to participate in the study. Since the process of selection into each study is unique and often unknown, conventional survey sampling techniques to handle probability samples with known sampling/survey weights are not generally applicable for such type of observational data which can be predominantly considered as nonprobability samples (samples where selection probabilities are unknown) (Beesley and Mukherjee, 2022b; Chen et al., 2020).
If the issue of selection bias is ignored, it can negatively impact downstream inference (Christensen et al., 1992; Kleinbaum et al., 1981). Due to unknown selection weights, naive inference from these nonprobability samples is generally not directly transportable to the target population. On the other hand, it is important to know when the selection process can be ignored and we can proceed with straightforward naive analysis. A structural framework to study selection bias using directed acyclic graphs (DAGs) was introduced in Hernán et al. (2004). We use this approach to study some common scenarios of selection mechanisms and their effects on estimates of association between a binary disease outcome and an exposure of interest (after adjusting for a set of confounders/covariates) specifically for real-world data. We consider a logistic regression model as the underlying disease outcome model.
After dissecting the selection mechanism to the best of our ability, we need to think about methods that are available to address/account for selection bias. Some of these methods rely on having individual-level data from an external probability-sample. Chen et al. (2020) adopted the method of pseudolikelihood based estimating equations to account for selection bias in estimating population mean of a response variable in nonprobability samples using individual-level data from an external probability sample. On the other hand, beta regression generalized linear model (glm) (Ferrari and Cribari-Neto, 2004) was used to estimate selection probabilities in Elliot (2009) and Beesley and Mukherjee (2022b). When only summary level information are available on an external probability sample, some methods in survey sampling, such as poststratification, raking, and calibration techniques as in Kim and Park (2010), Deville and Särndal (1992), and Montanari and Ranalli (2005) can be modified to reduce selection bias in nonprobability samples (Beesley and Mukherjee, 2022a, 2022b). We consider simulation settings reflecting common selection mechanisms represented by the DAGs and assess the bias-reduction properties of four of these weighting methods using the general framework of inverse-probability weighted (IPW) logistic regression. The methods differ in how the weights are constructed and what type of external data are required. We also present variance formulae associated with each weighting method.
Using EHR data from a longitudinal biorepository at the University of Michigan Healthcare system, the Michigan Genomics Initiative (MGI) and auxiliary data from a nationally representative probability sample study to define the selection weights, we illustrate how and when the weighted methods enable us to get closer to the truth compared to naive unweighted logistic regression.
The rest of the paper is organized as follows. In Section 2.1, we describe the study setting and four common types of selection DAGs. The expected extent of biases under different selection DAGs in a logistic regression outcome-exposure association model is studied using an analytical expression that relates the parameters of the true association model in overall population to the model restricted to the selected sub-population (without any adjustment for selection). Four variants of weighted logistic regression methods with individual or summary level external data (targeted to reduce selection bias in association parameters of interest) are described in Sections 2.3–2.5. We also present variance formulae for each method in Section 2.6. In Section 3, we conduct a simulation study comparing the four methods under different selection DAGs. In Section 4, we estimate the association between cancer and biological sex in the Michigan Genomics Initiative Data using the four IPW methods discussed in the previous sections with associated confidence intervals. We conclude with a brief discussion in Section 5.
2. Methods
2.1. Notation
Our main focus is on the relationship between a binary disease indicator D and a set of covariates in a target population from which the internal nonprobability sample is drawn. Selection is denoted by a binary indicator which is assumed to be driven by a set of covariates and may also depend on D. Figure 1 summarizes the structures of the disease and selection models. is the subset of present only in the disease model, influences the disease indicator D and may influence the selection indicator S. While denotes covariates present only in the selection model. The primary disease model of interest is:
Figure 1.
Figure depicting the disease and selection models along with the different variables present in both the models.
| (1) |
Selection into the internal sample is driven by a probability mechanism which is allowed to be completely nonparametric. Our desired target is as in equation (1). However, we can only fit . One can relate the true model parameters and the ones for naively fitted model (conditional on ) by the following key relationship:
| (2) |
where
describes how disease predictors modify the selection mechanism. The derivation of the above expression is as follows. By Bayes Theorem, we obtain that,
From equation (1) in Section 2.1, we know that and . Therefore, dividing both the numerator and denominator by , we obtain
Thus, we obtain Unless is a constant function of (like in a population-based case-control study), estimates obtained from the naive unweighted logistic regression model of D on and based on just the internal data lead to biased estimates of , , or both. A common example of such predictor outcome-dependent selection bias is case control studies where factors like education could influence the likelihood of volunteering as controls (Geneletti et al., 2009; Kleinbaum et al., 1981).
2.2. Selection DAGs
We study the extent of bias introduced by the additional term in equation (2) when we use naive logistic regression on the selected sample, namely . Note that,
| (3) |
We study the bias in the naive approach under some plausible DAGs with increasing levels of complexity in dependencies among D, S, , , and . We simplify the expression of in equation (3) under the different DAGs introduced in Figure 2.
Figure 2.
Selection DAGs representing some plausible relationships between different variables of interest: D (Disease Indicator), S (Selection Indicator into the internal sample) (Predictors in the disease model only), (Predictors both in disease and selection models) and (Predictors in the selection model only).
Example DAG 1: unbiased case
Under DAG 1 in Figure 2, the arrows from (, ) to do not exist. In addition, D does not directly affect . This implies, none of the disease model predictors and affect the selection mechanism. As shown in online supplementary material, Section S1.1.1, the expression of in equation (3) simplifies to a constant denoted by r:
In this case, estimates obtained from an unweighted logistic regression of D on and in the selected sample (conditional on ) are unbiased for and even without adjusting for any selection bias. However, the intercept term estimate is biased for with the bias being the offset term . This is equivalent to the results that are well known for a case-control study (Cornfield et al., 1959).
Example DAG 2: arrow induced bias for coefficient of
Under DAG 2 in Figure 2, we observe that there is an additional direct dependence from to compared to DAG 1. As shown in online supplementary material, Section S1.1.2, the expression for under this scenario reduces to,
With the introduction of additional arrow between and , the function depends on through but does not depend on . Consequently estimates from a naive logistic regression of leads to biased estimates of but not of . Similarly if is present and is absent, then using identical arguments, we obtain unbiased estimates for and biased for .
Example DAG 3: and induced bias for coefficients of and
Under DAG 3 in Figure 2, has a direct causal pathway to S which leads to increase in the strength of dependence between the selection and disease models when compared to DAG 2. As shown in online supplementary material, Section S1.1.3, the expression for in this case is,
Therefore, is a function of both and . The dependence on is through . The dependence on is through . The naive unweighted logistic regression method fails to provide unbiased estimates of both and . In case where exists but there is no arrow then estimates for will be biased and unbiased for .
Example DAG 4: strong dependence, increased bias for coefficients of and
DAG 4 in Figure 2 corresponds to a situation where the dependence between the selection and disease model is the most complex among the four selection DAGs we considered. As shown in online supplementary material, Section S1.1.4, the expression for in this case is given by,
Here, depends on via and . Consequently, the estimated coefficient of from a naive unweighted logistic regression of potentially becomes more biased for compared to the other DAGs conditioned on the fact that the strength of associations among the variables remain same across the different DAGs. However, the dependence of on is only through potentially leading to less bias in estimate of compared to estimate of .
Remark
The issue of correcting for selection bias becomes more challenging in our setting due to the joint dependence of S on both the disease indicator D and other covariates . If in fact there was no arrow , then conditioned on , all paths between D and S are blocked leading to d-separation of D and S in DAGs 1, 2, and 3, implying . This also implies . Thus, estimates from fitting the naive unweighted model in equation (2) are consistent for the true parameters, and . On the other hand for DAG 4, the estimates are biased since conditioned on the path is still unblocked.
Now that we have established that fitting a model on the selected sample namely, can generally (for example in DAGs 2, 3, and 4) lead to biased estimates of the true model parameters and in the target population, we consider four easy-to-use weighted logistic regression methods that address selection bias. The methods differ in terms of their construction of weights and the type of external data required.
2.3. Weighted logistic regression
In this section, we use the following notation. We assume that we have an internal nonprobability sample with selection indicator S and an external probability sample with selection indicator drawn from the same target population. Figure 3 is a schematic representation of the assumed scenario. The internal and external samples may or may not have overlap ( or 0, respectively, as in Figure 3).
Figure 3.
Figure depicting the relationship between the target population, internal nonprobability, and external probability samples. S and are the selection indicator variables of internal and external samples, respectively. is the selection indicator variable for a person present in both internal and external samples.
Inverse probability weighted (IPW) regression is a potential remedy to adjust for selection bias and obtain less biased estimates of parameters in the disease model (Beesley and Mukherjee, 2022b; Haneuse and Daniels, 2016). Let N be the size of the target population. Let denote the selection model covariates and denote probability of selection into internal sample. Therefore, the size of the internal nonprobability sample is given by . Let denotes the disease model covariates in equation (1) with parameters denoted by . Thus . In IPW logistic regression, the estimating equations are given by
| (4) |
where i corresponds to the individual in the target population. The consistency of the estimate obtained as a solution to equation (4) with known is presented in online supplementary material, Section S1.2.
For nonprobability samples, the selection probability of individual i, given by is unknown. Since there is no information available on participants who are not selected into the internal study , the estimation of requires some form of external information. Auxiliary external data are typically available in two forms: either individual-level data or summary-level statistics. Moreover, since the external sample is a probability sample drawn from the target population, we assume that we have access to the known sampling probabilities, say .
In Sections 2.4 and 2.5, we describe four methods to estimate the selection probabilities depending on the nature of available external information. All four methods adopt a two-step process: the first step involves obtaining estimates of the selection probabilities, ; the second step is estimation of disease model parameters using the weighted score equation (4) with replaced by . A summary of all the methods including the unweighted and the four weighted ones are given in Table 2.
Table 2.
Short Summary of the five methods including the naive and the four weighted ones
| Class | Method | Reference | Description | Features | Software |
|---|---|---|---|---|---|
| Unweighted | Naive | Unweighted logistic regression with D and as response and predictors respectively. | Estimates and unbiasedly only when is free of and respectively. | glm | |
| Individual External Patient Data | Simplex Regression (SR) | Barndorff-Nielsen and Jørgensen (1991), Elliot (2009) and Beesley and Mukherjee (2022b) | Weighted logistic regression with D and as response and predictors respectively. | Bias in estimation of increases with the complexity of DAG. Difference in predictors of the internal and external selection models leads to biased estimates for SR. | simplexreg SAMBA |
| Integrating external level individual data, the weights obtained using Simplex and Multinomial Regressions. | |||||
| Pseudo Likelihood (PL) | Chen et al. (2020) | An weighted logistic regression with D and as response and predictors respectively. With external level individual data, the maximum likelihood estimating equations for weights are approximated by the measures from external level data resulting in pseudolikelihood estimating methods. | With correct specification of selection model, the estimates are highly accurate for all set-ups, However highly sensitive to model misspecification. | nleqslv | |
| Summary Level Statistics Data | Post Stratification (PS) | Beesley and Mukherjee (2022b) | An weighted logistic regression with D and as response and predictors respectively. With external level joint probabilities of (), where are the discrete versions of , the weights are obtained. | Fails to work for set-up 2 and 3. However the efficiency in terms of bias increases for set-up 4 and is least sensitive to model mis specification. | survey SAMBA |
| Calibration (CL) | Wu (2003) | Weighted logistic regression with D and as response and predictors respectively. With external level marginal probabilities on the selection variables , the weights are estimated using pseudolikelihood estimating methods similar to the PL method | With correct specification of selection model, the estimates are highly accurate for all set-ups, However highly sensitive to model misspecification. | survey |
2.4. Estimation of weights using individual-level external data
In this subsection, we consider two methods to account for selection bias in the internal sample using individual-level data from an external probability sample. The first one is adaptation of a pseudolikelihood based estimating equation approach originally proposed in Chen et al. (2020) for estimation of population mean of a response variable. We modified this technique to our context. The second one is based on simplex regression method (Barndorff-Nielsen and Jørgensen, 1991), as an improvement over beta regression that has been previously used in this problem (Beesley and Mukherjee, 2022b; Elliot, 2009).
2.4.1. Pseudolikelihood-based estimating equation
The selection indicator variable into the internal sample for the individual in the population, is a bernoulli random variable with success probability . In this method, we assume a parametric model for indexed by parameters , specified by .
The likelihood function of is given by
| (5) |
Equivalently, the log likelihood is
| (6) |
The first term of the above equation only involves values of from the internal nonprobability sample. Ideally, the selection parameters would have been obtained by maximizing the above log likelihood in equation (6), however the second part of the log likelihood cannot be calculated solely based on the available data from the internal sample. This term requires the values of from sample. Chen et al. (2020) provide an approximation to the log likelihood using the following expression:
| (7) |
Since the exact sampling weights of the external probability sample are known, the second term of equation (7) is an unbiased estimator of the second term in equation (6). Using the logistic form of the internal selection model and differentiating equation (7) with respect to , we obtain the following estimating equation
| (8) |
Newton–Raphson method is used to estimate from the above equation. We obtain the estimates of internal selection probabilities, by plugging the estimates of in the logistic functional form of the selection model.
2.4.2. Simplex regression
The main idea underlying this method is based on the identity
| (9) |
where, . The proof of this above identity (9), is provided in online supplementary material, Section S1.3. From equation (9), we observe that we need to estimate and for each internal sample individual to calculate the internal selection probabilities . We adopt two separate regression frameworks to model the dependencies of and on , respectively.
Estimation of : We used simplex regression (Barndorff-Nielsen and Jørgensen, 1991) to model dependence of on . Simplex regression is one of the glm regression methods with proportions as the response. The main idea is to fit the best possible model in the external sample to the known design probabilities as a function , say, . The parameter is estimated by maximizing the following likelihood function based on the external probability sample obtained using the simplex distribution
| (10) |
with the unit deviance function,
In R, the simplexreg package (Zhang et al., 2016) provides estimates of by maximizing the likelihood in (10). for individuals in the internal nonprobability sample are then estimated by the plug-in estimate .
Estimation of : On the other hand, is estimated based on the combined data (external union internal) sample. We define a nominal categorical variable with three levels corresponding to different values of pairs (). An individual with level (1,1) is a member of both samples; (0,1) indicates a member of the exterior sample only, whereas (1,0) corresponds to the internal sample only. The multicategory response is again regressed on the internal selection model variables, using a multinomial regression model and we obtain estimates of .
Using the estimates of from multinomial regression and from simplex regression, respectively, the selection probabilities for the internal sample, , were estimated from equation (9) which serves as in equation (4).
2.5. Estimation of weights using summary-level statistics
In this section, we discuss two methods to account for selection bias using summary-level information that correspond to the target population. These summary information may be obtained directly from the target population (such as from census data) or from summary data that has been made available by applying known survey design weights to an external sample drawn from the target population. We consider two types of summary-level statistics namely, joint and marginal probabilities of . Similar to Beesley and Mukherjee (2022b), we adopt poststratification methods (Holt and Smith, 1979) when joint probabilities of the selection variables are available to us. On the other hand when only marginal probabilities are available, we modify the calibration method used in Wu (2003) originally proposed for obtaining modified sampling probabilities from survey data.
2.5.1. Poststratification
We assume the joint distribution of the selection variables in the target population, namely are available to us. In the case of continuous selection variables, we can at best expect to have access to joint probabilities of discretized versions of those variables. Beyond this coarsening, obtaining joint probabilities of a large multivariate set of predictors become extremely challenging. In such cases, several conditional independence assumptions will be needed to specify a joint distribution from sub-conditionals.
We consider the scenario where both and are continuous variables. Let and be the discretized versions of and , respectively. We assume that the joint distributions for in the target population are available from external sources. The post stratification method estimates the selection weights (inverse of selection probabilities into the internal sample) for the individual belonging to the internal sample by,
The numerator of the above expression is the known population level joint distribution for the discretized selection variables obtained from external sources. On the other hand, the denominator, is the same probability empirically estimated from the internal sample. The inverses of the weights, are normalized to obtain estimates of in equation (4).
2.5.2. Calibration
Calibration methods are often used in survey sampling to obtain corrected sampling weights in probability samples (Wu, 2003). We borrowed this idea to estimate internal selection probabilities by a model, indexed by parameters , when marginal population means of the selection variables are available from external sources. Using target population size N and the given marginal population means of the selection variables , we derive the population totals, namely . We obtain the estimate of by solving the following calibration equation,
| (11) |
In this approach, we match the sum of each selection variable in the internal sample (as estimated by inverse probability weighted sum on the LHS in equation (11)) with the available total from the target population (RHS of equation (11)), analogous to the method of estimation by first moment matching. Similar to Section 2.4.1, Newton–Raphson method is used to solve equation (11) to estimate and henceforth obtain for each individual in the internal sample. We used a logistic specification of in our numerical work, but any selection model consistent for will lead to consistent estimates of in equation (4).
2.6. Asymptotic distribution and variance estimation
We study the asymptotic distribution of the IPW estimator under each of the four weighting methods. We consider infinite population inference with population size N going to infinity. We assume that all the variables, including S, , , and ) are random. This asymptotic setting is intrinsically different than finite population asymptotics, often followed in the survey literature where all the variables other than the selection indicators are considered to be nonrandom. This asymptotic analysis allows us to derive consistent estimators of the variance of to be used in subsequent inference.
Pseudolikelihood: For PL, we derive the consistency, asymptotic normality, and asymptotic variance estimator of in online supplementary material, Section S1.3. The two-step variance estimation procedure incorporates uncertainty associated with estimates of the selection model parameter that are obtained by solving equation (8).
Simplex Regression: For SR, due to composite nature of the selection model, we use an approximation of the variance ignoring uncertainty in the estimates of the selection model parameters. The details of this approach are provided in online supplementary material, Section S1.2.
Poststratification: For PS, the weights are known from summary statistics of the target population and the variance formula is provided in online supplementary material, Section S1.2.
Calibration: Similar to PL, for CL we considered the uncertainty associated with estimates of the selection model parameter that are obtained by solving equation (11) while deriving the estimated asymptotic variance of . online supplementary material, Section S1.5 contains the details.
We compared the average of the variance estimators proposed above across simulated datasets with the empirical Monte Carlo variances of the obtained parameter estimates. In particular, we quantify the potential inconsistency of our variance estimator for the SR method due to omission or ignorance of the uncertainty associated with estimation of the parameters of the selection model.
3. Simulation study
In this section, we present three simulation scenarios for each of the four DAGs introduced in Figure 2. The three set-ups differed in the assumption of the functional form of the selection model of the internal sample, namely . For all three set-ups, we consider the following generative distributions
- Disease model covariates and: The joint distribution of is specified as,
- Disease outcome D: D is simulated from the conditional distribution specified by,
where, , and . - Selection model covariate : W is an univariate random variable simulated from the conditional distribution of , specified by,
to incorporate the dependencies of W on D, , and , respectively. We set for the four DAGs, respectively. -
The internal sample selection models for the three set-ups are specified as follows.
- Set-up 1: We set target population size to . The functional form of the selection model is given by,
We set , for DAGs 1 and 2 and for DAGs 3 and 4, , .(12) - Set-up 2: The internal selection model in set-up 1 is perturbed by a constant multiplication given by,
We set the exact same values for as in set-up 1. This pertubation of the selection model leads to a misspecification issue for pseudolikelihood and calibration methods, when we fit the two methods using a logistic form. In order to ensure comparable sample size of internal data for both the simulation scenarios, we increased the target population size to 125,000 which is 2.5 times the previous population size, 50,000. - Set-up 3: In this set-up, we incorporate interaction terms of and in the selection model. The new selection model is given by,
The values of are identical to set-up 1. We set for DAGs 1 and 2 and for DAGs 3 and 4. Therefore this set-up leads to a misspecification issue in pseudolikelihood, simplex regression, and calibration methods when we fit these models without considering the interaction terms.
- External Selection Model: For external data, the selection model can take any functional form and the selection probabilities are known to us. In our case, we assumed that the functional form of the external selection model is given by,
The values of are given by . The probabilities from the above equation were multiplied by a factor of 0.75.
For the PS method, the joint distribution of are available from external sources, where both and are the coarsened versions of and W. The criteria that we used to discretize these variables in the simulations is described in online supplementary material, Section S1.6. All simulation results are summarized over 1,000 replications.
Evaluation Metrics for Comparing Methods
In all the simulation set-ups, we compared the bias, relative bias and relative mean squared error (RMSE) relative to the unweighted method for both and across the four different weighted methods introduced in the previous section. The bias and relative bias % in estimation for a parameter θ using are given by,
where, is the estimate of θ in the simulated dataset, and .
RMSE of with respect to the unweighted estimator is defined as the ratio of the two MSEs given by,
3.1. Results from the simulation study
3.1.1. Example DAG 1: unbiased case
Under DAG 1, as described in Section 2.2, the selection bias-inducing term in the observed disease model , namely is a constant function in . We proved in online supplementary material, Section S1.1.1 that the unweighted method produces unbiased estimates of and for this DAG. This theoretical result is evident from the simulation results in Table 3 and Figure 4 under all the three set-ups. All five methods including the unweighted approach estimate both the disease model parameters with high accuracy. The highest relative bias among all the three set-ups is 0.82%, which implies that all the methods accurately estimate both the disease model parameters. Therefore, the results show that the different specifications of the functional form of the selection model do not affect the performances of any of the models significantly other than minor inflation in variance of the parameter estimates in some cases. The RMSE of all four weighted methods are close to 1.
Table 3.
Bias and RMSE comparison between the unweighted and four weighted methods in DAGs 1, 2, 3, 4, under simulation set-ups 1,2 and 3
| DAG | Method | Bias() (Multiplied by 1000) | Bias() (Multiplied by 1000) | ||||
|---|---|---|---|---|---|---|---|
| Set-up 1 | Set-up 2 | Set-up 3 | Set-up 1 | Set-up 2 | Set-up 3 | ||
| DAG 1 | Unweighted | −1.48 (0.30%) | −1.48 (0.30%) | −1.48 (0.30%) | 1.00 (0.20%) | 1.00 (0.20%) | 1.00 (0.20%) |
| PL | −1.56 (0.32%) | −1.56 (0.32%) | −1.56 (0.32%) | 1.21 (0.24%) | 1.21 (0.24%) | 1.21 (0.24%) | |
| SR | −1.58 (0.32%) | −1.58 (0.32%) | −1.58 (0.32%) | 1.22 (0.24%) | 1.22 (0.24%) | 1.22 (0.24%) | |
| PS | 0.83 (0.17%) | 0.83 (0.17%) | 0.83 (0.17%) | −4.10 (0.82%) | −4.10 (0.82%) | −4.10 (0.82%) | |
| CL | 0.68 (0.14%) | 0.68 (0.14%) | 0.68 (0.14%) | −0.61 (0.12%) | −0.61 (0.12%) | −0.61 (0.12%) | |
| DAG 2 | Unweighted | −70.80 (14.16%) | −70.42 (14.08%) | 47.61 (9.53%) | 1.02 (0.20%) | 0.66 (0.13%) | 1.69 (0.34%) |
| PL | −1.39 (0.28%) | −51.02 (10.20%) | 157.33 (31.47%) | 1.29 (0.25%) | 0.61 (0.12%) | 2.11 (0.42%) | |
| SR | −8.61 (1.72%) | −8.25 (1.65%) | 129.80 (25.96%) | 1.39 (0.28%) | 0.68 (0.14%) | 2.34 (0.47%) | |
| PS | 103.90 (20.78%) | 103.85 (20.77%) | 137.93 (27.59%) | 10.05 (2.01%) | 10.29 (2.06%) | −0.82 (0.16%) | |
| CL | 0.44 (0.09%) | −47.79 (9.56%) | 145.26 (29.05%) | −0.67 (0.13%) | −0.13 (0.03%) | 0.40 (0.08%) | |
| DAG 3 | Unweighted | −62.83 (12.6%) | −62.28 (12.46%) | 9.24 (1.85%) | −143.75 (28.74%) | −143.86 (28.77%) | 79.05 (15.81%) |
| PL | −0.30 (0.06%) | −48.90 (9.78%) | 122.56 (24.51%) | 1.27 (0.25%) | −112.28 (22.46%) | 147.67 (29.53%) | |
| SR | −13.80 (2.76%) | −13.41 (2.68%) | 84.07 (16.81%) | −83.72 (16.74%) | −84.22 (16.84%) | 77.48 (15.50%) | |
| PS | 123.44 (24.7%) | 124.32 (24.86%) | 138.89 (27.78%) | 2.15 (0.43%) | 2.10 (0.42%) | 9.74 (1.95%) | |
| CL | 0.14 (0.03%) | −42.95 (8.59%) | 108.11 (21.62%) | −0.62 (−0.12%) | −102.32 (20.46%) | 116.77 (23.35%) | |
| DAG 4 | Unweighted | −71.92 (14.38%) | −71.67 (14.33%) | −42.33 (8.47%) | −235.72 (47.14%) | −235.24 (47.05%) | −201.69 (40.34%) |
| PL | 0.07 (0.01%) | −60.53 (12.11%) | 82.38 (16.47%) | 1.09 (0.22%) | −198.12 (39.62%) | 173.74 (34.75%) | |
| SR | −30.11 (6.02%) | −30.23 (6.05%) | 21.63 (4.33%) | −145.59 (29.20%) | −145.58 (29.12%) | −61.17 (12.23%) | |
| PS | 2.96 (0.59%) | −20.03 (4.01%) | 12.68 (2.54%) | 29.11 (5.82%) | −57.71 (11.54%) | 31.38 (6.28 %) | |
| CL | 0.45 (0.09%) | −51.03 (10.21%) | 68.70 (13.7%) | −0.65 (0.13%) | −174.77 (34.75%) | 129.91 (25.98%) | |
| DAG | Method | RMSE() | RMSE() | ||||
|---|---|---|---|---|---|---|---|
| Set-up 1 | Set-up 2 | Set-up 3 | Set-up 1 | Set-up 2 | Set-up 3 | ||
| DAG 1 | Unweighted | 1 | 1 | 1 | 1 | 1 | 1 |
| PL | 1.03 | 1.03 | 1.03 | 1.07 | 1.07 | 1.07 | |
| SR | 1.06 | 1.06 | 1.06 | 1.07 | 1.07 | 1.07 | |
| PS | 1.21 | 1.21 | 1.21 | 0.88 | 0.88 | 0.88 | |
| CL | 1.13 | 1.13 | 1.13 | 1 | 1 | 1 | |
| DAG 2 | Unweighted | 1 | 1 | 1 | 1 | 1 | 1 |
| PL | 0.09 | 0.57 | 9.22 | 1.06 | 1.05 | 1.07 | |
| SR | 0.10 | 0.12 | 6.33 | 1.05 | 1.06 | 1.05 | |
| PS | 2.07 | 2.22 | 7.15 | 1.14 | 1.86 | 0.94 | |
| CL | 0.09 | 0.65 | 7.89 | 1.01 | 1.13 | 1.00 | |
| DAG 3 | Unweighted | 1 | 1 | 1 | 1 | 1 | 1 |
| PL | 0.11 | 0.67 | 31.01 | 0.03 | 0.63 | 3.33 | |
| SR | 0.15 | 0.15 | 15.08 | 0.36 | 0.36 | 0.98 | |
| PS | 3.61 | 3.86 | 39.64 | 0.02 | 0.04 | 0.09 | |
| CL | 0.11 | 0.74 | 24.32 | 0.02 | 0.56 | 2.10 | |
| DAG 4 | Unweighted | 1 | 1 | 1 | 1 | 1 | 1 |
| PL | 0.08 | 0.75 | 3.33 | 0.01 | 0.72 | 0.75 | |
| SR | 0.24 | 0.24 | 0.41 | 0.39 | 0.39 | 0.10 | |
| PS | 0.15 | 0.52 | 0.37 | 0.10 | 0.40 | 0.11 | |
| CL | 0.08 | 0.72 | 2.38 | 0.01 | 0.57 | 0.42 | |
Note. Unweighted, Unweighted Logistic Regression; SR, Simplex regression; PL, Pseudolikelihood; PS, Poststratification; CL, Calibration.
Figure 4.
(a) Estimates of in, coefficient of in the disease model along with 95% C.I using the unweighted and the four weighted methods under the three simulation set-ups for each of the four DAGs. (b) Estimates of in, coefficient of in the disease model along with 95% C.I using the unweighted and the four weighted methods under the three simulation set-ups for each of the four DAGs. Unweighted, unweighted logistic regression; SR, simplex regression; PL, pseudolikelihood; PS, poststratification; CL, calibration.
3.1.2. Example DAG 2: arrow induced bias for coefficient of
Under DAG 2, we showed in Section 2.2 that is a function of only. With introduction of the dependence, (), the relative bias in estimation of using the unweighted method increases to at least 9.6% compared to 0.30% in DAG 1, under all the three simulation set-ups. Under set-up 1, the selection model for PL and CL are correctly specified. In this set-up, PL and CL perform best in terms of both bias and RMSE in estimating , whereas SR estimates with a higher bias (1.72%). Due to loss of information in discretizing selection variables, the relative bias and RMSE of PS is the highest among all the five methods (20.78% and 2.07, respectively) under set-up 1. However under simulation set-ups 2 and 3, the biases and RMSEs of both PL and CL increase significantly due to misspecification of selection models. For PS, the functional form of the selection model affect neither bias nor RMSE. The estimate of using SR under set-up 2 is close to the estimate in set-up 1 since the estimation procedure of SR do not depend on the logistic form of the selection model. Therefore, the effect of perturbation of the selection probabilities by a constant in set-up 2 for SR is inconsequential. However, the introduction of interaction term in set-up 3 increases the relative bias and RMSE for SR to 25.96% and 6.33, respectively, since it assumes no interaction in the estimation method. In set-up 3, the RMSE of all the four unweighted methods are remarkably high (atleast 6) due to severe misspecification.
On the other hand, due to lack of dependence of on , all the methods produce accurate estimate of in terms of both bias and RMSE under all the three simulation set-ups.
3.1.3. Example DAG 3: and induced bias for coefficients of and
Under DAG 3, is a function of both and . Consequently under all the set-ups, the relative biases in estimation of increased to at least 16% using the unweighted logistic method. Due to correct specification of the selection model for PL and CL in set-up 1, we observe that these two methods accurately estimate both and . However, under both set-ups 2 and 3, the relative bias of estimates of and using PL and CL increases by a large amount. The relative bias in estimation of using SR increase to 16.74 in DAG 3 % from 0.28% in DAG 2 under set-up 1 due to incorrect model specification. The bias in estimation of using SR did not change much in set-ups 2 and 3 from set-up 1. For , we observe a big increase in bias in set-up 3 using SR. On the other hand in terms of RMSE, PL, CL, and SR perform better than the unweighted logistic regression except for estimation of in set-up 3, where RMSE increased to atleast 15. All the four weighted estimates being highly biased in compared to the naive estimator lead to this abrupt hike in RMSE. In all the other cases, the RMSE of these methods are below 1. The estimate of using PS in all the three set-ups performs poorly with high relative bias (at least 24%) and RMSE (at least 3.61). On the other hand, both relative bias and RMSE in estimation of using PS is fairly low (at most 1.95% and 0.09, respectively).
3.1.4. Example DAG 4: strong dependence, increased bias for coefficients of and
Due to increase in dependence of on , the bias in estimation of is the highest among all the DAGs for the unweighted method. The relative bias in estimation of increases to at least 40.34% in all the three set-ups using the unweighted method. Similar to the previous DAGs, under set-up 1, PL and CL perform best in terms of both RMSE and bias among all the methods in estimation of the disease parameters due to correct specification of the selection model. Under set-ups 2 and 3, these two methods perform poorly in terms of model misspecification. For SR, we observe an increase in relative bias to 29% in estimation of compared to DAG 3 in set-up 1. The bias in estimation of both and decrease compared to other DAGs using PS. For all the methods in most scenarios, the RMSE is less than 1, which implies better performance of the weighted methods compared to the unweighted logistic method.
3.1.5. Summary takeaways
The comparative performances of the different methods under all varying simulation scenarios are summarized in Figure 5.
Figure 5.
Preferred methods of estimation for including the unadjusted and the four weighted ones in terms of bias of estimation of the disease model parameters under different DAG set-ups in all the three considered simulation set-ups. Unweighted, unweighted logistic regression; SR, simplex regression; PL, pseudolikelihood; PS, poststratification; CL, calibration.
Set-up 1 Correctly Specified Individual Selection Model: As expected PL and CL estimate both the disease model parameters accurately when the selection model is correctly specified under all the four DAGs. They offer better solutions than using the naive logistic regression across all scenarios. It is not fair to compare PS and SR since they use different types of external data. Still, between PS and SR, there is no clear winner. While PS does well in DAG 4, SR has better performance in simpler DAGs. However, SR is also always better than naive logistic regression in all simulations. While for PS there could be very large RMSEs as we noticed in DAGs 2 and 3 (Table 1) due to high bias. The loss in information in discretizing the selection variables leads to incorrect selection weights estimation using PS. As a result, we observe that even with help of only marginal means of the selection variables from target population, under correct specification of selection model, CL works better than PS. However in DAG 4 due to high dependence among the different variables, the information contained in the discretized versions become adequate to estimate accurate weights for PS.
Variance Estimation/Uncertainty Quantification: Online supplementary material, Figures S1 and S2 assess the performances of the proposed variance estimators for the weighted methods under all the DAGs in set-up 1. Online supplementary material, Figure S1 shows the deviation of the estimated variance of using the variance estimators discussed in Section 2.6 from the Monte Carlo variance under all the four DAGs. We observe that the variance estimators for the four methods estimate accurately the Monte Carlo variance except for SR variance estimator in case of DAG 4. Online supplementary material, Figure S2 shows the coverage probabilities of the 95% confidence intervals constructed using the proposed variance estimators. The coverage probabilities of PL and CL are close to 0.95 for all the four DAGs in set-up 1. The coverage probabilities are conservative for PS in DAGs 1,2, and 3. In DAG 4 the coverage probability is less than 0.5 for PS. The coverage probabilities of SR in DAGs 1 and 2 are comparable to the other methods. On the other hand in DAGs 3 and 4, the coverage probability of SR is close to 0. The main reason behind the low coverage probability is due to high bias of SR in DAGs 3 and 4 observed from Figure 4b.
Set-up 2 Incorrectly Specified Selection Model 1: In set-up 2, our results indicate that all methods performed remarkably well in DAG 1, similar to the previous set-up since the bias term is constant in . SR and PS did not show major changes from the previous set-up and the performance in terms of relative bias % and RMSE are better than PL and CL. In DAG 4, PS estimate both the disease model parameters with low relative bias % and RMSE. On the other hand, in DAGs 3 and 4, we observed highly inaccurate estimates for PL and CL in terms of relative bias (%). Our findings suggest that these models are highly sensitive to selection model misspecification.
Set-up 3 Incorrectly Specified Selection Model 2: The key takeaways in this set-up are similar to the previous one. However, the RMSE of the all four weighted methods are extremely high in DAGs 2 and 3 in estimation of . Due to high degree of selection model misspecification with introduction of interaction among the selection variables, the selection weights estimates of the four unweighted methods are extremely inaccurate which leads to a huge increase in RMSE. However in DAG 4, the performance of the unweighted method degraded by a huge extent and as a result, the RMSEs of the weighted methods are much less compared to DAGs 2 and 3.
4. Data application: the Michigan Genomics Initiative
4.1. Introduction
The Michigan Genomics Initiative (MGI) is a rolling enrollment health EHR-linked biorepository within the University of Michigan Healthcare System consisting of over 93,000 participants primarily recruited through surgical encounters at Michigan Medicine. Due to the perioperative recruitment strategy, participants in MGI exhibit a lower overall health status and higher prevalence of cancer compared to the general population (Zawistowski et al., 2023). Time-stamped ICD (International Classification of Disease) diagnosis data are available for each patient. A rich ecosystem of additional information is available, including lifestyle and behavioural-risk factors, laboratory and medication data, geo-coded residential information, socioeconomic metrics, and other patient-level, census tract-level, and provider-level characteristics.
In this section, we use the MGI data to study the association between cancer (D) and biological sex () in the target US adult population. The direction of association in this case is well known from national SEER (Surveillance, Epidemiology, and End Results) registry estimates. SEER data indicates lower lifetime cancer risk among women relative to men, with corresponding marginal log-odds ratios of 0.24 (2008–2010), 0.19 (2010–2012), 0.08 (2012–2014), and 0.07 (2014–2016), respectively (seer.cancer.gov). This known target national-level true association presents us with an opportunity to assess and compare the methods when applied to MGI in terms of bias in . In this analysis, we investigate the marginal/unadjusted and age () adjusted association between cancer and biological sex. For all the methods, we divided age into three categories, namely (18–39) (reference level), (40–59), and (). For the selection model we use diabetes, race, smoking currently, BMI (body mass index), and CHD (coronary heart disease) as . BMI has four categories, namely (0–18) (reference level), [18.5–25), [25–30), and (). For the individual-level data methods (PL and SR), we use publicly available NHANES 2017–2018 (National Health and Nutrition Examination Survey) data to construct IPW weights (cdc.gov/nchs/nhanes). NHANES is a complex multistage probability sampling design used to select participants representative of the civilian, noninstitutionalized US population. On the other hand, we use age-specific and marginal summary statistics from SEER, the US Census, and the US CDC (Centers for Disease Control and Prevention) to construct poststratification and calibration weights, respectively.
4.2. Descriptive summaries
We select adult participants in NHANES since MGI consists of participants with age 18 years or older. After removing observations with incomplete data on the variables of interest, we are left with 80,947 and 5,153 participants in MGI and NHANES, respectively. Table 4 presents a comprehensive summary of the variables of interest in both the MGI and NHANES datasets. The reported statistics for the NHANES dataset in this table are unweighted. As expected, MGI is enriched with cancer patients, with 48.7% participants having a past or current cancer diagnosis (D). The NHANES dataset demonstrates a prevalence of cancer at 10.3%. The two studies differ in terms of the distribution of sex (), age (), and other selection covariates ().
Table 4.
Descriptive summaries of the different variables of interest in both MGI and NHANES data
| Variables | MGI | NHANES |
|---|---|---|
| Cancer | Yes (48.7%) | Yes (10.3%) |
| No (51.2%) | No (89.7%) | |
| Sex | Female (53.8%) | Female (51.8%) |
| Male (46.2%) | Male (48.2%) | |
| Age | 57.5 (18.1) | 51.2 (17.6) |
| Race | Non-Hispanic White (85.3%) | Non-Hispanic White (34.3%) |
| Others (14.7%) | Others (66.7%) | |
| BMI(kg/m2) | 29.9 (7.26) | 29.8 (7.4) |
| CHD | Yes: 16.5% | Yes: 4.6% |
| No: 83.5% | No: 95.4% | |
| Diabetes | Yes: 33.3% | Yes: 15.7% |
| No: 66.7% | No: 84.3% | |
| Current Smoking | Yes: 9.8% | Yes: 18.2% |
| No: 90.2% | No: 81.8% |
Note. The statistics for NHANES provided here are unweighted. CHD stands for Coronary heart disease. For continuous variables, we reported Mean (SD).
4.3. Analyses of MGI data
In these data example in the disease model, we consider cancer, sex, and age as D, , and respectively. The sex variable is coded as 1 for female participants. We are primarily interested in estimation of the marginal and age-adjusted association parameters between cancer and sex, , which is defined by the following equation.
In the marginal association model, we did not adjust for age. The additional terms in the adjusted model are displayed in red. Note that the reference data from SEER corresponds to the marginal association model of cancer on sex without adjusting for age.
For all the four weighting methods, we first estimated the IPW weights without including cancer ( as a selection variable (defined as , inverse of ). This is due to the small number of cancer cases in NHANES compared with MGI as displayed in Table 4. Then we modified the weights for the two individual-level methods (PL and SR) using the following expression to incorporate cancer into the selection model,
| (13) |
| (14) |
where is obtained from fitting a logistic regression model of D on in MGI. On the other hand, we fit a weighted logistic regression in the NHANES data with the given sampling weights to obtain . The details of deriving equation (13) is provided in online supplementary material, Section S1.7. In case of the summary-level methods PS and CL, estimation of in equation (14) is not possible due to limited availability of joint summary statistics from population. Therefore, we approximate in equation (14) by using SEER estimate of age-specific cancer SEER estimate. Similarly in the denominator is obtained using a logistic regression D on age in MGI. We still need the joint distribution of to estimation of to implement PS. Due to limited availability of joint and conditional summary data on from the US target population we made an assumption that given all the other selection variables are independent of each other. For all the weighted methods, we winsorized the selection weights by replacing the extreme 2.5% and 97.5% intervals by their respective quantiles to stabilize the methods.
4.4. Results
We present the estimates of marginal and age-adjusted association parameters between cancer and sex in Figure 6a and b, respectively, using all the four weighted methods and unweighted logistic regression.
Figure 6.
(a) Estimates of the marginal association between cancer and sex along with 95% C.I in using all the four weighted methods and the unweighted logistic regression with and without including cancer as a selection variable. (b) Estimates of the age-adjusted association between cancer and sex along with 95% C.I in using all the four weighted methods and the unweighted logistic regression with and without including cancer as a selection variable.
Marginal/Unadjusted Association: We consider the SEER estimates of cancer-sex association to be the target truth (0.24, 0.07). The estimate using the naive unweighted logistic regression method is 0.05 [95% Confidence Interval (C.I) (0.08,0.03)]. The corresponding estimates obtained using the four IPW weighted methods namely PL, SR, PS, CL, and without including cancer as a selection variable are 0.08 [95% C.I (0.04,0.12)], 0.12 [95% C.I (0.06,0.18)], 0.19 [95% C.I (0.15,0.23)], 0.22 [95% C.I (0.15,0.23)] respectively, showing that misspecified weights can sway the OR estimates in the wrong direction further away from the truth than the unweighted estimator. On the other hand, the estimates obtained using the four IPW weighted methods namely PL, SR, PS, CL and including cancer as a selection variable are 0.13 [95% C.I (0.16,0.09)], 0.11 [95% C.I (0.17,0.06)], 0.11 [95% C.I (0.15,0.07)], 0.12 [95% C.I (0.15,0.08)], respectively. The 95% C.I of using all the four weighted methods largely lie within the SEER confidence estimate (0.24, 0.07).
Age-adjusted Association: The age-adjusted estimate using the unweighted logistic method is 0.10 [95% C.I (0.07,0.13)] which lies in the opposite direction of the SEER confidence estimate. The estimates obtained using the four IPW weighted methods namely PL, SR, PS, CL and without including cancer as a selection variable skew the OR estimates in the opposite direction. In contrast, the estimates obtained using the four IPW weighted methods namely PL, SR, PS, CL and including cancer as a selection variable are 0.07 [95% C.I (0.10,0.03)], 0.09 [95% C.I (0.15,0.02)], 0.07 [95% C.I (0.12,0.02)], 0.05 [95% C.I (0.11,0.02)], respectively. We observe that all the four weighted methods have reduced the bias of the estimated association parameter.
4.5. Effects of different sub-sampling strategies within MGI
In this section, we carry out an idealized experiment using the MGI data. In real data, we do not know the actual variables that are driving the selection mechanism. However, when we subsample data intentionally based on certain variables from MGI, the selection model and variables are known to us. This intentional and known subsampling strategy provide a framework to study the extent of selection bias introduced due to different choices of selection variables and allow us to study the performance of different methods in recovering the truth in a more realistic situation. Let denotes the selection indicator of being included into the subsample of MGI. We incorporate four subsampling strategies using a logistic selection model with varying parameter values. The first one is a random sample, the second depends on only cancer (D), third on cancer (D), and sex (D) and finally the fourth on cancer (D), sex (Z), and diabetes (W). In this exercise, we do not include age in the disease model. The details of the subsampling strategies are given in online supplementary material, Section S1.8. Using the above four subsamples of the MGI data, we evaluate the performances of the different methods in estimating the association parameter between cancer and biological sex. We consider two scenarios with two target population (MGI and US populations, respectively) as we develop the weights. In both the scenarios, we assume that the true subsampling strategy is known.
First Scenario: In the first scenario, we assume that the MGI cohort is the target population. Therefore in this case, the unweighted estimate obtained from MGI [0.05, 95% C.I (0.08,0.03)] is assumed to be the truth and we compare the estimates of the different methods under varying subsamples. The different subsamples serve as the nonprobability samples of interest drawn from the target MGI population. For the individual-level methods, in this scenario external data and target are same which is MGI and hence for each participant. Therefore, it does not make sense to apply SR since the response variable for Simplex Regression step is 1 for all datapoints. For PS and CL, we constructed joint probabilities and marginal means from the MGI data. The performances of three weighted and the unweighted logistic method are presented in Figure 7a. Under random sampling, all the four methods accurately estimate in terms of bias as expected. In the case of only cancer affecting subsampling, all the methods including the unweighted logistic are unbiased. This case is exactly same as DAG 1 which justifies the accurate performances of all the methods. However when sex (Z) and cancer (D) impacts selection, the estimate using the unweighted logistic method is severely biased. The association changes to an entirely wrong direction [0.20, 95% C.I (0.15, 0.25)]. All three weighted methods, namely PL [0.05, 95% C.I (0.10, 0)], PS [0.05, 95% C.I (0.10, 0)], and CL [0.05, 95% C.I (0.10, 0)] estimate the association parameter with negligible bias. We observe similar results in the fourth case where diabetes (W) affects selection along with cancer and sex. In all the cases, we observe that the variances of the methods increase in comparison to the true MGI C.I due to smaller sample size of the subsamples.
Figure 7.
(a) Estimates of the association between cancer and sex along with 95% C.I using three weighted methods and the unweighted logistic regression under the four subsampling strategies when MGI is assumed to be the target population. (b) Estimates of the association between cancer and sex along with 95% C.I using three weighted methods and the unweighted logistic regression under the four subsampling strategies when US is assumed to be the target population. The band represents the 95% C.I of estimate of obtained from MGI using unweighted logistic regression. Unweighted, unweighted logistic regression; PL, pseudolikelihood; PS, poststratification; CL, calibration.
Second Scenario: In this scenario, we assume that the US adult population is the target population, not MGI. Therefore in this case, the SEER estimates are assumed to be the truth and we compare the estimates of the different methods under varying subsampling schemes. For each of the three weighted methods, we apply a two-stage weighting approach to obtain the final weights for the IPW regression. The first and second step of weights transport the subsample estimates to the MGI and then the US adult population, respectively. In the second weighting step we use all the variables in in Section 4.3 including age. All the three weighted methods have reduced the bias in estimating the association parameter compared to the estimate of the unweighted method. We observe from Figure 7b that under the first two subsampling strategies, all the three weighted methods perform well in terms of bias. For the last two subsampling strategies, CL and PL perform have a large overlap with the SEER band. For example when subsampling is based on both cancer and sex, majority portion of the 95% C.I bands of PL [0.14, 95% C.I (0.21,0.08)] and CL [0.14, 95% C.I (0.22,0.07)] are with the SEER band. Compared to PL and CL, PS on the other hand did not perform well since a large portion of PS is outside the SEER band. Again in all the cases, we observe that the variances of the methods increase due to smaller sample size of the subsamples.
Similar to the simulation results obtained in Section 3.1, we observe when either Z or W or both affect selection along with D, the unweighted estimate is highly biased. The IPW methods help in reducing the bias of the parameter of interest.
5. Discussion and conclusion
Selection bias is a major concern in EHR studies since it is extremely difficult to ascertain the process through which a patient from the target population enters the analytic sample or why a particular observation or laboratory result appears in the health record of a patient. The mechanism of patients’ interactions with the healthcare system may be influenced by a variety of patient characteristics such as age, sex, race, healthcare access, and other health-related co-morbidities. If the issue of selection bias is overlooked, association analyses are generally biased because unadjusted inference from these nonprobability samples from EHR data is generally not transportable to the target population. Therefore, there is a pressing need to understand the structure of selection bias and correct for it when needed, in order to draw valid inferences for the target population.
Hospital-based biobanks are enriched with specific diseases. For example, the dataset we used, MGI, (Zawistowski et al., 2023) recruits patients while they are waiting for surgery. Consequently it is enriched for many diseases including skin cancer (Fritsche et al., 2019). Thus the results from MGI are not directly generalizable to the Michigan or US population which is evident from the results shown in Section 4 on the cancer sex association. On the other hand, population-based biobanks such as the UK Biobank, Estonian Genome Center Biobank, and Taiwan Biobank attempt to recruit participants nationally by inviting volunteers. Even these large population-based biobanks like the UK Biobank suffer from healthy control bias (Fry et al., 2017). Nationally representative studies such as the NIH All of Us often have a purposeful sampling strategy that leads to, say, oversampling certain underrepresented subgroups (All Of Us Research Programs Investigators, 2019). The problem of selection bias may be magnified when multiple biobanks all over the world are being harmonized together for massive meta-analysis. For example, the Global Biobank Meta-analysis Initiative (GBMI) (Vogan, 2022) has linked 24 biobanks with more than 2.2 million genotyped samples linked with health records. For turning such big data into meaningful knowledge, one needs to characterize the different sampling mechanisms underlying the recruitment strategies of these diverse biobanks. We hope this paper provides a conceptual and analytic framework towards understanding selection bias and a set of the tools that are available to us.
In this work, we introduce a framework to assess selection bias using DAGs in case of estimating association of a binary response variable with other independent variables of interest. We considered four inverse probability weighting methods to reduce selection bias in estimation of association parameters in a logistic structured disease regression model. The four methods, namely Pseudolikelihood (PL), Simplex Regression (SR), PostStratification (PS), and Calibration (CL) differ primarily based on the nature of external data used in constructing the inverse probability weights. In all four methods, we have to first start with a plausible set of selection variables, this could be done either through description of the study’s recruitment mechanism or by an agnostic search, sifting through an array of variables that are predictive of . The first two methods (PL and SR) require individual-level external data on the selection variables from the target population or from an external probability sample drawn from the target population. In contrast, summary-level methods PS and CL use joint distribution and marginal means of the selection variables, respectively, from the target population or a probability sample drawn from the target population. We present using a simulation study the extent by which these methods are able to reduce selection bias across a diverse set of simulation settings. Next we discuss a data example of estimating the association of biological sex with cancer in a hospital-based biobank namely, MGI and compare the results obtained from different methods to the population-based SEER estimate. Finally, we describe the considerations that might guide a practitioner in choosing one method versus the others for a given analysis.
There could be significant challenges in gathering individual-level data from the target population (nearly impossible) or on an external probability sample drawn from the target population. Obtaining access to an external probability sample with individual level data encompassing all selection variables that drive internal sample selection, can be exceedingly difficult at times. Thus one may be forced to resort to methods that use summary data. PL requires that the joint distribution of the selection variables in the target population are available to us. We can also use a probability sample drawn from the target population to get to the target joint distribution. In the case of continuous selection variables, we can at best expect to have access to joint probabilities of discretized/categorized versions of those variables. Even with this coarsening, obtaining joint probabilities of a large multivariate set of predictors becomes daunting. As in our data example, one has to make several conditional independence assumptions to specify a joint distribution from available sub-conditionals. CL has the simplest data requirement on the target population (or an external probability sample drawn from the target population) and requires that only the marginal totals of the selection variables in the target population be available to us. It is not feasible to include second moments or interaction terms between variables in the selection model for CL due to the limitation of having access only to the first-order moments.
Suppose that the needed data are available for all four methods. How does one choose the optimal method then? The performance of the methods are driven by the relationship between the selection only variables (), disease outcome (D), and the predictors in the disease regression model () as well as whether we are getting closer to knowing the true selection mechanism. In this paper, we considered three simulation set-ups under all the Directed Acyclic Graph (DAG) settings in Figure 2 to study the performances of the four above-mentioned methods. In summary, we find that PL and CL are more robust to DAG complexity, but they are highly sensitive to selection model misspecification. The exact opposite pattern is observed for the other two methods, SR and PS. The simulation results suggest that in a given problem, the more we know about the selection mechanism and have measured the variables driving it in internal and external data, the better is our chance with methods that use individual-level data, in particular PL. Under misspecification of the selection model and outcome model, we are better off with summary-level methods, in particular PS. In practice, the selection model will be hard to pin down in many cases. It is then natural to ask, is it worth the effort to curate individual-level data on the external sample? To answer this question we implemented a new simulation setting. Under this set-up, we changed the functional form of the internal selection model to
| (15) |
where . Essentially, we augmented the internal selection model of set-up 3 in our primary simulation study by incorporating an additional interaction term between and W as observed from the above equation. The choice of the selection parameters show that this particular setting falls under the DAG 4 in Figure 2. For the implementation of the individual-level methods, PL and SR, in one implementation we included all the two way interaction terms between D, , and W in the selection model. In an alternative implementation, the models were fitted without incorporating any interactions. The selection weights for PS are estimated using available joint distributions of the selection variables from the target population. The selection weights estimation for PS does not incorporate the exact parametric structure of the selection model. The selection model for CL could only accommodate the main effects of D, , and W due to sole availability of marginal means for the selection variables. Thus, despite being aware of the actual specification of the internal selection model, the incorporation of interaction terms within the implementation of CL was not feasible.
Online supplementary material, Table S1 shows that under this new simulation set-up, PL accurately estimates both the disease model parameters and exhibiting low relative percentage biases of 0.70% and 1.05%, respectively, with RMSEs of 0.27 and 1.12. when the interaction term is incorporated in the selection model. Conversely, the omission of the interaction term results in biases (of at least 31%) and RMSEs (by at least a factor of 3) for both parameters. SR incorporates simplex and multinomial regression models to estimate selection weights. SR has high bias and RMSE with and without interaction terms as it misspecified the underlying logistic model in both cases. For PS, introduction of the new interaction term leads to greater misspecification of the underlying selection probabilities with categorized versions of the continuous variables and W and subsequently increases bias and RMSE in estimation of (by at least 45.17% and a factor of 7, respectively) as observed from online supplementary material, Table S1. Similar to PL, CL is highly sensitive to selection model misspecification. As mentioned earlier, under this set-up we do not have access to enough information to incorporate interaction terms into the selection weights estimation using CL. This leads to model misspecification due to limited availability of external information. The relative biases and RMSEs for both the parameters are at least 40% and a factor of 5, respectively. In summary, collecting individual-level data from the external probability sample can lead to improved performance of the PL method when we have accurate knowledge about a complex selection mechanism.
In the data example presented in Section 4, we conducted an age-adjusted association analysis between cancer and biological sex for MGI patients. Within the MGI dataset, the selection variables are known, and we also have access to individual-level data on these variables from an external probability sample (NHANES in this case). Importantly, in this context, PL exhibits the lowest variance when compared to the other methods. These findings align with the observations from the above simulation results as observed from Figure 6b.
In reality, it is not possible to know the exact selection model. Certain degrees of misspecification are inevitable, taking the form of either functional form misspecification or the absence of knowledge regarding certain selection variables. One suggestion to overcome this issue can be extensions of PL and SR using a lasso-based method to select among all the possible interactions between the selection variables. The selection mechanism can also be modelled using a more robust non parametric method. Suitable extensions of common Doubly Robust Methods (Chen et al., 2020) can be a suitable alternative to the above suggested methods. However, the implementations of such intricate extensions are primarily feasible when individual-level external data are available, as these methods necessitate information on higher-order moments and covariance terms between the various selection variables.
This work has several limitations. All the methods we consider suffer when the selection probability model is misspecified. We only considered functional misspecification of the selection model in our simulation studies but there will likely be many omitted covariates. It is nearly impossible to measure all the variables driving selection. Gathering more data on a representative subsample of the population embedded within EHR may also lead to more substantial reduction of bias. Chart review (Yin et al., 2022), multi-wave sampling (Liu et al., 2022), double sampling approaches (Chen and Chen, 2000) should also be considered as possible avenues. We also ignored selection model uncertainty in the simplex regression method. Bootstrap can offer a potential solution to consistent variance estimation. Finally, as described in Table 1, selection bias occurs not in isolation but in conjunction with several other sources of bias, for example with outcome misclassification (Beesley, Fritsche, et al., 2020). We need sensitivity analysis tools and source of bias diagnostics for EHR data to identify a hierarchy of the different sources of bias for a given problem. In this analysis, we did not consider the time stamps of the observations in longitudinal EHR data. The relationships between covariates and outcomes in the DAGs are highly dependent on the relative ordering. Extension of the discussed methods to longitudinal data may address this issue.
Finally, creation of nationally integrated databases, where all health encounters for everyone are recorded in the same data system will enable researchers to harness the full potential of real-world healthcare data for everyone, not just for some selected (often historically privileged) subpopulations. Use of exclusionary cohorts and data disparity is at the heart of fairness in modern machine learning methods (Mhasawade et al., 2021; Parikh et al., 2019). In that sense, equal probability sample selection method (EPSEM) is a tool to ensure equity and fairness in data science. In the absence of EPSEM in real-world data, thinking about selection bias is at the heart of doing inclusive science with data. Our hope is that our paper will contribute to that important discourse.
Supplementary Material
Acknowledgments
The authors thank Professor Ruth Keogh and organizers and attendees of the symposium on 50 years of the Cox Model for including this work in the programme and providing feedback during the presentation.
Contributor Information
Ritoban Kundu, Department of Biostatistics, University of Michigan, Ann Arbor, USA.
Xu Shi, Department of Biostatistics, University of Michigan, Ann Arbor, USA.
Jean Morrison, Department of Biostatistics, University of Michigan, Ann Arbor, USA.
Jessica Barrett, MRC Investigator, Biostatistics Unit, Medical Research Council, University of Cambridge, Cambridge, UK.
Bhramar Mukherjee, Department of Biostatistics and Epidemiology, University of Michigan, Ann Arbor, USA.
Funding
This research is supported by National Science Foundation Division of Mathematical Sciences 1712933, National Institutes of Health/National Cancer Institute CA267907, and National Institutes of Health R01GM139926.
Data availability
All correspondence should be directed to Bhramar Mukherjee (bhramar@umich.edu). All codes are available in https://github.com/Ritoban1/Short-Note-Selection-Bias.git. Michigan Genomics Initiative Data are available after institutional review board approval to select researchers. See https://precisionhealth.umich.edu/our-research/michigangenomics/.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series A.
References
- Abbasizanjani H., Torabi F., Bedston S., Bolton T., Davies G., Denaxas S., Griffiths R., Herbert L., Hollings S., Keene S., Khunti K., Lowthian E., Lyons J., Mizani M. A., Nolan J., Sudlow C., Walker V., Whiteley W., & Wood A., …CVD-COVID-UK/COVID-IMPACT Consortium (2023). Harmonising electronic health records for reproducible research: Challenges, solutions and recommendations from a UK-wide COVID-19 research collaboration. BMC Medical Informatics and Decision Making, 23(1), 1–15. 10.1186/s12911-022-02093-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- All Of Us Research Programs Investigators (2019). The “All of Us” research program. New England Journal of Medicine, 381(7), 668–676. 10.1056/NEJMsr1809937 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Almeida J. R., Silva L. B., Bos I., Visser P. J., & Oliveira J. L. (2021). A methodology for cohort harmonisation in multicentre clinical research. Informatics in Medicine Unlocked, 27, 100760. 10.1016/j.imu.2021.100760 [DOI] [Google Scholar]
- Barndorff-Nielsen O. E., & Jørgensen B. (1991). Some parametric models on the simplex. Journal of Multivariate Analysis, 39(1), 106–116. 10.1016/0047-259X(91)90008-P [DOI] [Google Scholar]
- Beesley L. J., Fritsche L. G., & Mukherjee B. (2020). An analytic framework for exploring sampling and observation process biases in genome and phenome-wide association studies using electronic health records. Statistics in Medicine, 39(14), 1965–1979. 10.1002/sim.v39.14 [DOI] [PubMed] [Google Scholar]
- Beesley L. J., & Mukherjee B. (2022a). Case studies in bias reduction and inference for electronic health record data with selection bias and phenotype misclassification. Statistics in Medicine, 41(28). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beesley L. J., & Mukherjee B. (2022b). Statistical inference for association studies using electronic health records: Handling both selection bias and outcome misclassification. Biometrics, 78(1), 214–226. 10.1111/biom.v78.1 [DOI] [PubMed] [Google Scholar]
- Beesley L. J., Salvatore M., Fritsche L. G., Pandit A., Rao A., Brummett C., Willer C. J., Lisabeth L. D., & Mukherjee B. (2020). The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities. Statistics in Medicine, 39(6), 773–800. 10.1002/sim.v39.6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradley V. C., Kuriwaki S., Isakov M., Sejdinovic D., Meng X.-L., & Flaxman S. (2021). Unrepresentative big surveys significantly overestimated US vaccine uptake. Nature, 600(7890), 695–700. 10.1038/s41586-021-04198-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Y., Li P., & Wu C. (2020). Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532), 2011–2021. 10.1080/01621459.2019.1677241 [DOI] [Google Scholar]
- Chen Y., Wang J., Chubak J., & Hubbard R. A. (2019). Inflation of type I error rates due to differential misclassification in EHR-derived outcomes: Empirical illustration using breast cancer recurrence. Pharmacoepidemiology and Drug Safety, 28(2), 264–268. 10.1002/pds.v28.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Y.-H., & Chen H. (2000). A unified approach to regression analysis under double-sampling designs. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(3), 449–460. 10.1111/1467-9868.00243 [DOI] [Google Scholar]
- Christensen K., Holm N., Olsen J., Kock K., & Fogh-Andersen P. (1992). Selection bias in genetic-epidemiological studies of cleft lip and palate. American Journal of Human Genetics, 51(3), 654. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1682715/pdf/ajhg00067-0208.pdf [PMC free article] [PubMed] [Google Scholar]
- Cornfield J., Haenszel W., Hammond E. C., Lilienfeld A. M., Shimkin M. B., & Wynder E. L. (1959). Smoking and lung cancer: Recent evidence and a discussion of some questions. Journal of the National Cancer Institute, 22(1), 173–203. 10.1093/jnci/22.1.173 [DOI] [PubMed] [Google Scholar]
- Dempster A. P., Laird N. M., & Rubin D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. 10.1111/j.2517-6161.1977.tb01600.x [DOI] [Google Scholar]
- Denny J. C., Bastarache L., Ritchie M. D., Carroll R. J., Zink R., Mosley J. D., Field J. R., Pulley J. M., Ramirez A. H., Bowton E., Basford M. A., Carrell D. S., Peissig P. L., Kho A. N., Pacheco J. A., Rasmussen L. V., Crosslin D. R., Crane P. K., Pathak J., & Roden D. M. (2013). Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature Biotechnology, 31(12), 1102–1111. 10.1038/nbt.2749 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deville J.-C., & Särndal C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376–382. 10.1080/01621459.1992.10475217 [DOI] [Google Scholar]
- Doove L. L., Van Buuren S., & Dusseldorp E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, 92–104. 10.1016/j.csda.2013.10.025 [DOI] [Google Scholar]
- Elliot M. R. (2009). Combining data from probability and non-probability samples using pseudo-weights. Survey Practice, 2(6), 2982. 10.29115/SP-2009-0025 [DOI] [Google Scholar]
- Ferrari S., & Cribari-Neto F. (2004). Beta regression for modelling rates and proportions. Journal of Applied Statistics, 31(7), 799–815. 10.1080/0266476042000214501 [DOI] [Google Scholar]
- Fritsche L. G., Beesley L. J., VandeHaar P., Peng R. B., Salvatore M., Zawistowski M., Gagliano Taliun S. A., Das S., LeFaive J., Kaleba E. O., Klumpner T. T., Moser S. E., Blanc V. M., Brummett C. M., Kheterpal S., Abecasis G. R., Gruber S. B., & Mukherjee B. (2019). Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan genomics initiative and the UK Biobank with a visual catalog: PRSWeb. PLoS Genetics, 15(6), e1008202. 10.1371/journal.pgen.1008202 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fry A., Littlejohns T. J., Sudlow C., Doherty N., Adamska L., Sprosen T., Collins R., & Allen N. E. (2017). Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. American Journal of Epidemiology, 186(9), 1026–1034. 10.1093/aje/kwx246 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu S., Leung L. Y., Raulli A.-O., Kallmes D. F., Kinsman K. A., Nelson K. B., Clark M. S., Luetmer P. H., Kingsbury P. R., Kent D. M., & Liu H. (2020). Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction. BMC Medical Informatics and Decision Making, 20(1), 1–12. 10.1186/s12911-020-1072-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galimard J.-E., Chevret S., Protopopescu C., & Resche-Rigon M. (2016). A multiple imputation approach for MNAR mechanisms compatible with Heckman’s model. Statistics in Medicine, 35(17), 2907–2920. 10.1002/sim.v35.17 [DOI] [PubMed] [Google Scholar]
- Geneletti S., Richardson S., & Best N. (2009). Adjusting for selection bias in retrospective, case–control studies. Biostatistics, 10(1), 17–31. 10.1093/biostatistics/kxn010 [DOI] [PubMed] [Google Scholar]
- Glynn E. F., & Hoffman M. A. (2019). Heterogeneity introduced by EHR system implementation in a de-identified data resource from 100 non-affiliated organizations. JAMIA Open, 2(4), 554–561. 10.1093/jamiaopen/ooz035 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haneuse S., & Daniels M. (2016). A general framework for considering selection bias in EHR-based studies: What data are observed and why? eGEMs (Generating Evidence & Methods to Improve Patient Outcomes), 4(1), 16. 10.13063/2327-9214.1203 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heart T., Ben-Assuli O., & Shabtai I. (2017). A review of PHR, EMR and EHR integration: A more personalized healthcare and public health policy. Health Policy and Technology, 6(1), 20–25. 10.1016/j.hlpt.2016.08.002 [DOI] [Google Scholar]
- Heintzman J., Marino M., Hoopes M., Bailey S. R., Gold R., O’Malley J., Angier H., Nelson C., Cottrell E., & Devoe J. (2015). Supporting health insurance expansion: Do electronic health records have valid insurance verification and enrollment data? Journal of the American Medical Informatics Association, 22(4), 909–913. 10.1093/jamia/ocv033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernán M. A., Hernández-Díaz S., & Robins J. M. (2004). A structural approach to selection bias. Epidemiology, 15(5), 615–625. 10.1097/01.ede.0000135174.63482.43 [DOI] [PubMed] [Google Scholar]
- Hoffmann T. J., Ehret G. B., Nandakumar P., Ranatunga D., Schaefer C., Kwok P.-Y., Iribarren C., Chakravarti A., & Risch N. (2017). Genome-wide association analyses using electronic health records identify new loci influencing blood pressure variation. Nature Genetics, 49(1), 54–64. 10.1038/ng.3715 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holt D., & Smith T. F. (1979). Post stratification. Journal of the Royal Statistical Society: Series A (General), 142(1), 33–46. 10.2307/2344652 [DOI] [Google Scholar]
- Huang J., Duan R., Hubbard R. A., Wu Y., Moore J. H., Xu H., & Chen Y. (2018). PIE: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data. Journal of the American Medical Informatics Association, 25(3), 345–352. 10.1093/jamia/ocx137 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplan R. M., Chambers D. A., & Glasgow R. E. (2014). Big data and large sample size: A cautionary note on the potential for bias. Clinical and Translational Science, 7(4), 342–346. 10.1111/cts.2014.7.issue-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim J. K., & Park M. (2010). Calibration estimation in survey sampling. International Statistical Review, 78(1), 21–39. 10.1111/insr.2010.78.issue-1 [DOI] [Google Scholar]
- Kleinbaum D. G., Morgenstern H., & Kupper L. L. (1981). Selection bias in epidemiologic studies. American Journal of Epidemiology, 113(4), 452–463. 10.1093/oxfordjournals.aje.a113113 [DOI] [PubMed] [Google Scholar]
- Lipsitch M., Tchetgen E. T., & Cohen T. (2010). Negative controls: A tool for detecting confounding and bias in observational studies. Epidemiology (Cambridge, Mass.), 21(3), 383–388. 10.1097/EDE.0b013e3181d61eeb [DOI] [PMC free article] [PubMed] [Google Scholar]
- Little R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88(421), 125–134. 10.1080/01621459.1993.10594302 [DOI] [Google Scholar]
- Liu X., Chubak J., Hubbard R. A., & Chen Y. (2022). SAT: A Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies. Journal of the American Medical Informatics Association, 29(5), 918–927. 10.1093/jamia/ocab267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Madigan D., Stang P. E., Berlin J. A., Schuemie M., Overhage J. M., Suchard M. A., Dumouchel B., Hartzema A. G., & Ryan P. B. (2014). A systematic statistical approach to evaluating evidence from observational studies. Annual Review of Statistics and Its Application, 1(1), 11–39. 10.1146/statistics.2013.1.issue-1 [DOI] [Google Scholar]
- Madow W. G., Nisselson H., Olkin I., & Rubin D. B. (1983). Incomplete data in sample surveys: Theory and bibliographies. (Vol. 2). Academic Press. [Google Scholar]
- Marcoulides G. A., & Schumacker R. E. (2013). Advanced structural equation modeling: Issues and techniques. Psychology Press. [Google Scholar]
- Meng W., Adams M. J., Hebert H. L., Deary I. J., McIntosh A. M., & Smith B. H. (2018). A genome-wide association study finds genetic associations with broadly-defined headache in UK Biobank (N=223,773). EBioMedicine, 28, 180–186. 10.1016/j.ebiom.2018.01.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mhasawade V., Zhao Y., & Chunara R. (2021). Machine learning and algorithmic fairness in public and population health. Nature Machine Intelligence, 3(8), 659–666. 10.1038/s42256-021-00373-4 [DOI] [Google Scholar]
- Montanari G. E., & Ranalli M. G. (2005). Nonparametric model calibration estimation in survey sampling. Journal of the American Statistical Association, 100(472), 1429–1442. 10.1198/016214505000000141 [DOI] [Google Scholar]
- Neuhaus J. M. (1999). Bias and efficiency loss due to misclassified responses in binary regression. Biometrika, 86(4), 843–855. 10.1093/biomet/86.4.843 [DOI] [Google Scholar]
- Parikh R. B., Teeple S., & Navathe A. S. (2019). Addressing bias in artificial intelligence in health care. Jama, 322(24), 2377–2378. 10.1001/jama.2019.18058 [DOI] [PubMed] [Google Scholar]
- Pendergrass S., Dudek S. M., Roden D. M., Crawford D. C., & Ritchie M. D. (2011). Visual integration of results from a large DNA biobank (BioVU) using synthesis-view. In Biocomputing 2011 (pp. 265–275). World Scientific. [DOI] [PMC free article] [PubMed]
- Rexhepi H., Huvila I., Åhlfeldt R.-M., & Cajander Å. (2021). Cancer patients’ information seeking behavior related to online electronic healthcare records. Health Informatics Journal, 27(3). 10.1177/14604582211024708 [DOI] [PubMed] [Google Scholar]
- Roberts E. K., Gu T., Wagner A. L., Mukherjee B., & Fritsche L. G. (2022, September). Estimating COVID-19 vaccination effectiveness using electronic health records of an academic medical center in Michigan. AJPM Focus, 1(1), 100015. 10.1016/j.focus.2022.100015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin D. B. (2004). Multiple imputation for nonresponse in surveys. (Vol. 81). John Wiley & Sons. [Google Scholar]
- Seaman S. R., & Vansteelandt S. (2018). Introduction to double robust methods for incomplete data. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 33(2), 184. 10.1214/18-STS647 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen C., Risk M., Schiopu E., Hayek S. S., Xie T., Holevinski L., Akin C., Freed G., & Zhao L. (2022). Efficacy of COVID-19 vaccines in patients taking immunosuppressants. Annals of the Rheumatic Diseases, 81(6), 875–880. 10.1136/annrheumdis-2021-222045 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi X., Miao W., Nelson J. C., & Tchetgen E. J. T. (2020). Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(2), 521–540. 10.1111/rssb.12361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun J. W., Wang R., Li D., & Toh S. (2022). Use of linked databases for improved confounding control: Considerations for potential selection bias. American Journal of Epidemiology, 191(4), 711–723. 10.1093/aje/kwab299 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toh S., García Rodríguez L. A., & Hernán M. A. (2011). Confounding adjustment via a semi-automated high-dimensional propensity score algorithm: An application to electronic medical records. Pharmacoepidemiology and Drug Safety, 20(8), 849–857. 10.1002/pds.v20.8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tong J., Huang J., Chubak J., Wang X., Moore J. H., Hubbard R. A., & Chen Y. (2020). An augmented estimation procedure for EHR-based association studies accounting for differential misclassification. Journal of the American Medical Informatics Association, 27(2), 244–253. 10.1093/jamia/ocz180 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vogan K. (2022). Global biobank meta-analysis. Nature Genetics, 54(12), 1764. 10.1038/s41588-022-01264-z [DOI] [PubMed] [Google Scholar]
- Wang E. C.-H., & Wright A. (2020). Characterizing outpatient problem list completeness and duplications in the electronic health record. Journal of the American Medical Informatics Association, 27(8), 1190–1197. 10.1093/jamia/ocaa125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu C. (2003). Optimal calibration estimators in survey sampling. Biometrika, 90(4), 937–951. 10.1093/biomet/90.4.937 [DOI] [Google Scholar]
- Yin Z., Tong J., Chen Y., Hubbard R. A., & Tang C. Y. (2022). A cost-effective chart review sampling design to account for phenotyping error in electronic health records (EHR) data. Journal of the American Medical Informatics Association, 29(1), 52–61. 10.1093/jamia/ocab222 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zawistowski M., Fritsche L. G., Pandit A., Vanderwerff B., Patil S., Scmidt E. M., VanderHaar P., Willer C. J., Brummett C. M., Keterpal S., Zhou X.Boehnke M., Abecasis G. R., & Zöllner S. (2023, Jan 31). The michigan genomics initiative: A biobank linking genotypes and electronic clinical records in Michigan Medicine patients. Cell Genomics, 3(2), 100257. 10.1016/j.xgen.2023.100257 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang P., Qiu Z., & Shi C. (2016). simplexreg: An R package for regression analysis of proportional data using the simplex distribution. Journal of Statistical Software, 71(11), 1–21. 10.18637/jss.v071.i11 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All correspondence should be directed to Bhramar Mukherjee (bhramar@umich.edu). All codes are available in https://github.com/Ritoban1/Short-Note-Selection-Bias.git. Michigan Genomics Initiative Data are available after institutional review board approval to select researchers. See https://precisionhealth.umich.edu/our-research/michigangenomics/.







