Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2022 Jan 6;191(4):711–723. doi: 10.1093/aje/kwab299

Use of Linked Databases for Improved Confounding Control: Considerations for Potential Selection Bias

Jenny W Sun , Rui Wang, Dongdong Li, Sengwee Toh
PMCID: PMC9430441  PMID: 35015823

Abstract

Pharmacoepidemiologic studies are increasingly conducted within linked databases, often to obtain richer confounder data. However, the potential for selection bias is frequently overlooked when linked data is available only for a subset of patients. We highlight the importance of accounting for potential selection bias by evaluating the association between antipsychotics and type 2 diabetes in youths within a claims database linked to a smaller laboratory database. We used inverse probability of treatment weights (IPTW) to control for confounding. In analyses restricted to the linked cohorts, we applied inverse probability of selection weights (IPSW) to create a population representative of the full cohort. We used pooled logistic regression weighted by IPTW only or IPTW and IPSW to estimate treatment effects. Metabolic conditions were more prevalent in linked cohorts compared with the full cohort. Within the full cohort, the confounding-adjusted hazard ratio was 2.26 (95% CI: 2.07, 2.49) comparing initiation of antipsychotics with initiation of control medications. Within the linked cohorts, a different magnitude of association was obtained without adjustment for selection, whereas applying IPSW resulted in point estimates similar to the full cohort’s (e.g., an adjusted hazard ratio of 1.63 became 2.12). Linked database studies may generate biased estimates without proper adjustment for potential selection bias.

Keywords: health-care databases, linked data, pharmacoepidemiology, selection bias

Abbreviations

CI

confidence interval

HR

hazard ratio

IPSW

inverse probability of selection weights

IPTW

inverse probability of treatment weights

T2D

type 2 diabetes

Health-care databases, such as administrative claims data, electronic health records, and registries, are widely used in pharmacoepidemiology and health services research. However, these databases are not collected for research purposes and often lack information on important confounders (1). With the widespread availability of health-care databases, records for the same patient may be available across different data sources (2, 3). As a result, richer patient data can be obtained through data linkage (411). Guidance on the feasibility of data linkage and recommendations for transparent reporting have been published elsewhere (1215).

One of the advantages to a linked database study is improved confounding control, but the potential for selection bias is often overlooked. Typically, a data linkage is feasible for only a subset of the study population. For example, a claims database from a health plan may be linked to an electronic health record database from a delivery system but only among patients who appear in both data sources. Therefore, linked database studies are generally restricted to a subset of the original study population (1620). If the subset of linked patients is not representative of the original study population, then restricting an analysis to the linked population may introduce bias (21).

Here we highlight the importance of considering potential selection bias when working with linked data sources to estimate treatment effects in the original study population (i.e., the target population of interest). We also demonstrate how available analytical approaches can be used to account for this potential bias.

METHODS

Application example

We applied the proposed approaches to evaluate the association of antipsychotics and type 2 diabetes (T2D) in youths (aged 5–24 years). In young patients, antipsychotic use is associated with a 2- to 3-fold increased risk of developing T2D (22, 23), as well as an increased risk of other adverse cardiometabolic side effects, such as weight gain and lipid and glucose abnormalities (24, 25). Therefore, the American Diabetes Association recommends metabolic monitoring for youths treated with antipsychotics (26). Metabolic screening prior to treatment initiation is intended to guide treatment decision making, so patients with poor metabolic health at baseline may choose an alternative treatment. A study evaluating antipsychotics and the risk of T2D within a database that does not capture laboratory data may have residual confounding by unmeasured metabolic test results.

Definitions

We defined the “primary data set” as the data set where the original study population was identified (and therefore, the target population we would like to make inferences about) and the “supplemental data set” as the data set that contains additional covariate data that was not available in the primary data set. The “linked cohort” consisted of patients within the primary data set who have been linked to the supplemental data set.

Data sources

The primary data set was the IBM MarketScan Commercial Database (MarketScan; IBM Watson Health, Cambridge, Massachusetts; January 2010 to March 2019), a nationwide claims database in the United States (27). To obtain additional confounder data, we identified a supplemental data set that captures test results from select laboratory networks for patients who have laboratory tests ordered (IBM MarketScan Lab Database). A previous study showed that the distribution of laboratory results within this database is representative of the general US population (28). Using the deidentified patient enrollment identification number, we linked records across data sources for the subset of MarketScan patients who had a test result available within the laboratory database. The laboratory database includes results from dozens of tests. For the application example, we were interested in 3 of these tests: hemoglobin A1c (HbA1c), cholesterol (total, high-density lipoprotein, low-density lipoprotein), and triglycerides. This study was approved by the Institutional Review Board of Harvard Pilgrim Health Care with a waiver of informed consent.

Study population

We defined the study population within the claims database (primary data set) as youths aged 5–24 years who initiated an antipsychotic medication or a comparator psychotropic medication. The exposed group included initiators of an antipsychotic medication. The date of the first observed dispensing for an antipsychotic served as the index date. The comparator group consisted of new users of other psychotropic drugs (antidepressants, medications for attention-deficit/hyperactivity disorder, and mood stabilizers; details in Web Table 1 available at https://doi.org/10.1093/aje/kwab299). The date of the first observed dispensing for a comparator drug served as the index date. In the comparator group, we required no previous use of the initiation drug, but use of the other comparator drugs was allowed. For example, antidepressant initiators were required to have no previous antidepressant dispensings, but previous dispensings of medications for attention-deficit/hyperactivity disorder or mood stabilizers were allowed.

Patients who did not have continuous medical and pharmacy enrollment, had a diagnosis of diabetes or a dispensing for an antihyperglycemic medication, or were pregnant during the 180 days prior to the index date were excluded. Youths were additionally required to have ≥1 mental health diagnosis on the index date or any point prior to the index date. In the comparator group, we also required patients to have no antipsychotic use in the 180 days prior to the index date.

Linked subset

For patients with data available in both the primary and supplemental data set, we varied the definition of linked cohort as follows (illustrated in Figure 1):

Figure 1.

Figure 1

Overview of study cohorts, IBM MarketScan Data, United States, 2010–2019. We implemented the following framework for identifying the study cohorts: The study population was defined within the primary database. This population reflected the target population that we would like to make inferences about. To obtain additional confounder data, the primary database was linked to the supplemental database. Several definitions were considered for identifying the subset of patients who appeared in both data sources (linked cohort 1, linked cohort 2, linked cohort 3). In our study, the IBM MarketScan Commercial Database served as the primary database and the IBM MarketScan Lab Database served as the supplemental database.

  1. Linked cohort 1: linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the study period. This linked cohort included eligible patients who appeared in both data sets, but some of these patients might not have a recorded measurement for the tests of interest during the covariate assessment period (defined in the next section).

  2. Linked cohort 2: linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the covariate assessment period. This linked cohort included eligible patients who had any supplemental data available during the covariate assessment period. As in linked cohort 1, some patients might not have a recorded measurement for the tests of interest during the covariate assessment period.

  3. Linked cohort 3: linked patients with a recorded result for each of the 3 tests of interest during the covariate assessment period. This linked cohort included eligible patients who would be included in a conventional complete-case analysis.

Patient characteristics

We measured baseline covariates during the 180 days prior to the index date (covariate assessment period). Within the claims database, we identified several characteristics as potential confounders or proxies of confounders, including demographic factors, metabolic conditions, psychiatric conditions, laboratory tests ordered, lifestyle factors, medications, indicators of health-care utilization, and the pediatric comorbidity index (full list in Web Table 2) (29). As was typically done in claims database studies, patients were defined as not having a certain characteristic (e.g., depression) unless they had a recorded diagnosis or dispensing in the database. Therefore, there were no missing values in the claims-based covariates.

Within the laboratory database, we obtained test results for hemoglobin A1c, cholesterol, and triglycerides. Implausibly extreme laboratory values were set to missing (details in Web Table 2). When there were multiple records of the test result available, we used only the value closest to the index date.

Outcome

We followed patients until the onset of T2D, end of insurance coverage, or end of available data. We defined cases using a previously validated algorithm for identifying T2D in children using health-care databases (positive predictive value = 87%) (30). This definition was based on the presence of inpatient or outpatient diagnosis codes for T2D and use of antihyperglycemic medications.

Statistical analysis

Descriptive statistics.

First, we examined whether patients within the linked cohorts were representative of the full cohort by comparing the distributions of baseline patient characteristics across cohorts. Then we compared the distributions of patient characteristics between treatment groups within each cohort to assess for potential confounding.

Adjusting for confounding only.

We used inverse probability of treatment weights (IPTW) to adjust for baseline confounding (31). We estimated stabilized treatment weights as the marginal probability of treatment divided by the probability of treatment conditional on measured baseline covariates, separately for the full cohort and each of the linked cohorts. Then we truncated weights at the 99th and 1st percentiles to prevent outliers from influencing the analysis (32). For each cohort, we applied 2 levels of baseline confounding adjustment: 1) adjusted for claims-based covariates only and 2) adjusted for claims-based and laboratory covariates. To estimate the association of antipsychotics and T2D, we estimated hazard ratios (HRs) using pooled logistic regression models weighted by IPTW. These analyses did not account for potential selection bias (see next section).

For analyses adjusted for laboratory covariates, we used multiple imputation by chained equations to handle missing laboratory data using the PROC MI procedure in SAS, version 9.4 (SAS Institute, Inc., Cary, North Carolina) (33, 34). We performed imputation on the continuous laboratory covariates and then dichotomized the respective variables to define high total cholesterol, high low-density lipoprotein cholesterol, and high triglyceride levels. The imputation models included the outcome and all previously defined claims-based and laboratory covariates. We assumed that the laboratory covariates were missing at random. We created 20 imputed data sets and specified a multivariate normal distribution for the imputation of continuous laboratory covariates. We fitted separate treatment weights and pooled logistic regression models (as described above) for each imputation (35, 36) and then used Rubin’s rules to pool HRs and 95% confidence intervals (CIs) across imputations (37).

Adjusting for selection bias and confounding.

To account for potential selection bias in restricting analyses to the linked cohorts, we applied inverse probability of selection weights (IPSW) (38, 39). Within each treatment group, we estimated stabilized selection weights as the marginal probability of being in the respective linked cohort divided by the probability of being in the respective linked cohort conditional on all previously described claims-based covariates. Then we truncated weights at the 99th and 1st percentiles. This weighting created a pseudopopulation in which the distributions of measured factors related to selection were expected to be balanced between the full cohort and the respective linked cohort. To evaluate the performance of these weights, we reexamined the distributions of characteristics across cohorts in the reweighted sample.

Then, within each newly defined pseudopopulation, we applied IPTW to account for baseline confounding. Specifically, we used logistic regression models weighted by IPSW to estimate stabilized treatment weights separately for each of the linked cohorts. We included the previously described baseline covariates in these weight models. By adjusting for confounding within the pseudopopulation created by IPSW, this approach would ideally create covariate balance across treatment groups (internal validity) within the target population of interest. For analyses adjusted for laboratory covariates, we filled in missing laboratory data using multiple imputation by chained equations (details above) before estimating IPTW.

To estimate treatment effects adjusted for confounding and selection bias, we fitted pooled logistic regression models weighted by IPTW and IPSW to generate HRs.

For all analyses, we computed 95% CIs for the HRs using the standard sandwich variance estimator (40) and further quantified precision using confidence limit ratios (41), the ratio of the upper limit to the lower limit of the 95% CI. We additionally estimated variance using a nonparametric bootstrapping method. We assumed that the targeted treatment effects were identifiable under the assumptions of conditional exchangeability, positivity, and causal consistency (42).

The SAS (SAS Institute, Inc.) code used to implement the main analyses is available in the Web Material.

RESULTS

Data linkage

The full cohort, identified from the claims database, consisted of 349,180 antipsychotic initiators and 2,000,308 initiators of a control medication (Figure 2). After linkage to the laboratory database, 10.7% of antipsychotic initiators and 8.4% of control patients remained in linked cohort 1. Requiring data for any laboratory test during the covariate assessment period (linked cohort 2) reduced the sample to 5.3% of antipsychotic initiators and 2.8% of control patients. Restriction to complete cases (linked cohort 3) substantially reduced the sample size (0.4% of antipsychotic initiators, 0.1% of control patients).

Figure 2.

Figure 2

Flow diagram of cohort assembly, IBM MarketScan Data, United States, 2010–2019. A) New users aged 5–24 years of an antipsychotic medication. B) New users aged 5–24 years of a control medication (attention-deficit/hyperactivity disorder medications, antidepressants, mood stabilizers). HbA1c, hemoglobin A1c.

Patient characteristics

Compared with the full cohort, patients within the linked cohorts were slightly older (mean age of controls, 16.2 (standard deviation, 5.3) years in the full cohort versus 18.2 (standard deviation, 4.7) years in linked cohort 2; Table 1, Web Table 3). They were also more likely to have diagnoses of metabolic conditions and laboratory tests ordered, with the prevalence increasing as the definition for linked cohort became more restrictive. Notably, the prevalence of obesity or overweight diagnosis among control patients was 3.5% in the full cohort, 5.2% in linked cohort 1, 7.4% in linked cohort 2, and 26.0% in linked cohort 3. Similar trends were observed in the antipsychotic group. Other measured characteristics were generally similar across cohorts.

Table 1.

Characteristics of Patients Who Initiated Antipsychotic Treatment or Control Treatment Before and After Accounting for Selection Bias, IBM MarketScan Data, United States, 2010–2019

Initiators of Antipsychotics Initiators of Other Psychotropic Drugs
Full  
Cohort
Linked Cohort 1 Linked Cohort 2 Linked Cohort 3 Full  
Cohort
Linked Cohort 1 Linked Cohort 2 Linked Cohort 3
Characteristic No. % No. % No. % No. % No. % No. % No. % No. %
No Adjustment for Selection
Age at initiation, yearsa 16.7 (4.9) 17.2 (4.6) 17.7 (4.5) 17.6 (4.5) 16.2 (5.3) 17.4 (4.9) 18.2 (4.7) 18.8 (4.4)
Female sex 155,078 44.4 19,365 51.9 7,243 54.9 425 47.5 997,621 49.9 104,221 61.9 36,821 65.8 1,515 58.6
Pediatric comorbidity indexa 5.6 (4.9) 6.2 (4.2) 6.9 (4.2) 6.3 (4.0) 2.6 (2.7) 3.0 (2.9) 3.7 (3.1) 4.1 (3.1)
Medical diagnoses
 Obesity or overweight 16,538 4.7 2,310 6.2 1,086 8.2 142 15.9 70,254 3.5 8,678 5.2 4,118 7.4 673 26.0
 Weight management 7,235 2.1 1,203 3.2 590 4.5 44 4.9 30,784 1.5 4,175 2.5 1833 3.3 159 6.1
 Abnormal glucose/prediabetes 2,404 0.7 359 1.0 181 1.4 41 4.6 6,202 0.3 869 0.5 509 0.9 153 5.9
 Hyperlipidemia 6,381 1.8 1,058 2.8 600 4.5 93 10.4 20,521 1.0 3,158 1.9 1975 3.5 329 12.7
 Bipolar disorder 121,548 34.8 13,558 36.3 5,187 39.3 336 37.6 78,173 3.9 7,311 4.3 2,751 4.9 152 5.9
 Depression 169,212 48.5 19,562 52.4 7,300 55.3 434 48.5 596,370 29.8 56,573 33.6 21,394 38.2 1,135 43.9
 Psychotic disorders 42,016 12.0 4,785 12.8 1831 13.9 160 17.9 16,240 0.8 1,657 1.0 645 1.2 40 1.5
Laboratory tests ordered
 Comprehensive metabolic panel 125,125 35.8 16,969 45.5 8,624 65.3 738 82.6 373,700 18.7 49,460 29.4 30,241 54.1 2,181 84.3
 Glucose test 16,098 4.6 2,479 6.6 1,428 10.8 128 14.3 36,774 1.8 5,281 3.1 3,361 6.0 311 12.0
 HbA1c test 18,080 5.2 2,820 7.6 1789 13.6 812 90.8 44,708 2.2 6,871 4.1 4,775 8.5 2,390 92.4
 Lipid test 54,724 15.7 8,235 22.1 5,016 38 811 90.7 156,308 7.8 22,537 13.4 15,322 27.4 2,376 91.9
Weighted by the Inverse Probability of Selection
Age at initiation, yearsa 16.7 (4.9) 16.7 (4.3) 17.1 (4.0) 17.9 (4.4) 16.2 (5.3) 16.3 (5.3) 16.8 (5.1) 19.3 (5.5)
Female sex 155,078 44.4 13,744 44.7 4,702 49.1 263 38.4 997,621 49.9 87,829 50.7 29,688 56.7 2,491 62.3
Pediatric comorbidity indexa 5.6 (4.9) 5.7 (3.6) 6.2 (3.3) 7.3 (3.8) 2.6 (2.7) 2.6 (2.8) 3.0 (2.7) 3.9 (3.7)
Medical diagnoses
 Obesity or overweight 16,538 4.7 1,486 4.8 558 5.8 72 10.6 70,254 3.5 6,527 3.8 2,464 4.7 867 21.7
 Weight management 7,235 2.1 691 2.3 281 2.9 33 4.8 30,784 1.5 3,018 1.7 1,191 2.3 197 4.9
 Abnormal glucose/prediabetes 2,404 0.7 222 0.7 83 0.9 1 0.2 6,202 0.3 592 0.3 259 0.5 16 0.4
 Hyperlipidemia 6,381 1.8 573 1.9 220 2.3 32 4.7 20,521 1.0 1950 1.1 834 1.6 142 3.6
 Bipolar disorder 121,548 34.8 10,648 34.7 3,527 36.8 302 44.1 78,173 3.9 6,914 4.0 2,351 4.5 259 6.5
 Depression 169,212 48.5 15,029 48.9 4,982 52 370 54.0 596,370 29.8 52,611 30.4 17,340 33.1 1898 47.4
 Psychotic disorders 42,016 12.0 3,703 12.1 1,255 13.1 240 35.0 16,240 0.8 1,447 0.8 538 1.0 142 3.5
Laboratory tests ordered
 Comprehensive metabolic panel 125,125 35.8 11,149 36.3 3,923 40.9 240 35.1 373,700 18.7 34,134 19.7 12,445 23.8 661 16.5
 Glucose test 16,098 4.6 1,464 4.8 619 6.5 92 13.4 36,774 1.8 3,522 2.0 1,658 3.2 314 7.8
 HbA1c test 18,080 5.2 1,614 5.3 626 6.5 23 3.4 44,708 2.2 4,171 2.4 1791 3.4 54 1.3
 Lipid test 54,724 15.7 4,879 15.9 1774 18.5 36 5.3 156,308 7.8 14,423 8.3 5,834 11.1 97 2.4

Abbreviation: HbA1c, hemoglobin A1c.

a Values are expressed as mean (standard deviation).

After applying IPSW, the distributions of baseline patient characteristics (including metabolic conditions) in linked cohort 1 and linked cohort 2 were similar to those of the full cohort (Table 1, Web Table 4). There were residual imbalances for linked cohort 3 compared with the full cohort. Within the full cohort and linked cohorts 1 and 2, characteristics were similar between treatment groups after weighting by IPTW and IPSW, with absolute standardized differences of less than 0.10 for nearly all measured covariates (Table 2, Web Figure 1) (43).

Table 2.

Distribution of Patient Characteristics After Accounting for Potential Selection Bias and Baseline Confoundinga, IBM MarketScan Data, United States, 2010–2019

Initiators of Antipsychotics Initiators of other Psychotropic Drugs
Full  
Cohort
Linked Cohort 1 Linked Cohort 2 Linked Cohort 3 Full  
Cohort
Linked Cohort 1 Linked Cohort 2 Linked Cohort 3
Characteristic No. % No. % No. % No. % No. % No. % No. % No. %
No. of patients 288,402 100 25,069 100 7,804 100 14,797 100 2,020,663 100 173,439 100 52,681 100 16,137 100
Demographic factors
 Age at initiation, yearsb 16.7 (4.9) 16.4 (4.0) 16.8 (3.7) 20.2 (16.3) 16.3 (5.3) 16.3 (5.1) 16.9 (5.1) 20.0 (9.9)
 Female sex 155,078 44.4 12,400 49.5 4,363 55.9 10,751 72.7 993,276 49.2 81,207 48.9 29,240 55.5 7,460 46.2
 Pediatric comorbidity indexb 5.6 (4.0) 3.7 (2.7) 4.1 (2.6) 5.3 (13.1) 3.1 (3.3) 3.2 (3.3) 3.6 (3.2) 9.5 (13.9)
Metabolic conditions
 Obesity or overweight 11,452 4.0 1,003 4.0 368 4.7 2,967 20.0 74,650 3.7 6,738 3.9 2,548 4.8 1,571 9.7
 Weight management 5,109 1.8 470 1.9 186 2.4 973 6.6 32,891 1.6 3,122 1.8 1,215 2.3 818 5.1
 Abnormal weight gain 2,708 0.9 248 1.0 100 1.3 1,664 11.2 17,259 0.9 1,566 0.9 646 1.2 282 1.7
 Abnormal glucose or prediabetes 1,255 0.4 99 0.4 48 0.6 4,802 32.5 7,487 0.4 667 0.4 275 0.5 74 0.5
 Metabolic syndrome 338 0.1 39 0.2 18 0.2 425 2.9 2025 0.1 186 0.1 76 0.1 45 0.3
 Hyperlipidemia 3,697 1.3 347 1.4 145 1.9 1,034 7.0 23,156 1.1 2,101 1.2 894 1.7 1,140 7.1
 Hypothyroidism 4,636 1.6 435 1.7 190 2.4 3,692 24.9 27,990 1.4 2,501 1.4 1,131 2.1 81 0.5
Lab tests ordered
 Comprehensive metabolic panel 71,358 24.7 6,330 25.2 2,198 28.2 9,807 66.3 436,490 21.6 38,472 22.2 13,899 26.4 9,780 60.6
 Glucose test 7,467 2.6 682 2.7 302 3.9 1708 11.5 45,888 2.3 4,159 2.4 1914 3.6 1819 11.3
 HbA1c test 8,717 3.0 778 3.1 318 4.1 14,368 97.1 53,728 2.7 4,807 2.8 2034 3.9 12,380 76.7
 Lipid test 27,361 9.5 2,500 10.0 955 12.2 9,985 67.5 179,603 8.9 16,042 9.2 6,373 12.1 9,561 59.3
Psychiatric conditions
 ADHD 57,598 20.0 4,824 19.2 1,339 17.2 1,564 10.6 472,542 23.4 39,894 23.0 11,165 21.2 1818 11.3
 Anxiety 103,698 36.0 9,038 36.1 2,913 37.3 10,375 70.1 676,179 33.5 58,712 33.9 19,353 36.7 10,414 64.5
 Autism 13,169 4.6 1,187 4.7 348 4.5 612 4.1 70,769 3.5 5,956 3.4 1,693 3.2 686 4.3
 Bipolar disorder 32,263 11.2 2,873 11.5 1,008 12.9 138 0.9 185,632 9.2 14,993 8.6 5,269 10.0 8,145 50.5
 Depression 112,228 38.9 9,891 39.5 3,357 43.0 9,377 63.4 671,151 33.2 57,974 33.4 19,239 36.5 9,741 60.4
 Psychotic disorders 9,693 3.4 882 3.5 318 4.1 172 1.2 54,744 2.7 4,326 2.5 1709 3.2 1,153 7.1
Medications
 Lithium 3,621 1.3 314 1.3 112 1.4 8 0.1 17,624 0.9 1,344 0.8 497 0.9 1,076 6.7
 Anticonvulsant mood stabilizers 28,030 9.7 2,513 10.0 845 10.8 1823 12.3 140,177 6.9 11,651 6.7 4,047 7.7 5,986 37.1
 SSRIs 153,764 53.3 13,339 53.2 4,382 56.1 7,373 49.8 964,072 47.7 84,308 48.6 27,539 52.3 7,879 48.8
 Other antidepressants 50,626 17.6 4,438 17.7 1,492 19.1 2002 13.5 271,163 13.4 22,950 13.2 7,860 14.9 4,836 30.0
 ADHD medications 133,131 46.2 11,166 44.5 3,119 40.0 4,299 29.0 1,019,952 50.5 85,993 49.6 23,645 44.9 4,151 25.7
Health-care utilization
 No. of outpatient visitsc 5 (2–10) 5 (2–10) 6 (3–11) 7 (6–12) 4 (2–8) 4 (2–10) 5 (3–9) 9 (8–13)
 No. of distinct generic drugsc 3 (1–5) 4 (2–5) 4 (3–6) 3 (2–5) 4 (2–5) 3 (2–5) 3 (2–5) 6 (2–9)
 Any hospitalization 29,066 10.1 2,583 10.3 972 12.5 281 1.9 152,044 7.5 12,546 7.2 4,833 9.2 5,085 31.5
Laboratory test resultsd
 HbA1c, %b 5.28 (0.2) 5.30 (0.2) 5.34 (0.2) 5.26 (0.9) 5.27 (0.5) 5.3 (0.4) 5.33 (0.3) 5.26 (0.8)
  Proportion missing 99.8 98.6 96.4 0.0 99.8 98.4 96.3 0.0
 Total cholesterol, mg/dLb 161.06 (26.2) 159.07 (21.5) 157.12 (21.4) 156.75 (94.8) 163.11 (42.2) 160.28 (35.3) 157.15 (32.3) 153.87 (81.8)
  Proportion missing 99.4 95.4 87.1 0.0 99.3 94.6 87.5 0.0
  Proportion high cholesterol 13.1 12.0 12.2 6.6 14.5 13.2 12.4 11.0
 LDL cholesterol, mg/dLb 90.04 (21.2) 88.75 (17.1) 87.69 (16.6) 78.04 (89.8) 90.85 (34.2) 89.42 (27.7) 87.67 (23.9) 81.72 (70.7)
  Proportion missing 99.4 95.7 88.0 0.0 99.3 95.0 88.8 0.0
  Proportion high LDL cholesterol 8.5 7.9 7.6 3.2 9.8 9.3 8.9 4.7
 HDL cholesterol, mg/dLb 51.13 (10.6) 50.83 (8.6) 50.62 (8.2) 61.33 (70.1) 51.82 (17.5) 50.79 (14.1) 49.76 (12.4) 50.25 (32.9)
  Proportion missing 99.4 95.6 87.6 0 0.0 99.3 94.9 88.5 0.0
 Triglycerides, mg/dLb 101.76 (43.7) 99.93 (34.9) 98.87 (33.2) 87.53 (218.0) 103.31 (67.3) 102.25 (53.4) 101.43 (42.8) 101.58 (154.2)
  Proportion missing 99.4 95.7 88.1 0.0 99.3 95.0 89.0 0.0
  Proportion high triglycerides 16.5 15.6 14.8 10.0 15.9 15.5 15.3 22.5

Abbreviations: ADHD, attention-deficit/hyperactivity disorder; HbA1c, hemoglobin A1c; HDL, high-density lipoprotein; LDL, low-density lipoprotein; SSRI, selective serotonin reuptake inhibitor.

a The mean of the inverse probability of treatment weights after truncation were as follows: 0.98 (standard deviation, 0.63) for the full cohort, 0.98 (standard deviation, 0.73) for linked cohort 1, 0.98 (standard deviation, 0.77) for linked cohort 2, and 1.00 (standard deviation, 1.02) for linked cohort 3.

b Values are expressed as mean (standard deviation).

c Values are expressed as median (interquartile range).

d Distribution of lab test results summarized among available values (before multiple imputation).

Treatment effects

In the full cohort, the unadjusted association suggested a 3-fold increased hazard of T2D among antipsychotic initiators compared with control patients (HR = 3.06, 95% CI: 2.87, 3.25; Figure 3A, Web Table 5). The magnitude of association attenuated after controlling for baseline confounders in the claims (HR = 2.26, 95% CI: 2.07, 2.49; Figure 3B) and baseline confounders in both the claims and the imputed laboratory data (HR = 2.25, 95% CI: 2.05, 2.47; Figure 3C). The distributions of laboratory test results were generally similar before and after imputation (Web Table 6).

Figure 3.

Figure 3

Comparison of treatment effect estimates before and after accounting for selection bias and confounding, IBM MarketScan Data, United States, 2010–2019. A) Estimates with no confounding adjustment. B) Confounding adjustment by claims covariates. C) Confounding adjustment by claims and laboratory covariates. The full cohort, identified in the primary data set, was the target population of interest. The linked cohorts gradually became more restrictive: Linked cohort 1 consisted of linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the study period, linked cohort 2 consisted of linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the covariate assessment period, and linked cohort 3 consisted of linked patients with a recorded result for each of the 3 lab tests of interest during the covariate assessment period (complete confounder data). Estimates reported for the full cohort were repeated after adjustment for selection bias for the ease of comparison. The 95% confidence interval (CIs) are based on standard sandwich variance estimators; see Web Table 8 for 95% CIs based on variance estimated using a nonparametric bootstrap method. HR, hazard ratio.

With no adjustment for potential selection bias, effect estimates in the linked cohorts suggested a different magnitude of association from the full cohort. The HR adjusted for claims covariates was 1.79 (95% CI: 1.41, 2.27) in linked cohort 1, 1.63 (95% CI: 0.99, 2.68) in linked cohort 2, and 0.87 (95% CI: 0.29, 2.56) in linked cohort 3 (Figure 3B). Estimates were similar after controlling for both claims and laboratory covariates (Figure 3C).

After accounting for selection bias, claims-only confounding–adjusted HRs in linked cohort 1 (HR = 2.11, 95% CI: 1.62, 2.74) and linked cohort 2 (HR = 2.12, 95% CI: 0.88, 5.13) were nearly identical to the point estimate observed in the full cohort (HR = 2.26, 95% CI: 2.07, 2.49; Figure 3B). In linked cohort 3, the claims-only confounding-adjusted estimate (HR = 8.15, 95% CI: 1.24, 53.56) remained different from the full cohort estimate, but accounting for potential selection bias corrected the direction of association (Figure 3B). After adjustment for both claims and laboratory covariates, effect estimates remained similar in linked cohort 1 and linked cohort 2, whereas the variance increased substantially in linked cohort 3 (HR = 13.90, 95% CI: 1.61, 120.09; Figure 3C). As expected, treatment effects estimated within the linked cohorts were less efficient compared with the full cohort, and weighting by IPSW resulted in even wider 95% CIs. The extent to which applying IPSW increased variance differed for each cohort (Web Table 7). For example, for claims-only confounding–adjusted estimates, IPSW had limited impact on precision (confidence limit ratio of 1.61 before IPSW vs. 1.69 after IPSW) in linked cohort 1, whereas a more substantial increase in variance was observed in linked cohort 2 (confidence limit ratio 2.71 vs. 5.83) and linked cohort 3 (confidence limit ratio 8.83 vs. 43.19). The standard errors obtained from bootstrapping were generally similar to the estimates obtained from the standard sandwich variance estimator (Web Table 8), although in the weighted settings, the estimates from bootstrapping were slightly smaller.

DISCUSSION

We highlighted the importance of accounting for potential selection bias in linked database studies using the example of antipsychotics and the risk of T2D in youths within a claims and laboratory linked database. While this linked database offered more potential for confounding control compared with the claims database alone, patients within the linked cohorts were not representative of the full claims-based cohort, and failure to account for potential selection bias resulted in incorrect point estimates. In our application example, the laboratory values within the supplemental data set were not strong confounders, and restriction to a linked cohort introduced a substantial amount of selection bias. Applying IPSW resulted in effect estimates that were comparable to the full, original study cohort (the target population), demonstrating that a valid solution exists for addressing the often-neglected issue of selection bias in linked database studies.

Studies conducted within linked databases should carefully consider who is being analyzed and what the target population is. We assumed that the target population was the full cohort identified in the primary data set, and therefore, estimates obtained within linked subsets of the full cohort could potentially be biased if selection bias was not appropriately accounted for. IPSW provided a valid approach to extending our inferences from the linked population to the full cohort. However, some linked database studies may consider the linked subset as the target population (as opposed to the full cohort). In such circumstances, effect estimates from the linked subset target the proper estimand associated with the population of interest and are not necessarily prone to selection bias (44). In other words, the estimates would be unbiased for the linked subset, but they may not be generalizable to the full cohort. Finally, some studies may consider the target population as an external population that is distinct from the study sample (i.e., partially or completely nonoverlapping with the full cohort or the linked subset). In these situations, findings from the study sample can potentially be extended to an external population using approaches for transportability (such as inverse odds weighting) that are described elsewhere (4548).

Our study highlights the importance of explicitly specifying the target population (e.g., full cohort, linked subset, an external population) and identifying an appropriate approach to generate unbiased effect estimates for the target population of interest (with the usual assumptions of conditional exchangeability, positivity, and causal consistency). We found that applying IPSW created a pseudopopulation comparable to the full cohort for 2 of the 3 linked cohorts. However, residual selection bias was present in linked cohort 3 (complete cases), which was not unexpected given the small sample size (reflecting <0.5% of the full cohort) that was substantially unrepresentative of the full cohort. A complete-case analysis is also known to be biased in most circumstances (49, 50). While we highlighted several definitions of a linked cohort, we anticipate that linked cohort 2 (any linked data during covariate assessment period) will be the most widely used in pharmacoepidemiologic studies.

To further investigate the residual biases in linked cohort 3, we explored different ways to truncate inverse probability weights and a different specification of the weight models (Web Table 9, Web Figure 2). In these exploratory analyses, we found that possible positivity violations likely generated extreme weights, which resulted in residual imbalances across treatment groups and unstable effect estimates (adjusted HR ranging from 0.63 to 11.20). By requiring a recorded test result for hemoglobin A1c, cholesterol, and triglycerides, we likely included a much greater proportion of patients who were at a higher metabolic risk in the complete-case cohort and were not able to adequately account for the selection bias due to insufficient information on patients with a lower metabolic risk (who were more common in the full cohort).

There are several potential limitations to this analysis. First, we applied the approaches to only one empirical example. In our study, the potential confounders from the supplemental database turned out not to be strong confounders, but the data linkage introduced a substantial amount of selection bias. The extent to which selection bias or confounding may be present in a linked database study may differ in other applications. Nevertheless, the outlined approaches can be applied to other linked database studies in general. Second, we used multiple imputation to handle missing laboratory data but there are other approaches, such as inverse probability weighting, that could be considered (51, 52). Third, we truncated inverse probability weights to minimize the influence of outliers, but compared with no weight truncation, this approach increased precision at the expense of potentially increasing the imbalances between treatment groups (32). Since the degree of truncation was small, it is unlikely to substantially influence our findings. Finally, as expected, we observed a bias-variance tradeoff in effect estimates weighted by IPSW. Weighted estimates can increase variances (32), and we observed that variances got progressively larger as the linked cohorts differed more from the full cohort. However, accurate point estimates are generally prioritized in nonrandomized studies to minimize bias and achieve internal validity.

CONCLUSIONS

Studies conducted within linked databases, often with the goal of improved confounding control, may be restricted to patients who are not representative of the target population of interest. Analyses conducted within linked cohorts may generate biased effect estimates for the target population of interest, but this selection bias can be reduced through inverse probability of selection weights.

Supplementary Material

Web_Material_kwab299

ACKNOWLEDGMENTS

Author affiliations: Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, Massachusetts, United States (Jenny W. Sun, Rui Wang, Dongdong Li, Sengwee Toh); and Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States (Rui Wang).

This study was funded by Harvard Medical School and Harvard Pilgrim Health Care Institute through the Thomas O. Pyle Fellowship Fund and the Agency for Healthcare Research and Quality (grant R01HS026214).

This study was based on data from the IBM MarketScan Commercial Database obtained and used under license for the present study. Restrictions apply to the availability of these data, so they are not publicly available. The data underlying the results presented in the study are available for purchase by contacting the database owners.

We thank Jenny Hochstadt for her assistance in accessing the MarketScan data.

This work was presented as a podium presentation at the 37th International Conference on Pharmacoepidemiology (online), August 23–25, 2021.

J.W.S. is currently employed by Pfizer Inc. for unrelated work. All aspects of this work included in the initial submission, including the study design, data analysis, and manuscript draft, were completed prior to employment at Pfizer Inc. The other authors report no conflicts.

REFERENCES

  • 1. Schneeweiss  S, Avorn  J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol.  2005;58(4):323–337. [DOI] [PubMed] [Google Scholar]
  • 2. Bradley  CJ, Penberthy  L, Devers  KJ, et al.  Health services research and data linkages: issues, methods, and directions for the future. Health Serv Res.  2010;45(5):1468–1488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Trifirò  G, Sultana  J, Bate  A. From big data to smart data for pharmacovigilance: the role of healthcare databases and other emerging sources. Drug Saf.  2018;41(2):143–149. [DOI] [PubMed] [Google Scholar]
  • 4. Mears  GD, Rosamond  WD, Lohmeier  C, et al.  A link to improve stroke patient care: a successful linkage between a statewide emergency medical services data system and a stroke registry. Acad Emerg Med.  2010;17(12):1398–1404. [DOI] [PubMed] [Google Scholar]
  • 5. García Álvarez  L, Aylin  P, Tian  J, et al.  Data linkage between existing healthcare databases to support hospital epidemiology. J Hosp Infect.  2011;79(3):231–235. [DOI] [PubMed] [Google Scholar]
  • 6. van  Herk-Sukel  MPP, Lemmens  VEPP, van de  Poll-Franse  LV, et al.  Record linkage for pharmacoepidemiological studies in cancer patients. Pharmacoepidemiol Drug Saf.  2012;21(1):94–103. [DOI] [PubMed] [Google Scholar]
  • 7. Harron  K, Goldstein  H, Wade  A, et al.  Linkage, evaluation and analysis of national electronic healthcare data: application to providing enhanced blood-stream infection surveillance in paediatric intensive care. PLoS One.  2013;8(12):e85278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Setoguchi  S, Zhu  Y, Jalbert  JJ, et al.  Validity of deterministic record linkage using multiple indirect personal identifiers: linking a large registry to claims data. Circ Cardiovasc Qual Outcomes.  2014;7(3):475–480. [DOI] [PubMed] [Google Scholar]
  • 9. Patorno  E, Gopalakrishnan  C, Franklin  JM, et al.  Claims-based studies of oral glucose-lowering medications can achieve balance in critical clinical variables only observed in electronic health records. Diabetes Obes Metab.  2018;20(4):974–984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Huybrechts  KF, Gopalakrishnan  C, Franklin  JM, et al.  Claims data studies of direct oral anticoagulants can achieve balance in important clinical parameters only observable in electronic health records. Clin Pharmacol Ther.  2019;105(4):979–993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Schmidt  M, Schmidt  SAJ, Adelborg  K, et al.  The Danish health care system and epidemiological research: from health care contacts to database records. Clin Epidemiol.  2019;11:563–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Pratt  NL, Mack  CD, Meyer  AM, et al.  Data linkage in pharmacoepidemiology: a call for rigorous evaluation and reporting. Pharmacoepidemiol Drug Saf.  2020;29(1):9–17. [DOI] [PubMed] [Google Scholar]
  • 13. Rivera  DR, Gokhale  MN, Reynolds  MW, et al.  Linking electronic health data in pharmacoepidemiology: appropriateness and feasibility. Pharmacoepidemiol Drug Saf.  2020;29(1):18–29. [DOI] [PubMed] [Google Scholar]
  • 14. Lin  KJ, Schneeweiss  S. Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs. Clin Pharmacol Ther.  2016;100(2):147–159. [DOI] [PubMed] [Google Scholar]
  • 15. Dusetzina  SB, Tyree  S, Meyer  A-M, et al.  Linking Data for Health Services Research: A Framework and Instructional Guide, Rockville, MD: Agency for Healthcare Research and Quality (US); 2014. https://www.ncbi.nlm.nih.gov/books/NBK253313/. Accessed November 19, 2020. [PubMed] [Google Scholar]
  • 16. Mansfield  KE, Nitsch  D, Smeeth  L, et al.  Prescription of renin–angiotensin system blockers and risk of acute kidney injury: a population-based cohort study. BMJ Open.  2016;6(12):e012690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Bouras  G, Markar  SR, Burns  EM, et al.  The psychological impact of symptoms related to esophagogastric cancer resection presenting in primary care: a national linked database study. Eur J Surg Oncol. 2017;43(2):454–460. [DOI] [PubMed] [Google Scholar]
  • 18. Solomon  DH, Liu  C-C, Kuo  I-H, et al.  Effects of colchicine on risk of cardiovascular events and mortality among patients with gout: a cohort study using electronic medical records linked with Medicare claims. Ann Rheum Dis.  2016;75(9):1674–1679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Lee  MP, Glynn  RJ, Schneeweiss  S, et al.  Risk factors for heart failure with preserved or reduced ejection fraction among Medicare beneficiaries: application of competing risks analysis and gradient boosted model. Clin Epidemiol.  2020;12:607–616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Berger  A, Simpson  A, Leeper  NJ, et al.  Real-world predictors of major adverse cardiovascular events and major adverse limb events among patients with chronic coronary artery disease and/or peripheral arterial disease. Adv Ther.  2020;37(1):240–252. [DOI] [PubMed] [Google Scholar]
  • 21. Bohensky  M. Bias in data linkage studies. In: Harron  K, Golstein  H, Dibben  C, eds. Methodological Developments in Data Linkage. London, UK: John Wiley & Sons, Ltd; 2015:63–82. [Google Scholar]
  • 22. Galling  B, Roldán  A, Nielsen  RE, et al.  Type 2 diabetes mellitus in youth exposed to antipsychotics: a systematic review and meta-analysis. JAMA Psychiat.  2016;73(3):247–259. [DOI] [PubMed] [Google Scholar]
  • 23. Bobo  WV, Cooper  WO, Stein  CM, et al.  Antipsychotics and the risk of type 2 diabetes mellitus in children and youth. JAMA Psychiat.  2013;70(10):1067. [DOI] [PubMed] [Google Scholar]
  • 24. De Hert  M, Detraux  J, van  Winkel  R, et al.  Metabolic and cardiovascular adverse effects associated with antipsychotic drugs. Nat Rev Endocrinol.  2012;8(2):114–126. [DOI] [PubMed] [Google Scholar]
  • 25. De Hert  M, Dobbelaere  M, Sheridan  EM, et al.  Metabolic and endocrine adverse effects of second-generation antipsychotics in children and adolescents: a systematic review of randomized, placebo controlled trials and guidelines for clinical practice. Eur Psychiatry. 2011;26(3):144–158. [DOI] [PubMed] [Google Scholar]
  • 26. American Diabetes Association . Consensus development conference on antipsychotic drugs and obesity and diabetes. Diabetes Care.  2004;27(2):596–601. [DOI] [PubMed] [Google Scholar]
  • 27. IBM . MarketScan Research Databases. 2019; https://www.ibm.com/products/marketscan-research-databases. Accessed November 19, 2020.
  • 28. Brookhart  MA, Todd  JV, Li  X, et al.  Estimation of biomarker distributions using laboratory data collected during routine delivery of medical care. Ann Epidemiol.  2014;24(10):754–761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Sun  JW, Bourgeois  FT, Haneuse  S, et al.  Development and validation of a pediatric comorbidity index. Am J Epidemiol.  2021;190(5):918–927. [DOI] [PubMed] [Google Scholar]
  • 30. Teltsch  DY, Fazeli Farsani  S, Swain  RS, et al.  Development and validation of algorithms to identify newly diagnosed type 1 and type 2 diabetes in pediatric population using electronic medical records and claims data. Pharmacoepidemiol Drug Saf.  2019;28(2):234–243. [DOI] [PubMed] [Google Scholar]
  • 31. Robins  JM, Hernán  MÁ, Brumback  B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. [DOI] [PubMed] [Google Scholar]
  • 32. Cole  SR, Hernán  MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol.  2008;168(6):656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Sterne  JAC, White  IR, Carlin  JB, et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. SAS Institute Inc.  SAS/STAT, 14.1 User’s Guide The MI Procedure. Cary, NC: SAS Institute Inc; 2015. [Google Scholar]
  • 35. Leyrat  C, Seaman  SR, White  IR, et al.  Propensity score analysis with partially observed covariates: how should multiple imputation be used?  Stat Methods Med Res.  2019;28(1):3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Granger  E, Sergeant  JC, Lunt  M. Avoiding pitfalls when combining multiple imputation and propensity scores. Stat Med.  2019;38(26):5120–5132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Rubin  DB. Multiple Imputation for Survey Nonresponse. New York, NY: Wiley; 1987. [Google Scholar]
  • 38. Hernán  MA, Hernández-Díaz  S, Robins  JM. A structural approach to selection bias. Epidemiology.  2004;15(5):615–625. [DOI] [PubMed] [Google Scholar]
  • 39. Cole  SR, Stuart  EA. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. Am J Epidemiol.  2010;172(1):107–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Lin  DY, Wei  L-J. The robust inference for the Cox proportional hazards model. J Am Stat Assoc.  1989;84(408):1074–1078. [Google Scholar]
  • 41. Poole  C. Low P values or narrow confidence intervals: which are more durable?  Epidemiology.  2001;12(3):291–294. [DOI] [PubMed] [Google Scholar]
  • 42. Hernán  MA, Robins  JM. Causal Inference: What If?  Boca Raton, FL: CRC Press LLC; 2020. [Google Scholar]
  • 43. Austin  PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med.  2009;28(25):3083–3107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Hernán  MA. Invited commentary: selection bias without colliders. Am J Epidemiol.  2017;185(11):1048–1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Dahabreh  IJ, Robertson  SE, Tchetgen Tchetgen  EJ, et al.  Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics.  2019;75(2):685–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Westreich  D, Edwards  JK, Lesko  CR, et al.  Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol.  2017;186(8):1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Dahabreh  IJ, Robertson  SE, Steingrimsson  JA, et al.  Extending inferences from a randomized trial to a new target population. Stat Med.  2020;39(14):1999–2014. [DOI] [PubMed] [Google Scholar]
  • 48. Webster-Clark  M, Lund  JL, Stürmer  T, et al.  Reweighting oranges to apples: transported RE-LY trial versus nonexperimental effect estimates of anticoagulation in atrial fibrillation. Epidemiology.  2020;31(5):605–613. [DOI] [PubMed] [Google Scholar]
  • 49. Laird  NM. Missing data in longitudinal studies. Stat Med.  1988;7(1–2):305–315. [DOI] [PubMed] [Google Scholar]
  • 50. Ross  RK, Breskin  A, Westreich  D. When is a complete-case approach to missing data valid? The importance of effect-measure modification. Am J Epidemiol.  2020;189(12):1583–1589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Horton  NJ, Kleinman  KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat.  2007;61(1):79–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Little  RJ, Rubin  DB. Statistical Analysis With Missing Data. 3rd ed. Hoboken, NJ: John Wiley & Sons; 2019. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwab299

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES