Use of Linked Databases for Improved Confounding Control: Considerations for Potential Selection Bias

Jenny W Sun; Rui Wang; Dongdong Li; Sengwee Toh

doi:10.1093/aje/kwab299

. 2022 Jan 6;191(4):711–723. doi: 10.1093/aje/kwab299

Use of Linked Databases for Improved Confounding Control: Considerations for Potential Selection Bias

Jenny W Sun ^✉, Rui Wang, Dongdong Li, Sengwee Toh

PMCID: PMC9430441 PMID: 35015823

Abstract

Pharmacoepidemiologic studies are increasingly conducted within linked databases, often to obtain richer confounder data. However, the potential for selection bias is frequently overlooked when linked data is available only for a subset of patients. We highlight the importance of accounting for potential selection bias by evaluating the association between antipsychotics and type 2 diabetes in youths within a claims database linked to a smaller laboratory database. We used inverse probability of treatment weights (IPTW) to control for confounding. In analyses restricted to the linked cohorts, we applied inverse probability of selection weights (IPSW) to create a population representative of the full cohort. We used pooled logistic regression weighted by IPTW only or IPTW and IPSW to estimate treatment effects. Metabolic conditions were more prevalent in linked cohorts compared with the full cohort. Within the full cohort, the confounding-adjusted hazard ratio was 2.26 (95% CI: 2.07, 2.49) comparing initiation of antipsychotics with initiation of control medications. Within the linked cohorts, a different magnitude of association was obtained without adjustment for selection, whereas applying IPSW resulted in point estimates similar to the full cohort’s (e.g., an adjusted hazard ratio of 1.63 became 2.12). Linked database studies may generate biased estimates without proper adjustment for potential selection bias.

Keywords: health-care databases, linked data, pharmacoepidemiology, selection bias

Abbreviations

CI: confidence interval
HR: hazard ratio
IPSW: inverse probability of selection weights
IPTW: inverse probability of treatment weights
T2D: type 2 diabetes

Health-care databases, such as administrative claims data, electronic health records, and registries, are widely used in pharmacoepidemiology and health services research. However, these databases are not collected for research purposes and often lack information on important confounders (1). With the widespread availability of health-care databases, records for the same patient may be available across different data sources (2, 3). As a result, richer patient data can be obtained through data linkage (4–11). Guidance on the feasibility of data linkage and recommendations for transparent reporting have been published elsewhere (12–15).

One of the advantages to a linked database study is improved confounding control, but the potential for selection bias is often overlooked. Typically, a data linkage is feasible for only a subset of the study population. For example, a claims database from a health plan may be linked to an electronic health record database from a delivery system but only among patients who appear in both data sources. Therefore, linked database studies are generally restricted to a subset of the original study population (16–20). If the subset of linked patients is not representative of the original study population, then restricting an analysis to the linked population may introduce bias (21).

Here we highlight the importance of considering potential selection bias when working with linked data sources to estimate treatment effects in the original study population (i.e., the target population of interest). We also demonstrate how available analytical approaches can be used to account for this potential bias.

METHODS

Application example

We applied the proposed approaches to evaluate the association of antipsychotics and type 2 diabetes (T2D) in youths (aged 5–24 years). In young patients, antipsychotic use is associated with a 2- to 3-fold increased risk of developing T2D (22, 23), as well as an increased risk of other adverse cardiometabolic side effects, such as weight gain and lipid and glucose abnormalities (24, 25). Therefore, the American Diabetes Association recommends metabolic monitoring for youths treated with antipsychotics (26). Metabolic screening prior to treatment initiation is intended to guide treatment decision making, so patients with poor metabolic health at baseline may choose an alternative treatment. A study evaluating antipsychotics and the risk of T2D within a database that does not capture laboratory data may have residual confounding by unmeasured metabolic test results.

Definitions

We defined the “primary data set” as the data set where the original study population was identified (and therefore, the target population we would like to make inferences about) and the “supplemental data set” as the data set that contains additional covariate data that was not available in the primary data set. The “linked cohort” consisted of patients within the primary data set who have been linked to the supplemental data set.

Data sources

The primary data set was the IBM MarketScan Commercial Database (MarketScan; IBM Watson Health, Cambridge, Massachusetts; January 2010 to March 2019), a nationwide claims database in the United States (27). To obtain additional confounder data, we identified a supplemental data set that captures test results from select laboratory networks for patients who have laboratory tests ordered (IBM MarketScan Lab Database). A previous study showed that the distribution of laboratory results within this database is representative of the general US population (28). Using the deidentified patient enrollment identification number, we linked records across data sources for the subset of MarketScan patients who had a test result available within the laboratory database. The laboratory database includes results from dozens of tests. For the application example, we were interested in 3 of these tests: hemoglobin A1c (HbA1c), cholesterol (total, high-density lipoprotein, low-density lipoprotein), and triglycerides. This study was approved by the Institutional Review Board of Harvard Pilgrim Health Care with a waiver of informed consent.

Study population

We defined the study population within the claims database (primary data set) as youths aged 5–24 years who initiated an antipsychotic medication or a comparator psychotropic medication. The exposed group included initiators of an antipsychotic medication. The date of the first observed dispensing for an antipsychotic served as the index date. The comparator group consisted of new users of other psychotropic drugs (antidepressants, medications for attention-deficit/hyperactivity disorder, and mood stabilizers; details in Web Table 1 available at https://doi.org/10.1093/aje/kwab299). The date of the first observed dispensing for a comparator drug served as the index date. In the comparator group, we required no previous use of the initiation drug, but use of the other comparator drugs was allowed. For example, antidepressant initiators were required to have no previous antidepressant dispensings, but previous dispensings of medications for attention-deficit/hyperactivity disorder or mood stabilizers were allowed.

Patients who did not have continuous medical and pharmacy enrollment, had a diagnosis of diabetes or a dispensing for an antihyperglycemic medication, or were pregnant during the 180 days prior to the index date were excluded. Youths were additionally required to have ≥1 mental health diagnosis on the index date or any point prior to the index date. In the comparator group, we also required patients to have no antipsychotic use in the 180 days prior to the index date.

Linked subset

For patients with data available in both the primary and supplemental data set, we varied the definition of linked cohort as follows (illustrated in Figure 1):

Overview of study cohorts, IBM MarketScan Data, United States, 2010–2019. We implemented the following framework for identifying the study cohorts: The study population was defined within the primary database. This population reflected the target population that we would like to make inferences about. To obtain additional confounder data, the primary database was linked to the supplemental database. Several definitions were considered for identifying the subset of patients who appeared in both data sources (linked cohort 1, linked cohort 2, linked cohort 3). In our study, the IBM MarketScan Commercial Database served as the primary database and the IBM MarketScan Lab Database served as the supplemental database.

Linked cohort 1: linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the study period. This linked cohort included eligible patients who appeared in both data sets, but some of these patients might not have a recorded measurement for the tests of interest during the covariate assessment period (defined in the next section).
Linked cohort 2: linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the covariate assessment period. This linked cohort included eligible patients who had any supplemental data available during the covariate assessment period. As in linked cohort 1, some patients might not have a recorded measurement for the tests of interest during the covariate assessment period.
Linked cohort 3: linked patients with a recorded result for each of the 3 tests of interest during the covariate assessment period. This linked cohort included eligible patients who would be included in a conventional complete-case analysis.

Patient characteristics

We measured baseline covariates during the 180 days prior to the index date (covariate assessment period). Within the claims database, we identified several characteristics as potential confounders or proxies of confounders, including demographic factors, metabolic conditions, psychiatric conditions, laboratory tests ordered, lifestyle factors, medications, indicators of health-care utilization, and the pediatric comorbidity index (full list in Web Table 2) (29). As was typically done in claims database studies, patients were defined as not having a certain characteristic (e.g., depression) unless they had a recorded diagnosis or dispensing in the database. Therefore, there were no missing values in the claims-based covariates.

Within the laboratory database, we obtained test results for hemoglobin A1c, cholesterol, and triglycerides. Implausibly extreme laboratory values were set to missing (details in Web Table 2). When there were multiple records of the test result available, we used only the value closest to the index date.

Outcome

We followed patients until the onset of T2D, end of insurance coverage, or end of available data. We defined cases using a previously validated algorithm for identifying T2D in children using health-care databases (positive predictive value = 87%) (30). This definition was based on the presence of inpatient or outpatient diagnosis codes for T2D and use of antihyperglycemic medications.

Statistical analysis

Descriptive statistics.

First, we examined whether patients within the linked cohorts were representative of the full cohort by comparing the distributions of baseline patient characteristics across cohorts. Then we compared the distributions of patient characteristics between treatment groups within each cohort to assess for potential confounding.

Adjusting for confounding only.

We used inverse probability of treatment weights (IPTW) to adjust for baseline confounding (31). We estimated stabilized treatment weights as the marginal probability of treatment divided by the probability of treatment conditional on measured baseline covariates, separately for the full cohort and each of the linked cohorts. Then we truncated weights at the 99th and 1st percentiles to prevent outliers from influencing the analysis (32). For each cohort, we applied 2 levels of baseline confounding adjustment: 1) adjusted for claims-based covariates only and 2) adjusted for claims-based and laboratory covariates. To estimate the association of antipsychotics and T2D, we estimated hazard ratios (HRs) using pooled logistic regression models weighted by IPTW. These analyses did not account for potential selection bias (see next section).

For analyses adjusted for laboratory covariates, we used multiple imputation by chained equations to handle missing laboratory data using the PROC MI procedure in SAS, version 9.4 (SAS Institute, Inc., Cary, North Carolina) (33, 34). We performed imputation on the continuous laboratory covariates and then dichotomized the respective variables to define high total cholesterol, high low-density lipoprotein cholesterol, and high triglyceride levels. The imputation models included the outcome and all previously defined claims-based and laboratory covariates. We assumed that the laboratory covariates were missing at random. We created 20 imputed data sets and specified a multivariate normal distribution for the imputation of continuous laboratory covariates. We fitted separate treatment weights and pooled logistic regression models (as described above) for each imputation (35, 36) and then used Rubin’s rules to pool HRs and 95% confidence intervals (CIs) across imputations (37).

Adjusting for selection bias and confounding.

To account for potential selection bias in restricting analyses to the linked cohorts, we applied inverse probability of selection weights (IPSW) (38, 39). Within each treatment group, we estimated stabilized selection weights as the marginal probability of being in the respective linked cohort divided by the probability of being in the respective linked cohort conditional on all previously described claims-based covariates. Then we truncated weights at the 99th and 1st percentiles. This weighting created a pseudopopulation in which the distributions of measured factors related to selection were expected to be balanced between the full cohort and the respective linked cohort. To evaluate the performance of these weights, we reexamined the distributions of characteristics across cohorts in the reweighted sample.

Then, within each newly defined pseudopopulation, we applied IPTW to account for baseline confounding. Specifically, we used logistic regression models weighted by IPSW to estimate stabilized treatment weights separately for each of the linked cohorts. We included the previously described baseline covariates in these weight models. By adjusting for confounding within the pseudopopulation created by IPSW, this approach would ideally create covariate balance across treatment groups (internal validity) within the target population of interest. For analyses adjusted for laboratory covariates, we filled in missing laboratory data using multiple imputation by chained equations (details above) before estimating IPTW.

To estimate treatment effects adjusted for confounding and selection bias, we fitted pooled logistic regression models weighted by IPTW and IPSW to generate HRs.

For all analyses, we computed 95% CIs for the HRs using the standard sandwich variance estimator (40) and further quantified precision using confidence limit ratios (41), the ratio of the upper limit to the lower limit of the 95% CI. We additionally estimated variance using a nonparametric bootstrapping method. We assumed that the targeted treatment effects were identifiable under the assumptions of conditional exchangeability, positivity, and causal consistency (42).

The SAS (SAS Institute, Inc.) code used to implement the main analyses is available in the Web Material.

RESULTS

Data linkage

The full cohort, identified from the claims database, consisted of 349,180 antipsychotic initiators and 2,000,308 initiators of a control medication (Figure 2). After linkage to the laboratory database, 10.7% of antipsychotic initiators and 8.4% of control patients remained in linked cohort 1. Requiring data for any laboratory test during the covariate assessment period (linked cohort 2) reduced the sample to 5.3% of antipsychotic initiators and 2.8% of control patients. Restriction to complete cases (linked cohort 3) substantially reduced the sample size (0.4% of antipsychotic initiators, 0.1% of control patients).

Flow diagram of cohort assembly, IBM MarketScan Data, United States, 2010–2019. A) New users aged 5–24 years of an antipsychotic medication. B) New users aged 5–24 years of a control medication (attention-deficit/hyperactivity disorder medications, antidepressants, mood stabilizers). HbA1c, hemoglobin A1c.

Patient characteristics

Compared with the full cohort, patients within the linked cohorts were slightly older (mean age of controls, 16.2 (standard deviation, 5.3) years in the full cohort versus 18.2 (standard deviation, 4.7) years in linked cohort 2; Table 1, Web Table 3). They were also more likely to have diagnoses of metabolic conditions and laboratory tests ordered, with the prevalence increasing as the definition for linked cohort became more restrictive. Notably, the prevalence of obesity or overweight diagnosis among control patients was 3.5% in the full cohort, 5.2% in linked cohort 1, 7.4% in linked cohort 2, and 26.0% in linked cohort 3. Similar trends were observed in the antipsychotic group. Other measured characteristics were generally similar across cohorts.

Table 1.

Characteristics of Patients Who Initiated Antipsychotic Treatment or Control Treatment Before and After Accounting for Selection Bias, IBM MarketScan Data, United States, 2010–2019

	Initiators of Antipsychotics								Initiators of Other Psychotropic Drugs
	Full Cohort		Linked Cohort 1		Linked Cohort 2		Linked Cohort 3		Full Cohort		Linked Cohort 1		Linked Cohort 2		Linked Cohort 3
Characteristic	No.	%	No.	%	No.	%	No.	%	No.	%	No.	%	No.	%	No.	%
No Adjustment for Selection
Age at initiation, years^a	16.7 (4.9)		17.2 (4.6)		17.7 (4.5)		17.6 (4.5)		16.2 (5.3)		17.4 (4.9)		18.2 (4.7)		18.8 (4.4)
Female sex	155,078	44.4	19,365	51.9	7,243	54.9	425	47.5	997,621	49.9	104,221	61.9	36,821	65.8	1,515	58.6
Pediatric comorbidity index^a	5.6 (4.9)		6.2 (4.2)		6.9 (4.2)		6.3 (4.0)		2.6 (2.7)		3.0 (2.9)		3.7 (3.1)		4.1 (3.1)
Medical diagnoses
Obesity or overweight	16,538	4.7	2,310	6.2	1,086	8.2	142	15.9	70,254	3.5	8,678	5.2	4,118	7.4	673	26.0
Weight management	7,235	2.1	1,203	3.2	590	4.5	44	4.9	30,784	1.5	4,175	2.5	1833	3.3	159	6.1
Abnormal glucose/prediabetes	2,404	0.7	359	1.0	181	1.4	41	4.6	6,202	0.3	869	0.5	509	0.9	153	5.9
Hyperlipidemia	6,381	1.8	1,058	2.8	600	4.5	93	10.4	20,521	1.0	3,158	1.9	1975	3.5	329	12.7
Bipolar disorder	121,548	34.8	13,558	36.3	5,187	39.3	336	37.6	78,173	3.9	7,311	4.3	2,751	4.9	152	5.9
Depression	169,212	48.5	19,562	52.4	7,300	55.3	434	48.5	596,370	29.8	56,573	33.6	21,394	38.2	1,135	43.9
Psychotic disorders	42,016	12.0	4,785	12.8	1831	13.9	160	17.9	16,240	0.8	1,657	1.0	645	1.2	40	1.5
Laboratory tests ordered
Comprehensive metabolic panel	125,125	35.8	16,969	45.5	8,624	65.3	738	82.6	373,700	18.7	49,460	29.4	30,241	54.1	2,181	84.3
Glucose test	16,098	4.6	2,479	6.6	1,428	10.8	128	14.3	36,774	1.8	5,281	3.1	3,361	6.0	311	12.0
HbA1c test	18,080	5.2	2,820	7.6	1789	13.6	812	90.8	44,708	2.2	6,871	4.1	4,775	8.5	2,390	92.4
Lipid test	54,724	15.7	8,235	22.1	5,016	38	811	90.7	156,308	7.8	22,537	13.4	15,322	27.4	2,376	91.9
Weighted by the Inverse Probability of Selection
Age at initiation, years^a	16.7 (4.9)		16.7 (4.3)		17.1 (4.0)		17.9 (4.4)		16.2 (5.3)		16.3 (5.3)		16.8 (5.1)		19.3 (5.5)
Female sex	155,078	44.4	13,744	44.7	4,702	49.1	263	38.4	997,621	49.9	87,829	50.7	29,688	56.7	2,491	62.3
Pediatric comorbidity index^a	5.6 (4.9)		5.7 (3.6)		6.2 (3.3)		7.3 (3.8)		2.6 (2.7)		2.6 (2.8)		3.0 (2.7)		3.9 (3.7)
Medical diagnoses
Obesity or overweight	16,538	4.7	1,486	4.8	558	5.8	72	10.6	70,254	3.5	6,527	3.8	2,464	4.7	867	21.7
Weight management	7,235	2.1	691	2.3	281	2.9	33	4.8	30,784	1.5	3,018	1.7	1,191	2.3	197	4.9
Abnormal glucose/prediabetes	2,404	0.7	222	0.7	83	0.9	1	0.2	6,202	0.3	592	0.3	259	0.5	16	0.4
Hyperlipidemia	6,381	1.8	573	1.9	220	2.3	32	4.7	20,521	1.0	1950	1.1	834	1.6	142	3.6
Bipolar disorder	121,548	34.8	10,648	34.7	3,527	36.8	302	44.1	78,173	3.9	6,914	4.0	2,351	4.5	259	6.5
Depression	169,212	48.5	15,029	48.9	4,982	52	370	54.0	596,370	29.8	52,611	30.4	17,340	33.1	1898	47.4
Psychotic disorders	42,016	12.0	3,703	12.1	1,255	13.1	240	35.0	16,240	0.8	1,447	0.8	538	1.0	142	3.5
Laboratory tests ordered
Comprehensive metabolic panel	125,125	35.8	11,149	36.3	3,923	40.9	240	35.1	373,700	18.7	34,134	19.7	12,445	23.8	661	16.5
Glucose test	16,098	4.6	1,464	4.8	619	6.5	92	13.4	36,774	1.8	3,522	2.0	1,658	3.2	314	7.8
HbA1c test	18,080	5.2	1,614	5.3	626	6.5	23	3.4	44,708	2.2	4,171	2.4	1791	3.4	54	1.3
Lipid test	54,724	15.7	4,879	15.9	1774	18.5	36	5.3	156,308	7.8	14,423	8.3	5,834	11.1	97	2.4

Open in a new tab

Abbreviation: HbA1c, hemoglobin A1c.

^a Values are expressed as mean (standard deviation).

After applying IPSW, the distributions of baseline patient characteristics (including metabolic conditions) in linked cohort 1 and linked cohort 2 were similar to those of the full cohort (Table 1, Web Table 4). There were residual imbalances for linked cohort 3 compared with the full cohort. Within the full cohort and linked cohorts 1 and 2, characteristics were similar between treatment groups after weighting by IPTW and IPSW, with absolute standardized differences of less than 0.10 for nearly all measured covariates (Table 2, Web Figure 1) (43).

Table 2.

Distribution of Patient Characteristics After Accounting for Potential Selection Bias and Baseline Confounding^a, IBM MarketScan Data, United States, 2010–2019

	Initiators of Antipsychotics								Initiators of other Psychotropic Drugs
	Full Cohort		Linked Cohort 1		Linked Cohort 2		Linked Cohort 3		Full Cohort		Linked Cohort 1		Linked Cohort 2		Linked Cohort 3
Characteristic	No.	%	No.	%	No.	%	No.	%	No.	%	No.	%	No.	%	No.	%
No. of patients	288,402	100	25,069	100	7,804	100	14,797	100	2,020,663	100	173,439	100	52,681	100	16,137	100
Demographic factors
Age at initiation, years^b	16.7 (4.9)		16.4 (4.0)		16.8 (3.7)		20.2 (16.3)		16.3 (5.3)		16.3 (5.1)		16.9 (5.1)		20.0 (9.9)
Female sex	155,078	44.4	12,400	49.5	4,363	55.9	10,751	72.7	993,276	49.2	81,207	48.9	29,240	55.5	7,460	46.2
Pediatric comorbidity index^b	5.6 (4.0)		3.7 (2.7)		4.1 (2.6)		5.3 (13.1)		3.1 (3.3)		3.2 (3.3)		3.6 (3.2)		9.5 (13.9)
Metabolic conditions
Obesity or overweight	11,452	4.0	1,003	4.0	368	4.7	2,967	20.0	74,650	3.7	6,738	3.9	2,548	4.8	1,571	9.7
Weight management	5,109	1.8	470	1.9	186	2.4	973	6.6	32,891	1.6	3,122	1.8	1,215	2.3	818	5.1
Abnormal weight gain	2,708	0.9	248	1.0	100	1.3	1,664	11.2	17,259	0.9	1,566	0.9	646	1.2	282	1.7
Abnormal glucose or prediabetes	1,255	0.4	99	0.4	48	0.6	4,802	32.5	7,487	0.4	667	0.4	275	0.5	74	0.5
Metabolic syndrome	338	0.1	39	0.2	18	0.2	425	2.9	2025	0.1	186	0.1	76	0.1	45	0.3
Hyperlipidemia	3,697	1.3	347	1.4	145	1.9	1,034	7.0	23,156	1.1	2,101	1.2	894	1.7	1,140	7.1
Hypothyroidism	4,636	1.6	435	1.7	190	2.4	3,692	24.9	27,990	1.4	2,501	1.4	1,131	2.1	81	0.5
Lab tests ordered
Comprehensive metabolic panel	71,358	24.7	6,330	25.2	2,198	28.2	9,807	66.3	436,490	21.6	38,472	22.2	13,899	26.4	9,780	60.6
Glucose test	7,467	2.6	682	2.7	302	3.9	1708	11.5	45,888	2.3	4,159	2.4	1914	3.6	1819	11.3
HbA1c test	8,717	3.0	778	3.1	318	4.1	14,368	97.1	53,728	2.7	4,807	2.8	2034	3.9	12,380	76.7
Lipid test	27,361	9.5	2,500	10.0	955	12.2	9,985	67.5	179,603	8.9	16,042	9.2	6,373	12.1	9,561	59.3
Psychiatric conditions
ADHD	57,598	20.0	4,824	19.2	1,339	17.2	1,564	10.6	472,542	23.4	39,894	23.0	11,165	21.2	1818	11.3
Anxiety	103,698	36.0	9,038	36.1	2,913	37.3	10,375	70.1	676,179	33.5	58,712	33.9	19,353	36.7	10,414	64.5
Autism	13,169	4.6	1,187	4.7	348	4.5	612	4.1	70,769	3.5	5,956	3.4	1,693	3.2	686	4.3
Bipolar disorder	32,263	11.2	2,873	11.5	1,008	12.9	138	0.9	185,632	9.2	14,993	8.6	5,269	10.0	8,145	50.5
Depression	112,228	38.9	9,891	39.5	3,357	43.0	9,377	63.4	671,151	33.2	57,974	33.4	19,239	36.5	9,741	60.4
Psychotic disorders	9,693	3.4	882	3.5	318	4.1	172	1.2	54,744	2.7	4,326	2.5	1709	3.2	1,153	7.1
Medications
Lithium	3,621	1.3	314	1.3	112	1.4	8	0.1	17,624	0.9	1,344	0.8	497	0.9	1,076	6.7
Anticonvulsant mood stabilizers	28,030	9.7	2,513	10.0	845	10.8	1823	12.3	140,177	6.9	11,651	6.7	4,047	7.7	5,986	37.1
SSRIs	153,764	53.3	13,339	53.2	4,382	56.1	7,373	49.8	964,072	47.7	84,308	48.6	27,539	52.3	7,879	48.8
Other antidepressants	50,626	17.6	4,438	17.7	1,492	19.1	2002	13.5	271,163	13.4	22,950	13.2	7,860	14.9	4,836	30.0
ADHD medications	133,131	46.2	11,166	44.5	3,119	40.0	4,299	29.0	1,019,952	50.5	85,993	49.6	23,645	44.9	4,151	25.7
Health-care utilization
No. of outpatient visits^c	5 (2–10)		5 (2–10)		6 (3–11)		7 (6–12)		4 (2–8)		4 (2–10)		5 (3–9)		9 (8–13)
No. of distinct generic drugs^c	3 (1–5)		4 (2–5)		4 (3–6)		3 (2–5)		4 (2–5)		3 (2–5)		3 (2–5)		6 (2–9)
Any hospitalization	29,066	10.1	2,583	10.3	972	12.5	281	1.9	152,044	7.5	12,546	7.2	4,833	9.2	5,085	31.5
Laboratory test results^d
HbA1c, %^b	5.28 (0.2)		5.30 (0.2)		5.34 (0.2)		5.26 (0.9)		5.27 (0.5)		5.3 (0.4)		5.33 (0.3)		5.26 (0.8)
Proportion missing	99.8		98.6		96.4		0.0		99.8		98.4		96.3		0.0
Total cholesterol, mg/dL^b	161.06 (26.2)		159.07 (21.5)		157.12 (21.4)		156.75 (94.8)		163.11 (42.2)		160.28 (35.3)		157.15 (32.3)		153.87 (81.8)
Proportion missing	99.4		95.4		87.1		0.0		99.3		94.6		87.5		0.0
Proportion high cholesterol	13.1		12.0		12.2		6.6		14.5		13.2		12.4		11.0
LDL cholesterol, mg/dL^b	90.04 (21.2)		88.75 (17.1)		87.69 (16.6)		78.04 (89.8)		90.85 (34.2)		89.42 (27.7)		87.67 (23.9)		81.72 (70.7)
Proportion missing	99.4		95.7		88.0		0.0		99.3		95.0		88.8		0.0
Proportion high LDL cholesterol	8.5		7.9		7.6		3.2		9.8		9.3		8.9		4.7
HDL cholesterol, mg/dL^b	51.13 (10.6)		50.83 (8.6)		50.62 (8.2)		61.33 (70.1)		51.82 (17.5)		50.79 (14.1)		49.76 (12.4)		50.25 (32.9)
Proportion missing	99.4		95.6		87.6		0 0.0		99.3		94.9		88.5		0.0
Triglycerides, mg/dL^b	101.76 (43.7)		99.93 (34.9)		98.87 (33.2)		87.53 (218.0)		103.31 (67.3)		102.25 (53.4)		101.43 (42.8)		101.58 (154.2)
Proportion missing	99.4		95.7		88.1		0.0		99.3		95.0		89.0		0.0
Proportion high triglycerides	16.5		15.6		14.8		10.0		15.9		15.5		15.3		22.5

Open in a new tab

Abbreviations: ADHD, attention-deficit/hyperactivity disorder; HbA1c, hemoglobin A1c; HDL, high-density lipoprotein; LDL, low-density lipoprotein; SSRI, selective serotonin reuptake inhibitor.

^a The mean of the inverse probability of treatment weights after truncation were as follows: 0.98 (standard deviation, 0.63) for the full cohort, 0.98 (standard deviation, 0.73) for linked cohort 1, 0.98 (standard deviation, 0.77) for linked cohort 2, and 1.00 (standard deviation, 1.02) for linked cohort 3.

^b Values are expressed as mean (standard deviation).

^c Values are expressed as median (interquartile range).

^d Distribution of lab test results summarized among available values (before multiple imputation).

Treatment effects

In the full cohort, the unadjusted association suggested a 3-fold increased hazard of T2D among antipsychotic initiators compared with control patients (HR = 3.06, 95% CI: 2.87, 3.25; Figure 3A, Web Table 5). The magnitude of association attenuated after controlling for baseline confounders in the claims (HR = 2.26, 95% CI: 2.07, 2.49; Figure 3B) and baseline confounders in both the claims and the imputed laboratory data (HR = 2.25, 95% CI: 2.05, 2.47; Figure 3C). The distributions of laboratory test results were generally similar before and after imputation (Web Table 6).

Comparison of treatment effect estimates before and after accounting for selection bias and confounding, IBM MarketScan Data, United States, 2010–2019. A) Estimates with no confounding adjustment. B) Confounding adjustment by claims covariates. C) Confounding adjustment by claims and laboratory covariates. The full cohort, identified in the primary data set, was the target population of interest. The linked cohorts gradually became more restrictive: Linked cohort 1 consisted of linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the study period, linked cohort 2 consisted of linked patients with data for any laboratory test (not limited to the 3 tests of interest) during the covariate assessment period, and linked cohort 3 consisted of linked patients with a recorded result for each of the 3 lab tests of interest during the covariate assessment period (complete confounder data). Estimates reported for the full cohort were repeated after adjustment for selection bias for the ease of comparison. The 95% confidence interval (CIs) are based on standard sandwich variance estimators; see Web Table 8 for 95% CIs based on variance estimated using a nonparametric bootstrap method. HR, hazard ratio.

With no adjustment for potential selection bias, effect estimates in the linked cohorts suggested a different magnitude of association from the full cohort. The HR adjusted for claims covariates was 1.79 (95% CI: 1.41, 2.27) in linked cohort 1, 1.63 (95% CI: 0.99, 2.68) in linked cohort 2, and 0.87 (95% CI: 0.29, 2.56) in linked cohort 3 (Figure 3B). Estimates were similar after controlling for both claims and laboratory covariates (Figure 3C).

After accounting for selection bias, claims-only confounding–adjusted HRs in linked cohort 1 (HR = 2.11, 95% CI: 1.62, 2.74) and linked cohort 2 (HR = 2.12, 95% CI: 0.88, 5.13) were nearly identical to the point estimate observed in the full cohort (HR = 2.26, 95% CI: 2.07, 2.49; Figure 3B). In linked cohort 3, the claims-only confounding-adjusted estimate (HR = 8.15, 95% CI: 1.24, 53.56) remained different from the full cohort estimate, but accounting for potential selection bias corrected the direction of association (Figure 3B). After adjustment for both claims and laboratory covariates, effect estimates remained similar in linked cohort 1 and linked cohort 2, whereas the variance increased substantially in linked cohort 3 (HR = 13.90, 95% CI: 1.61, 120.09; Figure 3C). As expected, treatment effects estimated within the linked cohorts were less efficient compared with the full cohort, and weighting by IPSW resulted in even wider 95% CIs. The extent to which applying IPSW increased variance differed for each cohort (Web Table 7). For example, for claims-only confounding–adjusted estimates, IPSW had limited impact on precision (confidence limit ratio of 1.61 before IPSW vs. 1.69 after IPSW) in linked cohort 1, whereas a more substantial increase in variance was observed in linked cohort 2 (confidence limit ratio 2.71 vs. 5.83) and linked cohort 3 (confidence limit ratio 8.83 vs. 43.19). The standard errors obtained from bootstrapping were generally similar to the estimates obtained from the standard sandwich variance estimator (Web Table 8), although in the weighted settings, the estimates from bootstrapping were slightly smaller.

DISCUSSION

We highlighted the importance of accounting for potential selection bias in linked database studies using the example of antipsychotics and the risk of T2D in youths within a claims and laboratory linked database. While this linked database offered more potential for confounding control compared with the claims database alone, patients within the linked cohorts were not representative of the full claims-based cohort, and failure to account for potential selection bias resulted in incorrect point estimates. In our application example, the laboratory values within the supplemental data set were not strong confounders, and restriction to a linked cohort introduced a substantial amount of selection bias. Applying IPSW resulted in effect estimates that were comparable to the full, original study cohort (the target population), demonstrating that a valid solution exists for addressing the often-neglected issue of selection bias in linked database studies.

Studies conducted within linked databases should carefully consider who is being analyzed and what the target population is. We assumed that the target population was the full cohort identified in the primary data set, and therefore, estimates obtained within linked subsets of the full cohort could potentially be biased if selection bias was not appropriately accounted for. IPSW provided a valid approach to extending our inferences from the linked population to the full cohort. However, some linked database studies may consider the linked subset as the target population (as opposed to the full cohort). In such circumstances, effect estimates from the linked subset target the proper estimand associated with the population of interest and are not necessarily prone to selection bias (44). In other words, the estimates would be unbiased for the linked subset, but they may not be generalizable to the full cohort. Finally, some studies may consider the target population as an external population that is distinct from the study sample (i.e., partially or completely nonoverlapping with the full cohort or the linked subset). In these situations, findings from the study sample can potentially be extended to an external population using approaches for transportability (such as inverse odds weighting) that are described elsewhere (45–48).

Our study highlights the importance of explicitly specifying the target population (e.g., full cohort, linked subset, an external population) and identifying an appropriate approach to generate unbiased effect estimates for the target population of interest (with the usual assumptions of conditional exchangeability, positivity, and causal consistency). We found that applying IPSW created a pseudopopulation comparable to the full cohort for 2 of the 3 linked cohorts. However, residual selection bias was present in linked cohort 3 (complete cases), which was not unexpected given the small sample size (reflecting <0.5% of the full cohort) that was substantially unrepresentative of the full cohort. A complete-case analysis is also known to be biased in most circumstances (49, 50). While we highlighted several definitions of a linked cohort, we anticipate that linked cohort 2 (any linked data during covariate assessment period) will be the most widely used in pharmacoepidemiologic studies.

To further investigate the residual biases in linked cohort 3, we explored different ways to truncate inverse probability weights and a different specification of the weight models (Web Table 9, Web Figure 2). In these exploratory analyses, we found that possible positivity violations likely generated extreme weights, which resulted in residual imbalances across treatment groups and unstable effect estimates (adjusted HR ranging from 0.63 to 11.20). By requiring a recorded test result for hemoglobin A1c, cholesterol, and triglycerides, we likely included a much greater proportion of patients who were at a higher metabolic risk in the complete-case cohort and were not able to adequately account for the selection bias due to insufficient information on patients with a lower metabolic risk (who were more common in the full cohort).

There are several potential limitations to this analysis. First, we applied the approaches to only one empirical example. In our study, the potential confounders from the supplemental database turned out not to be strong confounders, but the data linkage introduced a substantial amount of selection bias. The extent to which selection bias or confounding may be present in a linked database study may differ in other applications. Nevertheless, the outlined approaches can be applied to other linked database studies in general. Second, we used multiple imputation to handle missing laboratory data but there are other approaches, such as inverse probability weighting, that could be considered (51, 52). Third, we truncated inverse probability weights to minimize the influence of outliers, but compared with no weight truncation, this approach increased precision at the expense of potentially increasing the imbalances between treatment groups (32). Since the degree of truncation was small, it is unlikely to substantially influence our findings. Finally, as expected, we observed a bias-variance tradeoff in effect estimates weighted by IPSW. Weighted estimates can increase variances (32), and we observed that variances got progressively larger as the linked cohorts differed more from the full cohort. However, accurate point estimates are generally prioritized in nonrandomized studies to minimize bias and achieve internal validity.

CONCLUSIONS

Studies conducted within linked databases, often with the goal of improved confounding control, may be restricted to patients who are not representative of the target population of interest. Analyses conducted within linked cohorts may generate biased effect estimates for the target population of interest, but this selection bias can be reduced through inverse probability of selection weights.

Supplementary Material

Web_Material_kwab299

Click here for additional data file.^{(1MB, pdf)}

ACKNOWLEDGMENTS

Author affiliations: Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, Massachusetts, United States (Jenny W. Sun, Rui Wang, Dongdong Li, Sengwee Toh); and Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States (Rui Wang).

This study was funded by Harvard Medical School and Harvard Pilgrim Health Care Institute through the Thomas O. Pyle Fellowship Fund and the Agency for Healthcare Research and Quality (grant R01HS026214).

This study was based on data from the IBM MarketScan Commercial Database obtained and used under license for the present study. Restrictions apply to the availability of these data, so they are not publicly available. The data underlying the results presented in the study are available for purchase by contacting the database owners.

We thank Jenny Hochstadt for her assistance in accessing the MarketScan data.

This work was presented as a podium presentation at the 37th International Conference on Pharmacoepidemiology (online), August 23–25, 2021.

J.W.S. is currently employed by Pfizer Inc. for unrelated work. All aspects of this work included in the initial submission, including the study design, data analysis, and manuscript draft, were completed prior to employment at Pfizer Inc. The other authors report no conflicts.

REFERENCES

1. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58(4):323–337. [DOI] [PubMed] [Google Scholar]
2. Bradley CJ, Penberthy L, Devers KJ, et al. Health services research and data linkages: issues, methods, and directions for the future. Health Serv Res. 2010;45(5):1468–1488. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Trifirò G, Sultana J, Bate A. From big data to smart data for pharmacovigilance: the role of healthcare databases and other emerging sources. Drug Saf. 2018;41(2):143–149. [DOI] [PubMed] [Google Scholar]
4. Mears GD, Rosamond WD, Lohmeier C, et al. A link to improve stroke patient care: a successful linkage between a statewide emergency medical services data system and a stroke registry. Acad Emerg Med. 2010;17(12):1398–1404. [DOI] [PubMed] [Google Scholar]
5. García Álvarez L, Aylin P, Tian J, et al. Data linkage between existing healthcare databases to support hospital epidemiology. J Hosp Infect. 2011;79(3):231–235. [DOI] [PubMed] [Google Scholar]
6. van Herk-Sukel MPP, Lemmens VEPP, van de Poll-Franse LV, et al. Record linkage for pharmacoepidemiological studies in cancer patients. Pharmacoepidemiol Drug Saf. 2012;21(1):94–103. [DOI] [PubMed] [Google Scholar]
7. Harron K, Goldstein H, Wade A, et al. Linkage, evaluation and analysis of national electronic healthcare data: application to providing enhanced blood-stream infection surveillance in paediatric intensive care. PLoS One. 2013;8(12):e85278. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Setoguchi S, Zhu Y, Jalbert JJ, et al. Validity of deterministic record linkage using multiple indirect personal identifiers: linking a large registry to claims data. Circ Cardiovasc Qual Outcomes. 2014;7(3):475–480. [DOI] [PubMed] [Google Scholar]
9. Patorno E, Gopalakrishnan C, Franklin JM, et al. Claims-based studies of oral glucose-lowering medications can achieve balance in critical clinical variables only observed in electronic health records. Diabetes Obes Metab. 2018;20(4):974–984. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Huybrechts KF, Gopalakrishnan C, Franklin JM, et al. Claims data studies of direct oral anticoagulants can achieve balance in important clinical parameters only observable in electronic health records. Clin Pharmacol Ther. 2019;105(4):979–993. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Schmidt M, Schmidt SAJ, Adelborg K, et al. The Danish health care system and epidemiological research: from health care contacts to database records. Clin Epidemiol. 2019;11:563–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Pratt NL, Mack CD, Meyer AM, et al. Data linkage in pharmacoepidemiology: a call for rigorous evaluation and reporting. Pharmacoepidemiol Drug Saf. 2020;29(1):9–17. [DOI] [PubMed] [Google Scholar]
13. Rivera DR, Gokhale MN, Reynolds MW, et al. Linking electronic health data in pharmacoepidemiology: appropriateness and feasibility. Pharmacoepidemiol Drug Saf. 2020;29(1):18–29. [DOI] [PubMed] [Google Scholar]
14. Lin KJ, Schneeweiss S. Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs. Clin Pharmacol Ther. 2016;100(2):147–159. [DOI] [PubMed] [Google Scholar]
15. Dusetzina SB, Tyree S, Meyer A-M, et al. Linking Data for Health Services Research: A Framework and Instructional Guide, Rockville, MD: Agency for Healthcare Research and Quality (US); 2014. https://www.ncbi.nlm.nih.gov/books/NBK253313/. Accessed November 19, 2020. [PubMed] [Google Scholar]
16. Mansfield KE, Nitsch D, Smeeth L, et al. Prescription of renin–angiotensin system blockers and risk of acute kidney injury: a population-based cohort study. BMJ Open. 2016;6(12):e012690. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Bouras G, Markar SR, Burns EM, et al. The psychological impact of symptoms related to esophagogastric cancer resection presenting in primary care: a national linked database study. Eur J Surg Oncol. 2017;43(2):454–460. [DOI] [PubMed] [Google Scholar]
18. Solomon DH, Liu C-C, Kuo I-H, et al. Effects of colchicine on risk of cardiovascular events and mortality among patients with gout: a cohort study using electronic medical records linked with Medicare claims. Ann Rheum Dis. 2016;75(9):1674–1679. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Lee MP, Glynn RJ, Schneeweiss S, et al. Risk factors for heart failure with preserved or reduced ejection fraction among Medicare beneficiaries: application of competing risks analysis and gradient boosted model. Clin Epidemiol. 2020;12:607–616. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Berger A, Simpson A, Leeper NJ, et al. Real-world predictors of major adverse cardiovascular events and major adverse limb events among patients with chronic coronary artery disease and/or peripheral arterial disease. Adv Ther. 2020;37(1):240–252. [DOI] [PubMed] [Google Scholar]
21. Bohensky M. Bias in data linkage studies. In: Harron K, Golstein H, Dibben C, eds. Methodological Developments in Data Linkage. London, UK: John Wiley & Sons, Ltd; 2015:63–82. [Google Scholar]
22. Galling B, Roldán A, Nielsen RE, et al. Type 2 diabetes mellitus in youth exposed to antipsychotics: a systematic review and meta-analysis. JAMA Psychiat. 2016;73(3):247–259. [DOI] [PubMed] [Google Scholar]
23. Bobo WV, Cooper WO, Stein CM, et al. Antipsychotics and the risk of type 2 diabetes mellitus in children and youth. JAMA Psychiat. 2013;70(10):1067. [DOI] [PubMed] [Google Scholar]
24. De Hert M, Detraux J, van Winkel R, et al. Metabolic and cardiovascular adverse effects associated with antipsychotic drugs. Nat Rev Endocrinol. 2012;8(2):114–126. [DOI] [PubMed] [Google Scholar]
25. De Hert M, Dobbelaere M, Sheridan EM, et al. Metabolic and endocrine adverse effects of second-generation antipsychotics in children and adolescents: a systematic review of randomized, placebo controlled trials and guidelines for clinical practice. Eur Psychiatry. 2011;26(3):144–158. [DOI] [PubMed] [Google Scholar]
26. American Diabetes Association . Consensus development conference on antipsychotic drugs and obesity and diabetes. Diabetes Care. 2004;27(2):596–601. [DOI] [PubMed] [Google Scholar]
27. IBM . MarketScan Research Databases. 2019; https://www.ibm.com/products/marketscan-research-databases. Accessed November 19, 2020.
28. Brookhart MA, Todd JV, Li X, et al. Estimation of biomarker distributions using laboratory data collected during routine delivery of medical care. Ann Epidemiol. 2014;24(10):754–761. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Sun JW, Bourgeois FT, Haneuse S, et al. Development and validation of a pediatric comorbidity index. Am J Epidemiol. 2021;190(5):918–927. [DOI] [PubMed] [Google Scholar]
30. Teltsch DY, Fazeli Farsani S, Swain RS, et al. Development and validation of algorithms to identify newly diagnosed type 1 and type 2 diabetes in pediatric population using electronic medical records and claims data. Pharmacoepidemiol Drug Saf. 2019;28(2):234–243. [DOI] [PubMed] [Google Scholar]
31. Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. [DOI] [PubMed] [Google Scholar]
32. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. SAS Institute Inc. SAS/STAT, 14.1 User’s Guide The MI Procedure. Cary, NC: SAS Institute Inc; 2015. [Google Scholar]
35. Leyrat C, Seaman SR, White IR, et al. Propensity score analysis with partially observed covariates: how should multiple imputation be used? Stat Methods Med Res. 2019;28(1):3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Granger E, Sergeant JC, Lunt M. Avoiding pitfalls when combining multiple imputation and propensity scores. Stat Med. 2019;38(26):5120–5132. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Rubin DB. Multiple Imputation for Survey Nonresponse. New York, NY: Wiley; 1987. [Google Scholar]
38. Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15(5):615–625. [DOI] [PubMed] [Google Scholar]
39. Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. Am J Epidemiol. 2010;172(1):107–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Lin DY, Wei L-J. The robust inference for the Cox proportional hazards model. J Am Stat Assoc. 1989;84(408):1074–1078. [Google Scholar]
41. Poole C. Low P values or narrow confidence intervals: which are more durable? Epidemiology. 2001;12(3):291–294. [DOI] [PubMed] [Google Scholar]
42. Hernán MA, Robins JM. Causal Inference: What If? Boca Raton, FL: CRC Press LLC; 2020. [Google Scholar]
43. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–3107. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Hernán MA. Invited commentary: selection bias without colliders. Am J Epidemiol. 2017;185(11):1048–1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Dahabreh IJ, Robertson SE, Tchetgen Tchetgen EJ, et al. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics. 2019;75(2):685–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Westreich D, Edwards JK, Lesko CR, et al. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017;186(8):1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Dahabreh IJ, Robertson SE, Steingrimsson JA, et al. Extending inferences from a randomized trial to a new target population. Stat Med. 2020;39(14):1999–2014. [DOI] [PubMed] [Google Scholar]
48. Webster-Clark M, Lund JL, Stürmer T, et al. Reweighting oranges to apples: transported RE-LY trial versus nonexperimental effect estimates of anticoagulation in atrial fibrillation. Epidemiology. 2020;31(5):605–613. [DOI] [PubMed] [Google Scholar]
49. Laird NM. Missing data in longitudinal studies. Stat Med. 1988;7(1–2):305–315. [DOI] [PubMed] [Google Scholar]
50. Ross RK, Breskin A, Westreich D. When is a complete-case approach to missing data valid? The importance of effect-measure modification. Am J Epidemiol. 2020;189(12):1583–1589. [DOI] [PMC free article] [PubMed] [Google Scholar]
51. Horton NJ, Kleinman KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat. 2007;61(1):79–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Little RJ, Rubin DB. Statistical Analysis With Missing Data. 3rd ed. Hoboken, NJ: John Wiley & Sons; 2019. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwab299

Click here for additional data file.^{(1MB, pdf)}

[ref1] 1. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58(4):323–337. [DOI] [PubMed] [Google Scholar]

[ref2] 2. Bradley CJ, Penberthy L, Devers KJ, et al. Health services research and data linkages: issues, methods, and directions for the future. Health Serv Res. 2010;45(5):1468–1488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Trifirò G, Sultana J, Bate A. From big data to smart data for pharmacovigilance: the role of healthcare databases and other emerging sources. Drug Saf. 2018;41(2):143–149. [DOI] [PubMed] [Google Scholar]

[ref4] 4. Mears GD, Rosamond WD, Lohmeier C, et al. A link to improve stroke patient care: a successful linkage between a statewide emergency medical services data system and a stroke registry. Acad Emerg Med. 2010;17(12):1398–1404. [DOI] [PubMed] [Google Scholar]

[ref5] 5. García Álvarez L, Aylin P, Tian J, et al. Data linkage between existing healthcare databases to support hospital epidemiology. J Hosp Infect. 2011;79(3):231–235. [DOI] [PubMed] [Google Scholar]

[ref6] 6. van Herk-Sukel MPP, Lemmens VEPP, van de Poll-Franse LV, et al. Record linkage for pharmacoepidemiological studies in cancer patients. Pharmacoepidemiol Drug Saf. 2012;21(1):94–103. [DOI] [PubMed] [Google Scholar]

[ref7] 7. Harron K, Goldstein H, Wade A, et al. Linkage, evaluation and analysis of national electronic healthcare data: application to providing enhanced blood-stream infection surveillance in paediatric intensive care. PLoS One. 2013;8(12):e85278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Setoguchi S, Zhu Y, Jalbert JJ, et al. Validity of deterministic record linkage using multiple indirect personal identifiers: linking a large registry to claims data. Circ Cardiovasc Qual Outcomes. 2014;7(3):475–480. [DOI] [PubMed] [Google Scholar]

[ref9] 9. Patorno E, Gopalakrishnan C, Franklin JM, et al. Claims-based studies of oral glucose-lowering medications can achieve balance in critical clinical variables only observed in electronic health records. Diabetes Obes Metab. 2018;20(4):974–984. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Huybrechts KF, Gopalakrishnan C, Franklin JM, et al. Claims data studies of direct oral anticoagulants can achieve balance in important clinical parameters only observable in electronic health records. Clin Pharmacol Ther. 2019;105(4):979–993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Schmidt M, Schmidt SAJ, Adelborg K, et al. The Danish health care system and epidemiological research: from health care contacts to database records. Clin Epidemiol. 2019;11:563–591. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Pratt NL, Mack CD, Meyer AM, et al. Data linkage in pharmacoepidemiology: a call for rigorous evaluation and reporting. Pharmacoepidemiol Drug Saf. 2020;29(1):9–17. [DOI] [PubMed] [Google Scholar]

[ref13] 13. Rivera DR, Gokhale MN, Reynolds MW, et al. Linking electronic health data in pharmacoepidemiology: appropriateness and feasibility. Pharmacoepidemiol Drug Saf. 2020;29(1):18–29. [DOI] [PubMed] [Google Scholar]

[ref14] 14. Lin KJ, Schneeweiss S. Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs. Clin Pharmacol Ther. 2016;100(2):147–159. [DOI] [PubMed] [Google Scholar]

[ref15] 15. Dusetzina SB, Tyree S, Meyer A-M, et al. Linking Data for Health Services Research: A Framework and Instructional Guide, Rockville, MD: Agency for Healthcare Research and Quality (US); 2014. https://www.ncbi.nlm.nih.gov/books/NBK253313/. Accessed November 19, 2020. [PubMed] [Google Scholar]

[ref16] 16. Mansfield KE, Nitsch D, Smeeth L, et al. Prescription of renin–angiotensin system blockers and risk of acute kidney injury: a population-based cohort study. BMJ Open. 2016;6(12):e012690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Bouras G, Markar SR, Burns EM, et al. The psychological impact of symptoms related to esophagogastric cancer resection presenting in primary care: a national linked database study. Eur J Surg Oncol. 2017;43(2):454–460. [DOI] [PubMed] [Google Scholar]

[ref18] 18. Solomon DH, Liu C-C, Kuo I-H, et al. Effects of colchicine on risk of cardiovascular events and mortality among patients with gout: a cohort study using electronic medical records linked with Medicare claims. Ann Rheum Dis. 2016;75(9):1674–1679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Lee MP, Glynn RJ, Schneeweiss S, et al. Risk factors for heart failure with preserved or reduced ejection fraction among Medicare beneficiaries: application of competing risks analysis and gradient boosted model. Clin Epidemiol. 2020;12:607–616. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Berger A, Simpson A, Leeper NJ, et al. Real-world predictors of major adverse cardiovascular events and major adverse limb events among patients with chronic coronary artery disease and/or peripheral arterial disease. Adv Ther. 2020;37(1):240–252. [DOI] [PubMed] [Google Scholar]

[ref21] 21. Bohensky M. Bias in data linkage studies. In: Harron K, Golstein H, Dibben C, eds. Methodological Developments in Data Linkage. London, UK: John Wiley & Sons, Ltd; 2015:63–82. [Google Scholar]

[ref22] 22. Galling B, Roldán A, Nielsen RE, et al. Type 2 diabetes mellitus in youth exposed to antipsychotics: a systematic review and meta-analysis. JAMA Psychiat. 2016;73(3):247–259. [DOI] [PubMed] [Google Scholar]

[ref23] 23. Bobo WV, Cooper WO, Stein CM, et al. Antipsychotics and the risk of type 2 diabetes mellitus in children and youth. JAMA Psychiat. 2013;70(10):1067. [DOI] [PubMed] [Google Scholar]

[ref24] 24. De Hert M, Detraux J, van Winkel R, et al. Metabolic and cardiovascular adverse effects associated with antipsychotic drugs. Nat Rev Endocrinol. 2012;8(2):114–126. [DOI] [PubMed] [Google Scholar]

[ref25] 25. De Hert M, Dobbelaere M, Sheridan EM, et al. Metabolic and endocrine adverse effects of second-generation antipsychotics in children and adolescents: a systematic review of randomized, placebo controlled trials and guidelines for clinical practice. Eur Psychiatry. 2011;26(3):144–158. [DOI] [PubMed] [Google Scholar]

[ref26] 26. American Diabetes Association . Consensus development conference on antipsychotic drugs and obesity and diabetes. Diabetes Care. 2004;27(2):596–601. [DOI] [PubMed] [Google Scholar]

[ref27] 27. IBM . MarketScan Research Databases. 2019; https://www.ibm.com/products/marketscan-research-databases. Accessed November 19, 2020.

[ref28] 28. Brookhart MA, Todd JV, Li X, et al. Estimation of biomarker distributions using laboratory data collected during routine delivery of medical care. Ann Epidemiol. 2014;24(10):754–761. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Sun JW, Bourgeois FT, Haneuse S, et al. Development and validation of a pediatric comorbidity index. Am J Epidemiol. 2021;190(5):918–927. [DOI] [PubMed] [Google Scholar]

[ref30] 30. Teltsch DY, Fazeli Farsani S, Swain RS, et al. Development and validation of algorithms to identify newly diagnosed type 1 and type 2 diabetes in pediatric population using electronic medical records and claims data. Pharmacoepidemiol Drug Saf. 2019;28(2):234–243. [DOI] [PubMed] [Google Scholar]

[ref31] 31. Robins JM, Hernán MÁ, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. [DOI] [PubMed] [Google Scholar]

[ref32] 32. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] 34. SAS Institute Inc. SAS/STAT, 14.1 User’s Guide The MI Procedure. Cary, NC: SAS Institute Inc; 2015. [Google Scholar]

[ref35] 35. Leyrat C, Seaman SR, White IR, et al. Propensity score analysis with partially observed covariates: how should multiple imputation be used? Stat Methods Med Res. 2019;28(1):3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36. Granger E, Sergeant JC, Lunt M. Avoiding pitfalls when combining multiple imputation and propensity scores. Stat Med. 2019;38(26):5120–5132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. Rubin DB. Multiple Imputation for Survey Nonresponse. New York, NY: Wiley; 1987. [Google Scholar]

[ref38] 38. Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15(5):615–625. [DOI] [PubMed] [Google Scholar]

[ref39] 39. Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: the ACTG 320 trial. Am J Epidemiol. 2010;172(1):107–115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40. Lin DY, Wei L-J. The robust inference for the Cox proportional hazards model. J Am Stat Assoc. 1989;84(408):1074–1078. [Google Scholar]

[ref41] 41. Poole C. Low P values or narrow confidence intervals: which are more durable? Epidemiology. 2001;12(3):291–294. [DOI] [PubMed] [Google Scholar]

[ref42] 42. Hernán MA, Robins JM. Causal Inference: What If? Boca Raton, FL: CRC Press LLC; 2020. [Google Scholar]

[ref43] 43. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–3107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] 44. Hernán MA. Invited commentary: selection bias without colliders. Am J Epidemiol. 2017;185(11):1048–1050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref45] 45. Dahabreh IJ, Robertson SE, Tchetgen Tchetgen EJ, et al. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics. 2019;75(2):685–694. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] 46. Westreich D, Edwards JK, Lesko CR, et al. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017;186(8):1010–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] 47. Dahabreh IJ, Robertson SE, Steingrimsson JA, et al. Extending inferences from a randomized trial to a new target population. Stat Med. 2020;39(14):1999–2014. [DOI] [PubMed] [Google Scholar]

[ref48] 48. Webster-Clark M, Lund JL, Stürmer T, et al. Reweighting oranges to apples: transported RE-LY trial versus nonexperimental effect estimates of anticoagulation in atrial fibrillation. Epidemiology. 2020;31(5):605–613. [DOI] [PubMed] [Google Scholar]

[ref49] 49. Laird NM. Missing data in longitudinal studies. Stat Med. 1988;7(1–2):305–315. [DOI] [PubMed] [Google Scholar]

[ref50] 50. Ross RK, Breskin A, Westreich D. When is a complete-case approach to missing data valid? The importance of effect-measure modification. Am J Epidemiol. 2020;189(12):1583–1589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref51] 51. Horton NJ, Kleinman KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat. 2007;61(1):79–90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref52] 52. Little RJ, Rubin DB. Statistical Analysis With Missing Data. 3rd ed. Hoboken, NJ: John Wiley & Sons; 2019. [Google Scholar]

PERMALINK

Use of Linked Databases for Improved Confounding Control: Considerations for Potential Selection Bias

Jenny W Sun

Rui Wang

Dongdong Li

Sengwee Toh

Abstract

Abbreviations

METHODS

Application example

Definitions

Data sources

Study population

Linked subset

Figure 1.

Patient characteristics

Outcome

Statistical analysis

Descriptive statistics.

Adjusting for confounding only.

Adjusting for selection bias and confounding.

RESULTS

Data linkage

Figure 2.

Patient characteristics

Table 1.

Table 2.

Treatment effects

Figure 3.

DISCUSSION

CONCLUSIONS

Supplementary Material

ACKNOWLEDGMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases