Analysis of Sampling Bias in Large Health Care Claims Databases

Alex Dahlen; Vivek Charu

doi:10.1001/jamanetworkopen.2022.49804

. 2023 Jan 6;6(1):e2249804. doi: 10.1001/jamanetworkopen.2022.49804

Analysis of Sampling Bias in Large Health Care Claims Databases

Alex Dahlen ^1,^✉, Vivek Charu ^1,^2,^✉

¹Quantitative Sciences Unit, Department of Medicine, Stanford University School of Medicine, Stanford, California

²Department of Pathology, Stanford University School of Medicine, Stanford, California

Accepted for Publication: November 15, 2022.

Published: January 6, 2023. doi:10.1001/jamanetworkopen.2022.49804

^✉

Corresponding Authors: Alex Dahlen, PhD, Quantitative Sciences Unit, Department of Medicine, Stanford University School of Medicine, 300 Pasteur Dr, Palo Alto, CA 94304 (adahlen@stanford.edu); Vivek Charu, MD, PhD, Quantitative Sciences Unit, Department of Medicine, Stanford University School of Medicine, 300 Pasteur Dr, Palo Alto, CA 94304 (vcharu@stanford.edu).

Author Contributions: Dr Dahlen had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Both authors.

Acquisition, analysis, or interpretation of data: Both authors.

Drafting of the manuscript: Both authors.

Critical revision of the manuscript for important intellectual content: All authors.

Statistical analysis: Both authors.

Administrative, technical, or material support: Charu.

Supervision: Charu.

Conflict of Interest Disclosures: None reported.

Funding/Support: Research reported in this publication was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number KL2TR003143.

Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Disclaimer: The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Sharing Statement: See Supplement 2.

Additional Contributions: We thank Kate Miller and Steve Goodman for conversations about the manuscript.

Additional Information: Data for this project were accessed using Stanford Center for Population Health Sciences (PHS) Data Core. The PHS Data Core is supported by a National Institutes of Health National Center for Advancing Translational Science Clinical and Translational Science Award (UL1TR003142) and from Internal Stanford funding.

^✉

Corresponding author.

PMCID: PMC9857613 PMID: 36607640

Abstract

This cross-sectional study characterizes variation in sampling in a large health care claims database at the zip code level in 2018 and assesses whether socioeconomic and demographic factors are associated with inclusion.

Introduction

Health care claims databases that aggregate claims from multiple commercial insurers are increasingly being used to generate clinical evidence.^1,2,3 These databases represent a nonrandom sample of the underlying population, but often little attention is paid to the inherent sampling bias within the data, and how it might affect results. As an illustrative example, we characterize variation in sampling in the Optum Clinformatics Data Mart (CDM) at the zip code–level in 2018 and identify socioeconomic and demographic factors associated with inclusion.

Methods

This cross-sectional study was approved by the Stanford University institutional review board. Reporting followed the STROBE reporting guideline.

The Optum CDM consists of administrative claims derived from several large commercial and Medicare Advantage health plans. For the primary analysis of this cross-sectional study, we count the number of individuals with CDM coverage on a single day (June 1, 2018) in each zip code and compare it with the 2018 census estimates of the total population in that zip code. In sensitivity analyses, we also consider 1 strictly larger cohort—individuals with at least 1 day of coverage at any point during 2018—and 1 strictly smaller cohort—individuals with continuous coverage during the entire year of 2018.

To explain the variation in zip code–level CDM sampling, we fit an inverse-variance weighted multivariable linear regression model with 30 socioeconomic and demographic features extracted from the 2018 census, and state-level fixed effects. See the Table for the list of features and the eAppendix in Supplement 1 for details about the model and methods (including sensitivity analyses).

Table. Associations Between 30 Socioeconomic and Demographic Features and Claims Database Sampling Fraction at the Zip Code Level, Accounting for State-Level Variation in Sampling Fraction in 2 Models.

Characteristic	%		P value
Characteristic	Partial correlation coefficient^a	Multivariable regression coefficient (SD)^b	P value
Population
Total population (millions)	0.01	−0.0005 (0.00018)	<.001
Log₁₀ pop density (1/square mile)	0.14	0.67 (0.06)	<.001
Female sex	0.12	0.050 (0.010)	<.001
Race and ethnicity
Asian (non-Hispanic)	0.16	−0.010 (0.004)	<.001
Black (non-Hispanic)	−0.15	−0.008 (0.002)	<.001
Hispanic	−0.19	0.001 (0.004)	.64
White (non-Hispanic)	0.20	[Reference]	NA
Other^c	−0.08	−0.026 (0.008)	<.001
Age, y
<18	−0.09	[Reference]	NA
18-40	−0.21	−0.019 (0.010)	<.001
40-60	0.32	0.084 (0.014)	<.001
60-80	0.14	0.002 (0.012)	.71
>80	0.12	0.055 (0.024)	<.001
Household income, $
<15 000	−0.34	[Reference]	NA
15 000-30 000	−0.40	−0.016 (0.014)	.01
30 000-45 000	−0.37	−0.009 (0.012)	.16
45 000-60 000	−0.23	0.002 (0.014)	.78
60 000-100 000	0.10	0.007 (0.010)	.15
100 000-125 000	0.33	0.033 (0.018)	<.001
125 000-200 000	0.43	0.016 (0.012)	.01
>200 000	0.42	0.071 (0.014)	<.001
Work and insurance
Unemployed	−0.29	−0.001 (0.014)	.89
No health insurance	0.31	−0.030 (0.008)	<.001
Education
Less than high school	−0.33	[Reference]
High school	−0.27	0.038 (0.01)	<.001
Some college	−0.09	−0.010 (0.008)	.01
College	0.41	0.071 (0.010)	<.001
Graduate	0.30	−0.021 (0.010)	<.001
Housing
Houses that are owner occupied	0.25	−0.003 (0.004)	.14
Median house price (millions), $	0.36	−0.00001 (0.00003)	.43

Open in a new tab

Abbreviation: NA, not applicable.

^{^a}

The first set of models considers each covariate of interest separately, along with state-level fixed effects. Partial correlation coefficients derived from this model are presented; positive correlations indicate that zip codes with higher values of the covariate of interest are associated with higher zip-code level sampling in the claims database, even after adjusting for state-level clustering in sampling.

^{^b}

The second model is a full multivariable model that includes all 29 covariates of interest in addition to state-level fixed effects. For example, for a 10 percentage increase in a zip code’s fraction of households earning greater than $200 000, the model suggests the claims database sampling fraction will increase by 0.6 percentage points, on average.

^{^c}

Other race and ethnicity includes persons identifying as non-Hispanic American Indian and/or Alaska Native, non-Hispanic Native Hawaiian and Other Pacific Islander, non-Hispanic other races, and 2 or more races.

Results

There were 16.4 million distinct individuals captured in the CDM on June 1, 2018, representing 5.4% of the US population. The median (IQR) zip code sampling fraction was 4.4% (2.5%-7.1%), with clear geographic variation in sampling (Figure). At the state level, Alaska had the lowest sampling rate (0.8%) and Colorado had the highest sampling rate (11.0%); in multivariable regression models, state-level fixed effects explained 34.6% of the zip code–level variation in sampling.

Figure. — A, Clinformatics Data Mart sampling in each zip code, estimated from the number of patients with coverage in CDM on June 1, 2018. B, Clinformatics Data Mart sampling in each state, estimated using the same criteria.

Associations between socioeconomic and demographic features and CDM sampling fraction, after adjusting for state-level variation, are provided in the Table. Estimated partial correlations and regression model coefficients found that inclusion in CDM was associated with zip codes that had wealthier, older, more educated, and disproportionately White residents. These patterns were robust to the choice of cohort definition, and across the 10 most populous states individually. Socioeconomic and demographic features explain an additional 19.4% of the zip code–level variation in sampling on top of state-level variation, for a total adjusted R² of 54.0%.

Discussion

To interpret results generated from health care claims databases, it is essential to understand which patients are represented in them. We found that inclusion in the Optum CDM at the zip code level in 2018 varies spatially and along socioeconomic and demographic lines. Our study is limited by the granularity of demographic data; we analyzed data at the smallest geographic scale available in CDM, the zip code. Given our findings, there is likely to be additional bias within zip codes as well.

The socioeconomic and demographic features correlated with overrepresentation in claims data have also been shown to be effect modifiers across a diverse spectrum of health outcomes.⁴ The combination of heterogenous sampling and effect modification—both driven, in this case, by social determinants of health—gives rise to external validity bias, where results generated from the claims data will fail to generalize to the underlying population.⁵ This bias can affect studies that estimate disease incidence and/or prevalence and even comparative effectiveness studies that use contemporary causal inference methods whenever there is meaningful heterogeneity in treatment or policy effects along socioeconomic and demographic lines.

Our study highlights the well-established importance of investigating sampling heterogeneity in analyses of large health care claims data to evaluate how sampling bias might compromise the accuracy and generalizability of results.⁶ Importantly, investigating these biases or accurately reweighting the data will require data external data sources outside of the claims database itself. Health care claims databases offer enormous promise for medical research; characterizing and overcoming sampling bias in these data sets is essential.

Supplement 1.

eAppendix. Supplemental Methods

Click here for additional data file.^{(210.4KB, pdf)}

Supplement 2.

Data Sharing Statement

Click here for additional data file.^{(14.7KB, pdf)}

References

1.U.S. Food and Drug Administration . Real-world data: assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. Published December 10, 2021. Accessed August 23, 2022. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory [DOI] [PMC free article] [PubMed]
2.Bykov K, He M, Gagne JJ. Trends in Utilization of Prescribed Controlled Substances in US Commercially Insured Adults, 2004-2019. JAMA Intern Med. 2020;180(7):1006-1008. doi: 10.1001/jamainternmed.2020.0989 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Eberly LA, Garg L, Yang L, et al. Racial/ethnic and socioeconomic disparities in management of incident paroxysmal atrial fibrillation. JAMA Netw Open. 2021;4(2):e210247. doi: 10.1001/jamanetworkopen.2021.0247 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Braveman P, Gottlieb L. The social determinants of health: it’s time to consider the causes of the causes. Public Health Rep. 2014;129(1)(suppl 2):19-31. doi: 10.1177/00333549141291S206 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Degtiar I, Rose S. A review of generalizability and transportability. arXiv. Preprint posted online February 23, 2021. doi: 10.48550/arXiv.2102.11904 [DOI]
6.Konrad R, Zhang W, Bjarndóttir M, Proaño R. Key considerations when using health insurance claims data in advanced data analyses: an experience report. Health Syst (Basingstoke). 2017;9(4):317-325. doi: 10.1080/20476965.2019.1581433 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1.

eAppendix. Supplemental Methods

Click here for additional data file.^{(210.4KB, pdf)}

Supplement 2.

Data Sharing Statement

Click here for additional data file.^{(14.7KB, pdf)}

[zld220294r1] 1.U.S. Food and Drug Administration . Real-world data: assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. Published December 10, 2021. Accessed August 23, 2022. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory [DOI] [PMC free article] [PubMed]

[zld220294r2] 2.Bykov K, He M, Gagne JJ. Trends in Utilization of Prescribed Controlled Substances in US Commercially Insured Adults, 2004-2019. JAMA Intern Med. 2020;180(7):1006-1008. doi: 10.1001/jamainternmed.2020.0989 [DOI] [PMC free article] [PubMed] [Google Scholar]

[zld220294r3] 3.Eberly LA, Garg L, Yang L, et al. Racial/ethnic and socioeconomic disparities in management of incident paroxysmal atrial fibrillation. JAMA Netw Open. 2021;4(2):e210247. doi: 10.1001/jamanetworkopen.2021.0247 [DOI] [PMC free article] [PubMed] [Google Scholar]

[zld220294r4] 4.Braveman P, Gottlieb L. The social determinants of health: it’s time to consider the causes of the causes. Public Health Rep. 2014;129(1)(suppl 2):19-31. doi: 10.1177/00333549141291S206 [DOI] [PMC free article] [PubMed] [Google Scholar]

[zld220294r5] 5.Degtiar I, Rose S. A review of generalizability and transportability. arXiv. Preprint posted online February 23, 2021. doi: 10.48550/arXiv.2102.11904 [DOI]

[zld220294r6] 6.Konrad R, Zhang W, Bjarndóttir M, Proaño R. Key considerations when using health insurance claims data in advanced data analyses: an experience report. Health Syst (Basingstoke). 2017;9(4):317-325. doi: 10.1080/20476965.2019.1581433 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Analysis of Sampling Bias in Large Health Care Claims Databases

Alex Dahlen, PhD

Vivek Charu, MD, PhD

Abstract

Introduction

Methods

Table. Associations Between 30 Socioeconomic and Demographic Features and Claims Database Sampling Fraction at the Zip Code Level, Accounting for State-Level Variation in Sampling Fraction in 2 Models.

Results

Figure. Zip Code–Level and State-Level Variation in Sampling in the Optum Clinformatics Data Mart Database (CDM) in 2018.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Analysis of Sampling Bias in Large Health Care Claims Databases

Alex Dahlen, PhD

Vivek Charu, MD, PhD

Abstract

Introduction

Methods

Table. Associations Between 30 Socioeconomic and Demographic Features and Claims Database Sampling Fraction at the Zip Code Level, Accounting for State-Level Variation in Sampling Fraction in 2 Models.

Results

Figure. Zip Code–Level and State-Level Variation in Sampling in the Optum Clinformatics Data Mart Database (CDM) in 2018.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases