Abstract
Commercial healthcare claims datasets area nonrandom sample of the US population, affecting generalizability. Rigorous comparisons of claims-derived results to ground-truth data that quantify external validity bias are lacking. Our goal is to (1) quantify external validity of commercial healthcare claims data and (2) to evaluate how socioeconomic/demographic factors are related to the bias. We analyzed inpatient discharge records occurring between January 1, 2019 and December 31, 2019 in five states: California, Iowa, Maryland, Massachusetts, and New Jersey, and compared rates (per person-year) of the 250 most common inpatient procedures between claims and reference data for each target population. We used Merative MarketScan Commercial Database for the claims data and State Inpatient Databases and the US Census as reference. For a target population of all Americans, commercial healthcare claims underestimate the rate of overall inpatient discharges by 23.1%. The extent of bias varied across procedures, with the rates of ~25% of procedures being underestimated by a factor of 2. Socioeconomic factors were significantly associated with the magnitude of bias (
P < .001). When the target population was restricted to commercially insured Americans, the bias decreased substantially (1.4% of procedures were biased by more than factor of 2), but some variation across procedures remained.
Keywords: generalizability, healthcare claims data, electronic healthcare records data
Introduction
Commercial healthcare claims databases are among the largest patient-level data sets ever assembled, offering enormous potential for clinical research. They are increasingly being used in the context of disease and healthcare utilization monitoring, comparative effectiveness research and policy evaluation, with over 7000 publications using commercial healthcare claims indexed on Pubmed.gov.1-7 Despite their widespread use, commercial healthcare claims databases have not been rigorously vetted against ground-truth data.
Commercial healthcare claims datasets are assembled by aggregating data from potentially multiple commercial insurers and, as such, they represent a nonrandom sample of the underlying population. Inclusion in large commercial healthcare claims databases varies geographically and is systematically biased along socioeconomic and demographic lines, compared to the overall US population: patients who are old, White, wealthy, or college-educated are over-sampled relative to other populations.8 Statistical inferences derived from nonrandom samples are susceptible to external validity bias, meaning that they may fail to generalize to the underlying target population of interest.9 In the context of claims data, this bias arises when the same social determinants of health associated with inclusion in the datasets also affect the health outcomes of interest. As such, reliance on commercial healthcare claims alone may produce biased estimates of disease burden, healthcare access, and even treatment effectiveness. Though studies that analyze commercial healthcare claims often acknowledge that results only apply to a “commercially insured cohort,” the extent of bias in any specific context is completely unknown.
Relatively few studies have attempted to quantify external validity bias in claims-derived results, in part because ground-truth data on the outcome of interest is often unavailable.10,11 Here, we present the most detailed empirical analysis of internal and external validity bias in healthcare claims data to date, focusing on the rates of inpatient procedures, for which a unique ground-truth dataset exists. We quantify the extent and variation in external validity bias across a comprehensive set of inpatient procedures, and evaluate how social determinants of health explain the magnitude of bias.
Methods
This cross-sectional study was approved by the Stanford University institutional review board (IRB 40974). Reporting followed the STROBE reporting guideline.
Data
For healthcare claims data, we used the Merative MarketScan Commercial Database (MarketScan). It includes the health service data for approximately 250 million privately insured employees and dependents in the United States with primary healthcare coverage through fee-for-service, point-of-service or capitated health plans. All enrollment records and inpatient, outpatient, ancillary, and drug claims are collected. Patient-level demographic information has been de-identified except for age, gender, and the state and Metropolitan Statistical Area of residence. The dataset’s digital object identifier (DOI) is: 10.57761/n5v8-0v21.
Ground-truth data were derived from the State Inpatient Databases (SIDs), which are part of a family of databases maintained by the Healthcare Cost and Utilization Project, and sponsored by the Agency for Healthcare Research and Quality. The SIDs provide all inpatient discharges from nonfederal acute care hospitals; they include information on patient demographics, primary and secondary diagnosis and procedures codes, health insurance status/type, hospital charges, and length of stay. Because the SIDs capture more than 97% of all hospital discharges in each state, we considered data derived from the SIDs as ground-truth. Patient/discharge-level SID data includes: age, gender, insurance type, and the patient’s zip code. State populations and demographic information to characterize the overall cohort were extracted from the 2019 American Community Survey (ACS) five-year census data.
We used a convenience sample of SIDs from California, Iowa, Michigan, Maryland and New Jersey, and analyzed all inpatient discharges that occurred during the period January 1, 2019 to December 31, 2019 for patients in the age range 18–64 (at the time of discharge). Likewise, for the MarketScan data, we analyzed all inpatient discharges that occurred during the same period (January 1, 2019 to December 31, 2019), for patients with the same age range (18–64), residing in the same states (California, Iowa, Michigan, Maryland, and New Jersey). Both SID and MarketScan inpatient data exclude ambulatory and outpatient procedures. As described in Appendix S1 [Detailed Statistical Methods], we restricted both datasets to acute-care facilities and to procedures occurring therein.
Outcomes of interest
We compared the rates of inpatient procedures between claims data and the ground-truth data. Inpatient procedures were classified using the Clinical Classification Software Refined classification of ICD-10-PCS codes,12 and we studied the 250 most common inpatient procedures in the SID dataset, after excluding a small subset of procedures with extreme distributions of coding by state (see Appendix S1). A list of the procedure codes used to identify each inpatient procedure is provided in Table S1 [Excel file]. We chose to study inpatient procedures because, while diagnosis codes are often carried over across multiple encounters with the healthcare system, procedure codes reflect a procedure performed and billed for at a specific encounter.
Characterizing social determinants of health
We use two strategies to measure aggregate neighborhood-level SDOH: our primary measure was zip-code-level National Deprivation Index (NDI), which is a single metric of deprivation defined and maintained by the National Cancer Institute13; 13 socioeconomic indicators are extracted from the 2017 five-year census and combined into a single measure of deprivation. We used population averages to roll the score up from the census tract-level to the zip code level. As a sensitivity analysis, we used principal component analysis (PCA) to reduce 25 socioeconomic and demographic indicators extracted from the 2019 census to a single metric of socioeconomic status (SES) (see Appendix S1 and Figures S1 and S2). Since both measures put a significant amount of weight on poverty indicators, they are strongly correlated with each other.
Statistical analysis
The goal of our analysis was to (1) quantify the potential bias in estimates of the prevalence of a large number of inpatient procedures derived from MarketScan and (2) to characterize factors associated with the size of the bias.
Quantifying the bias
We estimated the rate of each procedure in the claims data and the SID data separately. For the claims data, rates were estimated by dividing the total number of inpatient discharges with the appropriate ICD-10-PCS codes by the total number of patient-years of coverage in the dataset for the entire cohort in 2019. For the reference data, rates were estimated by dividing the total number of discharges with the appropriate ICD-10-PCS code in the SID data by the 2019 population estimate for the cohort as derived from ACS data, effectively assuming that the population is constant over the year.
We defined the relative bias for each procedure as the ratio of the claims-derived rate divided by the ground truth: relative bias = rate derived from claims data/rate derived from reference data. A relative bias of 1 indicates that claims-derived estimates align with ground truth data (no bias); values less than 1 indicate that the claims data underestimate the rate, and values greater than 1 indicate that claims data overestimate it. We computed 95% Poisson confidence intervals for both rate estimates, and errors were propagated to the relative bias using the δ method on the log of the ratio.
Characterizing factors associated with the bias
We hypothesized that the magnitude of the relative bias for a given procedure would depend strongly on the social determinants of health of the patient population that undergoes each procedure. In particular, since people who are old, White, wealthy, or college-educated are over-sampled in the claims data,8 we hypothesized that procedures that are disproportionately performed on those demographic groups will tend to be overestimated, and vice versa.
For each procedure, we evaluated the strength of the association between the procedure and SDOH using a zip-code-level Poisson regression of the form:
![]() |
where k indexes the procedures and i indexes zip codes. For each zip code i,
is the count of the number procedures,
is the population, and
is a zip-code-level proxy metric of aggregate SDOH. In our primary analysis, we use NDI as this proxy metric, and in a sensitivity analysis, we use the SES metric we defined by PCA. For each procedure k,
measures the strength of the association between SDOH and the procedure rate (across zip codes). Larger positive values of
indicate that the procedure disproportionately occurs in zip codes with higher levels of the SDOH metric, and vice versa.
Finally, we used a log-linear regression to quantify the association between the relative bias for each procedure and
. A significant positive association between these quantities would indicate that the procedures that are most underestimated are those that tend to be performed in zip codes with lower NDI. We estimated an
value to quantify the fraction of the variation in relative bias explained by the association between NDI and the rate of the procedure.
Quantifying the bias for different target populations
Lastly, we are interested in understanding how the relative bias changes as we change the target population of interest. For our primary analysis described above, the target population was taken to be all Americans (aged 18–64 in 2019). We also considered two other possible target populations: insured Americans and commercially insured Americans. As the target population was restricted to more closely resemble the population represented in the claims dataset, we expected to see the overall bias decrease, but we were still interested in quantifying how much variation in the bias still existed. To compute reference estimates for these two additional target populations, we stratified by insurance type for both the SID data and the ACS Census population estimates to filter down to the relevant group.
Reproducibility
Full details of the statistical methods are provided in Appendix S1. The analysis was conducted in Python version 3.8.5, and the code has been shared publicly at https://github.com/alex-dahlen/ClaimsDataBenchmarking.
Results
We identified ~2.95 million hospital discharges among our cohort from January 1 to December 31, 2019, in the State Inpatient Databases (SIDs) (Table 1). For the same time period and age group, ~115 k hospital discharges were identified in the claims dataset. Demographic information for the underlying populations is provided in Table 1.
Table 1.
A comparison of the two cohorts: the claims-data cohort (MarketScan) and the ground-truth reference cohort (ACS five-year data census data).
| Claims data (MarketScan) 2019, aged 18–64 (five states [CA, IA, MD, MI, NJ]) | Ground truth (Census) 2019, aged 18–64 (five states [CA, IA, MD, MI, NJ]) | |
|---|---|---|
| n | 2 756 043 | 42 062 788 |
| Days of coverage | ||
| Mean (SD) | 303.6 (103.9) | 365 (0) |
| Median [IQR] | 365 [273–365] | 365 [365–365] |
| Inpatient visits | ||
| Number of discharges | 114 756 | 2 945 667a |
| Inpatient discharges/100 patient-years | 5.01 | 7.003 |
| Female (%) | 1 384 162 (50.2%) | 21 014 821 (50.0%) |
| State (%) | ||
| CA | 1 187 569 (43.1%) | 24 775 310 (58.9%) |
| IA | 95 748 (3.5%) | 1 885 249 (4.5%) |
| MD | 275 767 (10.0%) | 3 774 488 (9.0%) |
| MI | 674 593 (24.5%) | 6 121 044 (14.6%) |
| NJ | 524 596 (19.0%) | 5 506 697 (13.1%) |
| Age, n (%) | ||
| 18–34 | 1 008 596 (36.6%) | 11 320 834 (26.9%) |
| 35–49 | 862 019 (31.3%) | 13 320 352 (31.7%) |
| 50–64 | 885 428 (32.1%) | 17 421 602 (41.4%) |
| Health insurance, n (%) | ||
| Commercial insurance | 100% | 29 339 545 (72.3%) |
| Public (Medicaid) | 0 (0%) | 7 197 270 (17.7%) |
| Uninsured | 0 (0%) | 4 015 926 (9.9%) |
| Race/ethnicity, n (%) | ||
| Hispanic | Unknown | 19 256 925 (28.4%) |
| Non-Hispanic Asian | Unknown | 7 693 855 (11.4%) |
| Non-Hispanic Black | Unknown | 6 367 232 (9.4%) |
| Non-Hispanic White | Unknown | 30 772 555 (45.5%) |
| Others | Unknown | 3 599 769 (5.3%) |
| Education, n (%) | ||
| Less than high school | Unknown | 6 241 773 (13.7%) |
| High school | Unknown | 10 728 771 (23.5%) |
| Some college | Unknown | 13 025 249 (28.5%) |
| College | Unknown | 9 595 841 (21.0%) |
| Graduate | Unknown | 6 071 561 (13.3%) |
| Household income, n (%) | ||
| <$20 k | Unknown | 3 030 399 (12.8%) |
| $20 – $40 k | Unknown | 3 663 457 (15.5%) |
| $40 – $75 k | Unknown | 5 494 523 (23.2%) |
| $75 – $125 k | Unknown | 5 283 574 (22.3%) |
| $125 – $200 k | Unknown | 3 626 380 (15.3%) |
| >$200 k | Unknown | 2 583 525 (10.9%) |
Abbreviations: CA, California; IA, Iowa; MD, Maryland; MI, Michigan; NJ, New Jersey; SID, State Inpatient Databases.
For both cohorts, inclusion criteria are people aged 18–64, in the year 2019, residing in one of five states: California, Iowa, Maryland, Michigan, and New Jersey. For the claims data, we record the number of days of coverage during 2019 for each member of the cohort; for the reference group, we assume members remain in the cohort for the entire year. MarketScan provides limited demographic data on its members (just age and sex); the additional demographic details about the reference cohort are derived from the ACS five-year 2019 Census. Some members in the claims-data cohort had coverage across two or more states in 2019, so the percentages slightly exceed 100%.
The number of inpatient discharges was derived from the SID.
The overall estimated rate of all inpatient discharges was 7.003 [7.002, 7.004] per 100 person-years derived from the reference data, and 5.01 [4.98, 5.04] derived from claims data. This corresponds to a relative bias (claims/reference) of 0.715 (95% CI, 0.711, 0.719), indicating that, on average, the claims data underestimated the rate of inpatient discharges for all Americans by 28.5%.
We found considerable variation in estimates of this relative bias (with respect to the target population of all Americans) across the most common 250 procedures. The ten most under- and overestimated procedures are shown in Table 2, and a forest plot of the relative bias all 250 procedures is shown in Figures 1 and S1 (see Table S1 [Excel File] for full results.) We found that 56.4% of procedures were under- or over-estimated by more than a factor of 1.5, and 24.8% by more than a factor of 2 (Table S2). Overestimated procedures included: speech therapy treatment (58%), prostatectomy (overestimated by 47%), breast reduction (30%), hip replacements (29%), and knee replacements (17%); severely underestimated procedures included: subcutaneous contraceptive insertion (89%), cardiac stress tests (81%), hemodialysis (underestimated by 80%), and upper extremity amputation (78%).
Table 2.
The top 10 most underestimated and the top 10 most overestimated procedures in our analysis.
| Procedure | Claims data rate/100 patient-years | Reference data rate/100 patient-years | Relative bias claims/reference |
|---|---|---|---|
| Top 10 most overestimated procedures | |||
| Speech therapy treatment | 0.0056 ± 0.0010 | 0.00353 ± 0.00002 | 1.582 [1.330, 1.881] |
| Prostatectomy | 0.0185 ± 0.0018 | 0.01259 ± 0.00003 | 1.466 [1.332, 1.612] |
| Breast reconstruction | 0.0197 ± 0.0018 | 0.01518 ± 0.00004 | 1.299 [1.185, 1.425] |
| Hip arthroplasty | 0.1106 ± 0.0043 | 0.08595 ± 0.00009 | 1.287 [1.238, 1.338] |
| Administration and transfusion of bone marrow, stem cells, pancreatic islet cells, and t-cells | 0.0069 ± 0.0011 | 0.00546 ± 0.00002 | 1.270 [1.087, 1.483] |
| Prostate and seminal vesicle procedures (excluding prostatectomy) | 0.0088 ± 0.0012 | 0.00696 ± 0.00003 | 1.261 [1.098, 1.447] |
| Lymph node excision (therapeutic) | 0.0148 ± 0.0016 | 0.01238 ± 0.00003 | 1.198 [1.077, 1.333] |
| Knee arthroplasty | 0.1026 ± 0.0041 | 0.08759 ± 0.00009 | 1.171 [1.125, 1.220] |
| Pancreatectomy | 0.0054 ± 0.0010 | 0.00470 ± 0.00002 | 1.151 [0.965, 1.373] |
| Lymph node dissection | 0.0165 ± 0.0017 | 0.01438 ± 0.00004 | 1.147 [1.037, 1.269] |
| Top 10 most underestimated procedures | |||
| Subcutaneous contraceptive implant | 0.0010 ± 0.0004 | 0.00903 ± 0.00003 | 0.106 [0.070, 0.161] |
| Cardiac stress tests | 0.0017 ± 0.0005 | 0.00939 ± 0.00003 | 0.186 [0.136, 0.253] |
| Hemodialysis | 0.0460 ± 0.0028 | 0.22782 ± 0.00014 | 0.202 [0.190, 0.215] |
| Finger and other upper extremity amputation | 0.0007 ± 0.0003 | 0.00318 ± 0.00002 | 0.219 [0.134, 0.358] |
| Administration of diagnostic substances, NEC | 0.0018 ± 0.0006 | 0.00692 ± 0.00003 | 0.265 [0.196, 0.358] |
| Cardiac chest compression | 0.0072 ± 0.0011 | 0.02704 ± 0.00005 | 0.266 [0.229, 0.310] |
| Peripheral arterial pressure monitoring | 0.0066 ± 0.0011 | 0.02380 ± 0.00005 | 0.277 [0.236, 0.325] |
| Gastrostomy | 0.0112 ± 0.0014 | 0.03858 ± 0.00006 | 0.289 [0.256, 0.327] |
| Radiation therapy, not elsewhere classified | 0.0010 ± 0.0004 | 0.00356 ± 0.00002 | 0.294 [0.197, 0.438] |
| Above knee and other proximal lower extremity amputation | 0.0012 ± 0.0004 | 0.00383 ± 0.00002 | 0.308 [0.211, 0.449] |
Abbreviation: SID, State Inpatient Databases.
The rate of each procedure was estimated using claims data (MarketScan) and ground-truth reference data (SID and Census). The relative bias is determined by taking the ratio between the two estimates (claims/reference). Relative biases that are smaller than 1 correspond to cases where claims data has underestimated the rate and vice-versa.
Figure 1.

Forest plot of the relative bias for all 250 procedures. The overall estimated rate of all inpatient discharges was 5.0/100 person-years in the claims data compared to 7.0/100 person-years in the reference data, corresponding to a relative bias of 0.72, or an underestimate of 28% (the reference target population is all Americans in this analysis). This forest plot shows the relative bias (and 95% confidence interval) for each of the 250 most common inpatient procedures, ordered by the magnitude of the bias. There is a large variation in the extent of the bias across different procedures: 25% of procedures are underestimated by 50% of more, and another 6% of procedures are over-estimated by 10% or more.
We found a clear relationship between the relative bias for a given procedure, and the procedure’s association with social determinants of health, measured via the Neighborhood Deprivation Index (Figure 2; R2 = 69.4%, P < .0001, slope = −1.97). Procedures that are disproportionately performed in neighborhoods with higher levels of deprivation (NDI) were significantly more likely to be underestimated, and vice versa. Some examples of procedures that were disproportionately performed in neighborhoods with high deprivation are: hemodialysis (underestimated by 80%), upper extremity amputation (underestimated by 78%), and arterial oxygen saturation monitoring (underestimated by 71%). Examples of procedures that were disproportionately performed in neighborhoods with low deprivation are: breast reconstruction (overestimated by 30%), prostatectomy (overestimated by 47%), and hip replacement (overestimated by 29%). As a sensitivity analysis, we performed a similar analysis with a socioeconomic status metric we defined using 2019 census data, and we found an even stronger association (R2 = 79.3%, P < .0001, Figures S4 and S5), perhaps because the alternative metric is derived from data overlapping with the time period of our analysis, while the NDI metric is derived from 2017 data.
Figure 2.

The relationship between the relative bias for each procedure and the association between the procedure rate and the neighborhood deprivation index (NDI). For each procedure, we computed the latter association by comparing zip-code-level procedure rates with NDI; higher levels of this association indicate that the procedure is disproportionately performed in zip codes with high deprivation and vice versa. Examples of procedures with that are disproportionately performed in high-deprivation areas are hemodialysis, above knee amputation, and arterial oxygen saturation monitoring. Examples of procedures that are disproportionately performed in low deprivation areas are breast reconstruction, prostatectomy, and hip replacement. We find that procedures disproportionately performed in high-deprivation areas are far more likely to be underestimated and vice versa (P < .0001). We obtained similar results when we defined an alternative measure of neighborhood socioeconomic status using 2019 census data, and the results of that sensitivity analysis are shown in Figure S5.
When we restricted the target population to those with any health insurance (Figure 3A), we found the overall pattern of bias was similar, likely because patients without health insurance account for less than 10% of total inpatient discharges in the reference data. However, when we further restricted the target population to those with commercial health insurance (Figure 3B), the pattern of bias changes. The overall rate of all inpatient discharges among the commercially insured in our reference cohort was 4.123 [4.122, 4.123] per 100 person-years; compared to this, the claims data estimate (5.01 [4.98, 5.04]) is a 21.4% overestimate. The relative bias for most procedures improved: while the interquartile range for the relative biases was [0.50, 0.81] when comparing to the overall population; it was [0.92, 1.14] when comparing to the commercially insured population, which is narrower and, notably, overlaps 1. Nonetheless, there is still some variation in the bias by procedure; 5.6% of procedures were under- or over-estimated by more than a factor of 1.5, and 1.2% by more than a factor of 2. When we re-evaluated the relationship between the relative bias and the and the strength of the association with SDOH for this new target population, we found a much attenuated association (R2 = 17.4%, P < .001, slope = −0.41; Figure S6).
Figure 3.
The impact of changing the target population. Figure 3A Displays a forest plot of the relative bias for all 250 procedures, restricting the reference data to the insured population; Figure 3B displays a forest plot of the relative bias for all 250 procedures, restricting the reference data to the commercially insured population (For all target populations, we focused on the subset that are aged 18–64 in the year 2019, and living in California, Iowa, Maryland, Michigan, and New Jersey.) Procedures are ordered on the y-axis in the same way as they are in Figure 1.
Discussion
Healthcare claims data are increasingly being used to evaluate disease burden and quantify the effects of health policies and biomedical treatments on health outcomes. These datasets are a nonrandom sample of the US population. Prior work has shown that, compared to the US population, inclusion in large commercial healthcare claims databases is biased geographically and along socioeconomic and demographic lines.8 When the outcome of interest is associated with these socioeconomic and demographic features, claims-derived inferences are susceptible to external validity bias. While methods have been developed to translate clinical trial results to different target populations,14 these methods cannot be directly applied to claims data results, since claims datasets typically lack individual-level information on socio-economic/demographic indicators to readily allow for transport of effects to different populations of interest.
Isolated studies have evaluated the bias for selected outcomes,10,11,15,16 but they have been limited because of the difficulty of obtaining ground-truth reference data. In the setting of infectious diseases, with regional surveillance data serving as the ground-truth, prior work has demonstrated that claims-derived incidence of measles, mumps, and varicella are dramatically overestimated10; and, though claims-derived incidence rates of influenza were inaccurate, claims-derived disease patterns were representative.11,15 The randomized controlled clinical trial (RCT)-duplicate study has made direct comparisons between treatment effect estimates derived from 32 RCTs and those estimated from claims data, demonstrating that claims-derived inferences were generally comparable in a highly selected and nonrepresentative set of RCTs.17 Taken together, these prior studies highlight challenges in quantifying external validity bias in claims-derived results and demonstrate that the extent of bias depends on the disease context and outcome of interest.
Here we report on the empirical bias in healthcare claims-derived estimates of the 250 most common inpatient procedures for US patients aged 18–64, using a unique ground-truth dataset of inpatient hospitalizations. We found that: (1) with respect to all Americans, commercial healthcare claims data underestimate the true incidence of overall inpatient visits by ~27%, reflecting lower inpatient healthcare utilization among commercially insured patients; (2) the extent of the bias varies considerably across inpatient procedures, with 22.4% of procedures being under or overestimated by more a factor 2; (3) procedures that disproportionately occur in patients from low SES neighborhoods are the most severely underestimated and vice versa; and (4) if healthcare claims data are compared to a restricted target population of commercially insured Americans, the magnitude of external validity bias is considerably attenuated, but there is still some variation in the bias across different procedures and it is at least partially explained by SDOH.
A strength of our study is the focus on inpatient procedures, with outcome definitions as similar as possible between the claims and ground-truth data. All studies that make use of claims data are susceptible to misclassification bias,18 where errors diagnosing or coding for a disease occur. By focusing on two datasets with identical outcome definitions, we attribute the bias we have measured to nonrandom sampling.
Several limitations must be mentioned. First, we relied on a convenience sample of SID data, limited to five states. Although this sample of states was nonrandom, it reflects over 20% of the US population in year 2019, and covers a reasonable geographic distribution. Nonetheless, it is unknown how the findings presented here would generalize to other states. Second, while the SID data are extensive, up to 5% of inpatient visits are not captured; nonetheless we believe this is as close to a ground-truth reference dataset we can achieve for this purpose. Third, because the SID data only provides region information at the zip code level, our estimate of the strength of the association between a procedure and NDI was defined at the zip code level. More granular regional information could improve this metric. Finally, we restricted this study to the bias associated with inferences of rates of inpatient procedures; many other types of inferences remain to be studied, including comparative effect sizes.
Despite these limitations, our analysis has several implications for studies analyzing healthcare claims data. Whenever the outcome (or treatment effect) of interest is either associated with or modified by SDOH, claims-derived results can be biased. In particular: (1) studies that seek to estimate disease prevalence/incidence rates and medication prescription rates are very likely to be biased with respect to the US population, since SDOH are known to be associated with disease burden and access to health care; (2) studies that seek to estimate treatment effects by using claims data to emulate an RCT17 will be biased whenever treatment (or access to treatment, or adherence to treatment) is related to SDOH; and (3) studies that seek to evaluate the impact of policy-level changes will be biased if the policy of interest has heterogeneous effects across either insurance status or SDOH. To improve the transparency and reliability of studies using healthcare claims data, investigators should provide a first-principles argument for how SDOH might or might not modify the outcome of interest. Additionally, whenever possible, studies should seek to replicate their findings in more than one claims dataset with different patterns of sampling (eg, Medicaid claims).
Because of their large sample size, healthcare claims data offer enormous potential in research; characterizing and overcoming this selection bias is an essential first step to unlocking their potential.
Supplementary Material
Acknowledgments
Data for this project were accessed using the Stanford Center for Population Health Sciences Data Core. The PHS Data Core is supported by a National Institutes of Health National Center for Advancing Translational Science Clinical and Translational Science Award (UL1TR003142) and from Internal Stanford funding. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Contributor Information
Alex Dahlen, Department of Biostatistics, School of Global Public Health, New York University, New York, NY, United States; Quantitative Sciences Unit, Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States.
Yaowei Deng, Quantitative Sciences Unit, Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States; Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, United States.
Vivek Charu, Quantitative Sciences Unit, Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States; Department of Pathology, Stanford University School of Medicine, Stanford, CA, United States.
Supplementary material
Supplementary material is available at American Journal of Epidemiology online.
Funding
This work was supported by KL2TR003143 and P30DK116074 (to V.C.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Conflict of interest
The authors have no conflicts of interest to disclose.
Disclaimer
Each author contributed important intellectual content during manuscript drafting or revision and agrees to be personally accountable for the individual’s own contributions and to ensure that questions pertaining to the accuracy or integrity of any portion of the work, even one in which the author was not directly involved, are appropriately investigated and resolved, including with documentation in the literature if appropriate.
Data availability
Neither Merative MarketScan Commercial Database nor State Inpatient Databases data are publicly available, unfortunately. The analysis code is available at https://github.com/alex-dahlen/ClaimsDataBenchmarking.
References
- 1. Birnbaum HG, Cremieux PY, Greenberg PE, et al. Using healthcare claims data for outcomes research and Pharmacoeconomic analyses. Pharmacoeconomics. 1999;16(1):1-8. 10.2165/00019053-199916010-00001 [DOI] [PubMed] [Google Scholar]
- 2. Baggs J, Fridkin SK, Pollack LA, et al. Estimating National Trends in inpatient antibiotic use among US hospitals from 2006 to 2012. JAMA Intern Med. 2016;176(11):1639-1648. 10.1001/jamainternmed.2016.5651 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Patel M, Chen J, Kim S, et al. Analysis of MarketScan data for immunosuppressive conditions and hospitalizations for acute respiratory illness, United States. Emerg Infect Dis. 2020;26(8):1720-1730. 10.3201/eid2608.191493 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Song Z, Zubizarreta JR, Giuriato M, et al. Changes in health care spending, use, and clinical outcomes after nonfatal firearm injuries among survivors and family members. Ann Intern Med. 2022;175(6):795-803. 10.7326/M21-2812 [DOI] [PubMed] [Google Scholar]
- 5. McGinty EE, Bicket MC, Seewald NJ, et al. Effects of state opioid prescribing Laws on use of opioid and other pain treatments among commercially insured U.S. adults. Ann Intern Med. 2022;175(5):617-627. 10.7326/M21-4363 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Fralick M, Kesselheim AS, Avorn J, et al. Use of health care databases to support supplemental indications of approved medications. JAMA Intern Med. 2018;178(1):55-63. 10.1001/jamainternmed.2017.3919 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58(4):323-337. 10.1016/j.jclinepi.2004.10.012 [DOI] [PubMed] [Google Scholar]
- 8. Dahlen A, Charu V. Analysis of sampling bias in large health care claims databases. JAMA Netw Open. 2023;6(1):e2249804. 10.1001/jamanetworkopen.2022.49804 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Degtiar I, Rose S. A review of generalizability and transportability. Annu Rev. 2023;10(1):501-524. 10.1146/annurev-statistics-042522-103837 [DOI] [Google Scholar]
- 10. Majumder MS, Cusick M, Rose S. Measuring concordance of data sources used for infectious disease research in the USA: a retrospective data analysis. BMJ Open. 2023;13(2):e065751. 10.1136/bmjopen-2022-065751 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Viboud C, Charu V, Olson D, et al. Demonstrating the use of high-volume electronic medical claims data to monitor local and regional influenza activity in the US. PloS One. 2014;9(7):e102429. 10.1371/journal.pone.0102429 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Clinical Classifications Software Refined (CCSR) for ICD-10-CM Diagnoses . Accessed May 22, 2024. https://hcup-us.ahrq.gov/toolssoftware/ccsr/dxccsr.jsp
- 13. Messer LC, Laraia BA, Kaufman JS, et al. The development of a standardized neighborhood deprivation index. J Urban Health. 2006;83(6):1041-1062. 10.1007/s11524-006-9094-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Ling AY, Montez-Rath ME, Carita P, et al. An overview of current methods for real-world applications to generalize or transport clinical trial findings to target populations of interest. Epidemiology. 2023;34(5):627-636. 10.1097/EDE.0000000000001633 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Majumder MS, Rose S. Health care claims data may Be useful for COVID-19 research despite significant limitations. Health Affairs Forefront. Accessed May 22, 2024. https://www.healthaffairs.org/do/10.1377/forefront.20201001.977332/full/
- 16. Majumder M, Cusick MM, Rose S. Data source concordance for infectious disease epidemiology. medRxiv. 2022;3. 10.1101/2022.06.02.22275926 [DOI] [Google Scholar]
- 17. Wang SV, Schneeweiss S, RCT-DUPLICATE initiative . Emulation of randomized clinical trials with nonrandomized database analyses: results of 32 clinical trials. JAMA. 2023;329(16):1376-1385. 10.1001/jama.2023.4221 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Wilchesky M, Tamblyn RM, Huang A. Validation of diagnostic codes within medical services claims. J Clin Epidemiol. 2004;57(2):131-141. 10.1016/S0895-4356(03)00246-4 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Neither Merative MarketScan Commercial Database nor State Inpatient Databases data are publicly available, unfortunately. The analysis code is available at https://github.com/alex-dahlen/ClaimsDataBenchmarking.


