Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Sep 23;34(10):e70213. doi: 10.1002/pds.70213

Quantification of Information Gained by Linking Claims Data to an Electronic Health Record Cohort of Patients With Metastatic Breast Cancer

Jonah Geddes 1, Julie Katz 2, Alex Asiimwe 2, M Alan Brookhart 3, Charlotte Carroll 2, Lev Eldemir 4, Vera Mucaj 4, Kevin Nolan 5, Ioanna Ntalla 6, Carrie M Nielson 2,
PMCID: PMC12457698  PMID: 40988051

ABSTRACT

Purpose

Linking claims data to electronic health record (EHR) data can improve completeness, often at a cost of decreased sample size. Quantifying information gained and differences in patient characteristics between EHR and EHR‐claims linked cohorts may inform study design.

Methods

Using ConcertAI Patient360 EHR linked to multiple closed insurance claims sources, we compared an EHR cohort of patients with incident metastatic breast cancer (mBC) to an EHR‐claims subcohort (requiring ≥ 90 days claims coverage). We analyzed diagnosis coverage, patient time during lookback and follow‐up, baseline characteristics, and rates of 14 adverse events (AEs). Analyses were age stratified due to insurance coverage changes at age 65.

Results

For the EHR cohort (N = 6289), 1438 (23%) were in the EHR‐claims subcohort. A greater proportion were aged ≥ 65 years in the EHR cohort (30%) than in the EHR‐claims subcohort (17%). EHR‐claims patients had longer observation periods and more unique diagnoses across both age groups. For most AEs, incidences were higher in both age groups in the EHR‐claims subcohort than in the EHR cohort.

Conclusions

EHR‐claims provided more diagnoses and observation time, at the cost of a reduction in sample size and underrepresentation of patients ≥ 65 years. Differing age proportions support age‐stratified or standardized analyses for EHR‐claims data. Results aid interpretation of differences between EHR and EHR‐claims results due to shifts in age, completeness of diagnosis history, and duration of observation.

Keywords: data linkage, electronic health records, insurance claims data, real‐world data


Summary.

  • EHR‐claims linkage offered enhanced ascertainment of clinical data but resulted in substantial sample size reduction (23% retention rate from original EHR cohort).

  • A shift in age distribution occurred in the linked cohort, with underrepresentation of patients ≥ 65 years (30% in EHR vs. 17% in EHR‐claims).

  • EHR‐claims linkage demonstrated better diagnosis coverage in both breadth and density, with longer observation periods compared with EHR‐only data.

  • Adverse event detection rates and 180‐day incidence were consistently higher in the EHR‐claims subcohort across age groups.

  • Methodological implications include the necessity for age‐stratified or standardized analyses when using linked EHR‐claims data to account for demographic differences.

Abbreviations

AE

adverse event

ALT

alanine transaminase

BMI

body mass index

CAI

ConcertAI

CM

clinical modification

CPT

current procedural terminology

ECOG

European Cooperative Oncology Group

EHR

electronic health record

ICD

International Classification of Diseases

LDH

lactate dehydrogenase

mBC

metastatic breast cancer

NCI

National Cancer Institute

RWD

real‐world data

SMD

standardized mean difference

1. Introduction

Electronic health record (EHR) data focus on outpatient health care interactions and can miss broader interactions necessary to characterize potential confounders measured prior to patients' oncology care and events during follow‐up, such as inpatient visits and outpatient visits that occurred outside the EHR network [1]. Linking patients' claims data to their EHR data can help fill these gaps [2]. ConcertAI (CAI) provides oncology EHR data and has conducted a linkage with closed insurance claims (including Medicare Advantage) for patients in the United States with metastatic breast cancer (mBC).

The downside to linkage is a decrease in sample size, because only a subset of patients in EHR can be linked to claims data [3]. Whether the subset with linkage differs meaningfully from the full EHR population is often unknown. Understanding the nature and extent of improvements in completeness, as well as any differences in patient characteristics in the EHR cohort vs EHR‐claims subcohort, can inform future observational study designs. The approach in this study, which evaluates overall database‐level quality as well as high‐priority cohort‐specific measures, can be readily applied to assess the comprehensiveness of other real‐world data (RWD).

While studies based on patients who have linked data available often allude to improved data capture and reduced generalizability, these facets are rarely described quantitatively. In this study, we evaluated database‐level quality by measuring the breadth and density of diagnoses, which have previously been described [4]. Breadth is quantified as the number of unique codes (e.g., unique diagnosis codes in the database), whereas density is quantified as the number of unique codes per unit of patient time (e.g., unique diagnosis codes per patient month). For cohort‐specific analyses, we assessed a look‐back period prior to the index diagnosis and the study follow‐up. The amount of time in these respective periods and the variables observed help to directly address concerns of selection bias with the misclassification of potential confounders (due to the often‐narrower ascertainment windows in EHR than in claims data) and measurement error from outcome misclassification (because some care settings and provider types are not observed in oncology EHR data).

Claims data can further support evaluations of EHR data completeness, for both general measurements of completeness and for an assessment of the feasibility of use for specific purposes such as measurement of adverse events (AEs). For example, the appearance of a claim for a lab procedure (e.g., complete blood count) should be followed by the appearance of a lab result in EHR data. When a high proportion of lab claims are followed by lab results, confidence in the completeness of the EHR lab data for the cohort is bolstered. Knowing the extent to which adding claims linkage to EHR data results in a higher rate of certain AEs can inform decisions on the appropriate uses of EHR and linked claims data for AE analyses and interpretation of those results.

The approach proposed in this study can be used to inform decisions on using EHR or EHR‐claims linked data for analyses that include a look‐back period for ascertainment of prior health events, drug exposures, and potential confounders and for studies of AEs that may not be well captured in EHR‐only data.

2. Materials and Methods

2.1. Data Source

This retrospective cohort study was conducted using the CAI Patient360 Breast Cancer Electronic Medical Record dataset linked with selected data elements from closed insurance claims datasets (including Medicare Advantage). CAI sources clinical data from various organizations, aggregating health records from more than 400 oncology practices in community and academic centers across the US. ConcertAI accesses this data through agreements with individual practices, and through partnerships with data aggregation and service networks, including CancerLinQ. CAI processes all available oncology data (patient documents, physician notes, structured data) into a common data model (CDM). ConcertAI's data partners each aggregate their source data into a central database that represents a CDM‐driven, software‐agnostic, distributed data network. The networks are CDM‐driven in that the aggregation of data within each network uses a common data model for mapping of practice data from each source practice. They are distributed in that data aggregated into each network are drawn from multiple sites. And they are software‐agnostic in that the source practices that contribute data into each network do not all use the same software or the same data model. However, the data drawn from each site are mapped to the data partner's central database so that the data can be centrally queried. Data ingested from each data partner's central database is then mapped to ConcertAI's internal CDM.

Linkage of mortality and claims data to EHR records in the development of ConcertAI data is performed through deterministic and probabilistic methods. These methods use multiple identifiers to produce third‐party tokens (Datavant) that preserve the privacy and de‐identified status of the underlying source data and allow ConcertAI to use the underlying source data in a manner that ensures the most accurate linkage possible. Of all 56,508 breast cancer patients, 16,464 (29%) had closed claims data available for linkage.

ConcertAI has developed an all‐source composite mortality endpoint (ASCME), which includes the limited access master death file from the Social Security Administration, digital obituary and burial records, structured and unstructured data from the EHR, and open or closed administrative claims. The completeness and accuracy of death information from ASCME in the ConcertAI data against the National Death Index have been evaluated in 32,358 solid tumor patients, including 9377 breast cancer cases [5]. Sensitivity was 95%, specificity was 97%, and both PPV and NPV were 96%.

Analyses were conducted in SAS Studio 3.81, Enterprise edition (SAS Institute Inc., Cary, NC, USA).

2.2. Patient Selection

Adult (age ≥ 18 years) patients in the United States who were diagnosed with mBC during the study period (01 October 2016 to 30 April 2023) and had at least one visit, claim, or evidence of death in the CAI Patient360 EHR claims database were included. Patients with a diagnosis of any primary cancer (except for non‐melanoma skin cancer) other than breast within 5 years prior to their first mBC diagnosis (index date) were excluded. An EHR‐claims subcohort of patients was required to have ≥ 90 days of claims coverage during the study period.

2.3. Observation Period

The lookback period included the date of earliest encounter or claim after 01 October 2015 (to avoid including ICD‐9‐CM codes) to the day before the index date. The study period ended 30 April 2023. Patients were followed until the earliest of death, last activity (EHR cohort), disenrollment from medical coverage (EHR‐claims subcohort), or the end of the study period. The last activity in the EHR cohort was defined as the last available measurement in the data, inclusive of a patient encounter, diagnosis, treatment start or end, adverse event, exam, or test result, similarly so for the EHR‐claims cohort where death or disenrollment was not present.

2.4. Measures and Outcomes

2.4.1. Patient Characteristics

Patient characteristics were described in the EHR cohort and EHR‐claims subcohort. Characteristics evaluated at index included age, sex, race, provider practice region, and index year of mBC diagnosis. Payer type at index was categorized as commercial only, Medicare Advantage only, and “other”, which included Medicaid (including both fee‐for‐service and Managed Care), dual coverage (Medicaid/Medicare), and payers designated as “other”. Obesity (BMI ≥ 30 kg/m2) was categorized using EHR data as present, absent, or unknown; if a patient with unknown obesity status had an ICD‐10‐CM code for obesity, they were recategorized as obese (Table S1) [6, 7]. Smoking status was categorized as current, former, never, or unknown using EHR data; if a patient with unknown smoking status had an ICD‐10‐CM code for smoking (current or former), they were categorized as accordingly [8, 9]. European Cooperative Oncology Group (ECOG) performance status at the visit nearest index was derived solely from EHR. Obesity, ECOG score, and smoking status were ascertained in the period from 80 days before index to 7 days after index. In case of multiple ECOG scores or smoking status values, the value closest to index was reported. For obesity and smoking status, missing values and absence of diagnosis codes were categorized as unknown. Finally, the National Cancer Institute (NCI) comorbidity index [10] and specific comorbidities (including myocardial infarction (MI), congestive heart failure (CHF), peripheral vascular disease (PVD)) were evaluated during all available lookback. The NCI comorbidity index does not include cancer as a comorbidity, making it possible for patients with mBC in this study to have an index score of 0 (unlike in the Charlson comorbidity index score). Codes used to operationalize each variable are listed in (Table S1).

2.4.2. Diagnosis Coverage

Diagnosis coverage was evaluated by the breadth and density of diagnoses for patients in the EHR cohort and EHR‐claims subcohort. Breadth of encounter coverage was calculated as all unique codes in the eligible cohort, scaled by cohort size and reported by summary statistics (mean, median, minimum, maximum, and interquartile range). Diagnosis breadth consisted of unique International Classification of Diseases 10 Clinical Modification (ICD‐10‐CM) codes divided by the number of patients in the cohort. Diagnosis density was calculated as unique diagnoses per patient‐month. Two approaches were used to calculate unique diagnoses. In the first approach, all digits of the diagnosis codes were considered to determine uniqueness (e.g., R41.82, Altered Mental Status, unspecified, is distinct from R41.0, Disorientation, unspecified). The second approach considers only the digits before the decimal to determine uniqueness (e.g., R41.82 and R41.0 are considered duplicates of R41.X, Other symptoms and signs involving cognitive functions and awareness, and only counted as one unique diagnosis). These two approaches are described in tables as either all unique diagnostic codes or unique highest‐level codes.

2.4.3. Patient Time

The duration of patient time in the EHR and EHR‐claims databases was calculated for each cohort during the lookback period (from earliest encounter or claim to the day before index date) and follow‐up period (from index to censoring).

2.4.4. Adverse Events and Death

AEs of interest included anemia, neutropenia, febrile neutropenia, leukopenia, interstitial lung disease, left ventricular dysfunction, hand‐foot syndrome, diarrhea, alopecia, fatigue, neuropathy, nausea/vomiting, hepatotoxicity, and death. These AEs were selected based on their relevance to mBC treatment [11]. The rate of each AE was calculated as events per 100 person‐months. Also reported were the number of patients with each AE within 180 days after index. The cumulative incidence and 95% confidence interval of each AE at 180 days were calculated, accounting for death as a competing risk.

2.4.5. Lab Procedures and Subsequent Results

In the EHR‐claims subcohort, the proportion of claims followed by a quantitative result (through EHR data) was calculated for the lab procedures of lactate dehydrogenase (LDH), hemoglobin (HGB) count, absolute neutrophil count (ANC), and alanine transaminase (ALT). Current Procedural Terminology (CPT) codes were used to identify lab procedures in claims data.

2.5. Statistical Analysis

Demographic and clinical characteristics were reported as frequencies and percentages in the EHR cohort and EHR‐claims subcohort. Standardized mean differences for each characteristic were calculated within age groups (< 65 and ≥ 65 years). Breadth and density of encounter coverage, patient time, rate of AEs, and number of lab procedures with subsequent results were described in both cohorts. A sensitivity analysis for follow‐up time was done with death as censoring. Although the EHR‐claims data included Medicare Advantage claims, not all Medicare patients will be represented; therefore, all analyses other than baseline characteristics were stratified by age at index (< 65 and ≥ 65 years old) to account for shifts from commercial to Medicare coverage at age 65. The 180‐day unadjusted CIs for all specific AEs were derived with a non‐parametric cumulative incidence function, accounting for death as a competing risk [12]. The 180‐day cumulative incidence for individual AEs was calculated by age group within the EHR cohort and the EHR‐claims subcohort.

3. Results

For the EHR cohort, 6289 patients were eligible. Of those, 1438 (23%) were also in the EHR‐claims subcohort (Table 1, Table S2). A greater proportion were aged ≥ 65 years old in the EHR cohort (n = 1916, 30%) than in the EHR‐claims subcohort (n = 250, 17%). Of patients < 65 years old in the EHR‐claims subcohort, 46% had commercial coverage only, 11% had Medicare Advantage only, and 43% had some other form of insurance. Most patients ≥ 65 years old in the EHR‐claims subcohort had Medicare Advantage only (64%, Table 1). Within each of the age strata, there were no notable differences between the EHR cohort and the EHR‐claims subcohort either for the demographic characteristics and ECOG status (obtained solely from EHR) or for obesity and smoking status (measured from both EHR and claims). The proportion of patients with a Charlson comorbidity index score of zero was lower in the EHR‐claims subcohort (66% among those < 65 years old and 44% among those ≥ 65 years) than in the EHR cohort (80% and 66%, respectively). Similarly, the proportions with comorbidities of myocardial infarction, congestive heart failure, and peripheral vascular disease at baseline were lower in the EHR cohort than in the EHR‐claims cohort (Table 1).

TABLE 1.

Baseline characteristics of patients with mBC in the EHR cohort and EHR‐claims subcohort.

Patients < 65 years old Patients ≥ 65 years old
EHR Cohort EHR‐claims subcohort SMD a EHR Cohort EHR‐claims subcohort SMD a
Patient characteristic b N = 4373 N = 1188 N = 1916 N = 250
Payer type
Commercial only 545 46% 23 9%
Medicare Advantage only 131 11% 160 64%
Other c 512 43% 67 27%
Age (mean ± SD) 51 ±9 50 ±9 0.051 73 ±6 73 ±6 0.005
Age groups 0.061
18–35 319 7% 99 8%
36–45 972 22% 263 22%
46–55 1522 35% 431 26%
56–64 1560 36% 395 33%
65 and older 1916 100% 250 100% < 0.001
Sex 0.036 0.071
Female 4331 99% 1172 99% 1886 98% 248 99%
Race 0.055 0.197
Am. Indian or Alaska Nat. 62 1% 18 1% 22 1% 3 1%
Asian 143 3% 46 4% 27 1% 3 1%
Black or African American 607 14% 164 14% 233 12% 37 15%
Nat. Hawaiian or Pac. Isl. 4 0% 1 1% 1 0%
Other or Unknown 322 7% 84 7% 119 6% 26 11%
White 3235 74% 876 74% 1514 79% 180 72%
Practice region 0.249 0.227
Northeast 529 12% 107 9% 444 24% 40 16%
Midwest 1044 24% 380 32% 282 15% 64 26%
South 1907 44% 477 40% 826 44% 95 38%
West 811 19% 221 19% 330 18% 51 20%
Multiple 10 0% 1 0%
Missing 72 2% 2 0% 34 2%
Index year 0.103 0.145
2016 241 6% 62 5% 109 6% 13 5%
2017 1114 26% 326 27% 431 23% 59 24%
2018 1005 23% 285 25% 429 22% 60 24%
2019 891 20% 247 21% 414 22% 56 23%
2020 617 14% 163 14% 303 16% 41 17%
2021 383 9% 79 6% 166 9% 13 5%
2022 119 3% 26 2% 63 3% 8 3%
2023 3 0% 1 0%
Obesity 0.097 0.269
Yes 214 5% 71 6% 71 4% 16 6%
No 368 8% 128 11% 139 7% 36 14%
Unknown 3791 87% 989 83% 1703 89% 198 79%
Smoking 0.034 0.096
Current 579 13% 176 16% 150 8% 19 8%
Former 1083 25% 295 27% 530 28% 70 30%
Never 2416 55% 640 58% 1126 59% 138 55%
Unknown 925 7% 77 7% 110 6% 23 9%
ECOG performance status d 0.086 0.146
0 1513 35% 416 35% 431 22% 56 22%
1 892 20% 264 22% 474 25% 49 20%
367 8% 75 6% 334 17% 42 17%
Unknown 1601 37% 433 36% 677 35% 103 41%
NCI comorbidity index 0.375 0.505
0 3483 80% 769 66% 1258 66% 111 44%
1 708 16% 278 23% 463 24% 78 31%
2 150 3% 97 8% 132 7% 28 11%
≥ 3 32 1% 44 4% 63 3% 33 13%
Comorbidities
Myocardial infarction 44 1% 24 2% 0.083 69 4% 11 4% 0.041
Congestive heart failure 103 2% 43 4% 0.074 134 7% 34 14% 0.219
Peripheral vascular disease 42 1% 33 3% 0.134 24 1% 18 7% 0.299
a

Standardized mean difference.

b

Race, sex, practice region, and ECOG were ascertained from EHR data only.

c

Includes medicaid (fee for service, managed, unspecified), dual coverage (medicaid/medicare), and payers designated as other.

d

ECOG, Eastern cooperative oncology group.

Comparison of length of follow‐up and look‐back time from cohort entry was possible as the EHR cohort and EHR‐claims subcohort had a similar distribution of index years for cohort entry (Table S3). The median observation periods were longer in EHR‐claims than in the EHR cohort for both age strata (Figures 1a,b and 2a,b). For example, among those < 65 years old, the median look‐back period was 691 days in the EHR cohort but 868 days in the EHR‐claims cohort, and the median follow‐up period was 830 days in EHR but 1005 days in EHR‐claims (Figures 1a,b and 2a,b, Table S3). The proportions of patients with specific tiers of thresholds of look‐back and follow‐up time were consistently higher in the EHR‐claims subcohort than in the EHR cohort. For example, among patients < 65 years old and requiring at least 365 days in the look‐back period, 70% in the EHR cohort had ≥ 30 days of follow‐up, while 84% in the EHR‐claims subcohort had ≥ 30 days of follow‐up (Table S3). Differences between the EHR cohort and the EHR‐claims subcohort were larger for those ≥ 65 years old (Table S3). Similar differences for follow‐up time were observed for a sensitivity analysis for patient time from cohort entry, with death as censoring (Table S3).

FIGURE 1.

FIGURE 1

Lookback days from cohort entry, in the EHR cohort and EHR‐claims subcohort, for mBC patients < 65 years old (a) and ≥ 65 years old (b).

FIGURE 2.

FIGURE 2

Follow‐up days from cohort entry, in the EHR cohort and EHR‐claims subcohort, for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

For all unique diagnostic codes or only the highest‐level codes, the median breadth (number of unique diagnoses) and density (unique codes per person‐month) were higher in the EHR‐claims than in EHR for both age categories of patients (Table S4, Figures S1a,b and S2a,b). For patients < 65 years old, the median unique highest‐level codes are 6 per person and 0.3 per person‐month for the EHR cohort, contrasted with 28 per person and 0.9 per person‐month in the EHR‐claims subcohort. Similarly, for patients ≥ 65 years old, the median unique highest‐level codes are 6 per person and 0.4 per person‐month for the EHR cohort, contrasted with 26 per person and 1.0 per person‐month in the EHR‐claims subcohort (Figures 3a,b and 4a,b, Table S4).

FIGURE 3.

FIGURE 3

Breadth of unique high‐level diagnoses, per patient, in the EHR cohort and EHR‐claims subcohort, for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

FIGURE 4.

FIGURE 4

Density of unique high‐level diagnoses per patient‐month, in the EHR cohort and EHR‐claims subcohort, for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

For most AEs, rates of anemia, neutropenia, febrile neutropenia, leukopenia, interstitial lung disease, left ventricular dysfunction, hand‐foot syndrome, diarrhea, fatigue, neuropathy, and nausea/vomiting were higher in both age groups in EHR‐claims than in EHR (Table S5, Figure S3a,b). Incidences were also higher in EHR‐claims for most AEs (Figure 5a,b Table S5). The greatest differences were observed for neuropathy (for patients < 65 years old: 6.37 in EHR‐claims subcohort vs. 1.59 in EHR cohort) and anemia (for patients ≥ 65 years old: 5.01 in EHR‐claims subcohort vs. 1.54 in the EHR cohort). Notably, the incidence of death was lower in the EHR‐claims subcohort than in the EHR cohort for both age groups but with a greater difference for the patients ≥ 65 years (7.6% EHR‐claims subcohort vs. 16.7% EHR cohort, Figure 5) [13].

FIGURE 5.

FIGURE 5

Cumulative incidence of adverse events at 180 days in the EHR cohort and EHR‐claims subcohort for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

For labs (Table 2), the proportions for completion (lab claims with a subsequent lab result in EHR within 10 days after the claim) varied by the type of lab. Among patients < 65 years old in the EHR‐claims subcohort, the proportions for completion were 39% (neutrophil) to 77% (lactase dehydrogenase), with similar results observed among those ≥ 65 years old, 38% (neutrophil) to 68% (lactase dehydrogenase).

TABLE 2.

Lab claims with subsequent results in EHR for the EHR‐claims subcohort a , b .

Patients < 65 years old Patients ≥ 65 years old
N = 1188 N = 250
LDH claims, n (claims per patient) 4098 (3.45) 213 (0.85)
LDH result, n (% of LDH claims) 3158 (77%) 145 (68%)
Hemoglobin claim, n (claims per patient) 30 351 (25.55) 4344 (17.38)
Hemoglobin result, n (% of CBC claims) 20 129 (66%) 2942 (68%)
Neutrophil claim, n (claims per patient) 29 443 (24.78) 4181 (16.72)
Neutrophil result, n (% of CBC claims) 11 536 (39%) 1596 (38%)
ALT claim, n (claims per patient) 239 (0.20) 29 (0.12)
ALT result, n (% of ALT claims) 144 (60%) 12 (41%)
a

ALT, alanine transaminase; CBC, complete blood count; LDH, lactate dehydrogenase.

b

Each patient may have more than one occurrence of a lab claim and subsequent result. Lab results are from EHR data within 10 days after procedure claim.

4. Discussion

In this study of a US cohort of patients with mBC in EHR data and an EHR‐claims linked subcohort, we found advantages to using EHR‐claims linked data, including extended look‐back time, increased follow‐up time, and greater capture of all diagnoses and specific AEs. Increased diagnosis capture was evident even after controlling for the longer observation time for the measures of breadth and density of all high‐level diagnosis codes. We also quantified the reduction in sample size and the age bias that US commercial claims linkage introduces by failing to capture patients 65 and older who are covered by Medicare but no commercial insurers. After stratifying by age, we observed minimal differences in demographics between the EHR cohort and EHR‐claims subcohort. We observed somewhat higher comorbidity burden and somewhat lower rate of death in EHR‐claims than EHR, which appears to support a better capture of comorbidity diagnoses in EHR‐claims data [13].

4.1. Age Bias

These results demonstrate that using EHR linked to commercial claims for studies of patient populations with a substantial proportion over 65 years old is likely to bias the age distribution downward if linked Medicare claims data are not comprehensive. In cohorts of patients with mBC or other cancers, controlling for this bias through stratification or weighting in subsequent analyses of age‐related conditions or events should be considered.

4.2. Data Source Selection

For researchers considering studies in EHR‐claims linked datasets, straightforward metrics can be used to quantify information gained relative to sample size lost by requiring a claims linkage to EHR. The trade‐off between total sample size and EHR data quantity is well‐known [14, 15]. For example, in order to reduce the amount of misclassification through EHR discontinuity, Anand et al. restricted an EHR‐Medicaid claims cohort by selecting for the presence of variables that were highly predictive of EHR continuity [15]. Limiting the sample to acceptable levels of misclassification resulted in a sample size that was 20% of the original for patients ages ≥ 65 years and 50% for patients ≥ 18 years.

For EHR‐claims datasets with less stringent inclusion criteria for claims eligibility, such as 90 days, the continuity or completeness of the data can be difficult to identify by quantitative methods alone. While the amount of observed patient time can affect the completeness of the variables, comprehensiveness may have unmeasured sources of variation. The breadth and density of unique diagnoses are likely to be higher in EHR‐claims than EHR alone, given the increase in observable patient time. Our results show that even with adjustment for the greater follow‐up time observed in the EHR‐claims data, the mean and median breadth of unique codes remain greater for both age categories compared with the EHR cohort. However, in contrast to the increase in breadth and density of high‐level diagnoses, the addition of diagnostic claims made only minimal differences to critical confounders such as smoking status and obesity status.

Multiple measures of completeness should be assessed through quantitative and qualitative measures. Length in look‐back and follow‐up periods, as well as sample sizes in proposed minimal look‐back periods required for inclusion can inform design. For researchers initiating studies in EHR‐claims linked datasets, these metrics can aid in interpretation of comparisons to EHR‐only results. Attention should be paid to the potential for selection bias, resulting from use of claims linkage. The difference in mortality we observed suggests differential selection of patient severity based on the availability of the claims linkage by patient age. This contrasts with our observation of more comorbidities found in the EHR‐claims subcohort than in the EHR cohort.

4.3. AE Ascertainment

Even with the addition of claims data, the incidence estimates for certain AEs were lower than expected. For example, a review of AEs after therapies for HER2‐positive mBC reported the proportion who had experienced a Grade 3+ AE ranged from 39.4% to 66.3% [11]. Gastrointestinal AEs such as diarrhea or nausea/vomiting were higher in the EHR‐claims subcohort than in the EHR cohort but were still below the minima of the ranges reported for diarrhea (> 19%), nausea (> 34%), or vomiting (19%) by Perez et al. [11]. For hematologic AEs such as anemia or neutropenia, a higher incidence was observed in the EHR‐claims subcohort than EHR only (anemia was 11%, neutropenia 21% among patients age < 65; among patients ≥ 65 years, anemia was 13% and neutropenia 19%). In this subcohort, the incidence of anemia and neutropenia was equal to or above the percentages for the minima of the ranges in the review (anemia 11%, neutropenia 12%).

Although the lack of therapy exposure eligibility criteria in our study might reduce comparability to most studies in the literature, it is unlikely that patients with mBC undergoing usual care in CAI‐contributing clinics were exposed to therapies with such low risk of AEs. The under‐ascertainment of oncology AEs in RWD has been reported, and our results demonstrate that adding claims to EHR data can attenuate but not completely address this problem [16].

The value of the linkage may vary according to the importance of specific AEs. Common AEs such as fatigue or alopecia were either minimally reported (fatigue 6%) or not at all (alopecia 0%) [17, 18]. For severe but uncommon adverse events such as interstitial lung disease (ILD), the higher incidence (0.4%–0.5%) observed for the EHR‐claims subcohort could be relevant for studies of specific therapeutics, such as CDK4/6 inhibitors, with a low but meaningful published incidence (1%–2%) [17] of ILD that require special considerations for clinical care. For studies concerned with neuropathy, the incidence for the EHR‐claims subcohort (13%–22%) supports the use of a linkage [19].

4.4. Lab Result Missingness

In the EHR‐claims subcohort, we observed a substantial proportion of lab claims with no subsequent lab result in EHR within the following 10 days. The proportion of claims followed by a result ranged from 38% for neutrophil counts to 77% for LDH. In traditional EHR analysis, when a lab result does not appear in the record, it is ambiguous whether the lab result is missing or the lab procedure was never done. Claims data can be used to demonstrate that the lab procedure was done, thereby disambiguating the reason for the lab result not appearing. In future studies, EHR‐claims linked data can aid in evaluating patterns of missingness [20]. Complete case analysis is often used but is only appropriate when lab values are missing completely at random. Predictors of missingness could be used to assess biases, to determine if complete‐case analysis is appropriate and, if not, to then inform an alternative approach with imputation of missing values [21, 22]. Precise estimation of the comprehensiveness of the data, through use of variables such as labs, is the first step to addressing limitations from missingness present in EHR.

5. Limitations

This study demonstrates an approach to evaluating impacts of linkage in a ConcertAI cohort of patients with mBC. Although the approach could be applied broadly to all EHR datasets with a claims' linkage, it may not be generalizable to disease cohorts other than cancers. Other disease cohorts could have unique patterns for treatment or care, periods for measurement of potential confounders, or follow‐up time for AEs that would alter the impact of an EHR–claims linkage. Differences in the accuracy or measurement of diagnoses in other EHR data sources would also affect the generalizability of our results. Additional analytic choices in this study could alter our findings. For example, we required no specific healthcare setting, treatment exposure, or verification of a single diagnostic code for comorbidities at baseline; more stringent criteria could reduce the apparent comorbidity prevalence and index scores in the EHR–claims subcohort. We recommend that researchers undertaking studies in EHR–claims linked studies evaluate the impact of linkage on key patient characteristics, treatment exposures and comedications, and outcome ascertainment.

The criterion to define the EHR‐claims subcohort was 90 days of claims coverage at any time in the study period, regardless of whether it overlapped with the index date. This is a relatively lenient criterion and may have limited the ability for claims data to inform on baseline characteristics and incident AEs. More stringent coverage criteria might improve variable capture but would further shrink the subcohort.

6. Conclusion

Shifts in age, completeness of diagnosis history, and duration of observation can contribute to differences in results derived from EHR data and the subset of patients from the same EHR source who have linkage to claims data. Measuring each domain in advance allows for clearer interpretations of differences in results between EHR‐based studies and EHR‐claims‐based studies.

6.1. Plain Language Summary

This study compared two ways of tracking medical information for patients with metastatic breast cancer: using electronic health records (EHR) alone versus combining EHR with insurance claims data. Out of 6289 patients in the EHR group, only 23% (1438 patients) had insurance claims data. While the combined EHR–claims approach provided more detailed medical information and longer observation periods, it had two main drawbacks: a much smaller sample size and fewer patients aged 65 and older. The EHR‐only group had 30% of patients over 65, while the combined group had only 17%. Adverse events were reported more frequently in the combined data group, likely because the combined data captured more complete information. These findings suggest that researchers should consider age and other population differences when combined EHR–claims data are used and should understand the trade‐offs involving sample size, generalizability, and data completeness.

Disclosure

The authors have nothing to report.

Ethics Statement

This work was funded by Gilead Sciences. The authors with Gilead affiliations hold stock in the company and played a role in design, execution, interpretation, and writing.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Table S1: ICD‐10 CM Diagnosis or CPT codes for all variables.

PDS-34-e70213-s002.pdf (584.3KB, pdf)

Table S2: Attrition for metastatic breast cancer cohort in EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s001.pdf (99.6KB, pdf)

Table S3: Length of look‐back and follow‐up periods in the EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s006.pdf (165.9KB, pdf)

Table S4: Coverage of diagnosis codes in the EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s005.pdf (132.7KB, pdf)

Table S5: Adverse events in the EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s004.pdf (99.4KB, pdf)

Figure S1: Breadth of unique diagnoses, per patient, in the EHR cohort and EHR‐claims subcohort, for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

Figure S2: Density of unique high‐level diagnoses per patient‐month, in the EHR cohort and EHR‐claims subcohort, for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

Figure S3: Rates of adverse events per 100 person‐months, in the EHR cohort and EHR‐claims subcohort for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

PDS-34-e70213-s003.docx (209.9KB, docx)

Acknowledgements

Thanks to Tyler Alexander, SimulStat Inc., for manuscript editing and formatting.

Geddes J., Katz J., Asiimwe A., et al., “Quantification of Information Gained by Linking Claims Data to an Electronic Health Record Cohort of Patients With Metastatic Breast Cancer,” Pharmacoepidemiology and Drug Safety 34, no. 10 (2025): e70213, 10.1002/pds.70213.

Funding: This work was supported by Gilead Sciences.

Jonah Geddes and Carrie M. Nielson contributed equally to this study.

This work was presented at the International Society for Pharmacoepidemiology annual meeting in August 2024. The work was sponsored by Gilead Sciences.

References

  • 1. Sauer C. M., Chen L. C., Hyland S. L., Girbes A., Elbers P., and Celi L. A., “Leveraging Electronic Health Records for Data Science: Common Pitfalls and How to Avoid Them,” Lancet Digital Health 4, no. 12 (2022): e893–e898. [DOI] [PubMed] [Google Scholar]
  • 2. Lin K. J. and Schneeweiss S., “Considerations for the Analysis of Longitudinal Electronic Health Records Linked to Claims Data to Study the Effectiveness and Safety of Drugs,” Clinical Pharmacology and Therapeutics 100, no. 2 (2016): 147–159. [DOI] [PubMed] [Google Scholar]
  • 3. Harron K., Dibben C., Boyd J., et al., “Challenges in Administrative Data Linkage for Research,” Big Data & Society 4, no. 2 (2017): 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Weiskopf N. G., Hripcsak G., Swaminathan S., and Weng C., “Defining and Measuring Completeness of Electronic Health Records for Secondary Use,” Journal of Biomedical Informatics 46, no. 5 (2013): 830–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Slipski L. D. B., Yu Y., Walker M., and Natanzon Y., “Addition of Open Administrative Claims Significantly Improves Capture of Mortality in Electronic Health Record,” Presented at: ISPOR; November 17‐20 2024; Barcelona, Spain, https://www.ispor.org/conferences‐education/conferences/past‐conferences/ispor‐europe‐2024/program/program/session/euro2024‐4013/142887.
  • 6. McGinnis K. A., Skanderson M., Justice A. C., et al., “Using the Biomarker Cotinine and Survey Self‐Report to Validate Smoking Data From United States Veterans Health Administration Electronic Health Records,” JAMIA Open 5, no. 2 (2022): 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Haque M. A., Gedara M. L. B., Nickel N., Turgeon M., and Lix L. M., “The Validity of Electronic Health Data for Measuring Smoking Status: A Systematic Review and Meta‐Analysis,” BMC Medical Informatics and Decision Making 24, no. 1 (2024): 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Clemens K. K., Reid J. N., Shariff S. Z., and Welk B., “Validity of Hospital Codes for Obesity in Ontario, Canada,” Canadian Journal of Diabetes 45, no. 3 (2021): 243–248.e4. [DOI] [PubMed] [Google Scholar]
  • 9. Samadoulougou S., Idzerda L., Dault R., Lebel A., Cloutier A. M., and Vanasse A., “Validated Methods for Identifying Individuals With Obesity in Health Care Administrative Databases: A Systematic Review,” Obesity Science and Practice 6, no. 6 (2020): 677–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Stedman M. R., Doria‐Rose P., Warren J. L., Klabunde C. N., and Mariotto A., “Comorbidity Technical Report: The Impact of Different SEER‐Medicare Claims‐Based Comorbidity Indexes on Predicting Non‐Cancer Mortality for Cancer Patients,” https://Healthcaredelivery.Cancer.Gov/Seermedicare/Considerations/Comorbidity‐Report.Pdf.
  • 11. Perez E. A., Dang C., Lee C., et al., “Incidence of Adverse Events With Therapies Targeting HER2‐Positive Metastatic Breast Cancer: A Literature Review,” Breast Cancer Research and Treatment 194, no. 1 (2022): 1–11. [DOI] [PubMed] [Google Scholar]
  • 12. Klein J. P. and Moeschberger M. L., “Chapter 13,” in Survival Analysis: Techniques for Censored and Truncated Data, vol. XIV, 1st ed., ed. Dietz K., Gail M., Krickeberg K., Samet J., and Tsiatis A. (Springer‐Verlag, 1997), 502. [Google Scholar]
  • 13. Salas M., Henderson M., Sundararajan M., et al., “Use of Comorbidity Indices in Patients With Any Cancer, Breast Cancer, and Human Epidermal Growth Factor Receptor‐2‐Positive Breast Cancer: A Systematic Review,” PLoS One 16, no. 6 (2021): e0252925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Merola D., Schneeweiss S., Schrag D., Lii J., and Lin K. J., “An Algorithm to Predict Data Completeness in Oncology Electronic Medical Records for Comparative Effectiveness Research,” Annals of Epidemiology 76 (2022): 143–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Anand P., Zhang Y., Merola D., et al., “Comparison of EHR Data‐Completeness in Patients With Different Types of Medical Insurance Coverage in the United States,” Clinical Pharmacology and Therapeutics 114, no. 5 (2023): 1116–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Nielson C. M., Bylsma L. C., Fryzek J. P., Saad H. A., and Crawford J., “Relative Dose Intensity of Chemotherapy and Survival in Patients With Advanced Stage Solid Tumor Cancer: A Systematic Review and Meta‐Analysis,” Oncologist 26, no. 9 (2021): e1609–e1618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Schlam I., Giordano A., and Tolaney S. M., “Interstitial Lung Disease and CDK4/6 Inhibitors in the Treatment of Breast Cancer,” Expert Opinion on Drug Safety 22, no. 12 (2023): 1149–1156. [DOI] [PubMed] [Google Scholar]
  • 18. Schlam I., Tarantino P., and Tolaney S. M., “Managing Adverse Events of Sacituzumab Govitecan,” Expert Opinion on Biological Therapy 23, no. 11 (2023): 1103–1111. [DOI] [PubMed] [Google Scholar]
  • 19. Seretny M., Currie G. L., Sena E. S., et al., “Incidence, Prevalence, and Predictors of Chemotherapy‐Induced Peripheral Neuropathy: A Systematic Review and Meta‐Analysis,” Pain 155, no. 12 (2014): 2461–2470. [DOI] [PubMed] [Google Scholar]
  • 20. Wells B. J., Chagin K. M., Nowacki A. S., and Kattan M. W., “Strategies for Handling Missing Data in Electronic Health Record Derived Data,” EGEMS (Washington, DC) 1, no. 3 (2013): 1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Ferri P., Romero‐Garcia N., Badenes R., et al., “Extremely Missing Numerical Data in Electronic Health Records for Machine Learning Can Be Managed Through Simple Imputation Methods Considering Informative Missingness: A Comparative of Solutions in a COVID‐19 Mortality Case Study,” Computer Methods and Programs in Biomedicine 242 (2023): 107803. [DOI] [PubMed] [Google Scholar]
  • 22. Ibrahim J. G., Chu H., and Chen M. H., “Missing Data in Clinical Studies: Issues and Methods,” Journal of Clinical Oncology 30, no. 26 (2012): 3297–3303. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1: ICD‐10 CM Diagnosis or CPT codes for all variables.

PDS-34-e70213-s002.pdf (584.3KB, pdf)

Table S2: Attrition for metastatic breast cancer cohort in EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s001.pdf (99.6KB, pdf)

Table S3: Length of look‐back and follow‐up periods in the EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s006.pdf (165.9KB, pdf)

Table S4: Coverage of diagnosis codes in the EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s005.pdf (132.7KB, pdf)

Table S5: Adverse events in the EHR cohort and EHR‐claims subcohort.

PDS-34-e70213-s004.pdf (99.4KB, pdf)

Figure S1: Breadth of unique diagnoses, per patient, in the EHR cohort and EHR‐claims subcohort, for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

Figure S2: Density of unique high‐level diagnoses per patient‐month, in the EHR cohort and EHR‐claims subcohort, for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

Figure S3: Rates of adverse events per 100 person‐months, in the EHR cohort and EHR‐claims subcohort for patients with mBC < 65 years old (a) and ≥ 65 years old (b).

PDS-34-e70213-s003.docx (209.9KB, docx)

Articles from Pharmacoepidemiology and Drug Safety are provided here courtesy of Wiley

RESOURCES