Skip to main content
PLOS One logoLink to PLOS One
. 2026 Jan 23;21(1):e0336967. doi: 10.1371/journal.pone.0336967

Investigation into EHR data coverage in the All of Us Research Program via linkage to health insurance claims

Yuyang Yang 1,*, Kelsey Rodriguez 2, Javier Ezcurra 3, Romain Bogaerts 3, Andres Corrada-Emmanuel 3, Lew Berman 4, Melissa Basford 2, Abel Kho 1
Editor: Sreeram V Ramagopalan5
PMCID: PMC12829818  PMID: 41576060

Abstract

In 2020, the All of Us (AoU) Research Program Data Completeness Task Force identified several challenges in the state of program health data, including the high likelihood that Electronic Health Record (EHR) data from recruitment institutions is missing a significant portion of care received by participants. To improve data availability, the AoU Data Completeness Task Force recommended efforts to work with industry partners to better understand the degree of missingness within AoU EHR data, with an initial focus on claims data. In this study we describe efforts from AoU’s collaboration with Swoop, a commercial analytics company holding health insurance claims data for over 80% of Americans, to assess the degree to which health insurance claims data can fill out the complete picture of care received by AoU participants. Using record linkage to link between individual participant AoU EHR data and their respective insurance claims, we quantitatively assess the amount of missing health data in both Swoop claims data and AoU EHR data over a decade long sample, identifying trends in data missingness based on participant characteristics. Our analysis demonstrates that AoU would greatly benefit from ingestion of claims data, gaining an estimated 16 million (90 per person) unique diagnosis codes, 17.8 million (99 per person) unique procedure codes, and 9.4 million (53 per person) unique drug codes through linkage to claims data.

Introduction

The All of Us Research Program

The All of Us (AoU) Research Program is an NIH-funded longitudinal cohort research study which aims to recruit a diverse sample of one million individuals to support the study of human health and advance precision medicine [1]. Participants in AoU agree to share their health information, including electronic health record (EHR) data from care institutions, self-reported survey data about health behaviors, physical measurements, data from wearable devices like Fitbits, and genetic sequencing information with the program, providing a rich resource for health researchers. Currently the program has enrolled over 850,000 individuals across 50 US States and territories as of November 2024 [2].

Care fragmentation within AoU EHR data

EHR data exists as a longitudinal record of patient exposures and outcomes, serving as a powerful source of real-world data to drive biomedical discovery that has become widely accepted in the study of diverse topics like healthcare utilization, medication post-marketing safety surveillance, and increasingly in clinical research [3]. Despite the promise of EHR data, issues involving data missingness and quality are known pitfalls of EHR data that need to be addressed to maximize its potential [4]. Recently internal investigations within AoU have identified significant potential for missingness in the shared EHR data of program participants [57]. The primary method for All of Us to acquire participant EHR data is from the healthcare provider organization (HPO) that a participant enrolls into the program through. For individuals that enroll into the AoU outside of an HPO (participants referred to as Direct Volunteers (DV)), EHR data can instead be ingested through connections to clinician portals, using Fast Health Interoperability Resource (FHIR) protocols, from healthcare institutions identified by the participant themselves [1,7]. In either case, after receiving participant consent and a HIPAA authorization to share clinical data [7,8], EHR data is converted to a standard format under the Observational Medical Outcomes Partnership (OMOP) common data model [9] before it is sent to the AoU Data and Research Center (DRC) [10]. Because participant EHR data typically comes from the enrollment HPO, the degree to which participant healthcare records outside of the enrollment HPO are included in AoU is limited. Due to this, a major potential source of EHR data missingness in AoU is care fragmentation [11], in which care received by patients at a healthcare institution is not necessarily communicated to the EHRs of other healthcare institutions the patient receives care at, leading EHR data from enrollment HPOs to provide an incomplete record of health activities [12].

To investigate the risk of care fragmentation in AoU EHR data, members of our group previously used a privacy-preserving record linkage method [13] to identify the proportion of shared patients between seven AoU-contributing HPO sites in three mid-western states (Wisconsin, Illinois, Indiana) between the years 2011–2018 [14]. Instances when the same individual showed up in more than one of these HPOs was identified at a rate of 6.1% to 32.7% between sites, suggesting significant potential for care fragmentation in patient records of AoU contributing HPO sites.

Use of insurance claims to supplement EHR data

Health insurance claims are records of billable services submitted by healthcare systems to public or private insurers for the purposes of reimbursement. They contain standardized billing codes representing patient diagnoses, types of procedures performed, and medications received. Due to their structured nature and availability, health insurance claims are frequently used in healthcare studies [15]. However, because claims data are designed for obtaining reimbursement, they may not contain non-billable services or conditions, under-document diagnoses not relevant to obtaining compensation, and incentivize documentation to make patients appear sicker (“upcoding”) [16]. In addition, patients with unstable or no insurance may not be traceable using this datatype [16,17].In comparison, EHR data is primarily used for supporting and documenting patientcare. EHR data contains broader healthcare information than claims, including patient data from clinical notes, lab results, and medical history that are important for clinical care but not required for billing. While EHR data is also considered a valuable resource in health research, it has known data quality issues [18], as well as previously mentioned problems with care fragmentation.

It is largely supported that integration of claims and EHR data can be helpful in health research by providing a more complete picture of a patient’s health. For example, use of combined data from both claims and EHR has been shown to outperform EHR or claims data alone in predictive modeling tasks across multiple scenarios [1921]. While it may be helpful to link claims data to existing EHR systems, due to difficulties in interoperability, a lack of standardized pipelines to link EHR and claims data, and legal barriers surrounding HIPAA as well as state confidentiality laws, this practice is not typical [22,23].

In order to directly assess the extent of EHR data missingness within AoU participant data, we previously used record linkage to compare EHR data of 400,000 + AoU participants with claims data [24,25]. Unfortunately, only 41% of AoU participants matched to claims data provided by our data partner, Swoop, a precision health omnichannel solutions company with aggregated claims data for over 80% of Americans (obtained through HIPAA-compliant business associate agreements with primary sources including pharmacy benefit managers, clearinghouses, and payer organizations) [24]. Despite this limitation, our study identified significant missingness in the service dates, diagnosis events, procedure codes, and prescriptions in AoU EHR data compared to claims. In this manuscript, we describe improvements to our previous comparison of AoU EHR data with Swoop claims. We improved the participant match rate from 41% to 95% and conducted additional analysis to examine if participant level characteristics are associated with the extent of EHR data missingness when compared to claims.

Materials and methods

Permission to conduct this study was obtained from the AoU Research Program Research Compliance Branch Institutional Review Board and Northwestern University Institutional Review Board. A cohort of 246,128 total AoU participants were included in the final analysis. AoURP participants provided written informed consent to share their EHR data with the AoURP for research purposes through a HIPAA authorization form, including linkage of their data to secondary data sources such as insurance claims [7,8]. EHR and claims data for each participant was included between the years 2011 and 2021. Data analysis took place between September 24th, 2021, and April 30th, 2025. All participant data was fully anonymized prior to access by authors, who did not have access to information that could identify individual participants during or after data collection.

Data linkage process

Identifiers for each participant were aggregated by an analyst within the AoU DRC and include participant date of birth, full name, and gender. These identifiers were combined into a single string and used to create tokens at the individual person level to encrypt participant identity via software provided by the company Datavant. Using a one-way hash, two tokens for each participant were created from a string made up of the following patient identifiers [26].

  1. Token1=LastName+1st Initial of First Name+Sex at Birth+Date of Birth

  2. Token2=LastName (Soundex)+FirstName (Soundex)+Sex at Birth+Date of Birth

After tokenization, the hashed identifiers from the AoU DRC, alongside counts of healthcare activity to be compared, are sent to Swoop, which generates their own tokens on the same set of identifiers. Swoop identifies persons that exist in both datasets, where a match exists when both Token1 and Token2 are the same between AoU and Swoop data. For each matching participant, Swoop calculates counts of healthcare activity from claims data on their side and sends the file back to AoU. Finally, the data is analyzed by the DRC team at Northwestern Medicine (NM) on the AoU Researcher Workbench.

Description of shared data

We identified four healthcare activity types used for comparison in this study. 1) Service Dates, 2) Diagnosis (Dx) Codes (ICD9 and ICD10), 3) Procedure (Px) Codes (CPT and HCPCS), 4) National Drug Codes (NDCs).

Patient event months.

We defined units of healthcare activity at the monthly level in the form of patient-event months (PEM). For a service day, each day in the calendar month in which healthcare activity (a Dx, Px, or NDC code) is reported adds one to the activity count within that PEM. For PEM counts in Dx, Px, NDC categories, the total number of reported activities in that month will be registered. For example, for the month of January 2018, if a participant has six diagnosis codes on three separate days within that month, a record of three will be recorded as the PEM count for service days regardless of the number of distinct diagnosis codes that occurred on those days for that patient, while a PEM count of 6 will be recorded for diagnoses in that month.

Participant characteristics

We queried the AoU Researcher Workbench to identify additional information for each participant. These include participant self-identifying race, ethnicity, gender, and current age (by decade) and common chronic conditions. Chronic conditions were chosen based on the US Department of Health and Human Services Office of the Assistant Secretary of Health (OASH) list of prevalent chronic conditions [27] and include (Hypertension, Congestive Heart Failure, Coronary Artery Disease, Cardiac Arrhythmias, Hyperlipidemia, Stroke, Arthritis, Asthma, Autism Spectrum Disorder, Cancer, Chronic Kidney Disease, COPD, Dementia, Depression, Diabetes, Inflammatory disease of liver, HIV, Osteoporosis, Schizophrenia and Substance abuse disorder).

Additional patient data that is not available on the AoU Researcher Workbench (RWB) was obtained using an internal program data resource called the AoU Program Data Repository (PDR). This resource is not available on the RWB for general researcher use; however, it was used specifically on this project to inform programmatic activities and in terms of assessing EHR data quality (missingness) and determining the utility of pursuing claims data as an asset for use on the RWB. These included HPO Type (type of site through which the participant enrolls), income level, education, insurance type, and participant home address. We used the participant home Census tract to link in indices of socioeconomic disadvantage and degree of urbanization in the participant living area by linking to the Neighborhood Atlas Area Deprivation Index (ADI) [28,29] and Rural-Urban Commuting Area (RUCA) [30]. We used the national rank score of each census tract in the ADI, which provides a composite measure of deprivation between 1–100 and divided the score into quartiles (q1 [1–25), q2 [25–50), q3 [50–75), q4 [75–100]). RUCA codes were grouped into four categories depending on secondary RUCA code using “Scheme 1” provided by the Washington State Department of Health, which was chosen to account for participant accessibility to urban-based healthcare services [31]. Census tracts were categorized as urban core (1), suburban (2 and 3), large rural (4, 5, 6), and small town/rural (7, 8, 9, 10) depending on what number their respective RUCA codes started with.

Statistical analysis

Descriptive statistics were calculated to identify differences in counts of healthcare activities seen between AoU and Swoop sets of data. Additionally, to see what patient characteristics were associated with differential activity counts between AoU and Swoop data, we performed an adjusted linear regression analysis using ordinary least squares regression in the statsmodels package (Version 0.14.2) in Python. The covariates used were participant characteristics including demographic information (i.e., age, race, gender), socioeconomic characteristics (i.e., income, education, insurance type, deprivation index, geography type), and chronic conditions (i.e., heart failure, cancer, dementia). The reference values for this analysis were white (for race), non-hispanic (for ethnicity), male (for gender), 20–29-year-old (for age), annual income of 200k or greater (for income), advanced degree (for education), private (for insurance), urban core (for geography type), first quartile (being least deprived, for deprivation index) and not having the condition for all chronic diagnoses of interest. A cutoff of P < 0.05 is used to assess significance. All covariates used in the model had a variance inflation factor of less than 10. Because moderate heteroscedasticity of residuals was observed in the sample, we used robust standard errors measurements when calculating the regression.

Statistical analysis was performed using Python, version 3.10.12 within the All of Us Research Workbench (All of Us Registered Tier Dataset v7). A copy of our Jupyter Notebook and a link to our workspace is available upon request.

Results

Linkage evaluation

Health records for 472,877 AoU participants were initially considered for matching (Fig 1). Of these, 2,167 (.5%) participant records encountered a Datavant tokenization error (inability to generate tokens from patient identifiers), while 16,594 (3.5%) participants became ineligible for matching due to having identifiers that generate the same token pair as one or more other individuals. Of the 454,116 remaining potentially matchable individuals, 430,803 matched to a patient in the Swoop dataset, leading to a match rate of 95%. 429,995 individuals were available for analysis on the workbench, after finding that 808 (0.2%) of matched participants could not be identified in the AoU Researcher Workbench. We found that a substantial portion of AoU participants (178,331; 41.5% of the remaining participants) have yet to have EHR data ingested into the program (due to participants not providing HIPAA authorization to share EHR data with the program, a lag in the receipt of participant EHR data from partner HPOs, or that the participants joined the program digitally with no available EHR data) and were therefore removed from the primary analysis. A small portion of participants (n = 9) were then removed due to lack of data in demographic variables. Finally, participants without address information or those that failed geocoding were filtered out, leaving a final analysis cohort of 246,128 AoU participants (52% of the initial set).

Fig 1. Breakdown of all participants considered for the study including the number of participants excluded for missing data.

Fig 1

HPO = Healthcare Provider Organization, SDOH = Social Determinants of Health.

When comparing the final analysis cohort (n = 246,128) to the set of participants that were missing EHR data but had valid demographic data on the AoU Researcher Workbench, no major discrepancies in the breakdown of demographic variables before and after exclusions were noted (Table 1).

Table 1. Demographic variable breakdown for matched participants in AoU workbench pre and post exclusion of participants without EHR data.

N = 375,739 All Patients with Demographic Data N = 246,128 All Patients with Demographic, SDOH, and EHR data
Age Count (Pre-Filter) Percentage Count (Post-Filter) Percentage
18-40 84879 22.590 50265 20.422
41-60 119258 31.740 76195 30.957
61-80 146924 39.103 101554 41.261
80+ 24678 6.568 18114 7.360
Gender
Female 226,859 60.377 149653 60.803
Male 138,383 36.830 90402 36.730
Skips and other responses (Self-Identified) 10,497 2.794 6073 2.467
Race
White 211826 56.376 136692 55.537
Black 71146 18.935 48419 19.672
Asian 12644 3.365 6779 2.754
Multiracial 7174 1.909 4369 1.775
Other 6635 1.766 4235 1.721
Unknown 66314 17.649 45634 18.541
Ethnicity (Self-Identified)
Not Hispanic/Latino 300741 80.040 195018 79.234
Hispanic/Latino 64831 17.254 44669 18.149
Unknown 10167 2.706 6441 2.617

SDOH = Social Determinants of Health.

Analysis of recorded healthcare activity in EHR vs claims

The counts of healthcare activity found in AoU EHR data and Swoop claims data between 2011−2021 are shown (Fig 2). There are a total of 30 million service dates, 34.6 million diagnosis codes, 17.4 million procedure codes, and 30.7 million national drug codes observed in AoU data for the 251,664 matched patients with EHR data. In comparison, Swoop found 32.4 (+2.4) million service dates, 36.4 million (+1.8) diagnosis codes, 39.6 (+22.1) million procedure codes, and 20.5 (−10.2) million drug codes. The counts of service dates and diagnosis codes are relatively equal between the two sources, but procedure codes are much higher in Swoop data, while drug codes are higher in AoU data. When combining the total number of PEMs (adding Dx, Px, and NDCs), Swoop (96.4 million) has roughly 17% more activity counts compared to AoU (82.7 million) for matched patients.

Fig 2. A) The overall number of PEM between AoU and Swoop data by category.

Fig 2

B) The total count of healthcare activities (Dx + Px + NDCs) between AoU and Swoop.

Analysis of recorded healthcare activity in EHR vs claims in shared vs non-shared months

In some months both AoU EHR data and Swoop claims report healthcare activity for any given patient while in other months only one of the two sources report activity. We broke down PEMs into months in which both AoU and EHR data report healthcare activity and months in which only one of the two data sources report information (Fig 3). When looking at months in which only AoU or only Swoop report activity, it is clear than both data sources provide significant amounts of unique data. Much of the information comes from months in which the two data sources do not overlap. 106.5 million (59.4%) counts of activities are reported in months unique to one of the two data sources, while 72.7 million (40.6%) counts occur in months during which both AoU and Swoop observe patient healthcare activity. Examining months in which both AoU (All of Us Shared) and Swoop (Claims Shared) report healthcare activity, we see differences in the number of PEMs reported by each data source despite the activity coming in the same month. While diagnosis code counts in shared months is similar, Swoop reports nearly 16% more service dates, 75% more Px codes and 26% fewer NDCs compared to AoU in months in which both data sources observe patient healthcare activity.

Fig 3. The amount of All of Us and Swoop PEMs broken down by source in months in which activity is reported by both or only one dataset.

Fig 3

White bar = PEMs in months recorded only by All of Us, White striped bar = PEMs in months in which both All of Us and Claims record activity within All of Us data, Grey Bar = PEMs in months in which both All of Us and Swoop record activity within Swoop data, Black bar = PEMs in months recorded only by Swoop.

Relative contributions by data source for four care categories over time

Breaking down PEMs into years, trends in the relative contribution of AoU and Swoop PEMs are observed (S1 Fig) The contribution of PEMs observed in AoU EHR data makes up most of the activity during the beginning of the observation period from 2011 through 2013. Contribution of activity between Swoop and AoU data stabilizes around 2014 and stays similar throughout the rest of the study period.

Analysis of gain in participant health data using claims for participants without existing EHR data

To see how much data would be gained through the harmonization of Swoop claims data into AoU program data for participants without already existing EHR data, the counts of claims events for the 178,331 participants without existing AoU EHR data was aggregated (Fig 4). Overall, a total of 14.8 million service days, 16 million diagnosis codes, 17.8 million procedure codes, and 9.4 million drug codes could be gained from supplementation of Swoop claims data into AoU participant data for participants that do not currently have EHR data in the researcher workbench. At the per participant level, this amounts to an average of 83 service dates, 90 diagnosis codes, 99 procedure codes, and 53 medications per participant.

Fig 4. The overall number of PEMs found in Swoop claims data for matched AoU participants that do not have EHR data available (n = 178,331).

Fig 4

Adjusted analysis of characteristics associated with differential AoU contributions

We examined the relative richness of AoU EHR data compared to claims data for participants depending on their characteristics using a dependent variable consisting of differences in the count of PEMs observed in AoU vs Swoop data per category in a linear regression model (Diagnoses data as outcome shown in Fig 5, service dates, procedures, and medications shown in S2S4 Figs). When examining what participant characteristics made them more likely to have enriched data in AoU or Swoop data, several trends emerged.

Fig 5. Adjusted linear regression showing association between patient characteristics and greater presence of diagnoses data within AoU vs Swoop data.

Fig 5

Leftward facing data has stronger AoU contribution while rightward facing data has stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease Dep: = Deprivation, Geo: = Geography, Ins: = Insurance, Edu: = Education, Inc: = Income, Hpo: = Healthcare Provider Organization, VA = Veterans Association, FQHC = Federally Qualified Health Center, Ethn: = Ethnicity.

Participant race had little influence on the likelihood of greater AoU or Swoop-sided information, although Asian participants and participants with Unknown race had slightly higher AoU-sided information throughout PEM categories. Hispanic patients had consistently higher Swoop-sided information in all PEM categories, with an extra 20–33 Swoop counts of activity per category relative to non-Hispanic participants. Participant age correlated almost linearly with greater Swoop-sided data, with higher age increasingly associated with more claims-sided healthcare activity. Individuals who enrolled through a Veterans Association had much greater AoU-sided information compared to participants that enrolled through other HPO types, having large effect sizes of greater than 200 service days and procedures, over 179 NDCs and around 92 diagnoses.

For variables relating to social determinants of health, increasingly lower education attainment and lower income both correlated with greater claims data relative to AoU data. Relative to participants with private insurance, participants without coverage have more AoU information compared to Swoop while those with Medicaid have the most consistently high representation in Swoop information. Individuals in public insurance categories (public, Medicare, Medicaid) generally had slightly higher claims-sided data compared to those with private insurance in categories outside of NDCs, where the effect is mostly equal. In comparison to participants living in an urban center, participants who lived in suburbs had slightly more claims-sided information in categories outside of procedures, while patients that lived in small town/rural and especially large rural settings had more AoU information. Deprivation index had relatively little association with the relative strength of healthcare data, with participants in the most deprived category (q4) having the greatest differential association compared to participants in q1, with higher (+25) AoU services days, diagnoses (+8), and procedures (+15), and somewhat more Swoop NDCs (+12).

The presence of most chronic conditions was associated with higher AoU-sided information in categories other than procedures, where the presence of several comorbid conditions was associated with higher Swoop-sided data. The presence of a few chronic conditions indicated greater Swoop activity. HIV was the only condition with claims-sided medication data, and COPD and especially Schizophrenia had greater Swoop activity in categories outside of medications.

Discussion

Our study found significant benefit towards data coverage when combining EHR data of AoU participants with their respective claims data. Compared to AoU EHR data, Swoop claims contained roughly 17% more counts of healthcare activity in PEMs overall for matched patients. 59.4% (106.5 million) of total PEMs in the study were found in months that were unique to AoU or Swoop data, suggesting that each data source contains large amounts of information that is not found in the other. The distinctive data of each data asset highlights the need for integrating additional data sources into AoU. Indeed, through the AoU Center for Linkage and Acquisition of Data (CLAD), claims is a data stream being acquired and curated to fill in missingness and enhance research [32]. Additionally, a significant portion of AoU participants have no EHR data recorded, with these participants having the most to gain from linkage to claims (an average of 83 service dates, 90 diagnosis codes, 99 procedure codes, and 53 medications per patient).

We also found differences in the relative strength of the two data sources. Differences in PEM counts in shared months suggest that each dataset is richer in different types of data. Compared to claims, EHR data in AoU has notably increased representation in NDC type information, while claims data is much richer in Procedure codes. This may reflect differences in how data is captured between EHR and claims data. For example, non-prescription medications may not show up in claims data, leading to a higher NDC count in EHRs [33]. Additionally, comparing AoU data to Swoop by year, we see that there is a relative lack of Swoop claims data in 2011 and 2012 compared to later years (S1 Fig).

Our linear regression analysis found several patient factors that are associated with more AoU or sided healthcare activity. Generally, lower income, older, and less educated participants, participants on Medicaid, and participants self-identifying as Hispanic ethnicity had increased claims-sided representation. Participants recruited from the Veterans Association, participants without insurance, and participants living in more rural settings had higher AoU-sided information. The especially striking difference in Veterans Association participants compared to other HPOs may suggest better data coverage in those organizations or decreased tendency to seek care at other institutions. Meanwhile, the greater EHR data coverage in rural participants may be related to increased care options in urban settings leading to higher potential for fragmented care. Large effect sizes towards claims-sided data in older, lower income participants with decreased education attainment, as well as participants self-identifying as Hispanic, may be important considerations for AoU to consider for ensuring data quality in a diverse and equitable way.

Area-level measures of social determinants of health were weakly associated with relative strength of healthcare utilization data in this study, with only individuals in the highest deprivation areas (q4) having somewhat greater AoU-sided information compared to individuals in the least deprived areas. The presence of most comorbid conditions in the EHRs of AoU participants were associated with greater AoU-sided information. In contrast, AoU participants with COPD and especially schizophrenia instead had greater Swoop representation in categories outside of NDCs, while participants with HIV had greater claims information related to medications. The general AoU-sidedness of comorbidity data may be due to existence of EHR data serving as a marker of high information coverage in the EHRs of these patients. The reason for the reversal of this trend with certain comorbidities such as schizophrenia is unknown and may have something to do with increased tendency to have fragmented care in these patients, requiring further investigation.

Improving the quality of EHR data is a major focus of the All of Us Research Program. This work offers the first comprehensive look into the extent of data missingness in AoU EHR data and the potential that linkage to ancillary resources like health insurance claims would provide for program data. Overall, the existence of substantial amounts of information in months covered only in AoU or Swoop, the differential data quantity in shared months in different PEM categories, and the potential gain in healthcare data for participants without EHR provide strong evidence that the AoU EHR data would benefit greatly from linkage to health insurance claims. Deficiencies in relative coverage of AoU EHR data compared to claims we find here based on patient-facing characteristics are important considerations for the AoU research program in ensuring equitable data quality. Further, other factors not considered in this analysis such as relative coverage of EHR data coming from individual HPO enrollment sites may represent useful quality improvement checks for the program in improving data ingestion pipelines. Recently, AoU established the Center for Linkage and Acquisition of Data (CLAD) in order to improve missingness in participant EHR data through linkage to other sources of data [34]. The findings of this work will provide valuable guidance for EHR enrichment efforts like the CLAD to aid AoU in creating a more comprehensive data environment to advance health research.

Limitations

Our data partner Swoop has claims records for 300 + million unique patient journeys [25]. While our match rate suggests that Swoop does contain data for a large proportion of Americans, it is difficult to know what proportion of total healthcare activity for patients is captured by Swoop’s data and what blind spots exist within that data coverage. For one, the analysis in S1 Fig shows that Swoop has gaps in its representation of claims from earlier years, with data contribution between AoU and Swoop evening out around 2013. Per our colleagues at Swoop, the depth of data availability Swoop has for patients is dependent on the extent to which patient’s insurance providers make data available in the market for Swoop to use. While comprehensive, the dataset has some limitations worth noting: (1) it may underrepresent certain populations, particularly those who are uninsured or underinsured; (2) data coverage varies by geographic region and insurance type; (3) claims data inherently captures only billable healthcare encounters, potentially missing care provided through non-traditional channels or cash payments; and (4) clinical details beyond what is required for reimbursement are typically not captured. Furthermore, patients receiving clinical services from retail pharmacies with which Swoop or its data vendors have a data sharing agreement with will not appear in Swoop’s dataset.

A large proportion of the initially considered 472,877 participants were excluded, mostly due to lack of AoURP EHR data availability, failure to match participants between claims and EHR data, or errors in the tokenization process itself (Fig 1). While we did not identify major differences in participant matching by demographics (Table 1), we acknowledge that hidden biases in AoURP EHR data availability and Swoop claims coverage may impact the study conclusions. An additional limitation comes from our agreement with Swoop which only allowed us to analyze data at the level of the PEMs, restricting the granularity at which we can examine the data. Since we are unable to compare PEMs at the daily level, we are unable to account for date shifts in recorded activity between EHR data capture and claims records (although per Swoop, minor date are infrequent and occur at constant rates; losses across a month boundary are balanced by gains from the preceding month, so monthly PEM totals remain effectively unchanged), which may impact the accuracy of shared vs unique months (Fig 3). This limitation also means we are unable to investigate exactly what codes are driving the differences seen in AoU and Swoop data due to being unable to see healthcare activity data on the Swoop side. In addition to the limitation of PEM resolution in analyzing our data, fundamental differences in the way healthcare events are captured between claims and EHR data may complicate the analysis of our results. Another limitation of this research is that the PDR data used in this activity is not generally available to researchers. Nonetheless, it was used because it is a critical resource to guide programmatic activities in the investigation of data missingness and linkage. Finally, this research points to the gaps inherent to both EHR and claims for research purposes. While EHR may contain lab values and narrative data which is missing in claims, conversely, claims may fill in important areas of missingness such as procedural information. Taken together these two sources are complementary and amplify research.

Supporting information

S1 Fig. The relative proportion of patient event months per data source per year.

(TIF)

pone.0336967.s001.tif (35.9KB, tif)
S2 Fig. Adjusted linear regression showing association between patient characteristics and relative strength of service dates data between AoU vs Swoop.

Leftward facing data has stronger AoU contribution while rightward facing data has stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease Dep: = Deprivation, Geo: = Geography, Ins: = Insurance, Edu: = Education, Inc: = Income, Hpo: = Healthcare Provider Organization, VA = Veterans Association, FQHC = Federally Qualified Health Center, Ethn: = Ethnicity.

(TIF)

pone.0336967.s002.tif (321KB, tif)
S3 Fig. Adjusted linear regression showing association between patient characteristics and relative strength of procedures data between AoU vs Swoop.

Leftward facing data has stronger AoU contribution while rightward facing data has stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease Dep: = Deprivation, Geo: = Geography, Ins: = Insurance, Edu: = Education, Inc: = Income, Hpo: = Healthcare Provider Organization, VA = Veterans Association, FQHC = Federally Qualified Health Center, Ethn: = Ethnicity.

(TIF)

pone.0336967.s003.tif (314.3KB, tif)
S4 Fig. Adjusted linear regression showing association between patient characteristics and relative strength of medications data between AoU vs Swoop.

Leftward facing data has stronger AoU contribution while rightward facing data has stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease Dep: = Deprivation, Geo: = Geography, Ins: = Insurance, Edu: = Education, Inc: = Income, Hpo: = Healthcare Provider Organization, VA = Veterans Association, FQHC = Federally Qualified Health Center, Ethn: = Ethnicity.

(TIF)

pone.0336967.s004.tif (307.1KB, tif)
S1 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of diagnosis data between AoU vs Swoop.

Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

(CSV)

pone.0336967.s005.csv (6.1KB, csv)
S2 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of service dates data between AoU vs Swoop.

Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

(CSV)

pone.0336967.s006.csv (6.1KB, csv)
S3 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of procedures data between AoU vs Swoop.

Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

(CSV)

pone.0336967.s007.csv (6.1KB, csv)
S4 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of medication data between AoU vs Swoop.

Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

(CSV)

pone.0336967.s008.csv (6.1KB, csv)

Acknowledgments

We gratefully acknowledge All of Us Research Program participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health’s All of Us Research Program for making available the participant data examined in this study. We would also like to acknowledge Swoop for their support in the development of this manuscript.

Data Availability

All of the data used for analysis of this work can be found in our workspace on the All of Us user workbench. This link is now provided (https://workbench.researchallofus.org/workspaces/aou-rw-f6117030/duplicateofevaluatevisitspermonth/about).

Funding Statement

“This study was supported by the National Institutes of Health in the form of a grant awarded to M.B. (1OT2OD035404-01) and Vanderbilt University Medical Center in the form of a salary for M.B. The specific roles of this author are articulated in the ‘author contributions’ section. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript”.

References

  • 1.Investigators AoURP. The “All of Us” research program. NEJM. 2019;381(7):668–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Program AoUR. Data snapshots; 2024 [cited 2024 Apr 2]. Available from: https://www.researchallofus.org/data-tools/data-snapshots/
  • 3.Cowie MR, Blomster JI, Curtis LH, Duclaux S, Ford I, Fritz F, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol. 2017;106(1):1–9. doi: 10.1007/s00392-016-1025-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tang AS, Woldemariam SR, Miramontes S, Norgeot B, Oskotsky TT, Sirota M. Harnessing EHR data for health research. Nat Med. 2024;30(7):1847–55. doi: 10.1038/s41591-024-03074-8 [DOI] [PubMed] [Google Scholar]
  • 5.Berman L, Ostchega Y, Giannini J, Anandan LP, Clark E, Spotnitz M, et al. Application of a data quality framework to ductal carcinoma in situ using electronic health record data from the All of Us Research Program. JCO Clin Cancer Inform. 2024;8:e2400052. doi: 10.1200/CCI.24.00052 [DOI] [PubMed] [Google Scholar]
  • 6.Spotnitz M, Giannini J, Ostchega Y, Goff SL, Anandan LP, Clark E, et al. Assessing the data quality dimensions of partial and complete mastectomy cohorts in the All of Us Research Program: cross-sectional study. JMIR Cancer. 2025;11:e59298. doi: 10.2196/59298 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ramirez AH, Sulieman L, Schlueter DJ, Halvorson A, Qian J, Ratsimbazafy F. The All of Us Research Program: data quality, utility, and diversity. Patterns. 2022;3(8). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Program AoUR. Consent to Join the All of Us Research Program; 2021. Available from: https://allofus.nih.gov/sfsites/c/resource/aouConsenttoJoinAoUEnglish
  • 9.Mayo KR, Basford MA, Carroll RJ, Dillon M, Fullen H, Leung J, et al. The All of Us data and research center: creating a secure, scalable, and sustainable ecosystem for biomedical research. Annu Rev Biomed Data Sci. 2023;6:443–64. doi: 10.1146/annurev-biodatasci-122120-104825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Klann JG, Joss MAH, Embree K, Murphy SN. Data model harmonization for the All of Us Research Program: transforming i2b2 data into the OMOP common data model. PLoS One. 2019;14(2):e0212463. doi: 10.1371/journal.pone.0212463 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kern LM, Bynum JPW, Pincus HA. Care fragmentation, care continuity, and care coordination-how they differ and why it matters. JAMA Intern Med. 2024;184(3):236–7. doi: 10.1001/jamainternmed.2023.7628 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Atasoy H, Demirezen EM, Chen P. Impacts of patient characteristics and care fragmentation on the value of HIEs. Prod Oper Manag. 2021;30(2):563–83. doi: 10.1111/poms.13281 [DOI] [Google Scholar]
  • 13.Kho AN, Cashy JP, Jackson KL, Pah AR, Goel S, Boehnke J, et al. Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. J Am Med Inform Assoc. 2015;22(5):1072–80. doi: 10.1093/jamia/ocv038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kho AN, Yu J, Bryan MS, Gladfelter C, Gordon HS, Grannis S, et al., editors. Privacy-preserving record linkage to identify fragmented electronic medical records in the All of Us Research Program. In: Machine learning and knowledge discovery in databases. Cham: Springer International Publishing; 2020. [Google Scholar]
  • 15.Konrad R, Zhang W, Bjarndóttir M, Proaño R. Key considerations when using health insurance claims data in advanced data analyses: an experience report. Health Syst (Basingstoke). 2019;9(4):317–25. doi: 10.1080/20476965.2019.1581433 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ferver K, Burton B, Jesilow P. The use of claims data in healthcare research. 2009.
  • 17.Devoe JE, Gold R, McIntire P, Puro J, Chauvie S, Gallia CA. Electronic health records vs Medicaid claims: completeness of diabetes preventive care data in community health centers. Ann Fam Med. 2011;9(4):351–8. doi: 10.1370/afm.1279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gianfrancesco MA, Goldstein ND. A narrative review on the validity of electronic health record-based research in epidemiology. BMC Med Res Methodol. 2021;21(1):234. doi: 10.1186/s12874-021-01416-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Guo A, Foraker R, White P, Chivers C, Courtright K, Moore N. Using electronic health records and claims data to identify high-risk patients likely to benefit from palliative care. Am J Manag Care. 2021;27(1):e7–15. doi: 10.37765/ajmc.2021.88578 [DOI] [PubMed] [Google Scholar]
  • 20.Pandya CJ, Hatef E, Wu J, Richards T, Weiner JP, Kharrazi H. Impact of social needs in electronic health records and claims on health care utilization and costs risk-adjustment models within medicaid population. Popul Health Manag. 2022;25(5):658–68. doi: 10.1089/pop.2022.0069 [DOI] [PubMed] [Google Scholar]
  • 21.Kharrazi H, Chi W, Chang H-Y, Richards TM, Gallagher JM, Knudson SM, et al. Comparing population-based risk-stratification model performance using demographic, diagnosis and medication data extracted from outpatient electronic health records versus administrative claims. Med Care. 2017;55(8):789–96. doi: 10.1097/MLR.0000000000000754 [DOI] [PubMed] [Google Scholar]
  • 22.West SL, Johnson W, Visscher W, Kluckman M, Qin Y, Larsen A. The challenges of linking health insurer claims with electronic medical records. Health Inform J. 2014;20(1):22–34. doi: 10.1177/1460458213476506 [DOI] [PubMed] [Google Scholar]
  • 23.Ehrenstein V, Kharrazi H, Lehmann H, Taylor CO. Obtaining data from electronic health records. In: Tools and technologies for registry interoperability, registries for evaluating patient outcomes: a user’s guide. 3rd ed, Addendum 2 [Internet]. Agency for Healthcare Research and Quality (US); 2019. [PubMed] [Google Scholar]
  • 24.Yang Y, Rodriguez K, Basford M, Nambiar S, Berman L, Kho A. Ancillary data record linkage to characterize the completeness of data for the All of Us Research Program. IJPDS. 2022;7(3). doi: 10.23889/ijpds.v7i3.2090 [DOI] [Google Scholar]
  • 25.Swoop. Faq; 2024 [cited 2024 Nov 8]. Available from: https://www.swoop.com/faq
  • 26.NationalArchives. The Soundex Indexing System; 2024. Available from: https://www.archives.gov/research/census/soundex
  • 27.Goodman RA, Posner SF, Huang ES, Parekh AK, Koh HK. Defining and measuring chronic conditions: imperatives for research, policy, program, and practice. Prev Chronic Dis. 2013;10:E66. doi: 10.5888/pcd10.120239 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kind AJH, Buckingham WR. Making neighborhood-disadvantage metrics accessible - the neighborhood atlas. N Engl J Med. 2018;378(26):2456–8. doi: 10.1056/NEJMp1802313 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Health UoWSoMaP. Area Deprivation Index V4.0.1; 2021 [cited 2024 Apr 3]. Available from: https://www.neighborhoodatlas.medicine.wisc.edu/
  • 30.U.S. Department of Agriculture ERS. Rural-urban commuting area codes [cited 2024 Apr 3]. Available from: https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/
  • 31.Hailu A, Wassermanh C. Guidelines for using rural-urban classification systems for community health assessment. Washington State Department of Health; 2016. [Google Scholar]
  • 32.Program AoUR. Center for linkage and acquisition of data; 2025 [cited 31 Mar 2025]. Available from: https://allofus.nih.gov/article/center-for-linkage-and-aquisition-of-data
  • 33.Wilson J, Bock A. The benefit of using both claims data and electronic medical record data in health care analysis. Optum Insight. 2012;1:1–4. [Google Scholar]
  • 34.Program AoUR. All of Us Research Program establishes new center for linkage and acquisition of data; 2023. Available from: https://allofus.nih.gov/news-events/announcements/all-us-research-program-establishes-new-center-linkage-and-acquisition-data

Decision Letter 0

Sreeram V Ramagopalan

8 Jul 2025

Dear Dr. Yang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 22 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Sreeram V. Ramagopalan

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information.

3. Thank you for stating the following in the Competing Interests section:

“Abel Kho is an advisor of the company Datavant.”

We note that one or more of the authors are employed by a commercial company: Datavant

a.        Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

b. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

4. Please note that your Data Availability Statement is currently missing [the repository name and/or the DOI/accession number of each dataset OR a direct link to access each database]. If your manuscript is accepted for publication, you will be asked to provide these details on a very short timeline. We therefore suggest that you provide this information now, though we will not hold up the peer review process if you are unable.

5. Please remove all personal information, ensure that the data shared are in accordance with participant consent, and re-upload a fully anonymized data set.

Note: spreadsheet columns with personal information must be removed and not hidden as all hidden columns will appear in the published file.

Additional guidance on preparing raw data for publication can be found in our Data Policy (https://journals.plos.org/plosone/s/data-availability#loc-human-research-participant-data-and-other-sensitive-data) and in the following article: http://www.bmj.com/content/340/bmj.c181.long.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

**********

Reviewer #1: This manuscript presents a timely and relevant investigation into the completeness of EHR data within the All of Us Research Program, through linkage with data provided by Swoop. The scale of the dataset and the analytical approach are commendable and contribute meaningfully to the field of data integration in population health research.

Suggestions for Improvement

The introduction would benefit from a clearer explanation of what Swoop is, how it obtains its data, and the scope and limitations of its claims dataset. This context is essential for readers to assess the representativeness and reliability of the claims data used in the study.

The manuscript should more explicitly explain the structural and functional differences between EHR and claims data. This includes their respective purposes, typical content, and known limitations. Such clarification would help readers understand why certain types of information are more prevalent in one source than the other.

It would be helpful to briefly discuss why EHR and claims data are not routinely integrated in the U.S. healthcare system, touching on technical, legal, and institutional barriers. This would provide important context for the significance of the authors’ linkage efforts.

The manuscript states that 472,877 participants were considered for linkage, but only 246,128 were included in the final analysis. This represents a substantial reduction that should be clearly stated.

The authors should estimate and report the volume of diagnostic, procedural, and medication data that was not utilized due to unmatched or excluded participants. These figures should be critically discussed in terms of potential data loss, bias, and the impact on the study’s conclusions.

While the manuscript includes chronic conditions as covariates in regression models, it does not explore the completeness of data for these subgroups in detail. Given the importance of chronic disease populations in health research, the authors are encouraged to provide more granular analysis or discussion on how data completeness varies by condition. This would be particularly valuable for assessing equity and data quality across clinically vulnerable groups.

The percentage values in Table 1 currently display inconsistent decimal places. For clarity and professionalism, it is recommended that all percentages be presented with a uniform number of decimal places.

Reviewer #2: This study makes a valuable contribution by providing practical insights into maximizing patient-level data availability for the research of secondary use of data. However, several critical points require clarification to strengthen the manuscript's methodological rigor and practical applicability.

Point 1.

The manuscript appears to assume that duplicate data between EHR and claims sources were identified and excluded from analysis, particularly given the use of patient-event months (PEMs) as the unit of comparison. However, the methods lack a clear definition of what constitutes a duplicate event across databases (e.g., same patient, same month, same code), nor do they detail how such duplicates were detected and managed.

How were duplicate events defined between EHR and claim data? What criteria were used to identify duplicate diagnoses, procedures, and medications across EHR and claims data? How were potential legitimate duplicates (e.g., bilateral procedures, multiple prescriptions) distinguished from true redundancies?

Point 2

While the manuscript demonstrates the quantitative gain in data elements through claims linkage, it does not sufficiently discuss the specific types of research questions or clinical scenarios where claims data integration is most impactful. The authors should elaborate on concrete use cases where the breadth and continuity of claims data provide unique advantages over EHR data alone. Articulating these scenarios would help readers understand the practical value of data expansion in the research using secondary use of data.

Point 3

The study equates an increase in the number of captured data elements with improved “completeness.” In data quality frameworks, completeness refers to the absence of missing values within captured records, not the comprehensiveness of data sources. Clearly distinguish between: (a) completeness (non-missing values in existing records), (b) coverage (proportion of all healthcare events captured), and (c) comprehensiveness (breadth of data sources/types)

Point 4

The manuscript provides insufficient detail about how heterogeneous data from EHR and claims sources were harmonized for comparison. Given the inherent differences in data structure, coding systems, temporal granularity, and semantic representations between EHR and claims data, the manuscript should clearly describe the harmonization process used to enable valid comparisons. What mapping strategies, code translations, or aggregation rules were applied to ensure that events from both sources were comparable at the PEM level? Were there limitations in mapping certain codes or event types? How were discrepancies in coding standards addressed? Transparent reporting of these harmonization steps is essential for reproducibility and for interpreting the validity of the comparative analyses.

Point 5

The manuscript clearly states that the study received approval from the relevant Institutional Review Boards and that written informed consent was obtained from participants. Furthermore, the process of privacy-preserving record linkage using Datavant software—whereby personal identifiers are tokenized and encrypted prior to linkage—appears to be well described. The authors also note that researchers did not have access to identifiable information at any stage, and that all analyses were conducted within a secure research environment. However, I respectfully suggest that the manuscript could be strengthened by providing additional detail regarding the specific content of the participant consent forms, particularly regarding the scope of data linkage and secondary data use. Additionally, could you elaborate on the legal framework supporting this data linkage?

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: Yes: António da Luz Pereira

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2026 Jan 23;21(1):e0336967. doi: 10.1371/journal.pone.0336967.r002

Author response to Decision Letter 1


13 Aug 2025

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

- We have reviewed the style templates and have adjusted figures/tables/in-text citations to reflect these guidelines.

2. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information.

- Additional information about consent has been included in the ethics statement and Methods section.

3. Thank you for stating the following in the Competing Interests section:

“Abel Kho is an advisor of the company Datavant.”

We note that one or more of the authors are employed by a commercial company: Datavant

a. Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

b. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

- An updated funding statement and competing interest statement have been amended to the cover letter

4. Please note that your Data Availability Statement is currently missing [the repository name and/or the DOI/accession number of each dataset OR a direct link to access each database]. If your manuscript is accepted for publication, you will be asked to provide these details on a very short timeline. We therefore suggest that you provide this information now, though we will not hold up the peer review process if you are unable.

- This link is now provided (https://workbench.researchallofus.org/workspaces/aou-rw-f6117030/duplicateofevaluatevisitspermonth/about)

5. Please remove all personal information, ensure that the data shared are in accordance with participant consent, and re-upload a fully anonymized data set.

Note: spreadsheet columns with personal information must be removed and not hidden as all hidden columns will appear in the published file.

Additional guidance on preparing raw data for publication can be found in our Data Policy (https://journals.plos.org/plosone/s/data-availability#loc-human-research-participant-data-and-other-sensitive-data) and in the following article: http://www.bmj.com/content/340/bmj.c181.long.

- In order to preserve AoURP program data within the AoURP network (and due to size restriction in upload on the PLOS One portal), we have added a link to our workspace in point 4 for data analysis which contains a fully anonymized dataset.

The introduction would benefit from a clearer explanation of what Swoop is, how it obtains its data, and the scope and limitations of its claims dataset. This context is essential for readers to assess the representativeness and reliability of the claims data used in the study.

- We thank the reviewer for raising this point and agree that additional context surrounding Swoop is needed in the manuscript introduction. We have included new information surrounding Swoop (a company which provides precision healthcare omnichannel solutions) and the source of its claims dataset (multiple primary sources including pharmacy benefit managers, clearinghouses, and payer organizations). We have also included additional information about the limitations of its claims dataset [“(1) it may underrepresent certain populations, particularly those who are uninsured or underinsured; (2) data coverage varies by geographic region and insurance type; (3) claims data inherently captures only billable healthcare encounters, potentially missing care provided through non-traditional channels or cash payments; and (4) clinical details beyond what is required for reimbursement are typically not captured”] which has been added to the limitations section of the study.

The manuscript should more explicitly explain the structural and functional differences between EHR and claims data. This includes their respective purposes, typical content, and known limitations. Such clarification would help readers understand why certain types of information are more prevalent in one source than the other.

We have added an additional section to the introduction (“Use of Insurance Claims to Supplement EHR data”) explaining the role of insurance claims in population health research, and the differences between EHR and claims data.

It would be helpful to briefly discuss why EHR and claims data are not routinely integrated in the U.S. healthcare system, touching on technical, legal, and institutional barriers. This would provide important context for the significance of the authors’ linkage efforts.

-Context has been added into barriers responsible for difficulty integrating EHR and claims data within “Use of Insurance Claims to Supplement EHR data”.

The manuscript states that 472,877 participants were considered for linkage, but only 246,128 were included in the final analysis. This represents a substantial reduction that should be clearly stated.

-We have adjusted this language in the material and methods section to reflect the final analysis population rather than the initial considered participants. Explicit mention of the percentage reduction is added to the “Linkage Evaluation” Section. Language surrounding the reason behind why participants did not have EHR data available within AoURP data (because participants did not consent to release their EHR information, a lag of EHR information ingestion from partner healthcare provider organizations into AoURP, or online enrollment of patient without provided EHR information), is added in the “Linkage Evaluation” section of the text.

The authors should estimate and report the volume of diagnostic, procedural, and medication data that was not utilized due to unmatched or excluded participants. These figures should be critically discussed in terms of potential data loss, bias, and the impact on the study’s conclusions.

- We appreciate this point by the reviewer and agree that understanding the reasons behind potential data loss and the impact of data loss on the study conclusions is important. To understand data loss properly, we want to break down the major sources of data loss described in Figure 1.

1. A large majority of participants not included in our study were filtered out due to lack of EHR data in the AoURP platform (n =178,331). This is due to participants not providing a HIPAA authorization to allow sharing of their EHR data with the program, a lag in the receipt of participant EHR from partner Health Provider Organizations or that the participants joined the program digitally with no available EHR data. Similarly, data loss due to lack of SDOH data within AoURP data (n=5,526) is also due to participants not filling out surveys from which this SDOH data can be sourced.

2. A smaller proportion of participants were excluded due to an issue linking the participants EHR and claims data (n = 23,313). The sources of non-matching during this step can arise from two sources 1) the AoURP participant information is not contained within Swoop’s claims data or 2) Differences in First/Last name, sex at birth, and date of birth between the two datasets vary, as these are the characteristics used to generate Token 1 and Token 2 during matching. To assess potential bias in matching, we include Table 1 which shows that the relative patient demographic breakdown pre and post exclusion are relatively similar, suggesting that data loss due to lack of match between EHR and claims data does not seem to localize to a particular group (at least from a demographic standpoint).

3. The final source of data loss occurred during the tokenization process itself (Same Token Pair Assigned to Different Individuals n = 16,594 or Tokenization Error n = 2167). These errors occur due to 1) There is a lack of fields available from which a token is generated in either input dataset (such as no first name, last name, date of birth) 2) If two or more participants share the exact same inputs from which Token1 and Token2 are generated (have the exact same name and date of birth)

We prefer to include a breakdown of the total number of participants “lost” at each step rather than a breakdown of non-utilized data, as most of these participants fall under condition #1, meaning they do not have EHR data available for which to even report. Additional context behind these sources of data exclusion are added to the manuscript under “Linkage Evaluation” and “Limitations”.

While the manuscript includes chronic conditions as covariates in regression models, it does not explore the completeness of data for these subgroups in detail. Given the importance of chronic disease populations in health research, the authors are encouraged to provide more granular analysis or discussion on how data completeness varies by condition. This would be particularly valuable for assessing equity and data quality across clinically vulnerable groups.

-We thank the reviewer for raising this point and agree that more granular analysis is important in assessing differences in data completeness by patient condition, as well as their causes from a health equity perspective. As the reviewer notes, we do take a preliminary look at this using an adjusted multiple regression model examining some of the most common chronic conditions, patient demographic, as well as the indicators for socio-economic deprivation that were available to us. This manuscript is a primary investigation with two major goals 1) to gain insight into differences in data characteristics between claims and EHR data in a large (~250K), diverse, multi-year retrospective cohort, something that is seldom done on a large scale due to difficulties in data integration between the two sources and 2) to provide internal benefit towards AoURP research userbase and leadership in better understanding strengths and weaknesses of program EHR data, and whether program data would benefit from integration with secondary sources such as claims. While we agree with the reviewer that deeper analysis into why some groups of patients may have higher EHR or claims data in comparison to others, it is not within the primary objectives of this manuscript and we believe that deeper analysis aside from what is provided in Figure 5 represents a substantial effort that is better served as a focus of future work by our group. Further, at this stage we can only speculate on the reasons behind relative deficiencies in claims and EHR data comprehensiveness for patients of different groups. It will require internal investigations by the AoURP based on the initial findings in this paper to better understand causes of these differences and whether they constitute disparities from a health equity standpoint.

The percentage values in Table 1 currently display inconsistent decimal places. For clarity and professionalism, it is recommended that all percentages be presented with a uniform number of decimal places.

-We appreciate the reviewer’s attention in identifying this stylistic inconsistency and have adjusted the percentage values accordingly.

Reviewer #2: This study makes a valuable contribution by providing practical insights into maximizing patient-level data availability for the research of secondary use of data. However, several critical points require clarification to strengthen the manuscript's methodological rigor and practical applicability.

Point 1.

The manuscript appears to assume that duplicate data between EHR and claims sources were identified and excluded from analysis, particularly given the use of patient-event months (PEMs) as the unit of comparison. However, the methods lack a clear definition of what constitutes a duplicate event across databases (e.g., same patient, same month, same code), nor do they detail how such duplicates were detected and managed.

How were duplicate events defined between EHR and claim data? What criteria were used to identify duplicate diagnoses, procedures, and medications across EHR and claims data? How were potential legitimate duplicates (e.g., bilateral procedures, multiple prescriptions) distinguished from true redundancies?

-To clarify, our study design intentionally did not exclude data duplicated between AoU and Swoop datasets, as per agreements between AoURP and Swoop, data analysis would be limited to the PEM level and codes (ICD, NDC, CPT) would not directly be compared between the two organizations. As we mention in our limitations section, this reduces the granularity from which we can identify true unique data coming from either source. Instead, uniqueness is implied by the existence of healthcare interactions occurring on separate days across the two data-sources leading to differences in PEM counts in either dataset. Uniqueness is further supported by analysis in Figure 3, showing the high percentage of PEMs coming from months in which one data source reports events while the other does not.

Swoop has also addressed potential temporal discrepancies of the same duplicate data (such as the same ICD code being reported on a Tuesday in Swoop vs Wednesday in EHR data, leading to false differences in PEM counts between the two datasets). Prior analysis by Swoop confirms that occasional minor date shifts in health records from admini

Attachment

Submitted filename: Response To Reviewers_Final.docx

pone.0336967.s010.docx (35.5KB, docx)

Decision Letter 1

Sreeram V Ramagopalan

4 Nov 2025

Investigation into EHR data coverage in the All of Us Research Program via linkage to health insurance claims

PONE-D-25-23520R1

Dear Dr. Yang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sreeram V. Ramagopalan

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #2: Yes

**********

Reviewer #2: The authors have thoughtfully and thoroughly addressed the concerns raised in the initial review, and the revised version represents a clear advancement in quality. I am pleased to recommend acceptance of the manuscript.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #2: No

**********

Acceptance letter

Sreeram V Ramagopalan

PONE-D-25-23520R1

PLOS One

Dear Dr. Yang,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sreeram V. Ramagopalan

Academic Editor

PLOS One

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. The relative proportion of patient event months per data source per year.

    (TIF)

    pone.0336967.s001.tif (35.9KB, tif)
    S2 Fig. Adjusted linear regression showing association between patient characteristics and relative strength of service dates data between AoU vs Swoop.

    Leftward facing data has stronger AoU contribution while rightward facing data has stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease Dep: = Deprivation, Geo: = Geography, Ins: = Insurance, Edu: = Education, Inc: = Income, Hpo: = Healthcare Provider Organization, VA = Veterans Association, FQHC = Federally Qualified Health Center, Ethn: = Ethnicity.

    (TIF)

    pone.0336967.s002.tif (321KB, tif)
    S3 Fig. Adjusted linear regression showing association between patient characteristics and relative strength of procedures data between AoU vs Swoop.

    Leftward facing data has stronger AoU contribution while rightward facing data has stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease Dep: = Deprivation, Geo: = Geography, Ins: = Insurance, Edu: = Education, Inc: = Income, Hpo: = Healthcare Provider Organization, VA = Veterans Association, FQHC = Federally Qualified Health Center, Ethn: = Ethnicity.

    (TIF)

    pone.0336967.s003.tif (314.3KB, tif)
    S4 Fig. Adjusted linear regression showing association between patient characteristics and relative strength of medications data between AoU vs Swoop.

    Leftward facing data has stronger AoU contribution while rightward facing data has stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease Dep: = Deprivation, Geo: = Geography, Ins: = Insurance, Edu: = Education, Inc: = Income, Hpo: = Healthcare Provider Organization, VA = Veterans Association, FQHC = Federally Qualified Health Center, Ethn: = Ethnicity.

    (TIF)

    pone.0336967.s004.tif (307.1KB, tif)
    S1 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of diagnosis data between AoU vs Swoop.

    Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

    (CSV)

    pone.0336967.s005.csv (6.1KB, csv)
    S2 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of service dates data between AoU vs Swoop.

    Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

    (CSV)

    pone.0336967.s006.csv (6.1KB, csv)
    S3 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of procedures data between AoU vs Swoop.

    Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

    (CSV)

    pone.0336967.s007.csv (6.1KB, csv)
    S4 Table. Table showing effect sizes of regression coefficients used for linear regression analysis between patient characteristics and relative strength of medication data between AoU vs Swoop.

    Negative coefficients represent higher AoU contribution while positive coefficients represent stronger Swoop contribution. HTN = Hypertension, COPD = Chronic Obstructive Pulmonary Disease, VA = Veterans Association, FQHC = Federally Qualified Health Center.

    (CSV)

    pone.0336967.s008.csv (6.1KB, csv)
    Attachment

    Submitted filename: Response To Reviewers_Final.docx

    pone.0336967.s010.docx (35.5KB, docx)

    Data Availability Statement

    All of the data used for analysis of this work can be found in our workspace on the All of Us user workbench. This link is now provided (https://workbench.researchallofus.org/workspaces/aou-rw-f6117030/duplicateofevaluatevisitspermonth/about).


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES