Abstract
Electronic health records (EHRs) offer potential to study large numbers of patients but are designed for clinical practice, not research. Despite increasing availability, utilizing EHR data for research comes with its own set of challenges. In this paper, we describe some important considerations and potential solutions for commonly encountered problems when working with large-scale, EHR-derived data for health services and community-relevant health research. Specifically, using EHR data requires the researcher to define the relevant patient subpopulation, reliably identify the primary care provider, recognize the EHR as containing episodic (i.e., unstructured longitudinal) data, account for changes in health system composition and treatment options over time, understand that the EHR is not always well-organized and accurate, design methods to identify the same patient across multiple health systems, account for the enormous size of the EHR, and consider barriers to data access. Associations found in the EHR may be non-representative of associations in the general population, but a clear understanding of the EHR-based associations can be enormously valuable to the process of improving outcomes for patients in learning health care systems.
In the context of building two large-scale EHR-derived data sets for health services research, we describe the potential pitfalls of EHR data and propose some solutions for those planning to use EHR data in their research. As ever greater amounts of clinical data are amassed in the EHR, use of these data for research will become increasingly common and important. Attention to the intricacies of EHR data will allow for more informed analysis and interpretation of results from EHR-based data sets.
Health researchers increasingly have access to data resources from electronic health records (EHRs). The Office of the National Coordinator for Health Information Technology identified 11 advantages of electronic health records (EHR), including providing accurate information about patients, improved safety, and improved data security.1 For researchers, EHRs offer potential to study large numbers of patients, potentially with minimal resources. A wealth of national data sources exist based on clinical claims and surveys,2–7 but suffer important limitations. Many are cross-sectional, preventing longitudinal interpretations. Most do not allow for studies of physician behavior, as each physician may have a few patients in the data set. Important variables may be inaccurate or not collected. Sample sizes tend to be small, and although findings may be nationally-representative, analyses requiring large numbers of patients may be underpowered. Administrative claims data are often limited to specific populations (e.g., Medicare for ages ≥65 years) and lack clinical information, such as smoking and obesity. Local EHR data potentially address many of these shortcomings and help inform local disease surveillance and public health.
However, as with claims data, the EHR is designed for clinical practice rather than research. Researchers wishing to analyze longitudinal patient data may not know the underlying data structure (table/variable names), evolution of clinical ontology or appropriate patient samples, and may lack permission or skills to access data directly. Researchers generally need to involve information technology (IT) personnel who lack clinical knowledge, frequently resulting in miscommunication. For example, IT staff might not realize the difference between “hemoglobin” and “hemoglobin A1c”, resulting in errors. Although some health systems, such as Veterans Affairs and Kaiser Permanente, have long-established protocols for EHR-based research, extracting longitudinal clinical data remains difficult in most health systems. Additionally, once data are obtained, biostatistical methods may differ from those for administrative claims. Complex issues arise from cohort extraction, irregularly-space health care encounters and utilization data for specific healthcare system(s).
Our group recently created two large-scale, EHR-derived data sets to facilitate a combination of descriptive, predictive, geospatial and longitudinal studies in health services research. In this article, we describe the process and highlight pitfalls of constructing such data sets, and offer alternatives of how we addressed these pitfalls. We include methodological decisions that individuals at any institution should consider for successful development of an EHR data set, which otherwise may be easily overlooked by researchers accustomed to administrative claims and nationally-representative data.
Description of the EHR data sets
The Cleveland Clinic Medicine Institute Primary Care Registry was created to facilitate the use of EHR data for health services research. The registry contains 10 years of EHR data for >800,000 patients seen by primary care providers between 2006—2015. The registry provides access to commonly-requested EHR variables in a single location and easy-to-use format. Data have been cleaned, validated and transformed into a longitudinal format. The registry was developed at the Cleveland Clinic Health System (CCHS), based in northeast Ohio. The CCHS includes a large academic medical center, 17 regional hospitals, 18 family health centers and >210 outpatient locations.8
The Northeast Ohio Cohort for Atherosclerotic Risk Estimation (NEOCARE) database was created to study variation in atherosclerotic disease risk pertaining to socioeconomic position across a region for a series of birth cohorts over time. Thus, NEOCARE was disease-specific whereas the primary care database considered a broad range of primary care conditions. By developing a large observational cohort with representation across the entire socioeconomic spectrum, NEOCARE allows for analyses of narrowly-defined subpopulations. NEOCARE combines data from two health systems, CCHS (described above) and MetroHealth System (MHS), a public health system with 3 hospitals, 23 health centers and >40 additional sites. In creating NEOCARE, we linked EHRs for all patients aged ≥18 years with 2 or more visits at either MHS or CCHS between 1999—2017, identified unique patients and merged clinical records across institutions, and linked the data with national and state-wide death records. The patients from CCHS and MHS provide a diverse population including CCHS’s primarily commercially-insured and Medicare patients and the safety-net (mostly Medicaid and uninsured) population of MHS. Together, these systems provide approximately 70% of inpatient and outpatient services for northeast Ohio. Both health systems utilize an integrated, vendor-based EHR system (Epic™, Verona, WI), implemented in 1999 at MHS and 2001 at CCHS. The EHR at each site also leverages clinical decision support features including reminders for disease prevention and screening. Since 2014, the Healthcare Information and Management Systems Society (HIMSS) has certified both CCHS and MHS as having a Stage 7 Electronic Medical Record Adoption Model (EMRAM), the highest level of progress in EHR adoption.9,10
Both the Primary Care Registry and the NEOCARE database contain patient addresses which have been geocoded to census block-groups and linked with a variety of publicly-available neighborhood-level data sources, such as the US Census. Table 1 provides summary statistics stratified by age group for both databases. The Primary Care Registry contains 834,553 patients, 2.6 million patient-years of observation and 6.7 million encounters. The NEOCARE data set contains 3,018,569 patients and >74 million encounters.
Table 1.
Summary Statistics for the Primary Care and the NEOCARE Registries
| Patients | Patient-years | Encounters | ||||
|---|---|---|---|---|---|---|
| Age Range* | Primary Care Registry | NEOCARE | Primary Care Registry | NEOCARE | Primary Care Registry | NEOCARE |
| All, n (%) | 834,553 | 3,018,569 | 2,649,941 | 16,789,481 | 6,695,236 | 74,009,190 |
| Age 18–29 y | 131,094 (15.7) |
415,371 (13.8) |
318,517 (12.0) |
1,129,887 (6.7) |
648,885 (9.7) |
4,509,644 (6.1) |
| Age 30–39 y | 119,117 (14.3) |
400,390 (13.3) |
350,079 (13.2) |
2,068,045 (12.3) |
762,922 (11.4) |
7,454,662 (10.1) |
| Age 40–49 y | 135,072 (16.2) |
454,002 (15.0) |
453,731 (17.1) |
2,648,353 (15.8) |
1,048,015 (15.7) |
9,461,941 (12.8) |
| Age 50–64 y | 233,300 (28.0) |
822,564 (27.3) |
820,601 (31.0) |
5,155,832 (30.7) |
2,063,406 (30.8) |
21,573,538 (29.1) |
| Age 65–84 y | 181,836 (21.8) |
777,949 (25.8) |
616,345 (23.3) |
4,777,315 (28.5) |
1,860,921 (27.8) |
25,381,690 (34.3) |
| Age ≥85 y | 34,134 (4.1) |
148,293 (4.9) |
90,668 (3.4) |
1,010,050 (6.0) |
311,087 (4.6) |
5,627,715 (7.6) |
Age at the most recent encounter in the registry.
In creating these two EHR-derived databases, we encountered multiple difficulties. Below we describe some important considerations when working with US-based EHR data and methods we used to address these challenges. The studies were approved by the Institutional Review Boards of Cleveland Clinic and (for NEOCARE) MetroHealth System.
Potential Pitfall #1: Failure to Define the Patient Population
EHRs are designed to manage the needs of patients within a health system. But generalizable research requires that subjects represent a defined population. While defining the population may appear straightforward, if one wants to focus on patients for whom the EHR is likely to represent complete health care utilization (overall or by specialty), which was our main research area of interest, then the process is more complicated. An important first step in designing any research database is to define the population.
For the Primary Care Registry, we needed to define “primary care patients.” To allow flexibility, we considered several inclusion criteria. Our broadest definition included patients aged ≥18 years with at least 1 visit to a primary care location (defined as internal medicine or family medicine) between 2006—2015. Our most stringent definition included patients aged ≥18 years with at least 1 visit in at least 5 consecutive years.
Table 2 shows the effect of choosing different definitions of ongoing primary care. Most primary care patients used the system infrequently: 24% had a single encounter and 51% had <5 visits over 10 years. In our broadest definition, 51% of patients had ongoing primary care. Using the most stringent definition, only 19% of patients did. Results were robust to stratification by age (Appendix Table 1).
Table 2. Effect of the Definition of “Ongoing Primary Care” on Sample Size.
EHR data are reflective of the Primary Care Registry, which includes patients aged ≥18 years with encounters in internal medicine or family medicine during 2006–2015, at the Cleveland Clinic Health System in northeast Ohio.
| Share Relative to Total |
|||
|---|---|---|---|
| Ongoing Primary Care Definition* | Patients N = 834,553 |
Patient-years N = 2,649,941 |
Encounters N = 6,695,236 |
| ≥ 1 primary care visit ever. | 100% | 100% | 100% |
| ≥2 primary care visits in any one calendar year. | 51% | 45% | 66% |
| ≥3 primary care visits in any one calendar year. | 35% | 26% | 49% |
| ≥1 primary care visit in each of 3 calendar years with time between visits ≤ 2 years. | 34% | 59% | 64% |
| ≥1 primary care visit per year for ≥3 consecutive years. | 32% | 58% | 63% |
| ≥1 primary care visit in each of 5 calendar years with time between visits ≤ 2 years. | 22% | 45% | 49% |
| ≥1 primary care visit per year for ≥5 consecutive years. | 19% | 42% | 46% |
All criteria also required age ≥ 18 years and local residence, defined as residence in Northeast Ohio or an adjacent county.
Recognizing that different research questions will require different study populations, we created 4 variables: year of the earliest primary care encounter, year of the latest primary care encounter, number of years in which a patient had a primary care encounter and number of primary care encounters by year. These variables allow researchers to apply their own criteria to create the desired study population. Alternatively, recognizing that studies vary in design, researchers could choose to ignore these variables and include all patient encounters.
Potential Pitfall #2: Failure to Carefully Define a Patient’s Primary Health Care Provider
Health services researchers may wish to include a patient’s provider as a covariate in regression analyses or to assess heterogeneity across providers in clinical care. Yet, because a patient may see many providers, it can be difficult to attribute patients to a single provider responsible for their care. In academic systems, there also is no general agreement as to whether to assign patients to residents or supervising faculty. Deciding which visit types (ambulatory, emergency department, inpatient, etc.) to include in the allocation method is also important. Most EHRs contain a discrete PCP field. This field’s reliability depends on the process by which it is populated and maintained. For certain research questions, assignment to a site of care rather than a provider may be adequate.
To address this potential pitfall in the Primary Care Registry, we attributed patients to primary care providers (and similarly, primary care sites) using two methods: first, by assuming the EHR-designated PCP for each encounter was correct and second, by determining the primary care provider most visited by the patient in each year with >1 encounter. For the latter definition, we manually examined the list of providers to exclude generic groups or service lines (e.g., nurse visit, flu clinic) and utilized the PCP of record for ambiguous cases. Additionally, we recorded whether the PCP was a resident. For both methods, we limited to in-person encounters (excluding patient portals, telehealth, etc.). Historically, most clinical activity has been in-person; other encounters were less likely to be documented in discrete EHR fields. To assess clinical activity of providers, we calculated the number of encounters by provider and year. Because some providers may see patients at multiple locations, we assigned a primary department and practice site for each provider by year, based on where the majority of their encounters occurred.
Our process frequently identified potential PCP mismatches. The provider seen most commonly in a year differed from the PCP of record in 33% of patient-years and 16% of patients with ongoing primary care had not seen their PCP of record in >2 years. Other limitations include lack of a recognized reference standard and assuming the presence of a meaningful clinical relationship for patients with few visits. Guides to help select an attribution model for specific circumstances have been published,11 but absent a recognized standard, researchers will need to specify the patient attribution method they used when creating their data set.
Potential Pitfall #3: Failure to Think of the EHR as Episodic (i.e., unstructured longitudinal) Data
The EHR is not designed to reconstruct historical health status. Rather, it focuses on a patient’s current health as of each encounter, with selected information on health history. Moreover, as a dynamic and ever-changing live data system, the same query run on two separate days may yield different results as data are updated. For researchers accustomed to working with “locked” data sets, the EHR may seem an easy way to go back and collect additional data. However, reconstructing the EHR to a specific time point in the past (each historical encounter date) may be difficult. This may lead to challenges in estimating incidence and prevalence from EHR-derived data to avoid introducing bias.12,13
To address this potential source of error, we validated our primary care registry data using external sources. We compared the registry’s prevalence of important variables with US-level data (reported in the National Health and Nutrition Examination Survey [NHANES]14 and previously-published literature). Naturally, differences should be expected between health system and national data, but we considered the high-quality of national surveys to merit comparison. For variables with meaningful discrepancies, we conducted extensive manual chart review, iteratively updating variable definitions until discrepancies were resolved. Such comparisons suggested that our approach to transforming the EHR into longitudinal format was generally reasonable. For example, we found first-degree relatives with colorectal and breast cancer in 4.4% and 7.0% of patients, compared with general population estimates of 5.0% and 7.7%, respectively.15 Similarly, the proportion of eligible individuals prescribed medication for hypertension and diabetes was consistent with NHANES.14
However, we found that some diagnoses had low sensitivity. For example, alcohol misuse was identified in 1.6% of encounters compared with 14% nationwide.16 This discrepancy likely is partially due to infrequent alcohol misuse screening and documentation in free-text comments, rather than discrete variables. Thus, the proportion of patients with alcohol misuse disorders may “increase” over time as providers begin to document alcohol use in discrete EHR fields. Such improvements may occur when health system leaders ask providers to improve documentation for operational purposes, such as rapid identification of patients eligible for counseling or computation of quality metrics. (Alternately, because alcohol abusers are less likely to seek primary care, EHR-generated prevalence may remain constant.) This highlights how changes in EHR representation can reflect changing clinical practice rather than changing incidence or prevalence. This problem grows more complex when combining data from multiple sites and institutions, each with unique screening, coding and documentation practices.
We also found that internal coding was not easily aligned with common taxonomies such as Current Procedural Terminology (CPT) codes. For example, while 8 separate CPT codes identify breast cancer screening, our EHR contained more than 200 internally-developed procedure codes for it. Aligning these codes to CPT terminology required either merging the EHR with the billing system, a substantial project especially for data from earlier years, or manually mapping the internal codes. We chose the latter approach and generated a list of all unique procedures associated with any primary care patient across all years, searching for desired terms and then manually reviewing the results. For example, initially for mammography, we included all procedures with descriptions beginning “MAMM,” but this also captured unrelated mammary artery procedures and omitted 33,079 (1.4%) of mammograms abbreviated with “MAM.” Similarly, for laboratory results, we identified 16 unique codes for glycosylated hemoglobin (HbA1c), many of which had the same name.
On the other hand, we found the identification of pharmaceuticals straightforward because the EHR included National Drug Codes (NDCs). Specifically, each medication had an associated therapeutic and pharmaceutic class, allowing us to capture all relevant drugs.
Potential Pitfall #4: Failure to Account for Changes in Health System Composition
As a dynamic document, the EHR adapts and changes with the health system. During the creation of our data sets, both of the involved health systems expanded both physically and technologically, adding new practice sites, acquiring existing sites from other health systems and expanding use of electronic health records. With regard to physical expansion, because of multiple acquisitions at Cleveland Clinic, medical record numbers were not unique across all sites and patients at different locations could have the same number. Therefore, we utilized “enterprise ID,” a unique identifier within the health system. Similarly, MetroHealth acquired another health system and despite having the same EHR vendor, data from the acquired system could not be fully reconciled and merged into the MetroHealth EHR and were instead archived in a legacy system. Such changes can lead to dramatic shifts in the distribution of key study variables. For example, at MHS the percentage of patients with commercial insurance increased substantially after including data from the merged health system. Additionally, expansions and changes in affiliation agreements at MHS (e.g., contracts with different entities for pediatric specialty care over time) resulted in periodic changes in availability of some variables.
Prior to 2006, all providers at some of the CCHS’s newly acquired sites were grouped under the same department code (regardless of specialty), hindering identification of primary care visits. Therefore, we excluded these encounters. Since 2006, availability of unique identifiers for all providers regardless of site allowed inclusion of all primary care patients. Researchers should consider such changes and scrutinize descriptive statistics over time, rather than in aggregate.
With regard to technological expansion, both health systems in our analysis were early adopters of the EHR, with extensive discrete data throughout the sample period. However, for some fields, data quality improved over time. For example, providers began to more completely document changes in lifestyle (e.g., smoking, alcohol, sexual activity) in discrete fields rather than free-text. More generally, researchers should consider whether changes in their institutions’ use of the EHR materially affected data quality over time. For example, if historically, the EHR was unable to communicate between primary care and specialty (including emergency) departments, or between ambulatory and inpatient settings, then researchers should consider whether it is possible to accurately reconstruct a longitudinal history of vital signs data. In some cases, it may be better to exclude early years of an institution’s EHR data in favor of the first year with (near-) complete use.
Potential Pitfall #5: Assuming the EHR is Well-Organized and Accurate
Researchers familiar with administrative claims data might expect to “just find” a variable in EHR documentation and extract it. However, the EHR rarely has full, historical documentation. Instead, EHR changes are routinely executed without widely-distributed notice or consideration of consequences for researchers. Reconstructing historical updates required extensive personnel hours and manual chart review.
For example, when MHS changed the method for storing race and ethnicity, operations staff created a new table in the EHR but removed all previously-collected data. As a result, almost overnight the percentage of patients with missing data on race/ethnicity increased from <5% to nearly 35%. To deal with this type of problem, we investigated historical data structure for each variable and adjusted queries to harmonize data definitions over time, with mappings to multiple EHR source tables.
We also found multiple reporting formats that were not uniform, requiring standardization. For example, one row of HbA1c laboratory results might report “0.07”; the next, “7%”. When results for the same lab test were reported multiple times on the same day, we recorded the mean. Extreme values, identified by percentiles and clinical judgment, were replaced as missing. Results from international laboratories were converted from SI (metric) to US units. We removed text (e.g., “Unable to assay. No specimen received.”) from both the lab result and reference values. In some instances, this process could change a value from “< 0.001” to “0.001”; however, both the lab value and the reference value would still match, and the reference flag would remain unchanged.
Additionally, in the primary care database, information obtained from EHR-derived health maintenance tables (which focus on preventive care and chronic condition management) frequently disagreed with documented procedures and lab results. Often, this was because preventive services were obtained outside the health system; providers manually overrode the due date to reflect this. Conversely, some preventive services reported as completed in health maintenance could not be verified in other EHR locations; for example, a patient who obtained colonoscopy in 2005 (before the study commenced) would be up-to-date through 2015 but this status would only be found in health maintenance. We resolved these conflicts in favor of health maintenance.
Although the EHR included ICD-9 and ICD-10 diagnosis codes, which facilitated data retrieval, it was important to verify their accuracy. Recognizing that providers may choose a diagnosis to rule-out, rather than diagnose, a condition (e.g., testing for venous thromboembolism), the Primary Care Registry required non-acute conditions to be diagnosed on ≥2 encounters at least 30 days apart.17 Additionally, for some conditions, such as diabetes, we relied on both diagnoses and laboratory test results. This process improved accuracy but required additional effort.
In the EHRs at both CCHS and MHS, we found conflicts in historical demographics. Race, ethnicity and sex sometimes changed and subsequently returned to the previous value. To reconcile this, we took the mode. Patient addresses suffered from frequent misspellings and mischaracterizations (e.g., zip code mismatch), preventing linkage with US Census data. In the primary care database, we probability-matched to the most likely valid address using US Census geographic data.18,19 Using this process, we obtained a valid patient address, county and census block group for >97% of encounters.
For each potential challenge, we emphasize the importance of researchers conducting manual chart review to identify errors and confirm appropriate solutions. One strategy that we found particularly useful was to randomly select patient charts within subgroups of interest (e.g., very high or low body mass index) and confirm that their chart was appropriately coded across all registry variables. Then, as the registry was updated in an attempt to fix errors, we repeatedly returned to those charts until all variables were correct.
Potential Pitfall #6: Failure to Identify the Same Patient in Different Health Systems
NEOCARE combines data from two healthcare systems in the same region. In this situation, it is important to identify patients across systems. Most patients receive care within a single health system but may transition, for example, with a change in health insurance. Others receive ongoing care at both systems. Additionally, patients may have different identifying information (e.g., name, address) across health systems. To address this issue in the NEOCARE database, we utilized a multi-stage matching method to identify unique persons across institutions.
In the first step, we took advantage of both systems’ use of Care Everywhere, Epic’s built-in health information exchange platform that shares select patient data (e.g., medical history, medications, allergies) across health systems.20 We identified patients with a Care Everywhere ID indicating encounters at both CCHS and MHS.
For those failing to match in the first step and for the remaining patients without Care Everywhere IDs, we created a unique identifier which consisted of the first name, last name, last four digits of Social Security Number and birth year. We verified that this combination of variables was unique within each institution and identified patients at both institutions who matched on all 4 fields. In rare instances that a patient at one institution matched multiple patients at the other institution, we randomly selected one match. Finally, we combined all matches from the two steps above with the remaining non-matched patients (considered to have obtained care at only 1 institution) and assigned each patient a unique master study ID number.
Utilizing this 2-stage matching process for the 3,018,622 patients in NEOCARE, we found that 2,541,169 (84.2%) had data only from CCHS, 231,682 (7.7%) had data only from MHS, and 245,771 (8.1%) had data from both institutions. While other data sets may have different patient matching requirements, taking advantage of the built-in health information exchange (HIE) platforms already present in many commercial EHRs can greatly simplify the process.
Potential Pitfall #7: Failure to Account for the Enormity of EHR Data
Modern EHRs contain massive amounts of data on millions of patients. The amount of data contained in a large research database extracted from the EHR may rapidly overwhelm the capacity of desktop computers. To facilitate data management, the primary care registry was organized as a single longitudinal file. The NEOCARE registry initially took the same approach, but its size quickly overwhelmed available desktop resources and even simple management tasks took days, or failed to run at all. The data were then transferred to research servers, which are orders of magnitude more powerful than desktop systems. While this improved data management, the increasing resource usage was noticed with concern by the computing services department after we inadvertently exhausted the entire capacity of the server, thereby preventing access for other users. This led to a redesign of NEOCARE using a distributed database environment (Teradata SQL, Teradata Corporation, San Diego, CA) capable of quickly performing complex operations on millions of records. This improved data management, though not without raising concerns for exorbitantly resource-intensive computations on the Teradata cluster which is (primarily) used for production-level functions within the enterprise. These tasks also required considerable investment and flexibility in learning to manipulate data in new computing environments.
Potential Pitfall #8: Assuming the EHR is Representative of the General Population
Although EHRs contain information on large numbers of patients, they may not be representative of the general population. Even defining the general population from which the EHR patients arose may be difficult. Additionally, individuals in the EHR may differ from the general population by underlying illnesses, whether the health system is a referral site for specific conditions and socioeconomic position.
Recognizing that each health system served different subsets of the general population, NEOCARE focuses on subpopulations based on socioeconomic position, birth cohort and race/ethnicity. Analyses are conducted within these subpopulations as opposed to only examining average relationships across groups. This process also recognizes that female and older patients are more likely to obtain clinical care.
To evaluate how closely the NEOCARE population mirrors the general population in the region, we compared distributions of age and neighborhood socioeconomic position between the patients in NEOCARE and the general population of Cuyahoga County, Ohio. Socioeconomic position was characterized at the census block-group level using the 2014 US Census American Community Survey to calculate Area Deprivation Index (ADI), an index comprised of 17 education, employment, housing-quality, and poverty measures.21 We derived localized ADI measures using the ‘sociome’ package in R.22
As shown in Figure 1, older populations were overrepresented in the EHR-derived data compared with the general population (e.g., 11.9% aged ≥80 y in the EHR vs. 6.5% in the US Census), likely because older persons receive more health care. Younger populations were slightly underrepresented (e.g., 23.6% aged 18–43 y in the EHR vs. 28.0% in the US Census). However, representation across the socioeconomic spectrum (ADI quintiles) was similar.
Figure 1. Comparison of the Regional Electronic Health Record and US Census Estimates, by Age Group and Area Deprivation Index.
EHR data are reflective of the NEOCARE database, which includes patients from Cleveland Clinic Health System and MetroHealth System in northeast Ohio. Older patients were overrepresented in electronic health records data (larger hollow and diagonally-striped blocks in the EHR rows, as compared with the US Census rows). Representation across the socioeconomic spectrum (quintiles of the Area Deprivation Index) was consistent.
EHR: Electronic health record. US Census: US Census Block Group.
Potential Pitfall #9: Barriers to Data Access
EHRs are used clinically daily, but multiple barriers exist to using these data for research. We encountered 3 potential barriers to data access: technological issues, regulatory and legal issues, and data governance.
To reduce technological access barriers, the primary care registry is organized as a single file, with rows containing each patient’s historical encounter dates and columns the longitudinal variables. It is available in multiple formats such as R, SAS and Stata. For less experienced users, a Microsoft Excel file containing the first 1,000 rows is also available to help users understand the data and finalize an analysis plan. This process often can be completed the same day, rather than after weeks or months of wait time for a new data extraction request to information technology.
Regulatory issues include obtaining permission to access raw EHR data, which may be solved by partnering with data scientists and other colleagues who already have access, and designing and executing data sharing agreements across institutions. We solved the former issue by partnering with data scientists and other colleagues who already had access; these individuals retrieved raw data and (with IRB permission) saved them in a database accessible by the rest of the study team. Cross-institutional collaboration was critical for NEOCARE, which agreed to cross-credential project personnel at both institutions, subjecting them to each institution’s accountability, training and data security structures. Data agreements were reviewed by numerous officials in multiple departments of each institution, inevitably resulting in delays. Although specifics will vary by institution, they required patience and open communication between research and regulatory personnel.
For data governance in the primary care registry, we applied for registry status with an Institutional Review Board (IRB) identifier. This means that any investigator wishing to use the data may submit a protocol stating that data will be obtained from the registry. A research administrator asks the investigator to complete a statement of assurance, and then creates a restricted-access file directory which is only accessible to designated study personnel. Individuals using the registry are required to notify the research administrator of resulting publications and presentations.
Personnel Resources
For context, rather than a potential pitfall, we provide a rough sense of personnel required to develop the registries. Each registry took approx. 2 years to create. The primary care data registry required approx. 0.4 full-time equivalent effort (FTE)/year between 1 PhD and 1 Master’s-level scientist. NEOCARE required approx. 1.5 FTE/year among 3 PhD’s, 3 Master’s-level scientists and 1 physician. Supervision was intensive at first, then evolved to approx. 2–4 meetings/month.
DISCUSSION
In 2017, an estimated 96% of non-federal acute care hospitals and 80% of office-based physicians had adopted a certified EHR for clinical care.23,24 As use of the EHR has grown, so has the use of EHR data for research.
As more investigators begin to use EHR data for research, understanding potential pitfalls has become more important. This paper highlights that while EHRs are a focus of “big data” analysis and more recently considered a readily-available source of research data, the reality of using the EHR for research is complicated. We identified 9 potential pitfalls including complex and sometimes inaccurate data,25 episodic design, potentially non-representative data and utilization limited to specific health system(s), and suggested alternative solutions. Since most researchers will not have the knowledge or skills necessary to abstract EHR data themselves, learning how to specify data requests and communicating with those extracting data is important. Our pitfalls offer general guidelines and expand on a recent study in Europe.26 Further, researchers must carefully consider study design, including an appropriate patient population and assessment of summary statistics stratified by year, requiring time and careful attention in preparation for a valid study.
Naturally, pitfalls should not be considered complete; researchers are likely to encounter other issues specific to their institution. More broadly, while our alternatives should reduce the limitations of EHR data, they are not solved completely. Like administrative claims, EHRs reflect the nature of clinical practice, which lacks the rigorous design and quality controls of randomized clinical trials and national surveys.
In addition to being readily available, EHR data offer at least 4 advantages over other data sources. First, longitudinal availability of clinical data such as vital signs, smoking status and laboratory test results, which are essential to certain research questions. Administrative claims lack this capacity. Second, EHR data offer large sample size, greatly improving power for subgroup analysis over nationally-representative surveys. Large sample size also facilitates assessment of heterogeneity across patients and providers. Third, EHR data offer improved estimates of sociodemographic variables, which may be derived by geocoding patient addresses to census block-groups instead of zip codes or counties available in administrative claims. Fourth, Medicare claims, perhaps the most widely-used administrative database, primarily include only elders. Other claims data often are proprietary; all exclude the uninsured. Among younger adults, 30% are uninsured and nearly 50% change insurance annually;27 however, these individuals continue to be captured in EHRs as long as they receive care with a provider or health system.
In conclusion, we have addressed initial steps toward employing the EHR in more meaningful and useful analyses of local and regional data. The use of EHR data for research likely will continue to increase. Our experience highlights potential pitfalls that must be negotiated when utilizing EHR data outside clinical settings. When carefully designed, it is feasible to build large-scale EHR-based data registries to facilitate health services and community-level research. Learning from the experience of others will improve knowledge and understanding of EHR data and can serve as a guide for future investigators who seek to build data registries at other institutions.
Supplementary Material
Acknowledgments
The data registries described in this study were approved by the Institutional Review Boards of Cleveland Clinic and (for NEOCARE) MetroHealth System. Dr. Taksler was supported by R01AG059979 and R21AG052849 (National Institute on Aging). Drs. Taksler and Dawson were supported by KL2TR000440 (National Center for Advancing Translational Scien-ces and Clinical and Translational Science Collaborative of Cleveland). Drs. Dalton, Perzynski, Rothberg, Dawson & Einstadter were supported by R01AG055480 (National Institute on Aging). The funding sources had no role in study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication.
Footnotes
Financial conflicts: Dr. Perzynski reports book royalties from Springer Nature and Taylor Francis and is co-founder of Global Health Metrics. No other financial disclosures were reported.
REFERENCES
- 1.What are the advantages of electronic health records? https://www.healthit.gov/providers-professionals/faqs/what-are-advantages-electronic-health-records. Accessed July 19, 2018.
- 2.Surveys and Data Collection Systems. https://www.cdc.gov/nchs/surveys.htm. Accessed July 19, 2018.
- 3.CDC WONDER. https://www.cdc.gov/nchs/surveys.htm. Accessed July 19, 2018.
- 4.Health and Retirement Study. https://hrs.isr.umich.edu/data-products. Accessed February 25, 2019.
- 5.Behavioral Risk Factor Surveillance System. https://www.cdc.gov/brfss/index.html. Accessed February 25, 2019.
- 6.Ambulatory Health Care Data. https://www.cdc.gov/nchs/ahcd/index.htm. Accessed February 25, 2019.
- 7.Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Facts & Figures. https://my.clevelandclinic.org/about/overview/who-we-are/facts-figures. Accessed October 31, 2019.
- 9.HIMSS Analytics Honors MetroHealth Medical Center With Stage 7 Award. https://www.healthitoutcomes.com/doc/himss-analytics-honors-metrohealth-medical-center-with-stage-award-0001. Accessed May 25, 2020.
- 10.HIMSS Analytics Honors Cleveland Clinic Health System With Stage 7 Ambulatory Award. https://hitconsultant.net/2014/12/15/himss-analytics-honors-cleveland-clinic-health-system. Accessed May 14, 2020.
- 11.Attribution: principles and approaches. www.qualityforum.org/Publications/2016/12/Attribution_-_Principles_and_Approaches.aspx. Accessed August 12, 2019.
- 12.Rassen JA, Bartels DB, Schneeweiss S, Patrick AR, Murk W. Measuring prevalence and incidence of chronic conditions in claims and electronic health record databases. Clin Epidemiol. 2019;11:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Manuel DG, Rosella LC, Stukel TA. Importance of accurately identifying disease in studies using electronic health records. BMJ. 2010;341:c4226. [DOI] [PubMed] [Google Scholar]
- 14.Specifying Weighting Parameters. May 10, 2013; http://www.cdc.gov/nchs/tutorials/nhanes/SurveyDesign/Weighting/intro.htm. Accessed May 15, 2020.
- 15.Mai PL, Wideroff L, Greene MH, Graubard BI. Prevalence of family history of breast, colorectal, prostate, and lung cancer in a population-based study. Public Health Genomics. 2010;13(7–8):495–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Results from the 2016 National Survey on Drug Use and Health: Detailed Tables. Table 5.2D—Substance Use Disorder for Specific Substances in Past Year among Persons Aged 12 or Older, by Age Group: Standard Errors of Percentages, 2015 and 2016. https://www.samhsa.gov/data/sites/default/files/NSDUH-DetTabs-2016/NSDUH-DetTabs-2016.pdf. Accessed July 19, 2018.
- 17.Klabunde CN, Potosky AL, Legler JM, Warren JL. Development of a comorbidity index using physician claims data. J Clin Epidemiol. 2000;53(12):1258–1267. [DOI] [PubMed] [Google Scholar]
- 18.TIGER/Line® Shapefiles and TIGER/Line® Files. https://www.census.gov/geo/maps-data/data/tiger-line.html. Accessed January 11, 2019.
- 19.Yancey WE. An Adaptive String Comparator for Record Linkage. Washington, DC: Statistical Research Division, U.S. Bureau of the Census; February 19, 2004. [Google Scholar]
- 20.Organizations on the Care Everywhere Network. https://www.epic.com/careeverywhere/. Accessed May 26, 2020.
- 21.Kind AJH, Buckingham WR. Making Neighborhood-Disadvantage Metrics Accessible - The Neighborhood Atlas. N Engl J Med. 2018;378(26):2456–2458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Krieger NI, Wang C, Dalton JE, Perzynski AT. sociome: Helping Researchers to Operationalize Social Determinants of Health Data. R package version 0.4.0; 2018. [Google Scholar]
- 23.Percent of Hospitals, By Type, that Possess Certified Health IT. Health IT Quick-Stat #52. https://dashboard.healthit.gov/quickstats/pages/certified-electronic-health-record-technology-in-hospitals.php. Accessed November 18, 2019.
- 24.Office-based Physician Electronic Health Record Adoption. Health IT Quick-Stat #50. https://dashboard.healthit.gov/quickstats/pages/physician-ehr-adoption-trends.php. Accessed November 18, 2019.
- 25.Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Verheij RA, Curcin V, Delaney BC, McGilchrist MM. Possible Sources of Bias in Primary Care Electronic Health Record Data Use and Reuse. J Med Internet Res. 2018;20(5):e185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Austic E, Lawton E, Riba M, Udow-Phillipsbut M. Insurance Churning. University of Michigan Center for Healthcare Research & Transformation Policy Brief; 2016. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

