Abstract
Electronic health records (EHRs) have become increasingly relied upon as a source for biomedical research. One important research application of EHRs is the identification of biomarkers associated with specific patient states, especially within complex conditions. However, using EHRs for biomarker identification can be challenging because the EHR was not designed with research as the primary focus. Despite this challenge, the EHR offers huge potential for biomarker discovery research to transform our understanding of disease etiology and treatment and generate biological insights informing precision medicine initiatives. This review paper provides an in-depth analysis of how EHR data is currently used for phenotyping and identifying molecular biomarkers, current challenges and limitations, and strategies we can take to mitigate challenges going forward.
Keywords: electronic health records, biomarker discovery, phenotyping, precision medicine
The utility of electronic health records in research
Data from electronic health records (EHRs) (see glossary) have become increasingly relied upon as a source for biomedical research. One important research application of EHRs is the identification of biomarkers that are associated with specific patient states, especially within complex conditions. However, EHRs were not designed with research as the primary focus. Data within EHRs are generated predominantly as documentation for patient care and billing (see categories of data in Figure 1). Additionally, there is a cost associated with data collection within EHRs, whether that is in the form of time to write a patient note, or the financial and/or patient burden costs connected with going into the clinic for a lab test. This means that the availability of data within EHRs are biased and driven by processes unlikely to directly align with research goals. Despite this challenge, the vast amount of patient data present within EHRs offers huge potential for biomarker discovery research. Done correctly, researchers have the potential to transform our understanding of disease etiology and treatment and generate biological insights informing precision medicine initiatives [1].
Figure 1. Anatomy of the ideal patient electronic health record.

This ideal example of a patient chart contains routinely collected structured data including billing-related data (gray), unstructured notes, oral history, and messaging data (blue), consumer health data and molecular omic data (green), and environmental and SDOH data (purple). Data type-specific methods, validation tests, and interpretations are needed to identify biomarkers effectively. Icons in patient chart customized using BioRender.com.
This review paper provides an in-depth analysis of how EHR data are currently used for phenotyping and identifying molecular biomarkers, the current challenges and limitations, and the strategies we can take to mitigate challenges going forward. We begin by introducing (1) the different data types available in EHRs and (2) how routinely collected data can be used for phenotyping. The next sections explore (3) imaging and digital diagnostic data for identifying digital biomarkers and (4) Natural Language Processing (NLP) approaches for extracting data from clinical notes in the patient’s record for phenotyping. We discuss (5) the implications of missing data in EHRs and methods to mitigate the effects of missingness. We conclude by discussing (6) the future of the EHR, including data capture and quality control, as well as interoperability and utility.
Understanding EHRs in the context of biomarker discovery
An important starting point for performing EHR-based analyses is understanding the different data types present within EHRs as well as the driving factors to the generation of these data. By acknowledging these different factors, we can begin to categorize data types and devise typespecific strategies for the downstream usage of these data, given that each data type comes with its own caveats for utility and coding that are often hidden from researchers using EHR data. Table 1 specifies most of the primary data types available in the EHR and aims to differentiate some of the purposes and challenges of using different sources or representation of the same categories of data. Best practices using EHR data collected for other purposes often requires consideration of multiple data types and generating heuristics to look at individual values in their larger context. This generally requires working closely with clinical experts to understand when a particular lab would and wouldn’t be ordered. This idea is critical to important concepts in EHRbased biomarker association such as phenotyping algorithms.
Table 1.
Primary categories of data relevant to biomarker association available within the EHR, as well as their sources, purposes, and potential caveats of use in downstream analyses. Sources listed are not exhaustive, but rather focus on common data model sources and terminologies.
| Category | Source* | Purpose or Caveats |
|---|---|---|
| Patient, Visit, & Admission Details | Explicit tables available within Structured EHR data | Demographic and visit details for a patient - the multiple uses include patient care, justifying care decisions, and tracking billing information. It is important to note that there have been significant differences between patient provided and biological data (e.g., genetically derived ethnicity) [2] |
| Diagnoses | International Classification of Diseases (ICD) Diagnosis Codes [3] | Communicating a patient’s disease status to payers for the purpose of justifying costs associated with patient care |
| Problem List | These are used for communication between clinicians and eventually helping in the generation of ICD codes. These may be less likely to be updated over time due to lack of external downstream appraisal. | |
| Medications | Administered Medications | Tracking medications administered within the hospital |
| Prescribed Medications | These are used for generating a prescription for the patient to fill outside of the hospital. A key challenge is that prescriptions may then be renewed by other providers outside of a single EHR so it may be difficult to have complete medication coverage. The lack of link to pharmacy data also makes it challenging to track adherence, or if the prescription was even filled. This may be supplemented by insurance claims data to observe prescription fills. | |
| Notes-based Medication Extraction | Within notes clinicians, this records a history of present illness, a pertinent medical history, as well as a current list of patient medications (e.g., within the Subjective section of a “SOAP” note. These are predominately used for patient care decisions but may also be used for legal or billing purposes. | |
| Procedures | Procedure codes (e.g., Current Procedural Terminology - CPT®) [4] | Coding procedures (and other medical services) for billing purposes |
| Operative Notes | Patient care, but required for legal and billing purposes (reimbursement justification) | |
| Labs | Logical Observation Identifier Names and Codes (LOINC) Codes [5] and associated results | The result can be critical for patient care, but the code is required for billing (the result may also be required for prior authorization). An important caveat to results is that each reference lab or set of diagnostics can have different reference ranges, making cross institutional comparison dependent on standardization. |
| Genetic testing | Genetic testing results can include genetic variants and polymorphisms, whole exome/genome sequencing reports, and cytogenetic analyses. Traditionally this data has been included in an unstructured format (such as scanned pdf documents), but more recently EHRs are beginning to accommodate structured genetic variant entry. Pharmacogenomic testing provides insight about the influence of genetic markers on a patient’s medication response. Notes from genetic counseling sessions can also be included here. | |
| Vital Measurements | Automated Electronic Chart (e.g., device-fed vitals) | Digital devices may record data directly to a patient’s electronic chart, providing a more comprehensive view of vital measurements. Importantly, these data are only available in certain settings (e.g., ICU) and therefore present issues with selection bias. |
| Manually entered vitals or progress notes | These values generally either represent a point in time or human summarized form of vitals. This is especially relevant for values with high temporal variance (e.g., post-op blood pressure). | |
| Microbiology | Manual or semiautomated microbiology lab results | Microbiology culture results are used for antimicrobial therapy decisions, as well as infectious disease management. In general, they are less standardized than other data types across multiple institutions, presenting challenges in harmonization and standardization. |
| Imaging & Diagnostics | Direct/Raw Imaging & Diagnostics | The direct or raw imaging results (e.g., X-ray, CT scan, MRI etc.) and diagnostics (e.g., EEG, ECG) offer direct sensor-based representations of patient phenotypes but require modality specific methods. These data frequently have high- dimensionality and require feature extraction. |
| Imaging & Diagnostic Reports | Imaging and Diagnostic reports in the form of notes represent the clinician synthesized interpretation of the imaging or diagnostic. While this can make working with these data easier, it presents a limited view of the diagnostic. Reports may be focused around specific relevant findings, with abnormal anatomies highlighted. | |
| Pathology reports | Pathology reports can include the results of microscopic and cytologic examination of surgical samples, biopsies, and body fluids. These provide both structured and unstructured data on the structural and functional changes seen within patient samples and can be critical in confirming patient diagnoses. |
Data from outside sources may be digitally scanned and stored. These files will be linked to the EHR through a document management system and not always searchable from the EHR.
Extracting, defining, and validating phenotypes using EHRs
When creating disease phenotypes from databases of structured clinical data, there are several strategies that can be employed. Most of these strategies have been used with EHR data from EHRs for many years [6–8]. However, the ability to do so varies a great deal from one trait to another. For example, complex diseases such as Type 2 Diabetes [9] or cataracts [10] can be extracted based on the creation of electronic phenotyping algorithms. These algorithms may include structured data such as International Disease Classification (ICD) codes, Clinical Procedural Terminology (CPT) codes, vital signs, clinical lab measurements, and medications in addition to unstructured data from clinical notes processed through natural language processing (NLP). Over 45 such algorithms have been created for various diseases and deposited in the public repository PheKB [11], a Phenotype Knowledgebase that maintains the pseudocode to implement these electronic algorithms across different EHRs. Using these algorithms, the phenotypes created have been successfully used for genome-phenome association studies. For example, the Electronic MEdical Records and GEnomics network (eMERGE) has conducted GWAS studies on resistant hypertension [12], white blood cell count [13], cataracts [14], and venous thromboembolism [15] (to name a few traits), each using an electronic phenotyping algorithm. These algorithms have been demonstrated to have high positive predictive value and are portable across different healthcare organizations.
These phenotype algorithms can also be implemented in epidemiologic survey databases (ex. NHANES and ARIC), as well as population registries (ex. UK Biobank and deCODE). In addition, the adoption of phenotype ontologies and/or standardized vocabularies such as the Human Phenotype Ontology [16], SNOMED-CT [17], and LOINC [18], can assist in the mapping of the complex data elements collected in diverse EHRs and patient registries into more complete structured phenotype definitions. The way phenotypes have been defined thus far has been largely based on clinician expertise and experience. These algorithms identify individuals who have or do not have a disease, based on what a clinician indicates should be included in such an algorithm. Based on this algorithm, manual chart review followed by an iteration of the algorithm proceeds [19].
It is still important to understand the potential biases that specific decisions in phenotyping algorithms may introduce. For example, when thinking about the control population for Type 2 diabetes [20], data missingness can present challenging decisions. Younger, healthy patients may be less likely to have a blood glucose test ordered as their physician is not concerned about their glucose level. Requiring a blood glucose test with a normal value could lead to selecting for either an older population, a population that is less healthy in general, or a population that interacts with the healthcare system more frequently (i.e., and therefore of a particular socioeconomic group). Sensitivity analyses with alternate definitions are one way to ensure more robust biomarker association analyses [21].
Exploring routinely collected data for biomarkers
The integration of clinical and genomic data in EHR systems allows for easy access to patients’ clinical information alongside their genomic information. Consortia like i2b2 and eMERGE have led the way in having an EHR with patient-matched genomic information by routinely biobanking collected blood samples [22,23]. This comprehensive phenotyping in EHR data also enables the examination of variants against a larger array of outcomes, which might otherwise be missed in studies that look at specific clinical outcomes. For example, with EHR access, it is convenient to conduct phenome-wide association studies (PheWAS), which can uncover associations between a genetic variant and a panel of phenotypic outcomes. Access to the full array of phenotypes also allows for studies that reveal pleiotropic effects of genes [24]. Exploratory analyses with previously unrelated phenotypes could also help in the discovery of clinically meaningful comorbidities associated with the genetic variant. Continuous efforts are also being made to better integrate genomic test results into EHRs. Typically, these results are included in EHRs as summaries and reports, but the development of Fast Healthcare Interoperability Resources (FHIR) as well as knowledge-driven user interfaces which prioritize gene variant data are beginning to make these data accessible to increasing amounts of EHR users [25]. FHIR can be attached to proprietary sequencing platforms and EHRs; this can lead to exposure of the gene variant data for presentation to the end-user of the EHR. In Alterovitz et al. [25], three representative apps based on FHIR are demonstrated to test end-to-end feasibility, including integration of genomic and clinical data.
The utilization of EHR data is also an efficient and pragmatic way to identify patient cohorts for biomarker studies, making it ideal for the development of machine learning (ML) algorithms to identify molecular biomarkers [26] (Figure 2). For example, researchers have been able to crossapply cerebrospinal fluid (CSF) data collected on smaller cohorts to different domains. CSF is a valuable resource for studying various diseases, including those related to immune, neurological, musculoskeletal, and neoplastic conditions. However, collecting CSF data on large patient cohorts can be challenging due to the invasive nature of the procedure. CSF is not a routinely collected clinical variable given that very specific medical circumstances necessitate a test. Thus, any CSF collected as part of the EHR can help address this limitation by providing a larger pool of longitudinal patient data that can be used to identify patients with the relevant conditions and characteristics. The wealth of longitudinal patient data available in EHRs also enables the expansion of cohorts that might otherwise be limited due to small sample sizes. The longitudinal nature of patient data from EHRs also allows for the discovery and validation of biomarkers over a period of “pre-diagnosis” time, enabling researchers to identify early indicators of disease and inform interventions for prevention or early detection. For example, complete blood count (CBC) and liver function tests (LFTs) are two routinely collected labs that provide valuable information for the discovery of new biomarkers. CBC measures the levels of different types of blood cells, while LFTs are a proxy for liver damage or inflammation. Both sets of tests can provide important insight into the overall health of an individual by capturing dynamic changes over time.
Figure 2. Biomarker discovery workflow.

Integration of existing patient data with other health variables can create a more comprehensive picture of patient health. Transforming structured and unstructured data using knowledge sources generates an interpretable clinical context. Lastly, implementation of machine learning and statistical models on large-scale, diverse patient data can enable biomarker discovery. Icons in patient chart customized using BioRender.com.
It is important to note that many EHR populations (especially those linked to DNA biobanks) are of European ancestry and may not account for disease associations that present themselves in populations with other non-European ancestries [1]. This could potentially lead to findings that might not accurately translate to other populations [27]. Analyzing data from large, diverse populations is essential to generating meaningful clinical insights. It is also essential to acknowledge that some data elements that might be important in accounting for outcomes, such as family history, lifestyle and behavior, and medication compliance, might not be comprehensively collected in EHRs [1] [28,29]. Moreover, a link between EHR data and research cohorts such as those created through DNA biobanks is not universal, as some private practices where research is not the main mission may not have the infrastructure to facilitate such links.
Using imaging and diagnostics as digital biomarkers
Digital biomarkers from imaging, pathology, and diagnostics tests are another source of unstructured data useful for phenotyping. For example, electrocardiograms (ECG) are used to check heart rhythm and electrical activity to identify and characterize different subtypes of heart disease. However, several challenges arise when leveraging this data effectively and at scale for biomarker discovery. One challenge is that variability in ECG signals due to patient age, sex, body mass index, etc., as well as the equipment used, makes it difficult to identify a baseline of normal and abnormal patterns. The lack of standardized reference datasets due to heterogeneity and the high dimensionality of data points generated by ECG tests makes it challenging to analyze and interpret data. Tools such as IntroECG extract quantitative metrics from raw ECG waveforms and provide a framework for machine learning implementation that can be instrumental in biomarker discovery [30]. Digital pathology is another area in which artificial intelligence (AI) is used to aid in pattern detection and image recognition. For example, reproducible frameworks for quantitation of amyloid and tau pathologies in Alzheimer’s disease have been developed [31]. Implementation of ML approaches in digital pathology workflows can allow for a wider evaluation of interindividual differences in morphologic characteristics such as neuritic plaque area and neurofibrillary tangle densities.
AI algorithms are being implemented in the field of nuclear medicine and radiology, as well. 3D imaging captured through computed tomography (CT), magnetic resonance (MRI), single-photon emission computed tomography (SPECT), and positron emission tomography (PET) is subject to high variations in resolution, noise, and contrast. Decuyper et al. [32] reviewed the applications of various deep learning models in imaging data that can aid in biomarker discovery. For example, in oncology, nuclear scans can reveal information about metabolic activity, blood flow, and anatomical features of a tumor, which can help guide treatment decisions. In rheumatology, MRI and CT scans can shed light on inflammation and joint damage. This can help identify biomarkers such as presence of certain cytokines indicative of disease severity or progression. In diabetes, a PET scan can reveal the metabolic activity of the pancreas, shedding light on the association of levels of hormones and insulin resistance. In pulmonary conditions, a PET scan can reveal structural changes in lung function, allowing for identification of biomarkers such as the thickness of the airway wall associated with diagnosis and treatment response. The development of more efficient image processing technologies allows novel derived phenotypes to be extracted from EHR databases. For example, brain morphology and function can be predicted from brain images [33]. Digital diagnostic tests, such as the ones identified here, can be used to find new biomarkers for disease prognosis, monitoring, and patient stratification.
Natural language processing for information extraction
Natural language processing (NLP) of clinical notes offers the ability to extract information which may not be contained in the structured data of EHRs. Clinical notes may contain more refined information about phenotypes such as symptoms and severity. For example, within epilepsy, it is difficult to estimate a seizure burden from structured data [34]. A patient may be seizure free but receive epilepsy ICD codes because of follow-up visits to neurologists for check-ins or tied to refilling anti-seizure medications. Within a clinical note seizure frequency will often be noted. Importantly, a note may also explicitly state that a patient has been seizure free for a specified period.
Not only is seizure frequency not available via the structured EHR data but in general, we cannot assume that because something is not present in structured data that it has not occurred. For example, if an epilepsy patient has been seizure free for two years, but then experiences a seizure while out of town, their seizure instance may be captured in the EHR at a different health system (the one near the vacation destination). Thus, the primary neurology clinic EHR does not capture this new seizure instance; however, at the next visit, the patient is likely to tell the neurologist who will add this to the clinical notes. Additionally, clinical notes will often justify clinical decisions for multiple potential audiences: A) providers for care synchronization by suggesting a particular plan, B) payors for prior authorization, C) medico-legal for risk mitigation, and increasingly D) to help patients better understand their care. These justifications can be useful for phenotyping particularly because they may inform us that the provider is ruling out a particular diagnosis or providing their reasoning for choosing one medication over another [35–36] (Box 1).
Box 1: Importance of clinical notes.
Clinical notes may offer the potential to provide greater specificity than is available in structured data. The justification for why a lab was not ordered may allow secondary analyses to handle this missing data more appropriately. Notes may contain disease specific clinical rating scales that are not directly entered into the EHR. For example, the Hoehn and Yahr scale[35,36] for functional disability associated with Parkinson’s disease is generally not available within structured data, but may be captured within the notes of a movement disorders clinic to track the disease progression and guide decision making about treatment (e.g., initiation or dosing of levodopa). Even when data are available in structured form, notes may provide information for the way those data were derived. This could include specifying whether a Hoehn and Yahr score is taken when “on” levodopa, clarifying whether the evaluation was performed while a patient was taking an appropriate therapeutic dose of medication. Another example can be observed as part of maternal-fetal medicine. There are several ways to estimate the gestational age of a fetus, including the last menstrual period or measurements directly from ultrasounds (e.g., gestational sac size). However, at different periods of gestation, different measures have shown to be more accurate than others. While EHRs frequently capture a gestational age value, understanding how the estimate was made can enable like-to-like comparison or appropriate uncertainty estimation.
Traditionally, phenotyping from clinical notes was done by performing manual chart review where clinician or medical personnel with appropriate expertise reads through patient charts to extract specific values. This process is time-consuming and might require substantial expertise thereby making it potentially costly. It can also be difficult to find domain experts who are willing to undertake what can be an onerous task on top of busy workloads. To this end it became important to try to scale this process more efficiently, predominantly by using natural language processing validated by manual chart review on a sampled subset of patient charts.
Recently, large language models (LLMs) have emerged and demonstrated promise for extracting phenotypic information from clinical notes. For general NLP purposes, large language models have represented a giant step forward on a multitude of tasks [38]. Domain-specific clinical models such as Med-PaLM [39] have already shown state of the art performance on question banks designed to represent medical licensing exams. This is a rapidly emerging area and there is substantial work to be done in understanding these models, particularly in answer confidence and sourcing. Despite this early clinical feature extraction models have demonstrated promise [40], and this clearly an area worth paying attention to.
Absence of clinical data is not necessarily the absence of disease
The interpretation of absence of data as periods of “good health” has critical consequences for phenotyping and predictive modeling. Ideally, our medical charts would be effective trackers of our health over time. However, the reality is that EHR data is confounded by many non-medical factors. For example, if an individual receives care at multiple health centers, most likely their data will not be transitioned between systems to fill in the “gaps”. An individual may choose to go to a hospital system for more serious or specialty conditions, and a local practice for less serious conditions. Socioeconomic factors such as healthcare coverage and benefits, ability to take time off from work for a doctor’s visit, childcare, and public transportation offerings influence the decision to go to one medical center versus another, as well.
These instances, in addition to human errors and lack of documentation, leave holes in EHR data making it difficult to paint an accurate picture of an individual’s health journey over time. While there are instances where a clinician may no longer find it necessary to continue monitoring the labs of a stabilized patient, missing data is largely attributable to health disparities and systemic inefficiencies. Li et al. [41] find an inverse relationship between patients’ comorbidity burden and the rate of missingness in lab data. Tan et al. [42] derive clinical insights from temporal trends in patterns of missingness in multi-site COVID-19 inpatient data. By aligning both the mechanism of data missingness in model development and model deployment in the clinic, a feedback loop can be created to optimize performance of clinical prediction tools as certain variables are revealed to be more predictive than others [43]. This information could inform ways to mitigate missingness for certain variables, thereby changing the model itself.
Leveraging clinically integrated networks has the potential to identify such data biases introduced at the system level. By connecting both EHR and claims data across healthcare entities, patterns within clinical diagnostics may emerge that shed light on coding or insurance practices that hinder accurate portrayal of an individual’s health. For example, an integrated network approach may reveal that certain institutions have far more documented cases of a certain subset of symptoms in individuals with a chronic condition compared to another institution. The individual-level data may indicate that there is a difference in disease progression or characterization between the institutions, when in actuality, there is a missing subset of symptoms that could be attributable to a variety of non-pathological factors.
Imputation techniques can help to mitigate missingness in EHR data and ensure that the data used for biomarker discovery is as accurate and complete as possible. Different imputation practices present their own biases [44]. Most commonly, incomplete records are omitted from cross-sectional analyses where variables of interest are predefined. In some temporal EHR studies, last observation carried forward (LOCF) is done to propagate known data points. In model-driven imputation techniques, the outcomes of patients are assumed to be known and prediction evaluation metrics are used to benchmark the performance of imputed data. 3-Dimensional multiple imputation by chained equation (3D-MICE) is one such approach that uses sequential regression models to impute one variable at a time [47]. Jazayeri et al [48] build on this approach by computing an additional interpatient similarity coefficient to impute lab data based on similar patient baselines. Random forest regression methods such as missForest [49] impute both continuous and categorical data by fitting a random forest on the observed data to predict the missing data. K-Nearest Neighbors (KNN) and recurrent neural network-based (RNN) methods have been used to recognize sequential time-series patterns for imputation. Generative adversarial networks (GANs) have also been used to combine synthetic data generation predictive modeling for long and short-term outcomes [51]. Diffusions models [52] have also shown promise in other data domains. It is important to understand the type of missingness in a dataset to select the best method for imputation.
Conclusions and future directions of EHR-based biomarker research
As we have discussed in this review, there is a tremendous amount of data in the EHR that can be used for biomarker research. The availability of very large datasets enables association testing even within rare diseases or for small effect sizes. Diverse populations of participants with a breadth of measurements collected as part of routine clinical care, can have a minimal cost to researchers and patients as data are collected as part of the clinical workflow rather than separate research study workflows (which cost both time and money). While there are numerous challenges, the potential for this research warrants methodological or other mitigation of these challenges as well as appropriate acknowledgement of limitations. We highlight some of the outstanding questions in the Outstanding Questions Box.
First, considering how EHR data are generated in the context of healthcare delivery is incredibly important. As discussed, data are captured as part of routine clinical care, without a lens for research. Are the data being entered uniformly across clinics? Are the data being entered in the same way by different providers? Are all the important data elements being populated for each patient? Is there data missing for specific patient groups that are unrelated to disease, but instead related to their social factors (age, race, sex, socioeconomic status, health insurance, etc.)? Each of these questions should be considered as we extract data from an EHR. Next, we should consider enhancing data collection in the clinic that would enhance not only clinical decision making but also research depth. For example, many EHR vendors have started to created modules to collect Social Determinants of Health (SDOH). This is largely in response to a 2015 report from the National Academy of Medicine [63] where the NAM committee identified multiple domains and measures of SDOH that are important for health and disease that should be better captured in an EHR. While advances have been made since 2015, EHRs still lack many important SDOH measures such as sexual orientation and geolocation data. These measures are examples where direct collection of data from patients would likely enhance the completeness of the EHR data in a more ethical and unbiased manner. Another category of data that is not currently collected broadly is behavioral and lifestyle factors. Capturing data on patient exposures and activities in the real world (outside of the clinic) could make significant impact in their healthcare as well as research capability with EHR data [64]. Smartphones and other wearable technologies may be another streamlined approach to collect and integrate these types of data with the EHR [65].
Along with collection of more diverse domains of data to supplement the EHR to capture more of the human condition outside of the clinic, the quality control and interoperability of the data are additional challenges to consider. Because these data are not collected under robust, carefully designed standard operating procedures and data collection instruments, it is critical to conduct thorough and thoughtful quality control and data cleaning activities before the data are used for research. As with all statistical and computational analyses, low-quality data into the models will result in low-quality results (garbage in, garbage out). This means that performing careful analyses of the distributions of variables, ensuring they are in the ranges that are possible for humans and in the same units across each patient in the dataset, is crucial (Box 2).
Box 2: Use of AI and ML in medicine.
While there is tremendous excitement about the utility of AI and ML in medicine, there is also significant concern with structural racism and bias in healthcare. This is captured in EHR data and has the potential to be exacerbated through the use of AI/ML. Many Americans seek healthcare where their insurance is considered in-network; or they will seek care in the clinic closest to their home or their work. When their insurance or job or home changes, they will go to a different clinic. This leads to their longitudinal clinical data being captured in many EHR systems that are not well integrated. Our ability to fill in the gaps in the missing data by linking the data from these disparate EHR systems has the potential to make a huge impact in reducing this bias. Technologies, such as The Blue Button Project [69,70], are designed to enable patients to access their EHR data and empower patients to share their data with other providers. However, the reality is that the education and tech-savvy needed to make use of this is likely to impact the very patient population that has sparse EHR data across different health systems.
Developing technology and simple policies that enable health systems and smaller clinics to easily share patient data, maintaining patient privacy and protections, is grossly needed. Still, the future of biomedical research utilizing EHRs is incredibly exciting. The recent explosion of large language models like ChatGPT [66] and GPT3 has spurred the race to answer questions about the probability of future health care events/outcomes in patients and the development of tools like voice-to-text transcription of clinic visits.
As we look forward, an integration across large-scale health, insurance, consumer, and publicly available data will be needed to build effective patient-centric models of disease progression. Coupled with development of robust biomarker identification techniques, personalized medicine approaches are sure to see success in improving healthcare outcomes for patients worldwide.
Clinician’s corner.
EHRs offer huge potential in biomarker discovery for several reasons: 1) the availability of substantially larger sample sizes enables association testing within rare diseases and the identification of smaller effect sizes, 2) a more diverse population with measurements taken as part of routine clinical care, 3) and due to these data being collected as part of clinical care, the instant availability of the data as part of a clinical data warehouse for free or at a minimal cost, meaning that patients do not need to be actively recruited or go through research data collection procedures.
Many clinical variables can be repurposed for association testing across a wide spectrum of conditions, accelerating the discovery process.
Longitudinal EHR data captures disease progression and treatment response over time. It can be particularly useful in identifying biomarkers that change in response to different interventions, such as drug therapies or lifestyle changes.
Outstanding Questions.
Are the data being entered in an EHR being done so uniformly across different clinics, by different providers, across all patients?
Is there data missing for specific patient groups that are unrelated to disease, but instead related to their social factors (age, race, sex, socioeconomic status, health insurance, etc.)?
How can we improve the data capture of social determinants of health to integrate with EHR clinical data for research and also clinical care?
How can we integrate data from wearables/mobile devices for research and/or clinical care?
What steps are needed to properly conduct thorough and thoughtful quality control and data cleaning activities before the EHR data are used for research?
Can the structured clinical data along with the unstructured clinical notes be used to build a large language model (LLM) that can then be queried to ask questions about the probability of future health care events/outcomes in specific patients?
Could these types of models be used in combination with ambient listening devices and voice-to-text transcription to develop transcripts of clinic visits and create content for the structured data in the EHR and drafts of provider documentation to include in the clinical notes?
How do we prevent structural racism and bias in the EHR from propagating into the AI and Machine Learning models being implemented on EHR data?
Can we fill in the gaps in the missing EHR data by linking the data from disparate EHR systems in different health systems to reduce healthcare access/utilization bias?
Highlights.
Electronic health records (EHRs) have become increasingly relied upon as a source for biomedical research.
Many different data types are available in EHR including diagnoses, laboratory measurements, imaging and digital diagnostic data for identifying digital biomarkers, and clinical notes.
There are challenges to consider when using EHR data including potential bias in the availability of EHR data and the implications of missing data in EHR.
Different modalities of longitudinal patient data can be used to investigate dynamic changes in health and inform biomarker discovery.
There is a bright and opportunistic future of the EHR, including data capture, quality control, interoperability, governance, and utility.
Acknowledgements
PS is supported by F31 AG069441-01. MDR is supported by R01GM138597, UL1-TR-001878, R01HG010067, and R01HG012670. BBJ was supported by the NINDS of the National Institutes of Health under award number: K99NS114850. TGD is supported by K08DK127247, R01HG012670, and the Burroughs Wellcome Fund.
During the preparation of this work the author(s) used ChatGPT in order to create the definitions in the glossary. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.
Glossary
- Electronic health records
Electronic health records are digital records of a patient’s medical history, including diagnoses, treatments, and test results. EHRs can help improve communication and coordination among healthcare providers and can provide a more complete picture of a patient’s health.
- eMERGE
eMERGE (Electronic Medical Records and Genomics) is a research network that aims to combine electronic health records and genomic data to improve our understanding of disease risk and develop more personalized treatments.
- Fast Healthcare Interoperability Resources (FHIR)
FHIR is a standard for exchanging healthcare information electronically. FHIR enables interoperability between different healthcare systems and applications and can help improve the speed and accuracy of healthcare data exchange.
- Generative adversarial networks (GANs)
Generative adversarial networks are a type of machine learning model that consists of two networks - a generator and a discriminator. The generator generates fake data, while the discriminator tries to distinguish the fake data from real data. The two networks are trained together, with the generator improving over time to create more realistic fake data.
- i2b2
i2b2 (Informatics for Integrating Biology and the Bedside) is an open-source software framework designed for clinical research. It allows for querying and analyzing clinical data in a secure and efficient manner.
- K-Nearest Neighbors (KNN) methods
K-Nearest Neighbors (KNN) methods are a type of machine learning algorithm used for classification and regression tasks. KNN methods involve finding the K closest data points to a new data point and using these points to predict the label or value of the new data point.
- Large language models (LLMs)
Large language models (LLMs) are machine learning models that are trained on vast amounts of text data to generate human-like language. LLMs can be used for various natural language processing tasks, such as language translation, chatbots, and text summarization.
- Last observation carried forward (LOCF)
Last observation carried forward (LOCF) is a statistical method used to impute missing data in longitudinal studies. LOCF involves carrying forward the last observed value of a variable for subsequent time points where data is missing.
- LOINC
Logical Observation Identifiers Names and Codes (LOINC) is a standardized system for identifying and naming medical laboratory tests and observations. LOINC codes can help facilitate interoperability and data exchange across different healthcare systems and applications.
- 3-Dimensional multiple imputation by chained equation (3-D MICE)
A statistical method used to handle missing data in research studies by imputing missing values based on available data.
- Random forest regression methods
A type of machine learning algorithm that is used for predictive modeling. It involves creating multiple decision trees and combining their predictions to make a more accurate prediction.
- Recurrent neural network-based (RNN) methods
These are machine learning algorithms that are used for sequential data, such as time series or language data. They are commonly used in natural language processing and speech recognition.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Abul-Husn NS and Kenny EE (2019) Personalized Medicine and the Power of Electronic Health Records Cell, 17758–69 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Mersha TB and Abebe T (2015) Self-reported race/ethnicity in the age of genomic research: its potential impact on understanding health disparities. Hum. Genomics 9, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hirsch JA et al. (2016) ICD-10: History and Context. AJNR Am. J. Neuroradiol 37, 596–599 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Dotson P (2013) CPT® Codes: What Are They, Why Are They Necessary, and How Are They Developed? Adv. Wound Care 2, 583–587 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Forrey AW et al. (1996) Logical observation identifier names and codes (LOINC) database: a public use set of codes and names for electronic reporting of clinical laboratory test results. Clin. Chem 42, 81–90 [PubMed] [Google Scholar]
- 6.Chiu P-H and Hripcsak G (2017) EHR-based phenotyping: Bulk learning and evaluation. J. Biomed. Inform 70, 35–51 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Banda JM et al. (2018) Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models. Annu Rev Biomed Data Sci 1, 53–68 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ritchie MD (2018) Large-Scale Analysis of Genetic and Clinical Patient Data. Annu. Rev. Biomed. Data Sci 1, 263–274 [Google Scholar]
- 9.Kho AN et al. (2012) Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J. Am. Med. Inform. Assoc 19, 212–218 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Peissig PL et al. (2012) Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J. Am. Med. Inform. Assoc 19, 225–234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kirby JC et al. (2016) PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc 23, 1046–1052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dumitrescu L et al. (2017) Genome-wide study of resistant hypertension identified from electronic health records. PLoS One 12, e0171745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Crosslin DR et al. (2012) Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network. Hum. Genet 131, 639–652 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Ritchie MD et al. (2014) Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptibility loci. Mol. Vis 20, 1281–1295 [PMC free article] [PubMed] [Google Scholar]
- 15.Heit JA et al. (2017) Identification of unique venous thromboembolism-susceptibility variants in African-Americans. Thromb. Haemost 117, 758–768 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Robinson PN et al. (2008) The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet 83, 610–615 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Randorff Højen A and Rosenbeck Gøeg K (2012) SNOMED CT Implementation. Methods Inf. Med 51, 529–538 [DOI] [PubMed] [Google Scholar]
- 18.Vreeman DJ et al. (2010) LOINC® - A Universal Catalog of Individual Clinical Observations and Uniform Representation of Enumerated Collections. Int. J. Funct. Inform. Personal. Med 3, 273–291 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wei W-Q and Denny JC (2015) Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 7, 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pacheco JA and Lead WTN (2011) Type 2 diabetes mellitus electronic medical record case and control selection algorithms[Online]. Available: http://www.phekb.org/sites/phenotype/files/T2DM-algorithm.pdf. [Accessed: 17-Mar-2023]
- 21.Huang Y et al. (2021) Illustrating potential effects of alternate control populations on realworld evidence-based statistical analyses. JAMIA Open 4, ooab045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Murphy SN et al. (2010) Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc 17, 124–130 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.McCarty CA et al. (2011) The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med. Genomics 4, 13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kanai M et al. (2018) Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet 50, 390–400 [DOI] [PubMed] [Google Scholar]
- 25.Alterovitz G et al. (2015) SMART on FHIR Genomics: facilitating standardized clinicogenomic apps. J. Am. Med. Inform. Assoc 22, 1173–1178 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wells QS et al. (2019) Accelerating Biomarker Discovery Through Electronic Health Records, Automated Biobanking, and Proteomics. J. Am. Coll. Cardiol 73, 2195–2205 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Manrai AK et al. (2016) Genetic Misdiagnoses and the Potential for Health Disparities. N. Engl. J. Med 375, 655–665 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li W et al. (2019) Obtaining a Genetic Family History Using Computer-Based Tools. Curr. Protoc. Hum. Genet 100, e72. [DOI] [PubMed] [Google Scholar]
- 29.Orlando LA et al. (2013) Development and validation of a primary care-based family health history and decision support program (MeTree). N. C. Med. J 74, 287–296 [PMC free article] [PubMed] [Google Scholar]
- 30.Elias P et al. (2022) Deep Learning Electrocardiographic Analysis for Detection of LeftSided Valvular Heart Disease. J. Am. Coll. Cardiol 80, 613–626 [DOI] [PubMed] [Google Scholar]
- 31.Neltner JH et al. (2012) Digital pathology and image analysis for robust high-throughput quantitative assessment of Alzheimer disease neuropathologic changes. J. Neuropathol. Exp. Neurol 71, 1075–1085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Decuyper M et al. (2021) Artificial intelligence with deep learning in nuclear medicine and radiology. EJNMMI Phys 8, 81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Elliott LT et al. (2018) Genome-wide association studies of brain imaging phenotypes in UK Biobank Nature, 562210–216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Villamar MF et al. (2022) Severity of Epilepsy and Response to Antiseizure Medications in Individuals With Multiple Sclerosis: Analysis of a Real-World Dataset. Neurol. Clin. Pract 12, e49–e57 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Goetz CG et al. (2004) Movement Disorder Society Task Force report on the Hoehn and Yahr staging scale: status and recommendations the Movement Disorder Society Task Force on rating scales for Parkinson’s disease. Mov. Disord 19, 1020–1028 [DOI] [PubMed] [Google Scholar]
- 36.Hoehn MM and Yahr MD (1998) Parkinsonism: onset, progression, and mortality. 1967. Neurology 50, 318 and 16 pages following [DOI] [PubMed] [Google Scholar]
- 37.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–70 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Brown TB et al. (2020) Language Models are Few-Shot Learners arXiv [cs.CL] [Google Scholar]
- 39.Singhal K et al. (2022) Large Language Models Encode Clinical Knowledge arXiv [cs.CL] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Agrawal M et al. (2022) Large language models are few-shot clinical information extractors. in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1998–2022 [Google Scholar]
- 41.Li J et al. (2021) Imputation of missing values for electronic health record laboratory data. NPJ Digit Med 4, 147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Tan ALM et al. (2023) Informative missingness: What can we learn from patterns in missing laboratory data in the electronic health record? Journal of Biomedical Informatics, 139104306 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Groenwold RHH (2020) Informative missingness in electronic health record systems: the curse of knowing. Diagn Progn Res 4, 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Beaulieu-Jones BK and Moore JH (2017) MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS. Pac. Symp. Biocomput 22, 207–218 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Cesare N and Were LPO (2022) A multi-step approach to managing missing data in time and patient variant electronic health records. BMC Res. Notes 15, 64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Azur MJ et al. (2011) Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res 20, 40–49 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Luo Y et al. (2018) 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J. Am. Med. Inform. Assoc 25, 645–653 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Jazayeri A et al. (2020) Imputation of Missing Data in Electronic Health Records Based on Patients’ Similarities. Int. J. Healthc. Inf. Syst. Inform 4, 295–307 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Stekhoven DJ and Bühlmann P (2011) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 [DOI] [PubMed] [Google Scholar]
- 50.Beaulieu-Jones BK et al. (2018) Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis JMIR Medical Informatics, 6e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Gupta M et al. (2021) Concurrent Imputation and Prediction on EHR data using BiDirectional GANs: Bi-GANs for EHR imputation and prediction. ACM BCB 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Ho J et al. (2020) Denoising Diffusion Probabilistic Models arXiv [cs.LG], 6840–6851 [Google Scholar]
- 53.Dagliati A et al. (2020) Using topological data analysis and pseudo time series to infer temporal phenotypes from electronic health records. Artif. Intell. Med 108, 101930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Zhou F et al. (2020) Use of disease embedding technique to predict the risk of progression to end-stage renal disease. J. Biomed. Inform 105, 103409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Herrero-Zazo M. et al. Using Machine Learning to Model Older Adult Inpatient Trajectories From Electronic Health Records Data. SSRN Electronic Journal. doi: 10.1016/j.isci.2022.105876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Watson C et al. (2022) Latent class trajectory modelling: impact of changes in model specification. Am. J. Transl. Res 14, 7593–7606 [PMC free article] [PubMed] [Google Scholar]
- 57.Haue AD et al. (2022) Temporal patterns of multi-morbidity in 570157 ischemic heart disease patients: a nationwide cohort study. Cardiovasc. Diabetol 21, 87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Singhal P et al. (2022) DETECT: Feature extraction method for disease trajectory modeling bioRxiv [PMC free article] [PubMed] [Google Scholar]
- 59.do Valle IF et al. (2022) Network-medicine framework for studying disease trajectories in U.S. veterans. Sci. Rep 12, 12018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lim B and van der Schaar M (17–18 Aug 2018) Disease-Atlas: Navigating Disease Trajectories using Deep Learning. in Proceedings of the 3rd Machine Learning for Healthcare Conference, 85, pp. 137–160 [Google Scholar]
- 61.Motahari-Nezhad H et al. (2022) Digital Biomarker-Based Studies: Scoping Review of Systematic Reviews. JMIR Mhealth Uhealth 10, e35722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Dinh-Le C et al. (2019) Wearable Health Technology and Electronic Health Record Integration: Scoping Review and Future Directions. JMIR Mhealth Uhealth 7, e12861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Institute of Medicine et al. (2015) Capturing Social and Behavioral Domains and Measures in Electronic Health Records: Phase 2, National Academies Press; [PubMed] [Google Scholar]
- 64.McCarthy MM et al. (2021) Implementing the physical activity vital sign in an academic preventive cardiology clinic. Prev Med Rep 23, 101435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Patel MS et al. (2020) Smartphones vs Wearable Devices for Remotely Monitoring Physical Activity After Hospital Discharge: A Secondary Analysis of a Randomized Clinical Trial. JAMA Netw Open 3, e1920677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.van Dis EAM et al. (2023) ChatGPT: five priorities for research. Nature 614, 224–226 [DOI] [PubMed] [Google Scholar]
- 67.Patel SB and Lam K (2023) ChatGPT: the future of discharge summaries? Lancet Digit Health 5, e107–e108 [DOI] [PubMed] [Google Scholar]
- 68.Ali SR et al. (2023) Using ChatGPT to write patient clinic letters. Lancet Digit Health DOI: 10.1016/S2589-7500(23)00048-1 [DOI] [PubMed] [Google Scholar]
- 69.Mohsen MO and Aziz HA (2015) The Blue Button Project: Engaging Patients in Healthcare by a Click of a Button. Perspect. Health Inf. Manag 12, 1d. [PMC free article] [PubMed] [Google Scholar]
- 70.Klein DM et al. (2015) Use of the Blue Button Online Tool for Sharing Health Information: Qualitative Interviews With Patients and Providers. J. Med. Internet Res 17, e199. [DOI] [PMC free article] [PubMed] [Google Scholar]
