Abstract
The electronic Medical Records and Genomics (eMERGE) Network assessed the feasibility of deploying portable phenotype rule-based algorithms with natural language processing (NLP) components added to improve performance of existing algorithms using electronic health records (EHRs). Based on scientific merit and predicted difficulty, eMERGE selected six existing phenotypes to enhance with NLP. We assessed performance, portability, and ease of use. We summarized lessons learned by: (1) challenges; (2) best practices to address challenges based on existing evidence and/or eMERGE experience; and (3) opportunities for future research. Adding NLP resulted in improved, or the same, precision and/or recall for all but one algorithm. Portability, phenotyping workflow/process, and technology were major themes. With NLP, development and validation took longer. Besides portability of NLP technology and algorithm replicability, factors to ensure success include privacy protection, technical infrastructure setup, intellectual property agreement, and efficient communication. Workflow improvements can improve communication and reduce implementation time. NLP performance varied mainly due to clinical document heterogeneity; therefore, we suggest using semi-structured notes, comprehensive documentation, and customization options. NLP portability is possible with improved phenotype algorithm performance, but careful planning and architecture of the algorithms is essential to support local customizations.
Subject terms: Translational research, Genetics research, Communication and replication, Computational models, Data integration, Data mining, Data processing, Machine learning, Programming language, Software, Standards
Introduction
Accurate extraction of complete and detailed phenotypic information from large-scale electronic health record (EHR) data improves efficiency and accuracy of precision medicine research. However, structured data alone is often insufficient to fully identify or describe many conditions, particularly when an attribute is not commonly billed for or requires nuanced interpretation1–4. Natural language processing (NLP) and machine learning (ML) promise to enable deep phenotyping using nuanced EHR narratives5–8.
Both sophisticated NLP pipelines, such as MedLEE9, CLAMP10, cTAKES11 and MetaMap12,13; and simpler rule-based approaches combining regular expressions (RegEx) and logic; have increasingly been leveraged for deep phenotyping14. However, it is challenging to achieve broad generalizability and phenotype algorithm portability given the disparate EHR systems and heterogeneous documentation approaches used by clinicians15. For instance, Sohn et al. reported how variations in asthma related clinical documentation between two cohorts affect NLP system portability16. Additionally, document types and structures vary among EHRs, and some sites have more unstructured data than others. Abbreviations, terminologies, and other language usage also varies across sites, clinicians, and time. For example, Adekkanattu et al. reported variability in system performance due to the heterogeneity of local text formats and lexical terms used to document various concepts, across three different institutions assessing the portability of a specialized echocardiography information extraction system17.
The biomedical NLP community has developed a number of approaches to address these issues, including measuring semantic similarity of text, deploying ensemble NLP systems, using comprehensive term dictionaries, and converting text into data standards, such as Fast Health Interoperability Resources (FHIR) and the Observational Medical Outcomes Partnership (OMOP) common data model (CDM)18. Specifically, Liu et al.19 demonstrated that ensembles of NLP systems can improve portability through both generic phenotypic concept recognition and patient specific phenotypic concept identification over individual systems. Furthermore, Jiang et al. leveraged the FHIR standard to develop a scalable data normalization pipeline that integrates both structured and unstructured clinical data for phenotyping20. Lastly, Sharma et al. developed a portable NLP system by extracting phenotype concepts, normalizing them using Unified Medical Language System (UMLS), and mapping them to the OMOP CDM21.
The eMERGE (electronic MEdical Records and GEnomics) Network was organized and funded by the National Human Genomic Research Institute (NHGRI) in 2007 to study the intersection of genomics and EHRs22–26. One of the network’s most enduring contributions is the development of computable phenotypes to identify common diseases within EHRs for genetic research. Each phenotype algorithm is validated across multiple sites and is publicly available in the Phenotype KnowledgeBase (PheKB.org)27. Over the past fourteen years, the eMERGE Network has accumulated considerable experience in phenotyping algorithm development, validation, and implementation17,22–25,28–32. This collaboration among multiple participating institutions provides rare opportunities to explore NLP performance and portability for the ‘big data’ in EHRs across diverse settings. An ongoing critical task remains identification of the knowledge gap of best practices in development, validation, and implementation of portable phenotype algorithms using NLP.
Objective
One of the goals of phase III of the eMERGE Network (2015–2020) was to incorporate NLP/ML into existing eMERGE phenotype algorithms to improve their performance and/or better ascertain sub-phenotypes. To that end, in 2019–2020, a 1 year pilot study was conducted to test the feasibility of deploying portable phenotype algorithms that incorporated NLP components into existing rules-based phenotype algorithms. Specifically, we aimed to use NLP to identify sub-populations and improve existing phenotype algorithms. As we are identifying cases (and also sometimes controls) for genetic research, having the highest number possible of accurately identified patients (cases) with the given phenotype is important. Thus, improvement was defined as either improved recall, to increase the number of cases; and/or improved precision to correctly identify a higher percentage of true cases. We hypothesized that development of portable, accurate, and efficient NLP tools for multi-site application depends on the availability of intra- and inter-site human and technological resources, due to highly variable experience in the field, including among our sites. These must be capable of exposing and addressing the various sources of heterogeneity, such as different environments, which impact an NLP system’s ability to accurately extract information. Reflecting on this eMERGE work, the objective of this paper is to: (1) report challenges we faced during implementation of eMERGE phenotype algorithms with NLP/ML- components added and, (2) recommend best practices we encountered and/or found upon review, to help others overcome those challenges, in order to implement portable phenotype algorithms, especially those with NLP/ML components.
Materials and methods
In order to achieve these objectives, an NLP sub-workgroup of the eMERGE Phenotyping Workgroup was formed that included representatives from nine eMERGE sites: Children’s Hospital of Philadelphia (CHOP), Cincinnati Children’s Hospital Medical Center (CCHMC), Columbia University, Geisinger, Harvard/Mass General Brigham, Kaiser Permanente Washington and the University of Washington (KPWA/UW), Mayo Clinic, Northwestern University (NU), and Vanderbilt University Medical Center (VUMC). Based on scientific merit and predicted difficulty, the group selected six phenotypes with existing computable phenotype algorithms to enhance with NLP: chronic rhinosinusitis (CRS)33, electrocardiogram (ECG) traits34, systemic lupus erythematosus (SLE)35, asthma/chronic obstructive pulmonary disease (COPD) overlap (ACO)36, familial hypercholesterolemia (FH)37, and atopic dermatitis (AD)38. All of the algorithms were case – control algorithms; specifically, cases were patients with, and controls without, the phenotype, as defined by each algorithm. Sub-phenotypes included traits on ECG reports such as Brugada syndrome, CRS with and without nasal polyps, and sub-types of SLE and AD.
To reduce study heterogeneity to accommodate time limitations and to lower barriers to implementation by clinicians with minimal NLP training, we restricted NLP pipelines to those with which we had experience39–44, and that were a reasonable reflection of the variety of NLP tools currently used in healthcare settings, as seen in a recent review45. To this end, NLP platform selection was based on a survey of platforms that sites had the most experience using. The selected tools were: cTAKES11, MetaMap12,13, and/or regular expressions (RegEx), along with two commonly adopted negation detection modules: NegEx and ConText46,47, which are rule-based. The modified AD and COPD/ACO phenotype algorithms also had ML components, for which custom code written in Python and Java was used, respectively. The phenotypes, along with goals and selected tools, are shown in Table 1, and more details of the algorithms are available on PheKB.org27.
Table 1.
Phenotype | Lead | Validation | Goal | Tool(s) |
---|---|---|---|---|
Chronic rhinosinusitis | Geisinger | NU | Improve precision | cTAKES |
ECG traits | VUMC | Mayo clinic | Enrich phenotype & extract sub-phenotypes | RegEx |
Systemic lupus erythematosus | NU | VUMC | Improve sensitivity & extract sub-phenotypes | MetaMap w/RegEx |
Asthma/chronic obstructive pulmonary disease overlap | Harvard | KPWA/UW | Improve sensitivity | RegEx, Java |
Familial hypercholesterolemia | Mayo clinic | Geisinger | Improve precision | cTAKES |
Atopic dermatitis | NU | CHOP, Marshfield clinic, Geisinger | Improve sensitivity (for adults) & extract sub-phenotypes | cTAKES, RegEx, Python |
NU Northwestern University, VUMC Vanderbilt University Medical Center, KPWA/UW Kaiser Permanente Washington and the University of Washington, CHOP The Children’s Hospital of Philadelphia.
To validate the phenotype algorithms according to our objectives, we focused on validating if patients were correctly identified as cases (and/or controls) by both the original and new NLP-enhanced phenotype algorithms. The original algorithms were previously validated33–38. Then for this study, the “lead” (primary) site added NLP component(s) to the original algorithm, which they had previously led (with one exception, AD, which was previously led by a pediatric site, but in this pilot project was led by a site focused on adults). Then the lead site validated the NLP/ML-enhanced phenotype algorithm, via manual chart review of randomly selected subsets of: patients’ charts, and as needed, clinical notes for those patients. Next, as is typical in the development of eMERGE phenotype algorithms23, the lead site worked with at least one “validation” (secondary) site to further adjust the algorithms as needed, until satisfactory precision and recall was achieved, calculated via the manual reviews. Specifically, the eMERGE network’s phenotype algorithm validation procedures23, which were used here, involve sites having clinicians experienced in diagnosing and treating the given phenotype, or medical professionals who are highly trained, to ascertain presence or absence of the phenotype in the entire patient health record (not just the clinical text), and if necessary its detailed characteristics, such as signs and symptoms. As also is typical within eMERGE, if possible, at least 2 people reviewed the charts and also reviewed at least a few of the same charts in the beginning to ensure inter-rater reliability, while a more senior person adjudicates any differences where possible; or, if there is only a single reviewer, the person is an expert for the phenotype. For example, for the ACO phenotype development, 2 pulmonologists reviewed and a 3rd pulmonologist reconciled discordant labels; while at KPWA for the same phenotype, chart reviews were conducted by one professional non-clinician chart abstractor with access to an MD clinician to assist the abstractor in resolving any questions/concerns that were beyond the abstractors’ competency. Similarly, at Mayo and Geisinger, a single MD reviewed the charts and at VUMC, a senior cardiologist reviewed all ECG reports and for SLE, a rheumatologist doing SLE research did that review. A lead site reviews approximately 50 patients’ charts and at least one validation (secondary) site subsequently reviews approximately 25 charts: the number of charts reviewed is sometimes higher depending on the phenotype23–27, which did occur in this study. If the phenotype algorithm is for identifying both cases and controls, then the total number of charts reviewed includes both (for example, approximately 25 potential cases and 25 potential controls when the total is 50 charts reviewed)23–27, as seen for multiple phenotypes in this study. Finally, the phenotype algorithms were disseminated to all participating sites for implementation, and further iteratively improved as needed based on feedback from implementing sites. Final accuracy statistics were re-calculated, if necessary, after all modifications were made for reporting here.
We then retrospectively compared NLP methods and tools to assess performance, portability, and ease of use. To do this, we asked sites to report their lessons learned for creating and sharing NLP/ML algorithms via a brief informal survey about each phenotype algorithm they developed, validated, and/or implemented (questions asked are listed in Supplementary Appendix A). Quantitatively, sites were asked to report performance (especially recall and precision) at both the lead and (secondary) validation sites, for both the original and modified (NLP added) phenotype algorithms. Sites were also asked to estimate the amount of resources and time it took to complete development, validation, and implementation. These estimates were based on approximations after the work was completed. In addition, since personnel typically did not spend 100% of their time on the algorithms, time estimates are variable as they are dependent on proportion of effort. Also, some sites optionally separated the expertise of people needed to complete the task (e.g. clinical, informatics, and EHR analysts). Physical resources were reported as the number of servers needed to query the data and/or execute the algorithms. Qualitatively, sites were asked to report on how difficult they felt each algorithm was to implement; how portable it was, including any local customizations that were needed for the algorithm to perform; and any other issues identified by sites when sharing, including technical or performance issues. Additional qualitative feedback on the experience was informally collected at monthly workgroup meetings and from direct emails from sites.
Using grounded theory48, a thematic analysis was conducted by two authors (JAP, LVR) via independent review of all qualitative feedback. First, open and axial coding on categories of issues or concerns was completed to identify key phrases and loosely categorize them. The coders used selective coding to refine axial codes into a comprehensive hierarchical codebook, independently re-coded the feedback, and reviewed to achieve consensus. Emergent themes were identified through iterative review of the codes. Next, we prepared a review and summary of lessons learned, including (a) challenges for each theme; (b) corresponding best practices to address those challenges based on existing published evidence and/or experience of the eMERGE Network; and (c) if applicable, opportunities for future research. Finally, to assess credibility, the results were presented to co-authors, then lessons and recommendations were further refined as needed.
Ethics, consent and permissions
Informed consent was obtained from all subjects involved in the study per each site’s Institutional Review Board (IRB). The research was performed in accordance with the relevant guidelines and regulations for use of human participants’ biomedical data, including those of each site’s approved IRB protocols, and in accordance with the Declaration of Helsinki.
Results
For each phenotype algorithm, Table 2 presents accuracy statistics and personnel required. Although not reported by all sites, the roles of personnel involved included programmers, clinicians, and computational linguists. Although most sites reviewed 50–100 patients’ charts as is standard within eMERGE for validation of phenotype algorithms, the range did vary: lead sites reviewed anywhere from 46 to 972 charts, with a median of 100 charts reviewed, and validating sites reviewed 50–950 charts, with a median of 65 charts reviewed. From those patient chart reviews, for all but one algorithm (SLE, where overall the precision decreased), the precision and recall overall were unchanged or improved at both the lead (primary) and (secondary) validation sites. Changes in accuracy statistics for sub-phenotypes varied between sub-phenotypes and developing and validating sites. Differences in phenotype algorithm performance were not associated with the tools used. Only two sites noted the number of records in the EHR (containing both clinical text and discrete data such as labs) that were used, and for implementation of the final NLP/ML enhanced phenotype algorithm: for the ECG algorithm, it was noted that just over 1 million ECG records from the EHR were used in VUMC’s implementation; for the SLE algorithm, 185,838 notes were processed from 4468 patients for VUMC’s implementation; and for the AD algorithm, 4094 patients’ notes, labs, and/or codes were reviewed for another site’s implementation.
Table 2.
Phenotype | People Involved | Charts reviewed | Precision | Recall | Comments |
---|---|---|---|---|---|
Chronic rhinosinusitis | 2 | 126 | 76% → 78–83% | 97% → 100% | Also significant improvement on specificity |
ECG traits | 1–3 | 1050 | Cases: 80–100% Controls: 94–99% | N/A | Unable to extract 1 sub-phenotype; precision varied between sub-phenotypes |
Systemic lupus erythematosus | 2–3 | 1022 | 99% → 96% | 79% → 91% | 2/3 sub-phenotypes performed better at validation site |
Asthma/chronic obstructive pulmonary disease overlap | 1–2 | 300 | 90% → 91% | 38% → 54% | Although overall improved, performed worse at validation site possibly due to how the ML model used counts of features |
Familial hypercholesterolemia | 1–4 | 150 | 96–98% → 74–96% | N/A | Negative predictive value decreased |
Atopic dermatitis | 1–3 | 150 | 73–79% → 72–84% | 51–54% → 63–75% | Mixed results across sub-phenotypes & sites |
The “People Involved” column lists the estimated number or range of full-time equivalent persons involved with all aspects of the implementation, and includes programmers, clinicians, and computational linguists. Charts reviewed is the total number of patients’ charts reviewed for each phenotype, a sum of the charts reviewed for cases, and controls if applicable, at both the lead and validating sites. Precision and Recall columns list those statistics for the original computable phenotype rule-based algorithm vs. the new computable phenotype rule-based algorithm with NLP components added: arrows indicate change in these statistics from these original to new phenotype algorithms. Some algorithms have a range for precision or recall as either multiple (secondary) validation sites reviewed patients’ charts from which accuracy statistics were calculated, or there were separate precision/recall measures for sub-phenotypes. N/A not applicable: recall was not targeted for improvement in all phenotypes; thus, it was not calculated for all phenotypes.
Lastly, time to develop, and validate (including chart review) by lead and validation sites, was considerably longer than the subsequent implementation by other sites; specifically, 6 months or more for development and validation, versus only weeks for implementation. For example, ECG took 11 months to develop and validate, but sites only took anywhere from 1–3 weeks to implement. Also, sites reported that 1–2 servers were needed for executing the algorithms, although no further details were provided on server configuration.
Themes
Figure 1 shows the three major themes identified from the qualitative analysis: portability, phenotyping workflow/process, and technology. The technology theme was found to be a modifier for the other two primary themes, as all technologies were associated with another theme. This approach was used for the analysis and summarization phase to identify any recurring themes associated strongly with one or more technologies. Each of the themes is summarized in Table 3, with the full codebook available in Supplementary Appendix B.
Table 3.
Theme | Challenges |
---|---|
Portability of algorithms | Algorithm performance varies by phenotype |
Identifying the correct type(s) of notes across sites can be challenging given differences how notes are categorized | |
Well-known challenges in NLP and ML persist | |
Implementation environments | Use of different programming languages/NLP pipelines can cause delays in implementation when a site does not have local expertise |
Sites run NLP and ML in different environments, which may have different requirements for the software that can be run | |
Local changes/customization were often needed for things like file paths and document input formats | |
Data preparation steps were the most time and resource intensive | |
Privacy | Given identifiers embedded in clinical notes, sites have different requirements and restrictions on their use of notes for NLP |
Documentation | Scripts and software often lacked sufficient documentation on how to execute, and the expected output |
Phenotyping workflow/process | Communication delays between author and implementer could have compounding effects on overall time to complete |
Sharing NLP/ML pipelines with other sites may be hindered by intellectual property concerns | |
Reconsider traditional workflows to phenotyping |
Summary of the top themes found within our analysis, and a summary of the challenges reported by eMERGE sites within each theme. A full listing of themes is available in Supplementary Appendix B.
A few common sub-themes were identified, including both the portability and use of different technology. Filtering of data was also important, for both proper selection for the algorithm and appropriate filters to decrease the amount of data to improve performance of the software. Another important sub-theme was the need for human resources, both the need for team members with specialized skills to assist with the portability of technology, and the need for team members to communicate well.
Summary of major challenges
Portability of algorithms
Considerations regarding the portability of phenotype algorithms were split into two sub-themes. The first theme was algorithm portability: how the ML and/or NLP algorithm performed at sites other than the lead site. This reinforced established observations that algorithm performance can differ by phenotype. For example, for atopic dermatitis, at a (secondary) validation site, many of the relevant dermatology records were captured on paper, from which text was not converted into parse-able format into the EHR, thus, the EHR-based algorithm had a high number of false negative results.
The format, composition, and classification of documents at different sites also played a role in algorithm portability, including the formats of clinical notes. This was an issue across all the types of notes used, which included ECG and other procedure and lab reports, and office/clinic encounter/visit notes. More often, sites described challenges in identifying the right documents to process with NLP. For example, the phenotype algorithm would require “radiology notes”, but no a priori semantic grouping was readily available at each implementing site for identifying broad categories of notes such as imaging, pathology, and microbiology. Instead, sites needed to review and map local document types to the document types specified in the algorithm. Similar issues occurred for the medical specialties/departments with which notes were associated, as well as the specific sections within notes. Manual review was often needed to resolve these issues. An unexpected finding was that the inclusion of general patient educational material in clinical notes also negatively impacted performance at some sites.
Finally, sites acknowledged that well-known challenges within NLP and ML persisted. The most prevalent challenge was negation: the task of inferring from the context of a term or phrase when it was not present or true. We observed several NLP components suffer performance losses because the modules failed to correctly capture some negation instances, e.g. “atrial fibrillation/flutter is no longer present” was falsely identified as a case. Accurate detection of negation can be difficult regardless of the NLP technologies used49. In addition to negation, language usage and document formatting can vary by institution or even across specialties at the same institution, which affected NLP performance. One example was the use of the colon as a separator in the text, which was interpreted in some sites as a terminator and in others as the start of a list. Diagnostic uncertainty (when the text indicates that the diagnosis is unclear) and rare terms were linguistic features also noted as issues, although we note NLP solutions may not exist to alleviate the former.
Implementation environments
The second sub-theme identified regarding portability was centered around the execution of the algorithm code—specifically, making the NLP/ML software run. Although NLP was restricted to two systems (cTAKES and MetaMap), setting up and executing these systems in different computing environments (e.g. different operating systems) introduced challenges. In addition, there were no restrictions placed on the programming languages used for ML and rule-based components of the phenotype algorithm as a whole. Sites noted that certain programming languages (e.g. Ruby) were not widely used across institutions. For some institutions, this meant the language was not supported, and as such the algorithm code could not be run. For others, the language was not the preferred language, and local experts had to be found to assist in the execution. This surfaced two additional themes for “Resource needs”: dedicated server environments to run the NLP/ML, and specialized staff—most often someone with experience in NLP.
Regardless of how familiar sites were with an NLP system or programming language, they frequently needed to modify the algorithm code before it would run locally (“Customization/Localization”). These changes were typically minor, such as changing file paths in the code and document input formats. Other changes included separate pre-processing steps for the clinical text—a technical solution to general problems noted in the “Data heterogeneity across sites” sub-theme.
Another difference noted across sites was “Performance (Speed)” as it relates to both the total elapsed time to get NLP/ML to run and the actual execution time. Sites noted that data preparation steps were typically the most labor intensive, and there was wide variation in time needed across sites. Execution time varied with computational resources and volume of the textual information available. With memory intensive text processing, one site noted that an NLP algorithm deployed as a Jupyter notebook on a PC with limited resources took “ > 2 h to run”, in response to which the site extracted the Python code and deployed directly to the server with augmented memory and disk space. Filtering of notes was a prevalent performance related theme. Some NLP algorithms as deployed would process all clinical notes, which at some sites was not feasible because of the very large numbers of notes at those sites, which at least at 1 site, were over 1 million notes, even after filtering. To address this, sites applied filters either by pre-selecting patients for whom to process notes or narrowing down to the appropriate clinical note types to process. Pre-selection/filtering of patients was very broad, such as selecting all patients whom had any diagnosis code for, or related to, the given phenotype.
Sites also noted how the use of multiple technologies (“Heterogeneous environments”) impeded portability. As previously noted, depending on the technology, local specialists were needed. Finding and coordinating availability of those individuals increased the total elapsed implementation time at some sites. Across multiple technologies, or sometimes even when using the same technology, the algorithm was implemented as disjoint scripts or programs (“Lack of integration”). Sites noted that they would need to run each of these steps separately, which also lengthened the total implementation time.
Additional themes relating to the software implementation were also noted by sites, including lack of boundary condition checks that caused software to crash. This included things such as unexpected null/empty/misformatted input. These also increased delays in the implementation as time was required to troubleshoot and resolve the issue.
Privacy
Sites reported that because clinical notes often contain patient identifiers, additional measures had to be pursued to assure patient privacy. One site required additional approvals to access clinical notes to run NLP. Another site observed that by running the NLP locally and distributing the final output/results, they obviated the complications of having to share entire clinical notes with the algorithm author. Therefore, by only sharing the outputs, this allowed sites and the network to maintain de-identified data, while still providing a deeper search into the EHRs of each institution.
Documentation
A lack of documentation crossed both the technical and algorithm themes. Sites noted sufficient documentation and instructions were not always available on how to execute the phenotype algorithm. In addition, insufficient documentation about what was the intended function of an algorithm, or the exact input needed, complicated the implementation. For the latter, sites would sometimes need to read the code itself, which also lacked sufficient documentation and/or comments.
Phenotyping workflow/process
During the implementation process, sites noted there were delays related to communication issues. For example, lack of documentation would prompt a site to request more information. While awaiting a reply, a site may have been required to shift focus away from the phenotype algorithm to another project, causing another delay before the site could shift focus back.
One site noted delays in implementation and dissemination due to intellectual property (IP) concerns at their institution. Since NLP and ML typically require significant investments in resources, an internally-developed system at this site was considered protected IP. The site worked to establish a version of the NLP algorithm that could be shared across sites. The considerable amount of time required to conduct the review and secure approvals delayed the overall implementation timeline.
An adjustment to the phenotyping process also included porting/re-writing code, which took on two forms. The first was specific to this study and driven by the network’s decision to limit the NLP pipelines that would be used. One site had a pre-existing NLP pipeline that was not one of the ones chosen; as a result, the site was required to port the NLP algorithm to cTAKES. Issues were identified in the ported version of the algorithm, which required correction. The second form of porting was driven by site-specific needs, requirements, or preferences to refactor or rewrite the provided algorithm. For example, one site rewrote a Ruby RegEx implementation in Python.
Overall, the network identified the need for and proposed a new phenotyping workflow to guide development and improve the process of validation (Fig. 2), especially for, but not limited to, NLP/ML algorithms. In the pre-existing workflow23, secondary site validation of algorithms did not commence until after a lead (primary) site develops and subsequently validates an algorithm. Therefore, the first workflow improvement is development of an algorithm at the lead site to be performed in parallel with the creation of a “gold standard” validation cohort by medical record review at both the lead and (secondary) validation sites, especially for NLP/ML algorithms that need training sets in order to develop the algorithm. This requires screening the EHR at the start of the workflow for a defined cohort from which the training and validation sets of patients are chosen. For example, a screen could be at least one International Classification of Diseases (ICD)-9/ICD-10 code for that phenotype as a highly sensitive filter. Consequently, selection of a random sample from a population enriched for that phenotype facilitates reasonable prevalence, usually in the 20–80% range. From this process, each site can select a random sample of perhaps 100–200 patients that clinicians classify as positive or negative cases, or undetermined, for a goal of at least 50 confirmed cases in each gold standard dataset. The algorithm developed in the primary dataset can be tested in the secondary dataset; therefore, if performance metrics are poor, the algorithm can be revised and tested in both sites’ datasets, without additional medical record review. Thus, the lead (phenotype creation) sites will encounter less inherent pressure to produce a “perfect” algorithm as a prerequisite to release to (secondary) validation sites, expediting the algorithm development process.
Discussion
We leveraged the unique resources of the eMERGE network to assess the advantages and challenges of integrating NLP into portable computational phenotypes. Advantages of NLP include: improvement of sensitivity (SLE and ACO) for identifying more cases of a rarer condition; increased precision (CRS), an important consideration in more common conditions; and enabling deep phenotyping, such as extracting subphenotypes from ECG notes. In general, algorithm performance at both lead and validation sites was enhanced with the addition of portable NLP. Similarly, an implementation of a portable and computable phenotyping algorithm, for identifying patients for clinic trial recruitment, added NLP to their algorithm, improving algorithm recall and precision26.
NLP performance may vary between sites, due to heterogeneity in clinical document names and the basic structure of clinical notes. Ideally, the implementation of standardized terminology (e.g. LOINC Document Ontology) across all sites could provide explicit input descriptions and reduce inconsistency18. However, implementation of these standard terminologies is impractical due to the absence of clear selection criteria currently. The overall process could be costly, time-consuming, and difficult to change when insufficient evidence is available to guide the selection. Furthermore, even if all sites adopt the same terminology and CDM for the clinical notes, because the notes may vary in their local templates, documentation patterns, document quality (i.e. spelling mistakes and typos), overall EHR data quality, and sublanguages; portability is still challenging16,26. Thus, we suggest starting with semi-structured clinical notes (e.g. problem/medication lists): for example, recent studies have demonstrated the benefits of using allergy lists for clinical studies50,51.
Notably, the generalizability of negation modules remains an open NLP challenge, and is consistent with other reports49,52. Local tailoring on negation may be necessary, such as adding correction rules to the code for negating language. In addition, errors in the software code was another potential source of differing algorithm performance between sites. The use of formal collaborative version control systems (such as GitHub) should be prioritized over other less effective means such as e-mail distributions of code and documentation. For this and other reasons already mentioned, portability can be further improved by requiring institutions to improve development processes, provide comprehensive documentation, and customization options.
Successfully sharing and implementing a computable phenotype using NLP is not just about the NLP technology or the algorithm itself. Other critical factors include privacy protection, technical infrastructure setup, intellectual property agreement, and efficient communication. For example, as clinical notes are not always able to be de-identified, sites may be unable to exchange example notes, causing difficulties for cross-site validation. Recent advances on the Privacy-Protective Generative Adversarial Network may generate fake text data with retained structure similarity that can be used for NLP algorithm development and validation53. Federated learning approaches have also emerged to preserve privacy without needing to transport clinical text54. Formatting information embedded in notes (e.g. Rich Text Format [RTF]) has been shown to improve phenotyping results55; however, cross-site utilization of format information is used inconsistently across the eMERGE network. Infrastructure challenges may be ameliorated by cloud computing in which algorithms and data workflows can be prepackaged and used by researchers with little training55,56; however, institutions may not be comfortable putting protected health information (PHI) into a sharable cloud. Although not explicitly tested in this work, we also believe full-text indexing of all clinical notes at the beginning would speed up execution time and reduce infrastructure needs by narrowing down the notes to process with a rule-based NLP system.
Lastly, efficient and effective communication across sites is critical. Our traditional approach (i.e. communication via comments on PheKB.org), may be unsuitable for timely, iterative, bi-directional communication. Furthermore, as others have also noted, collaboration between sites and also between the different types of experts (i.e. clinicians, informaticists, etc.) needed is critical23,27,29. Additionally, developing a “simplicity metric” to characterize phenotyping algorithms would allow researchers to more easily determine the skills needed for implementation. For example, data types required by the algorithm could be ranked in order of simplicity of extraction from the EHR.
There are a few limitations to this study. First, the comparison of the performance using NLP pipelines other than MetaMap or cTAKES, such as CLAMP, was beyond our resources and timeline. Our approach to NLP platform selection was based on those with which we had the most experience, which is not necessarily based on the strengths or capabilities of the platform itself. While the advantage of our approach is that the results are likely more generalizable to organizations wanting to implement NLP enhanced phenotyping, sometimes by clinicians with minimal NLP training; the disadvantage is that it precluded using the most up to date NLP approaches, which could impact the results. A separate study may be needed to evaluate other pipelines’ performance. In addition, we were not able to assess how portable NLP performs for rare phenotypes: although we intended to identify patients with Brugada syndrome from ECG reports, we did not find sufficient cases for evaluation. As stated previously, sites were only asked to qualitatively evaluate their experiences, and gather quantitative data beyond performance statistics, in the last quarter of the 1 year pilot project; thus, sites had to at least partly rely on their memories, resulting in loss of some details. For example, unfortunately as significant time had passed, we could not accurately estimate hours spent; however, we felt it more important to report real time elapsed given the additional complexity noted needing to wait for team members across multiple sites to be available.In addtion, no formal, standardized measurement of time and effort was used, leading to reliance on estimates that could also lead to inconsistent reporting and inaccuracies. Finally, the number of charts reviewed for some of the phenotypes was small, and, for at least one phenotype, only 1 person reviewed the chart.
Conclusion
In conclusion, incorporating NLP and ML into EHR phenotyping algorithms can improve phenotyping performance and enable deep phenotyping. Furthermore, while applying NLP at multiple sites entails several challenges, it is feasible to develop and implement phenotype algorithms with NLP/ML components with reproducible performance. Lastly, NLP requires dedicated personnel who are skilled in EHR phenotyping and NLP, and who communicate well. Given the value of mixed-methods evaluation of the portability of phenotype algorithms with NLP/ML, we recommend its use in studies of this type. While portable and replicable phenotype definitions and algorithms are possible, careful planning and architecture of the algorithms that support local customizations are expected to be needed for the foreseeable future.
Supplementary Information
Acknowledgements
This work was primarily conducted under Phase III of the eMERGE Network, and additional work was completed in the current phase of the eMERGE Network; therefore, we acknowledge and thank our eMERGE colleagues in both phases, especially those who were/are part of the eMERGE Phenotyping Workgroup. We also acknowledge the support of our funding agencies, which are listed below in the funding section.
Author contributions
J.A.P., L.V.R., K.W., T.N.P., S.S., S.N.M., V.M.C., C.L., T.L., A.S., O.D., K.K., Y.L., G.N., M.S.W., E.W.K., J.E.L., C.W., W.W. made substantial contributions to the conception and/or design of the work. J.A.P., L.V.R., T.N.P., D.C., S.S., S.N.M., J.H.G., V.S.G., V.M.C., F.M., T.L., A.S., G.E., V.W., A.F., R.P., D.S.C., Y.D., N.W., B.S., I.J.K, O.D., J.C.S., J.F.P., N.S., K.K., Y.N., Y.L., G.N., E.A.R., T.L.W., M.S.W., E.W.K., J.E.L., C.W., W.W. made substantial contributions to the acquisition, analysis, and/or interpretation of data for the work.
Funding
This work was primarily conducted under Phase III of the eMERGE Network, which was initiated and funded by the NHGRI through the following grants: U01HG008657 (Group Health Cooperative/University of Washington); U01HG008685 (Brigham and Women’s Hospital); U01HG008672 (Vanderbilt University Medical Center); U01HG008666 (Cincinnati Children’s Hospital Medical Center); U01HG006379 (Mayo Clinic); U01HG008679 (Geisinger Clinic); U01HG008680 (Columbia University Health Sciences); U01HG008684 (Children’s Hospital of Philadelphia); U01HG008673 (Northwestern University); U01HG008701 (Vanderbilt University Medical Center serving as the Coordinating Center); U01HG008676 (Partners Healthcare/Broad Institute); U01HG008664 (Baylor College of Medicine); and U54MD007593 (Meharry Medical College). Additional work was completed in the current phase of the eMERGE Network, which was initiated and funded by the NHGRI through the following grants: U01HG011172 (Cincinnati Children’s Hospital Medical Center); U01HG011175 (Children’s Hospital of Philadelphia); U01HG008680 (Columbia University); U01HG008685 (Mass General Brigham); U01HG006379 (Mayo Clinic); U01HG011169 (Northwestern University); U01HG008657 (University of Washington); U01HG011181 (Vanderbilt University Medical Center); U01HG011166 (Vanderbilt University Medical Center serving as the Coordinating Center). The systemic lupus erythematosus phenotype development was also partially funded by the National Institute of Arthritis and Musculoskeletal Disease, grant 5R21AR072262.
Data availability
The data used for this work was from electronic health records which include identifiable data and thus cannot be shared per the HIPAA Privacy Rule. The code is available on PheKB.org under the page for each phenotype and survey data can be de-identified and available upon request by contacting the corresponding author, Jennifer A. Pacheco.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Chunhua Weng and WeiQi Wei.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-023-27481-y.
References
- 1.Liao KP, Cai T, Savova GK, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ. 2015;350:h1885. doi: 10.1136/bmj.h1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Velupillai S, Suominen H, Liakata M, et al. Using clinical natural language processing for health outcomes research: Overview and actionable suggestions for future advances. J. Biomed. Inform. 2018;88:11–19. doi: 10.1016/j.jbi.2018.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yu S, Liao KP, Shaw SY, et al. Toward high-throughput phenotyping: Unbiased automated feature extraction and selection from knowledge sources. J. Am. Med. Inform. Assoc. 2015;22:993–1000. doi: 10.1093/jamia/ocv034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N. Engl. J. Med. 2019;380:1347–1358. doi: 10.1056/NEJMra1814259. [DOI] [PubMed] [Google Scholar]
- 5.Luo Y, Uzuner Ö, Szolovits P. Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical relations. Brief Bioinform. 2017;18:160–178. doi: 10.1093/bib/bbw001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Miller TA, Avillach P, Mandl KD. Experiences implementing scalable, containerized, cloud-based NLP for extracting biobank participant phenotypes at scale. JAMIA Open. 2020;3:185–189. doi: 10.1093/jamiaopen/ooaa016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zeng Z, Deng Y, Li X, et al. Natural language processing for EHR-based computational phenotyping. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019;16:139–153. doi: 10.1109/TCBB.2018.2849968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Son JH, Xie G, Yuan C, et al. Deep phenotyping on electronic health records facilitates genetic diagnosis by clinical exomes. Am. J. Hum. Genet. 2018;103:58–73. doi: 10.1016/j.ajhg.2018.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Friedman C. Towards a comprehensive medical language processing system: methods and issues. Proc Conf Am Med Inform Assoc AMIA Fall Symp 595–9 (1997). [PMC free article] [PubMed]
- 10.Soysal E, Wang J, Jiang M, et al. CLAMP—A toolkit for efficiently building customized clinical natural language processing pipelines. J. Am. Med. Inform. Assoc. JAMIA. 2018;25:331–336. doi: 10.1093/jamia/ocx132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. JAMIA. 2010;17:507–513. doi: 10.1136/jamia.2009.001560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. Proc AMIA Symp 17–21 (2001). [PMC free article] [PubMed]
- 13.Aronson AR, Lang F-M. An overview of MetaMap: Historical perspective and recent advances. J. Am. Med. Inform. Assoc. JAMIA. 2010;17:229–236. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Banda JM, Seneviratne M, Hernandez-Boussard T, et al. Advances in electronic phenotyping: From rule-based definitions to machine learning models. Annu. Rev. Biomed. Data Sci. 2018;1:53–68. doi: 10.1146/annurev-biodatasci-080917-013315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Carrell DS, Schoen RE, Leffler DA, et al. Challenges in adapting existing clinical natural language processing systems to multiple, diverse health care settings. J. Am. Med. Inform. Assoc. JAMIA. 2017;24:986–991. doi: 10.1093/jamia/ocx039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sohn S, Wang Y, Wi C-I, et al. Clinical documentation variations and NLP system portability: A case study in asthma birth cohorts across institutions. J. Am. Med. Inform. Assoc. JAMIA. 2018;25:353–359. doi: 10.1093/jamia/ocx138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Adekkanattu P, Jiang G, Luo Y, et al. Evaluating the portability of an NLP System for processing echocardiograms: A retrospective, multi-site observational study. AMIA Annu. Symp. Proc. 2020;2019:190–199. [PMC free article] [PubMed] [Google Scholar]
- 18.Hong N, Wen A, Stone DJ, et al. Developing a FHIR-based EHR phenotyping framework: A case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J. Biomed. Inform. 2019;99:103310. doi: 10.1016/j.jbi.2019.103310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Liu C, Ta CN, Rogers JR, et al. Ensembles of natural language processing systems for portable phenotyping solutions. J. Biomed. Inform. 2019;100:103318. doi: 10.1016/j.jbi.2019.103318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hong N, Wen A, Shen F, et al. Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data. JAMIA Open. 2019;2:570–579. doi: 10.1093/jamiaopen/ooz056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sharma H, Mao C, Zhang Y, et al. Developing a portable natural language processing based phenotyping system. BMC Med. Inform. Decis. Mak. 2019;19:78. doi: 10.1186/s12911-019-0786-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ryan GW, Bernard HR. Techniques to Identify Themes. Field Methods. 2003;15:85–109. doi: 10.1177/1525822X02239569. [DOI] [Google Scholar]
- 23.Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: Results and lessons learned from the eMERGE network. J. Am. Med. Inform. Assoc. JAMIA. 2013;20:e147–154. doi: 10.1136/amiajnl-2012-000896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kho AN, Pacheco JA, Peissig PL, et al. Electronic medical records for genetic research: Results of the eMERGE consortium. Sci. Transl. Med. 2011 doi: 10.1126/scitranslmed.3001807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Gottesman O, Kuivaniemi H, Tromp G, et al. The Electronic Medical Records and Genomics (eMERGE) Network: Past, present and future. Genet. Med. Off. J. Am. Coll. Med. Genet. 2013;15:761–771. doi: 10.1038/gim.2013.72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ahmed A, Vairavan S, Akhoundi A, et al. Development and validation of electronic surveillance tool for acute kidney injury: A retrospective analysis. J. Crit. Care. 2015;30:988–993. doi: 10.1016/j.jcrc.2015.05.007. [DOI] [PubMed] [Google Scholar]
- 27.Kirby JC, Speltz P, Rasmussen LV, et al. PheKB: A catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. JAMIA. 2016;23:1046–1052. doi: 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Shang N, Liu C, Rasmussen LV, et al. Making work visible for electronic phenotype implementation: Lessons learned from the eMERGE network. J. Biomed. Inform. 2019;99:103293. doi: 10.1016/j.jbi.2019.103293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ahmad FS, Ricket IM, Hammill BG, et al. Computable phenotype implementation for a national, multicenter pragmatic clinical trial: Lessons learned from ADAPTABLE. Circ. Cardiovasc. Qual. Outcomes. 2020;13:e006292. doi: 10.1161/CIRCOUTCOMES.119.006292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Nadkarni GN, Gottesman O, Linneman JG, et al. Development and validation of an electronic phenotyping algorithm for chronic kidney disease. AMIA Annu. Symp. Proc. AMIA Symp. 2014;2014:907–916. [PMC free article] [PubMed] [Google Scholar]
- 31.Pacheco JA, Rasmussen LV, Kiefer RC, et al. A case study evaluating the portability of an executable computable phenotype algorithm across multiple institutions and electronic health record environments. J. Am. Med. Inform. Assoc. JAMIA. 2018;25:1540–1546. doi: 10.1093/jamia/ocy101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jackson KL, Mbagwu M, Pacheco JA, et al. Performance of an electronic health record-based phenotype algorithm to identify community associated methicillin-resistant Staphylococcus aureus cases and controls for genetic association studies. BMC Infect. Dis. 2016;16:684. doi: 10.1186/s12879-016-2020-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Hsu J, Pacheco JA, Stevens WW, Smith ME, Avila PC. Accuracy of phenotyping chronic rhinosinusitis in the electronic health record. Am. J. Rhinol. Allergy. 2014;28(2):140–144. doi: 10.2500/ajra.2014.28.4012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Denny JC, Ritchie MD, Crawford DC, Schildcrout JS, Ramirez AH, Pulley JM, Basford MA, Masys DR, Haines JL, Roden DM. Identification of genomic predictors of atrioventricular conduction: Using electronic medical records as a tool for genome science. Circulation. 2010;122(20):2016–2021. doi: 10.1161/CIRCULATIONAHA.110.948828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Walunas TL, Ghosh AS, Pacheco JA, Mitrovic V, Wu A, Jackson KL, Schusler R, Chung A, Erickson D, Mancera-Cuevas K, Luo Y. Evaluation of structured data from electronic health records to identify clinical classification criteria attributes for systemic lupus erythematosus. Lupus Sci. Med. 2021;8(1):e000488. doi: 10.1136/lupus-2021-000488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Chu SH, Wan ES, Cho MH, Goryachev S, Gainer V, Linneman J, Scotty EJ, Hebbring SJ, Murphy S, Lasky-Su J, Weiss ST. An independently validated, portable algorithm for the rapid identification of COPD patients using electronic health records. Sci. Rep. 2021 doi: 10.1038/s41598-021-98719-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Safarova MS, Liu H, Kullo IJ. Rapid identification of familial hypercholesterolemia from electronic health records: The SEARCH study. J. Clin. Lipidol. 2016;10(5):1230–1239. doi: 10.1016/j.jacl.2016.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Gustafson E, Pacheco J, Wehbe F, Silverberg J, Thompson W. A machine learning algorithm for identifying atopic dermatitis in adults from electronic health records. In: Gustafson E, Pacheco J, Wehbe F, Silverberg J, Thompson W, editors. 2017 IEEE International Conference on Healthcare Informatics (ICHI) IEEE; 2017. pp. 83–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kullo IJ, Fan J, Pathak J, Savova GK, Ali Z, Chute CG. Leveraging informatics for genetic studies: Use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease. J. Am. Med. Inform. Assoc. 2010;17:568–574. doi: 10.1136/jamia.2010.004366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Savova GK, Fan J, Ye Z, Murphy SP, Zheng J, Chute CG, Kullo IJ. Discovering peripheral arterial disease cases from radiology notes using natural language processing. AMIA Annu. Symp. Proc. 2010;2010:722–726. [PMC free article] [PubMed] [Google Scholar]
- 41.Sohn S, Ye Z, Liu H, Chute CG, Kullo IJ. Identifying abdominal aortic aneurysm cases and controls using national language processing of radiology reports. AMIA Summits Transl. Sci. Proc. 2013;2013:249–253. [PMC free article] [PubMed] [Google Scholar]
- 42.Khaleghi, M., Isseh, I. N., Jouni, H., Sohn, S., Bailey, K. R., Kullo, I. J. Family history as a risk factor for carotid artery stenosis. Stroke, 45(8), 2252–6 (2014). Erratum in: Stroke, 45(9), e198 (2014). [DOI] [PubMed]
- 43.Lingren T, Chen P, Bochenek J, Doshi-Velez F, Manning-Courtney P, Bickel J, Welchons LW, Reinhold J, Bing N, Ni Y, Barbaresi W, Mentch F, Basford M, et al. Electronic health record based algorithm to identify patients with autism spectrum disorder. PLoS One. 2016;11(7):e0159621. doi: 10.1371/journal.pone.0159621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Lingren T, Thaker V, Brady C, Namjou B, Kennebeck S, Bickel J, Patibandla N, Ni Y, Van Driest SL, Chen L, Roach A, Cobb B, Kirby J, Denny J, Bailey-Davis L, Williams MS, Marsolo K, Solti I, Holm IA, Harley J, Kohane IS, Savova G, Crimmins N. Developing an algorithm to detect early childhood obesity in two tertiary pediatric medical centers. Appl. Clin. Inform. 2016;7(3):693–706. doi: 10.4338/ACI-2016-01-RA-0015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Koleck TA, Dreisbach C, Bourne PE, Bakken S. Natural language processing of symptoms documented in free-text narratives of electronic health records: A systematic review. J. Am. Med. Inform. Assoc. 2019;26(4):364–379. doi: 10.1093/jamia/ocy173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Chapman WW, Bridewell W, Hanbury P, et al. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform. 2001;34:301–310. doi: 10.1006/jbin.2001.1029. [DOI] [PubMed] [Google Scholar]
- 47.Harkema H, Dowling JN, Thornblade T, et al. ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports. J. Biomed. Inform. 2009;42:839–851. doi: 10.1016/j.jbi.2009.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Strauss A, Corbin J. Grounded theory methodology: An overview. In: Denzin NK, Lincoln YS, editors. Handbook of Qualitative Research. Thousand Oaks, CA: SAGE; 1994. pp. 273–285. [Google Scholar]
- 49.Wu S, Miller T, Masanz J, et al. Negation’s not solved: Generalizability versus optimizability in clinical natural language processing. PLoS One. 2014 doi: 10.1371/journal.pone.0112774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wu P, Nelson SD, Zhao J, et al. DDIWAS: High-throughput electronic health record-based screening of drug-drug interactions. J. Am. Med. Inform. Assoc. 2021;28:1421–1430. doi: 10.1093/jamia/ocab019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Zheng NS, Stone CA, Jiang L, et al. High-throughput framework for genetic analyses of adverse drug reactions using electronic health records. PLoS Genet. 2021;17:e1009593. doi: 10.1371/journal.pgen.1009593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Mehrabi S, Krishnan A, Sohn S, et al. DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx. J. Biomed. Inform. 2015;54:213–219. doi: 10.1016/j.jbi.2015.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Liu, Y., Peng, J., Yu, J. J. Q. et al. PPGAN: Privacy-preserving generative adversarial network. In 2019 IEEE 25th Int Conf Parallel Distrib Syst ICPADS 985–9 10.1109/ICPADS47876.2019.00150 (2019).
- 54.Sui, D., Chen, Y., Zhao, J., Jia, Y., Xie, Y., Sun, W. FedED: Federated learning via ensemble distillation for medical relation extraction. In Proc of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2118–2128 (Association for Computational Linguistics, 2020).
- 55.Zeng Z, Zhao Y, Sun M, et al. Rich text formatted EHR narratives: A hidden and ignored trove. Stud. Health Technol. Inform. 2019;264:472–476. doi: 10.3233/SHTI190266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Terra. https://app.terra.bio/ (Accessed 23 September 2021).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data used for this work was from electronic health records which include identifiable data and thus cannot be shared per the HIPAA Privacy Rule. The code is available on PheKB.org under the page for each phenotype and survey data can be de-identified and available upon request by contacting the corresponding author, Jennifer A. Pacheco.