Skip to main content
. Author manuscript; available in PMC: 2022 May 26.
Published in final edited form as: Heart. 2022 May 25;108(12):909–916. doi: 10.1136/heartjnl-2021-319769

Table 4.

Portfolio of open-source NLP tools and datasets applied in cardiology contexts

Name and Origin Description Accessibility
NLP tools
clinical Text Analysis and Knowledge Extraction System (cTakes); Mayo Clinic A modular pipeline of components using both rule-based and machine learning methods to support information extraction; based on UIMA (Unstructured Information Management Architecture) standards. Open-source at http://www.ohnlp.org
EchoExtractor; Veterans Affairs An application which extracts Concept-Value pairs for metrics measured during an echocardiogram study. Open-source at https://github.com/department-of-veterans-affairs/EchoExtractor
Leo; Veterans Affairs Informatics and Computing Infrastructure (VINCI) A set of services and libraries that leverages UIMA standards to enable rapid creation and deployment of NLP analysis tools and incorporation of previously developed tools. Open-source at https://department-of-veterans-affairs.github.io/Leo/userguide.html
MedTagger; Mayo Clinic A set of tools developed for indexing based on dictionaries, information extraction based on patterns, and machine learning-based named entity recognition to support information extraction; based on UIMA standards. Open-source at https://github.com/OHNLP/MedTagger
pyConText; University of Utah A Python implementation of ConText, a simple text processing algorithm for identifying a large number of features and relationships between features. Open-source at https://pypi.org/project/pyConTextNLP/0.6.0.5/
semEHR; King’s College London, UK A general-purpose search and analytics tool that processes heterogeneous data sources, covers a range of biomedical concepts, and captures context to support information extraction in study-specific or case-specific contexts. Open-source at https://github.com/CogStack/CogStack-SemEHR
Datasets
The Medical Information Mart for Intensive Care III (MIMIC III), Massachusetts Institute of Technology Deidentified, freely available, critical care database of over 60,000 intensive care unit admissions. https://mimic.mit.edu/
Electronic Medical Records and Genomics (eMERGE) network, National Human Genome Research Institute (NHGRI) Combines DNA biorepositories with EHR data from several clinical sites nationally, and has been extensively used to develop phenotyping algorithms. https://emerge-network.org/
Integrating Biology and the Bedside (i2b2), Partners Healthcare A dataset of deidentified patient discharge summaries made available for research purposes. https://www.i2b2.org/