Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2006;2006:931.

A Suite of Natural Language Processing Tools Developed for the I2B2 Project

Sergey Goryachev 1, Margarita Sordo 1, Qing T Zeng 1
PMCID: PMC1839726  PMID: 17238550

Abstract

Textual medical records contain a wealth of information that needs to be extracted and/or indexed in order to be analyzed and interpreted by the automated tools. We have developed a collection of natural language processing (NLP) tools to extract various types of information from unstructured medical records. The generic NLP components, when assembled in pipelines and initialized with custom configuration parameters, become a powerful medical data mining instrument. We have successfully extracted such medical concepts as diagnoses, comorbidities, discharge medications, and smoking status from various types of medical records.

INTRODUCTION

A textual medical record is a rich source of clinical information. Although a number of NLP systems had demonstrated good accuracy in information extraction, they were often domain-, institution- and application-specific. We have developed a suite of NLP tools for the I2B2 (Informatics for Integrating Biology and the Bedside, a national center for biomedical computing) project to address a wide range of text processing needs. We took a modularized and parameterized approach in the software development and employed syntactic, statistical, template-based methods for different parsing tasks. This approach allows users to tailor the NLP tools to extract and index specific information from different domains and institutions.

METHODS AND RESULTS

We have developed 11 modules for text report processing (Figure 1):

Figure 1.

Figure 1

NLP components for medical report processing assembled into pipelines for various information extraction tasks.

  1. Section Splitter

  2. Section Filter

  3. Text Tokenizer

  4. Part-of-Speech (POS) Tagger

  5. Noun Phrase Finder

  6. UMLS Concept Finder

  7. Negation Finder

  8. Regular Expression-based Concept Finder

  9. Sentence Splitter

  10. N-Gram Tool

  11. Classifier (e.g. Smoking Status Classifier)

These modules were applied to discharge summaries and outpatient notes from 2 institutions, Brigham and Women's Hospital and Massachusetts General Hospital with minimum changes in the configuration files. They were also used to extract key data items from a set of medical error reports, which involved adding several new modules, but didn't require any alteration of the original 11 modules.


Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES