Abstract
Textual medical records contain a wealth of information that needs to be extracted and/or indexed in order to be analyzed and interpreted by the automated tools. We have developed a collection of natural language processing (NLP) tools to extract various types of information from unstructured medical records. The generic NLP components, when assembled in pipelines and initialized with custom configuration parameters, become a powerful medical data mining instrument. We have successfully extracted such medical concepts as diagnoses, comorbidities, discharge medications, and smoking status from various types of medical records.
INTRODUCTION
A textual medical record is a rich source of clinical information. Although a number of NLP systems had demonstrated good accuracy in information extraction, they were often domain-, institution- and application-specific. We have developed a suite of NLP tools for the I2B2 (Informatics for Integrating Biology and the Bedside, a national center for biomedical computing) project to address a wide range of text processing needs. We took a modularized and parameterized approach in the software development and employed syntactic, statistical, template-based methods for different parsing tasks. This approach allows users to tailor the NLP tools to extract and index specific information from different domains and institutions.
METHODS AND RESULTS
We have developed 11 modules for text report processing (Figure 1):
Figure 1.
NLP components for medical report processing assembled into pipelines for various information extraction tasks.
Section Splitter
Section Filter
Text Tokenizer
Part-of-Speech (POS) Tagger
Noun Phrase Finder
UMLS Concept Finder
Negation Finder
Regular Expression-based Concept Finder
Sentence Splitter
N-Gram Tool
Classifier (e.g. Smoking Status Classifier)
These modules were applied to discharge summaries and outpatient notes from 2 institutions, Brigham and Women's Hospital and Massachusetts General Hospital with minimum changes in the configuration files. They were also used to extract key data items from a set of medical error reports, which involved adding several new modules, but didn't require any alteration of the original 11 modules.

