Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2010 Nov 13;2010:91–95.

Aligning Structured and Unstructured Medical Problems Using UMLS

Lorena Carlo 1, Herbert S Chase 1, Chunhua Weng 1
PMCID: PMC3041294  PMID: 21346947

Abstract

This paper reports a pilot study to align medical problems in structured and unstructured EHR data using UMLS. A total of 120 medical problems in discharge summaries were extracted using NLP software (MedLEE) and aligned with 87 ICD-9 diagnoses for 19 non-overlapping hospital visits of 4 patients. The alignment accuracy was evaluated by a medical doctor. The average overlap of medical problems between the two data sources obtained by our automatic alignment method was 23.8%, which was about half of the manual review result, 43.56%. We discuss the implications for related research in integrating structured and unstructured EHR data.

INTRODUCTION

With the 2009 national initiative to accelerate comparative effectiveness research, more and more clinical research studies will be carried out utilizing data captured by Electronic Health Records (EHR). Efficient secondary use of EHR data needs accurate extraction of a complete list of medical problems or clinical phenotypes at the individual patient level. The 9th version of International Classification of Diseases (ICD-9) has been used broadly to encode medical problems or laboratory procedures, but primarily for billing purposes. Although it is well known that structured ICD-9 diagnoses can be incomplete and inaccurate12, they are still popular data sources for clinical research queries. In contrast, rich disease information in unstructured clinical notes remains underused by research queries. Natural language processing (NLP) algorithms have been developed to construct problem lists from narrative notes35. However, little work has been done to use both ICD-9 and narrative clinical notes to generate complete medical problem lists. Many used either ICD-9 diagnoses or notes as the single data source.

Studies have identified the lack of concordance between information from physicians and coders6. Problem lists derived only from either source may be far from complete. Reconciliation of structured and unstructured data sources is much needed to achieve a complete medical problem list for a patient. An important step toward this goal is to semantically align disease concepts from structured and unstructured data sources in the EHR.

We have previously conducted a case study of the tradeoffs of using structured ICD-9 codes and narrative discharge summaries for identifying eligible patients for a clinical trial study2. Still, little is known about the degree of information overlap between structured and unstructured data sources. As an extension to our previous study, in this paper, we report a pilot study to quantify the information overlap between structured and unstructured clinical data and to align medical problems from ICD-9 codes and discharge summaries using semantic knowledge.

As a standard knowledge source, The Unified Medical Language System (UMLS) defines a concept unique identifier (CUI) and stores mappings across different clinical terminologies, including ICD-9, for each biomedical concept. Therefore, we hypothesize that a CUI can serve as an intermediator for resolving the semantic equivalence between medical problems extracted from both structured and unstructured data.

Following the above design rationale, we extracted and aligned medical problems identified from discharge summaries and ICD-9 diagnoses using UMLS. Next, we describe our semantic alignment framework, analyze the alignment results in our pilot study, and discuss the identified issues.

RESEARCH DESIGN

We used the New York Presbyterian Hospital / Columbia University Medical Center’s Clinical Data Warehouse (http://ctcc.cpmc.columbia.edu/rdb/) to extract structured ICD-9 diagnoses and discharge summary texts for 19 hospital visits. We selected 4 hypertension patients between the ages of 50 and 60 and selected 5 hospital visits from each or them. Figure 1 shows our overall research design.

Figure 1.

Figure 1.

The UMLS-base Alignment Framework.

The discharge summaries were parsed by the Medical Language Extraction and Encoding System (MedLEE)7, encoded by the UMLS CUIs, and presented in a tabular format. A concept can be mapped to multiple CUIs with varied specificity. We derived the corresponding ICD-9 codes for the extracted CUIs using UMLS, and aligned them to the official ICD-9 diagnoses in the patient records.

Our alignment experiment consisted of 3 steps: (1) visits selection and co-references resolution between two data sources; (2) problem selection; and (3) aligning ICD-9 codes derived from discharge summaries to official ICD-9 diagnoses.

1. Visits selection and co-reference resolution

For this pilot study, the first step was to select the patient visits that involved minimal confounding factors for alignment analysis. In other words, we needed to identify a set of ICD-9 diagnoses codes and their “corresponding” discharge summaries for the same hospital visits. Generally, these two data elements were entered into the EHR at different time points by various users, including attending, nurses, lab technicians, ICD-9 coders, etc. In addition, there could be time delays between the visit dates and the time stamps in the EHR. Therefore, the time stamps were not necessarily helpful for linking correlating ICD-9 diagnoses and discharge summaries, especially for those patients who had multiple consecutive visits in a short time period. To avoid any confounding factors that might be caused by the potential difficulty for teasing out diagnoses for connecting or intersecting time windows for different discharges, for this study, we included only discharge summaries that occurred at least one month apart in time. The patient visits were randomly retrieved over a large data set and chosen using the following process.

We manually reviewed a small sample of ICD-9 diagnoses and discharge summaries and decided to use the time window of “[admission date, discharge date + 14 days]” to select official ICD-9 codes that might correlate with a patient discharge. We used the admission data as the lower boundary because no diagnoses could be given before the admission time. The upper boundary of 14 days after the discharge date was selected to accommodate potential data entry delays and to capture those diagnoses that were entered into the EHR after the patient was discharged.

2. Problem selection for both data sources

The concepts in the discharge summaries extracted by MedLEE7 were automatically filtered using only those that were classified by one of the following “illness-related” UMLS semantic types: Disease or Syndrome, Congenital Abnormality, Pathologic Function, Injury or Poisoning, Neoplastic Process and Mental or Behavioral Dysfunction, as recommended by a medical doctor (Chase)8. Furthermore, MedLEE specifies multiple attributes for each concept, including certainty (positive vs. negative), currency (past vs. current), and primary type (e.g., problem vs. procedure). Using these attributes, we were able to select only the current, positive medical problems for further alignment. The MedLEE output often includes multiple CUIs of different semantic granularities corresponding to a shared medical problem, which were all included in our alignment analysis.

We also did problem selection in the ICD-9 diagnosis set. Since our study focused only on medical problems, we defined preference rules for choosing concepts that have disease codes instead of procedure codes in our semantic alignment framework. Procedure codes from the official ICD-9 diagnoses were automatically filtered out. We excluded all ICD-9 codes that belong to the group E800–E999, which are External Causes of Injury and Poisoning, and V01-V89, which are Factors Influencing Health Status and Contact with Health Services. For example the CUI C0042029 has two ICD-9 codes: V13.02 and 599.0. The framework selects the latter because it was a disease diagnosis.

3. Aligning ICD-9 codes derived from discharge summaries to official ICD-9 diagnosis codes

A subset of the UMLS MRCONSO.RRF file was uploaded to a database and used to automatically assign ICD-9 codes for disease concept CUIs extracted from discharge summaries. The following source vocabularies and term types were extracted from MRCONSO: ICD-9CM PT (Prefer term), ICD-9CM AB (Abbreviation), ICD-9CM HT (hierarchical term), MTHICD-9 ET (entry term). UMLS CUIs can have 0, 1 or more slots for its corresponding ICD-9 code. Using the UMLS MRREL.RRF file, we identified the PAR relationships for the CUIs that did not have a direct mapping to an ICD-9 code with the objective of finding the equivalent ICD-9 codes of their parents. The derived ICD-9 codes were aligned with the official ICD-9 diagnoses codes. A partial match method was developed to align all terms that were under the same ICD-9 hierarchy. This means, all of the disease concepts that share the same characters to the left of the decimal point of the ICD-9 code. For example, one of the patients was diagnosed with Diabetes Mellitus and received an official ICD-9 diagnosis code, 250. The derived ICD-9 code from discharge summaries was 250.0 with the disease description being “Diabetes mellitus NOS” in the ICD-9 hierarchy. They were considered the same concept because both codes started with “250” and were grouped together by the partial match algorithm because codes 250 and 250.0 belong to the same ICD-9 hierarchy.

When computing the overlap, equivalent medical problems in the same visit were counted as one. For example “429.2: Cardiovascular Degeneration” and “429.9: Heart Disease NOS” were considered as one medical problem in the same visit. We calculated the percentages of the terms extracted from the discharge summaries that could or could not be mapped to an ICD-9 code respectively, as well as the overlap rate between ICD-9 diagnoses and discharge summaries.

RESULTS

The complete list for the 19 visits is available at (http://people.dbmi.columbia.edu/~chw7007/download.htm). Our results listed below show (1) how well UMLS concepts can be encoded with ICD-9 codes; (2) how well structured ICD-9 diagnosis overlap with UMLS concepts for primary problems; and (3) how well the derived ICD-9 codes from UMLS concepts align with the official ICD-9 diagnosis.

1. Encoding UMLS Concepts with ICD-9 Codes

Table 1 shows the results for deriving ICD-9 codes from the extracted UMLS concepts. About 59.3% medical problems in discharge summaries were mapped to an ICD-9 code, and 40.7% did not have a mapping.

Table 1.

Percentages of concepts that have or do not have ICD-9 Mappings.

Concept Mapping Categories Concept Counts Mapping Percentage
Discharge Summaries concepts that were mapped to an ICD-9 code 73 59.3%
Discharge Summaries concepts that were not mapped to an ICD-9 code 50 40.7 %

2. Measuring the Overlap between Structured and Unstructured Medical Problems

We aggregated all the distinct medical problems identified from discharge summaries and official ICD-9 diagnoses from the 19 visits. A total of 85 out of 172 (49.4%) distinct medical problems were extracted only from discharge summaries (Set A), while 52 out of 172 (30.2%) concepts were present only in official ICD-9 diagnosis data (Set B). The number of problems that overlap (Set C) was 35 out of 172 (20.3%). See Figure 2.

Figure 2.

Figure 2.

Degree of Overlapping Diagnoses Achieved by the Automatic Alignment Method.

3. Accuracy of The Medical Problem Lists

Co-author (HC) manually reviewed the free-text discharge summaries of 19 hospital visits, their automatically derived ICD-9 codes and the official ICD-9 diagnoses that were associated with the visits. Important and relevant clinical conditions were identified. A gold standard was established to represent a complete list of medical problems for each visit. In this context, past illnesses were presumed to be still present and were counted as active problems (i.e. “history of fibroids”). Only unique concepts were counted.

Table 2 shows the accuracy of three different medical problem lists. We use MICD9 output to refer to the integrated list of medical problems generated by the semantic alignment framework. The sensitivity for the MICD9 output is 72.4% and its PPV is 75.6%. The sensitivity for the official ICD-9 codes is 61.3% and its PPV is 85.5%. The sensitivity of the overlap is 30.3% and the PPV is 97.3%. The sensitivity and positive predictive value (PPV) of the MICD9 output and the official ICD-9 codes was calculated by counting the true positives (TP), false positives (FP) and false negatives (FN). A TP was a gold standard term appearing in the MICD9 output. A FP was a MICD9 output term that was not in the gold standard list. A FN was a term in the gold standard that did not appear in the MICD9 output. A TN was incalculable for this study. Table 2 shows that MedLEE-derived ICD-9 codes have balanced sensitivity and positive predictive value.

Table 2.

Accuracy of the 3 medical problem lists.

MedLEE-derived ICD-9 Only Official ICD-9 Diagnosis Only Overlapping Problems
Sensitivity 72.4% 61.3% 30.3%
PPV 75.6% 85.5% 97.3%

4. Aligning Derived ICD-9 codes from Discharge Summaries to official ICD-9 diagnosis codes

We also calculated the concept alignment results for each patient visit. The average overlap per visit was 23.8%. In Figure 3 a Venn diagram represents the medical problems obtained in visit 3 of patient 1. The left side (Set A), corresponds to the problems that were identified in the Discharge Summaries only. The right side (Set B) corresponds to the problems that were identified in the Official ICD-9 diagnoses only. The overlap (Set C) represents the medical problems that were found in the Discharge Summaries and in the Official ICD-9 diagnoses. The FP, FN are marked in the diagram with the labels [FP] and [FN] respectively.

Figure 3.

Figure 3.

Example of the overlap between disease concepts extracted from discharge summaries and official ICD-9 diagnosis for one patient visit.

In Set A, the conditions Enzymopathy NOS (277.9) and Infantilism NOS (259.9) are false positives. False positives increase the number of medical problems, decreasing the overlap rate. There is a subset of CUIs that could not be translated to ICD-9 codes (i.e. C1956346: Artery Disease, Coronary). There are no maps to specific ICD-9 codes probable because the CUIs are too coarse. They are false negatives. To simplify the view in the Venn diagram, we used those concepts that are in the top of the ICD-9 hierarchy that corresponded to the problems in the overlap. The complete list of diseases for Set C is: 427.60: Premature contractions or systoles NOS; 427.6: Premature beats; 427.9: Cardiac arrhythmia NOS; 427: Cardiac dysrhythmias; 427.89: Other specified cardiac dysrhythmias; 250.0: Diabetes mellitus NOS; 250: Diabetes mellitus; 493.9: Asthma, allergic NOS; 493: Asthma; 493.2: Asthma with COPD; 428.0: Right heart failure secondary to left heart failure; 428: Heart failure; 428.2: Systolic heart failure. The only disease that was missed in Set C is Hypertension. In Set A, hypertension is represented by 405.99 Hypertensive Disease. In Set B hypertension has the ICD-9 code 401.9 Hypertension NOS. They are not matched correctly because they do not belong to the same ICD-9 hierarchy. The conditions MR (Mitral regurgitation) and TR (tricuspid regurgitation) were false negatives, which were only found in the clinical notes. They were not displayed in the diagram.

DISCUSSION

We used the UMLS Metathesaurus and Semantic network to derive the ICD-9 codes for concepts in the discharge summaries of a 19 hospital admissions. A total of 40.7% of UMLS concepts in the discharge summaries for the 19 visits could not be mapped to an ICD-9 code.

The overall overlap calculated by the alignment framework was 20.3% while the overall overlap calculated manually was 43.56%. One possible cause for the discrepancy in these percents can be that some UMLS concepts did not have an ICD-9 code, and hence were not detected as an overlap by our alignment framework.

A possible cause for the discrepancies in the overlap rates was that there could be different ICD-9 codes for the same medical problem. In Figure 3, Hypertension (997.91) was found in Discharge Summaries (Set A) and the coder manually coded it as 401.9: HYPERTENSION NOS (Set B). Even though both represented the same condition, they were viewed as different medical problems by our algorithm. Another possible cause was that a single ICD-9 code can represent more than one medical problem. For example, there was a visit that had the official ICD-9 code: “Asthma with COPD” (493.2) and the alignment framework found the codes “COPD” (496) and “acute bronchospasm” (519.11), which was asthma. Since these ICD-9 codes did not look alike, it was hard to tell that they were for the same disease. In other words, methods that can semantically integrate ICD-9 codes for similar diseases will help solve this problem.

The partial match method used to compute the overlap among data sources worked well for most of the codes, but not for some hierarchies such as the one of “997.XX” that contains different diseases. For example, there was an overlap between concept “Hypertension” (CUI=C0085623) that has an ICD-9 code “997.91” and concept “Complications of external stoma of urinary tract” that has an ICD-9 code “997.5”.

The sensitivity and PPV values of both medical problem lists (MICD9 output and the official ICD-9 codes) suggest that completeness might be improved by combining the two data sources. Some gold standard terms were found only in one list and not in the other and vice versa. The PPV for the overlap was nearly 100%, which means that if a term is on both lists, then it is almost certain to be a true positive. The sensitivity of the overlap is low (30.3%) and our future goal is to improve it by addressing the issues that we mention above and adding new methods to integrate data like converting the official ICD-9 codes to CUIs and finding semantic distances between the concepts.

Errors in ICD-9 coding can be caused by several reasons (i.e. misspecification of the diagnoses by physicians, inexperienced coders, etc.) Automatic generation or recommendations of ICD-9 codes from discharge summaries can be useful for coders to validate their coding results and avoid potential problems. The resulting list of overlapping medical problems can be used to analyze which diseases reported in the EHR notes are also coded by the coders or physicians and the reasons for this.

Despite the small sample size of this pilot study, we suspect that the discrepancies between structured and unstructured EHR data are common and significantly big. On one hand, diseases that are secondary medical problems often may not receive an ICD-9 code. On the other hand, the ICD-9 coding system does not have a good coverage of disease concepts commonly used by clinicians. Many disease concepts may not have a corresponding ICD-9 code. The class hierarchy of ICD-9 was also semantically incoherent, or not a strict hierarchy.

Moreover, identification of all the pertinent data for the same patient visit is not trivial. We need better methods to detect correlating structured and unstructured data before we can align them. These unsettled research questions need to be further investigated to achieve a complete patient profile, which will be our future work.

CONCLUSION

This pilot study contributes early knowledge of the information overlap between structured ICD-9 diagnoses and unstructured discharge summaries and valuable experience of using UMLS to convert medical problems to accurate ICD-9 codes. We also identify interesting future research topics.

Acknowledgments

This research was funded under NLM grant R01 LM009886 and CTSA award UL1 RR024156. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NIH. We thank Dr. Carol Friedman and her team for providing MedLEE to parse discharge summaries into structured formats.

REFERENCES

  • 1.O’Malley KJ, Cook KF, Price MD, Wildes KR, Hurdle JF, Ashton CM. Measuring Diagnoses: ICD Code Accuracy. Health Serv Res. 2005;40(5 Pt 2):1620–1639. doi: 10.1111/j.1475-6773.2005.00444.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Li L, Chase H, Patel C, Friedman C, Weng C. Comparing ICD9-Encoded Diagnoses and NLP-Processed Discharge Summaries for Clinical Trials Pre-Screening: A Case Study. Paper presented at: Proc of 2008 AMIA Fall Symp; 2008. pp. 404–408. [PMC free article] [PubMed] [Google Scholar]
  • 3.Meystre S, Haug P. Medical problem and document model for natural language understanding. Paper presented at: AMIA Annu Symp Proc; 2003. pp. 455–459. [PMC free article] [PubMed] [Google Scholar]
  • 4.Meystre S, Haug P. Comparing natural language processing tools to extract medical problems from narrative text. Stud Health Technol Inform. 2005;116:823–828. [PMC free article] [PubMed] [Google Scholar]
  • 5.Meystre S, Haug P. Automation of a problem list using natural language processing. BMC Med Inform Decis Mak. 2005;5(30) doi: 10.1186/1472-6947-5-30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yao P, Wiggs BR, Gregor C, Sigurnjak R, Dodek P. Discordance between physicians and coders in assignment of diagnoses. Int J Qual Health Care. 1999 Apr 1;11(2):147–153. doi: 10.1093/intqhc/11.2.147. 1999. [DOI] [PubMed] [Google Scholar]
  • 7.Friedman C. A broad-coverage natural language processing system. Paper presented at: Proc AMIA Symp; 2000. pp. 270–274. [PMC free article] [PubMed] [Google Scholar]
  • 8.Chase HS, Kaufman DR, Johnson SB, Mendonca EA. Voice Capture of Medical Residents’ Clinical Information Needs During an Inpatient Rotation. Journal of the American Medical Informatics Association. 2009 May 1;16:387–394. doi: 10.1197/jamia.M2940. 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES