Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Aug 1.
Published in final edited form as: J Biomed Inform. 2010 Mar 31;43(4):595–601. doi: 10.1016/j.jbi.2010.03.011

Selecting Information in Electronic Health Records for Knowledge Acquisition

Xiaoyan Wang 1, Herbert Chase 1, Marianthi Markatou 2, George Hripcsak 1, Carol Friedman 1
PMCID: PMC2902678  NIHMSID: NIHMS199562  PMID: 20362071

Abstract

Knowledge acquisition of relations between biomedical entities is critical for many automated biomedical applications, including pharmacovigilance and decision support. Automated acquisition of statistical associations from biomedical and clinical documents has shown some promise. However, acquisition of clinically meaningful relations (i.e. specific associations) remains challenging because textual information is noisy and co-occurrence does not typically determine specific relations. In this work, we focus on acquisition of two types of relations from clinical reports: disease-manifestation related symptom (MRS) and drug-adverse drug event (ADE), and explore the use of filtering by sections of the report to improve performance. Evaluation indicated that applying the filters improved recall (disease-MRS: from 0.85 to 0.90; drug-ADE: from 0.43 to 0.75) and precision (disease-MRS: from 0.82 to 0.92; drug-ADE: from 0.16 to 0.31). This preliminary study demonstrates that selecting information in narrative electronic reports based on the section improves the detection of disease-MRS and drug-ADE types of relations. Further investigation of complementary methods, such as more sophisticated statistical methods, more complex temporal models and use of information from other knowledge sources, is needed.

Keywords: knowledge acquisition, natural language processing (NLP), text mining, pharmacovigilance, decision support, electronic health record (EHR)

1. Introduction

Knowledge of relationships between biomedical entities, such as ‘a symptom is a manifestation of a disease’ or ‘a drug causes a symptom’, is critical for many automated biomedical applications. Such relations are often hidden in either the biomedical literature or narrative clinical reports because they are typically not explicitly stated in the text, particularly in the case of clinical reports [1]. When processing clinical reports, as an initial step, associations among entities are often acquired using statistical methods based on co-occurrences in the text before specific relationships can be identified. Natural language processing (NLP), as one of the high throughput technologies, can extract and encode massive amounts of text data in the literature and in clinical reports within a relatively short time [2]. Various NLP systems have been applied to the biomedical literature and to narrative clinical reports [3-8]. Text mining methods involving statistical methods combined with NLP systems have been explored in recent years for extracting and establishing associations between entities from textual data. Several studies have demonstrated the use of narrative clinical reports for discovering associations [9, 10]. Rindflesch and colleagues extracted drug and disease entities from the Mayo Clinic notes using the natural language system SemRep, and constructed a repository of drug-disease co-occurrences to validate inferences produced by SemRep from the literature about drug treatments for diseases [10].

Our group has been developing automated methods for detecting associations between clinical entities from both the biomedical literature and narrative clinical records [11-13]. Recently, we extended that work and adapted use of NLP and statistical methods to acquire knowledge of disease-symptom associations as well as potential drug-adverse drug event (ADEs) associations for pharmacovigilance [14, 15]. The field of pharmacovigilance was introduced after the tragedy of thalidomide[16], and surveillance systems have been developed to monitor large populations [17, 18]. Data mining algorithms using spontaneous reporting systems, pharmacoepidemiology databases and electronic health record (EHR) systems have produced some interesting results [19-21]. However, data mining algorithms for pharmacovigilance have focused on coded and structured data, and therefore missed important clinical data that is relevant for pharmacovigilance, which are generally only available in the narrative EHR reports, such as “chills” and “suicidal thoughts”. Our work involving detecting associations of drug-potential ADEs is intended to provide a high throughput model and method to identify drug safety signals by mining narrative reports in the EHR [15]. Similarly, our work concerning detection of disease-symptom associations not only provides statistical evidence of association strength, but also enables automated updating of knowledge bases that provide comprehensive disease-specific information. In addition, our work also demonstrates a method for customizing knowledge for particular patient populations. These knowledge bases are essential for clinical applications ranging from quality of care to hypothesis generation[14].

Determining clinically relevant relations from narrative reports using text mining remains challenging, however. There are two issues concerned with the use of statistical associations obtained from text mining. First, the associations basically depend on the strength of statistics only. Statistically significant associations however, may not be medically important. Second, these methods usually identify statistically significant associations without reference to discrimination between the various types of relationships. For example, a drug-symptom association may represent the following relations: 1) a ‘treat’ relation in which the drug is used to treat a symptom or disease (i.e. ibuprofen-headache); 2) a ‘cause’ relation in which the drug causes the symptom (i.e. rosiglitazone-headache); or 3) an ‘indirect treat’ relation in which the drug treats a disease that is manifested by the symptom (i.e. rosiglitazone-(diabetes)-polyuria). In contrast, a disease-symptom association could represent the following relations: 1) a ‘manifestation’ relation in which the symptom is a direct manifestation of the disease (i.e. diabetes-polyuria); 2) an ‘indirect manifestation’ relation in which the symptom is a manifestation of a disease that is highly associated with the disease of the interest (i.e. diabetes-(heart disease)-angina pectoris); or 3) a ‘treatment-induced’ relation in which the symptom is caused by a treatment or procedure (i.e. depressive disorder-(paroxetine)-chills). In order to detect the appropriate relation between drugs and potential ADEs for pharmacovigilance, it is critical to obtain the ‘drug cause symptom’ relationship and not the others, and therefore it is critical to eliminate the ‘treat’ or ‘indirect treat’ relations. Similarly, ‘treatment-induced’ symptoms should be eliminated when acquiring ‘disease-related manifestation’.

One of the possible solutions to help in the detection of certain specific types of relations is to use temporal information related to clinical entities when computing statistical associations. For example, a symptom occurring before a drug is given is unlikely to be a ‘cause’ relation for that drug, and can be eliminated as a potential ADE. Temporal reasoning has received a great deal of attention from the medical and computer science communities [22]. Unlike structured data in EHRs, which is usually time-stamped, narrative reports frequently do not provide detailed temporal information corresponding to clinical entities or if they do provide temporal information, it is frequently vague. A sophisticated temporal reasoning system, Timetext, has been developed to represent, extract and reason with temporal information in clinical narrative reports and may be useful in future work [23]. Other methods focused on determining whether an event occurred in the past or whether it is current. A method developed by Harkema and colleagues used a regular-expression based algorithm (ConText) to infer the status of a condition, including the temporal status from clinical reports, and the study demonstrated reasonable performance [24]. Automatic acquisition of temporal status of a condition, such as smoking status (‘past’ vs ‘current’), from clinical reports was also shown to be challenging [25]. However, more sophisticated temporal systems would be needed to determine whether or not the start of a medication event precedes or follows the occurrence of a condition. Other alternative approaches to resolve the problem would involve applying information from other knowledge sources. For example, relations between a drug and diseases the drug treats could be obtained from terminologies such as National Drug Formulary Reference Terminology (NDFRT) in the UMLS [26].

However, before using the more complex approaches, a simple strategy of selecting information would be interesting to explore. Clinical information is usually expressed as either structured databases or narrative reports. Some clinical reports, such as discharge summaries for inpatients, often contain sections (i.e. “History of Present Illness”, “Chief Complaints”), which are demarcated by well-defined section headers. The section headers are often regarded as ‘containers’ of the clinical information providing relevant context, where coding systems and terminologies are considered as ‘contents’ [27]. Although many clinical NLP applications do not recognize section headers explicitly, several NLP systems can identify predefined sections among clinical notes [27-30].

In this study, we experiment with a simple approach to improve acquisition of relations, which is based on use of contextual filters consisting of the sections in the report where the clinical entities are found. We continue to build upon previous work that extracts certain types of entities (e.g. medications, diseases, symptoms), and establishes statistical associations between them [12-15]. We focus on detection of two types of relations: disease-manifestation related symptom (MRS), which includes both ‘manifestation’ and ‘indirect manifestation’ relations between a disease and a symptom, and a drug-potential ADE relation that denotes a ‘possible cause’ relation between a drug and a disease/symptom. In the present study we consider side effects of drugs only if they occurred during the hospital stay. We measure the performance of relation detection with and without the contextual filters, in order to analyze the effectiveness of these filters and to explore the usefulness of various types of information based on the section where the information occurs.

2. Methods

2.1. Materials

The data warehouse in NYPH maintains a variety of structured and unstructured patient information including narrative reports, coded laboratory data, and pharmaceutical orders. Textual discharge summaries for patients admitted in 2004 were used to determine associations in this study after IRB approval was obtained. An example of a partial discharge summary written in NYPH (some information has been modified to ensure patient privacy) is shown in Figure 1A, and the sample output generated by a NLP system is shown in Figure 1B, and is described in Section 2.3. The discharge summary describes a patient with metastatic breast cancer and mild myocardial infarction, who previously had hypertension, asthma and complained of severe cough, headache, chest pain and increased sweating at admission. In this study, a NLP system, MedLEE (the Medical Language Extraction and Encoding System), was used to parse and transform discharge summaries into structured representations consisting of UMLS codes with modifiers [8]. MedLEE output of the sentence ‘The patient has severe cough’ is illustrated in Figure 1B.

Figure 1.

Figure 1

Example of a partial discharge summary in NYPH and simplified MedLEE output for the first sentence.

2. 2. Research design and experiments

2.2.1. Contextual filters

A regular filter in this study is always applied and is used to include entities that are current and asserted about the patients, and not information occurring in the past, negated, or concerning family members. A contextual filter for a clinical entity in this study is defined as a filter that consists of the section in the discharge summary where the clinical entity occurs. Four contextual filters were designed, as described below, to determine the two types of relations of interest, and their effectiveness was evaluated.

Disease-MRS

In order to avoid those symptoms primarily caused by subsequent treatments or procedures occurring in the hospital course, two filters were developed to obtain diseases and symptoms occurring primarily at or before the time of admission. The filter Dx-Adm used information in the Diagnosis and History of Present Illness sections only. A similar filter Sx-Adm was developed that used information in the History of Present Illness and Chief Complaints sections only.

Drug-Potential ADE

An adverse drug event (ADE) in this study is used to denote a symptom/disease, as shown by clinical evidence, which is the result of taking a particular drug. In order to capture ADE relations, a filter Rx-HC was designed to capture drugs mentioned only in the Hospital Course and Medications section in an effort to eliminate medications not given in the hospital, such as Discharge Medications. Drugs in the Medications section were included because typically, unless otherwise specified, these contain medications continued in the hospital that were prescribed during an outpatient visit prior to admission or medications given at admission. Another contextual filter ADE-HC was designed to include symptoms/diseases occurring in the section Hospital Course only and to avoid diseases/symptoms mentioned in sections signifying that they occurred at the time of admission or before the hospital stay, such as Chief Complaints and History of Present illness (HPI) since these sections more frequently describe the disease a patient has rather than an ADE.

2.2.2. Association Detection

The framework for acquiring clinical associations from narrative reports has been described in more detail in previous work [15]. The method consists of five main phases.

  1. Collecting the set of reports for the study. Discharge summaries in 2004 were extracted from NYPH data repository and de-identified in this study after obtaining IRB approval.

  2. Processing the reports. NLP was used to extract and encode clinical entities into UMLS codes with modifiers, such as negation, time, change, and degree. For example, as shown in Figure 1B, a sentence ‘the patient has severe cough’ is encoded as UMLS code C0010200 corresponding to ‘cough’, which has a certainty modifier high certainty corresponding to ‘has’ and a degree modifier high degree corresponding to ‘severe’.

  3. Selecting entities. The semantic classes of the UMLS codes were used to select the appropriate information types for this study, which consisted of two types of clinical entities (disease and symptom) and one environmental entity type (drug). For example, the UMLS codes that corresponded to the UMLS semantic classes Disease or Syndrome [T047], Mental or Behavioral Dysfunction [T048], and Neoplastic Process [T191], were used for disease entities. For example, heart failure would be selected as a disease because it is classified in the UMLS as semantic class T047.

  4. Applying regular and contextual filters. This was done in order to reduce the amount of potentially confounding information. By using regular filters, entities associated with modifiers corresponding to certain certainty values (negation, low certainty, workup), past events, or certain sections such family history are avoided to ensure that the entities were about the patients and occurring in the current. Meanwhile, by applying contextual filters, information that is most relevant to the questions of interested are kept.

  5. Applying statistical methods to detect disease-symptom and drug-potential ADE relations of interest. For the statistical analysis we obtained the frequencies of drugs, diseases, and symptoms, as well as frequencies of pairs of co-occurring drug-disease/symptom and disease-symptom. A pair was considered to co-occur if each entity of the pair was selected and not filtered out, and occurred within a single report. Selected co-occurring pairs in one report that consist of a symptom and a disease, such as ‘chest pain’ and ‘myocardial infarction’, will be included in the statistics used for disease-MRS detection. These types of entities frequently appear in different sentences and in different sections of a report, and their relationships are usually not expressed. Therefore typical linguistic characteristics, such as proximity and explicit relationships cannot be used. For example, in the report shown in Figure 1A, ‘atenolol’ in the Medication section was likely used to treat ‘chest pain’ in the Chief Complaint section, while ‘chest pain’ was likely a symptom of ‘myocardial infarction’ in the Admission/Discharge Diagnosis section.

To test the hypothesis of no association between a pair of entities, the χ2 statistic was used. In the present study, because the data are 2 × 2 tables with the same row margins, we computed the adjustment to the chi-square p-value that corresponds to tables with fixed row margins. Fixed row margin tests are partially conditional tests, where the conditioning argument is the variable that describes the row marginals. This conditioning guarantees that the margins of the table do not provide any evidence either in favor or against the null hypothesis of independence (i.e. no association). Fixed row margin volume tests have similar interpretation with the unconditional volume tests, that is, they can be interpreted as a distance from the surface of independence. The larger the distance, the stronger the association. A description of how the cutoff point was determined is described by Cao and colleagues [11, 12]. A list was then generated consisting of associations above the cutoff and their strengths.

2.2.3. Experimental Design

We evaluated the effectiveness of the contextual filters in detecting the two types of relations of interest by performing a series of experiments. Since the contextual filters depend on accurate identification of the sections where the information occurred, we first performed an evaluation of MedLEE’s section detection method as applied to discharge summaries. The method uses straightforward pattern matching, which utilizes a set of predetermined section headers and their target forms. As shown in Figure 1B, the output associates each term that is extracted with the section it occurred in. We then evaluated the performance of relation detection using the contextual filters. Relations of disease-MRS were detected without any contextual filter, with Dx-Adm only, with Sx-Adm only, and with both Dx-Adm and Sx-Adm filters. Similarly, relations of drug-potential ADE were detected without any contextual filter, with Rx-HC only, with ADE-HC only, and with Rx-HC and ADE-HC.

2.2.4. Evaluation

Evaluation of Section Identification

A subset of 11 discharge summaries was used to evaluate the performance of section identification using MedLEE. These reports were randomly selected from the 2004 NYPH data repository.

Reference Standard

Two reference standards were constructed respectively for the evaluation of section identification: one for all sections and the second only for sections related to this study (i.e. the medication, hospital course, chief complaint, and diagnosis at admission sections). An independent physician who was presented with selected discharge summaries, formed the reference standard by identifying the sections corresponding to the concepts in the reports

Quantitative Evaluation

Two metrics were used to assess the performance of section identification in this study. Recall was calculated as the ratio of the number of concept-section pairs that were identified by MedLEE that were correct according to the reference standards over the total number of concept-section pairs in the reference standards (i.e. TP/(TP+FN)). Precision was measured as the ratio of the number of concept-section pairs returned by MedLEE that were correct according to the reference standards divided by the total number of concept-section pairs found by our method (i.e. TP/(TP+FP)). Confidence intervals of recall and precision were calculated by bootstrapping to estimate the variability of the metrics.

Evaluation of Contextual Filters

We evaluated the performance of contextual filtering using a total of 315 associations selected from the entire result set. Among the evaluation set, there were 183 disease-MRS (Manifestation Related Symptom) pairs associated with 12 diseases, and 132 drug-ADE pairs associated with seven drugs/drug classes. The diseases were selected based on their frequencies of occurrence in our database, where four strata were determined (i.e. most common, common, less common, rare.). Three diseases were randomly chosen from each stratum. The seven drugs were chosen to detect 1) known ADEs before marketing, 2) known ADEs, which first became known after 2004 and after marketing, and 3) common ADEs for drug classes. For more detailed information of the diseases and drug/drug classes and how they were selected please refer to our previous work [14, 15].

Reference Standard

To evaluate the relations, two reference standards (disease-MRS and drug-ADE respectively) from our previous work were used. The reference standards consisted of well-known reliable reference resources (i.e. Micromedex [31] WebMD [32], a textbook [33] and two experts). For disease-MRS (Manifestation Related Symptom), three references of a comprehensive consumer health online resource, a textbook and a expert were combined to create the reference standard as follows: 1) if an association was agreed on by at least two resources, the majority was chosen as the reference standard; (2) if an association was not agreed on by any of the resources, the response from WebMD was chosen to be the reference standard due to its comprehensiveness. In contrast, for drug-ADE relation, the reference standard was formed based on an expert and Micromedex, a well-respected, evidence-based reference material[31]. The expert summarized the ADEs for each drug/drug class based on his medical knowledge and knowledge from Micromedex. Therefore, for evaluation purposes, the study related to drug-ADE relations was based on known ADEs, but our method was developed to detect all potential ADEs, including ones that are not yet known.

Quantitative evaluation

Similarly, two metrics were used to assess the performance of each experiment. Recall was calculated as the ratio of the number of distinct disease-MRS/drug-ADE pairs that were identified by an experiment over the total number of the corresponding disease-MRS/drug-ADE pairs in the reference standards. Precision was measured as the ratio of the number of distinct disease-MRS/drug-ADE pairs returned by an experiment that were correct according to the reference standards divided by the total number of disease-MRS/drug-ADE pairs found by the experiment. Confidence intervals of recall and precision were calculated by bootstrapping to estimate the variability of the metrics.

Qualitative Analysis

To understand how the contextual filters affected the performance of relation detection, a qualitative analysis was performed. A random sample of relations in the evaluation set was manually reviewed and compared to the reference standard, and an error analysis was performed to categorize the types of errors that occurred with/without contextual filters. For disease-MRS pairs, a relationship of ‘manifestation’ or ‘indirect manifestation’ between a disease and a symptom was considered as a true positive. False disease-MRS relations were categorized into two groups: 1) ‘treatment-induced’ relations in which a symptom was caused by treatment or procedure based on clinical evidence (i.e. chills in diabetes-(metformin)-chills). 2) ‘Unknown’ relations in which a symptom was either conceptually poorly-defined (i.e. difficulty in depressive disorder-difficulty) or currently unknown to be associated with a disease (i.e. gravida 0 in kidney disease-gravida 0). For drug-potential ADE pairs, a relationship of ‘cause’ between a drug and a symptom was considered as a true positive. False drug-potential ADE relations were further classified into the following categories: 1) ‘treat’ relation: a symptom/disease that a drug treats (i.e. pain in Ibuprofen-pain); 2) ‘indirect treat’ relation: a symptom/disease is known to be highly associated with or be a consequence of the indications that a drug treats (i.e. tremor in rosiglitazone-(diabetes-hypercholesterolemia-stroke)-tremor). 3) ‘Unknown’ relation in which a symptom/disease is either conceptually poorly-defined (i.e. thicken in paroxetine-thicken) or currently unknown to be associated with a drug (i.e. erythema in rosiglitazone-erythema). Since our reference standard for drug-ADEs by necessity was based on known drug-ADE relations, it is possible that some of the unknown drug-ADE associations could be novel associations, but further exploration would be necessary.

3. Results

3.1. Data Statistics

The case reports in this study included a total of 25,074 discharge summaries from NYPH. Results with and without the four contextual filters are summarized in Table 1. There were a total of 206,812 disease occurrences in all the sections (i.e. without any contextual filter), whereas there were about 124,305 (60%) disease occurrences concepts selected when the Dx-Adm filters were applied, and only 8,072 (4%) diseases were found using ADE-HC filters. About half of the symptoms occurred in the admission-related sections with Sx-Adm (51%) and also in the hospital course section ADE-HC (43%) compared to a total of 220,672 symptom occurrence without any filters. For drug-ADE relations, Rx-HC consisted of about 48% of the drug occurrences. Similarly, among the unique symptom concepts of 1,718 without any contextual filters, there were 563 (32%) and 732 (42%) after Sx-ADM and Sx-HC filters respectively. There were 1,997 unique drug concepts obtained using Rx-HC compared to a total of 2,289 without filters.

Table 1.

Summary of the Concepts Identified With and Without Contextual Section Filtering

Data in the corpus Count
No filter Disease-MRS filters Drug-ADE filters
Total disease occurrences 206,812 124,305 8,072
Total symptom occurrences 220,672 113,299 95,290
Total drug occurrences 295,493 NA 143,828
Unique disease concepts 3,449 1,366 175
Unique symptom concepts 1,718 563 732
Unique drug concepts 2,289 NA 1,997

3.2. Results of Evaluation

3.2.1 Evaluation of Section Identification

There were 791 relevant concepts associated with the 11 discharge summaries. Results of the section identification evaluation are summarized in Table 2. Recall and precision were 0.91 and 0.92 respectively for all sections. Recall and precision for the relevant sections only were 0.97 and 0.99 respectively. A further analysis indicated that some rare sections (i.e. ‘Impairments’, ‘Discharge Equipment’) were not recognized by MedLEE, but these had little effect on the terms in the relevant sections.

Table 2.

Summary of recall and precision for identification of overall sections and relevant sections

Metrics Overall sections (95%CI) Relevant sections (95%CI)
Recall 0.91(0.84-0.96) 0.97(0.94-0.99)
Precision 0.92(0.85-0.95) 0.99(0.97-1.00)

3.2.2. Evaluation of Contextual Filters

Quantitative evaluation

Table 3 presents the results of the quantitative evaluation of the contextual filters. For filters designed to detect disease-MRS relations, the Dx-Adm filter alone barely increased recall and precision, and the Sx-Adm filter increased recall slightly from 0.85 to 0.88 and precision from 0.82 to 0.89. Applying both Dx-Adm and Sx-Adm contextual filters for acquiring disease-MRS relations indicated that the recall was increased from 0.85 to 0.90 and precision was increased from 0.82 to 0.92.

Table 3.

Summary of recall and precision with/without contextual filters

Metrics Disease-MRS (95%CI) Drug-ADE (95%CI)

No filter Dx-Adm Sx-Adm Dx-Adm & Sx-Adm No filter Rx-HC ADE-HC Rx-HC & ADE-HC

Recall 0.85 (0.79-0.89) 0.86 (0.81-0.90) 0.88 (0.84-0.92 0.90 (0.86-0.94) 0.43 (0.39-0.47) 0.48 (0.43-0.51) 0.54 (0.50-0.57 0.75 (0.68-0.79)

Precision 0.82 (0.76-0.86) 0.82 (0.78-0.87) 0.89 (0.84-0.93) 0.92 (0.88-0.95) 0.16 (0.12-0.20) 0.19 (0.13-0.22) 0.23 (0.19-0.26) 0.31 (0.26-0.35)

For filters designed to acquire drug-potential ADE relations, using drugs in the hospital course increased recall from 0.16 to 0.19 and precision from 0.43 to 0.48, whereas using potential ADEs from the hospital course section improved recall from 0.43 to 0.54, and precision from 0.16 to 0.23; when both filters were used, recall and precision were improved almost two-fold (recall: from 0.43 to 0.75; precision: from 0.16 to 0.31)

Qualitative Analysis

Using contextual filters eliminated 15 of the 34 false positives for disease-MRS relation, whereas contextual filters eliminated 206 out of the 298 false positives for drug-ADE relation. A few examples of the false positive relations are presented in Table 4. Note that some known ADEs of drugs (i.e. metformin to treat diabetes), such as cough and chills, were erroneously obtained as symptoms related with the disease when contextual filters were not applied (Table 3A). Similarly, more manifestations of diabetes, such as polyuria and visual impairment, were erroneously obtained as potential ADEs for rosiglitazone when contextual filters were not applied (Table 3B).

Table 4.

Elimination of False Positive Relations with Contextual Filters

A

Disease Filter Manifestation-Related Symptom
Treatment induced

Diabetes no filter cough, chills

Dx-Adm & Sx-Adm

Depressive disorder no filter pain chest, dyspnea

Dx-Adm & Sx-Adm

B

Drug Filter Relation
Treat Indirect treat

Rosiglitazone no filter polyuria, visual impairment, sensory discomfort, sciatica, asthenia, sweating increased, dizziness bruit, renal angle tenderness, facial paresis, slurred speech

Rx-HC & ADE-HC syncope, vertigo tremor, pins and needle, cyanosis, colic abdominal

A: Contextual filters for Dx-MRS; B: Contextual filters for Rx-ADE

4. Discussion

This study explored the effectiveness of contextual filters for selecting information from an EHR for use when automatically acquiring certain relations using statistical methods. As an initial step, four contextual filters designed for two types of relations were examined. By applying contextual filters utilizing selective sections where clinical entities occurred, our findings demonstrated that both recall and precision were improved in detecting specific relations of interests. The results highlight the potential of the filters for improving the performance of automated knowledge acquisition.

Upon examining the results obtained in the study, we found that when the admission-related filters Dx-Adm and Sx-Adm were used 1) more of the treatment-induced relations were removed, 2) more of the manifestation relations were obtained, and 3) the ADE events related to treatments or procedures performed in hospital course were reduced. In contrast, when the hospital course information filters were used (i.e. ADE-HC and Rx-HC filters) more ADEs and fewer treatment-related manifestations were found. This explains why use of contextual information led to improving the performance of detecting meaningful relations. In addition, it appears that performance of drug-potential ADE relations benefit more from selecting information using contextual filters than that of disease-MRS, likely due to the fact that ADEs are sparsely distributed in clinical reports while symptoms related to diseases occur more often. Filtering information is therefore more effective for drug-ADE relations.

Determining the specific relations is still difficult, and filtering by sections is a partial solution. There are other approaches to explore that could be combined with this approach. In a parallel study, we have already explored an approach based on mutual information (MI) and its property of data processing inequality (DPI) to help differentiate the direct and indirect type relations (i.e. a symptom could be directly a manifestation of a disease (disease-manifestation) or a manifestation of a disease that is highly associated with the disease of interest (disease-indirect manifestation) between clinical entities [34]. Our preliminary results demonstrate that this information theoretic approach also shows promise in differentiating direct and indirect relations. Additionally, knowledge resources such as the UMLS contain rich and manually curated biomedical knowledge, especially disease-specific knowledge. The use of these sources of knowledge could be explored to solve the problem as well. In addition, a further line of research, which involves extending the proposed methodology to use more sophisticated statistical methods, more complex temporal models, additional information from knowledge sources, can be devised to differentiate between the different types of relations observed.

There are a number of limitations in this study. One of the limitations is that we included only discharge summaries, which are narrative reports for inpatients. As a result, our findings are based on single admissions and applicable in the context of a sick patient population. However, this limitation is due to the type of reports we focused on and not the methodology. A similar approach could be extended to the longitudinal record, including outpatient reports, but it is likely that we will be presented with different challenges. Another limitation is that this method depends on accurate section identification. Sections in discharge summaries of NYPH are typically regular but other reports may not have such a regular structure. For example, outpatient clinic notes at NYPH frequently do not have well-defined sections and many section headers are abbreviated. In that situation, a more sophisticated section identification system would have to be developed. Denny JC et al. have developed a section header terminology which contained 99.9% of the clinical sections in the randomly selected corpus of history and physical notes (H&P notes) [35]. These researchers further developed an algorithm to identify and characterize certain section headers in H&P notes [30]. The evaluation indicated the algorithm could accurately recognize both labeled and unlabeled section headers for these types of reports[30]. This type of algorithm may be needed in other institutions or for other types of clinical reports when selecting information in clinical documents. However, instead of section headers, a longitudinal record consisting of multiple reports for one patient could be collected and temporal information associated with them could be used, such as the time of the visit. However, accurate identification concerning the section does not guarantee that all the information in a section is appropriate. For example, the sentence ‘the patient presented with high fever today’ could appear in the Past History section. More complex methods should be investigated to identify clinical entities with accurate information, although considering certain modifiers, such as ‘today’ or ‘now’ irregardless of the section could partially solve the problem. An additional problem is that chronic conditions often occur in the Past Medical History section although the patient still has that condition. For example, the condition hypertension often occurs in the Past Medical History section, but based on medical knowledge, it is likely that the patient would still have that condition currently.

A second limitation of this investigation is that, although the evaluation involved a total of 315 relations, they corresponded to a set of 12 diseases and seven drug/drug classes. Also, the reference standard was obtained using only two experts and three other information resources. The majority rule was applied to create the reference standards and a priori preference was chosen when there was no majority among the resources. As pointed out by Pestian et al [36], there are, however, potential problems with this strategy, one of which is inter-rater agreements among resources. Kappa agreement statistics were computed in this work but there are more sophisticated techniques to evaluate and improve the reliability of the reference standard, as discussed by Hripcsak and Heitjan [37]. A more comprehensive evaluation involving a larger sample size and a more reliable reference standard will be undertaken in future work.

5. Conclusion

Narrative text, such as clinical reports, has been shown to be a rich resource for biomedical knowledge. In this study, we applied contextual filters consisting of sections where certain clinical entities occurred in clinical discharge summaries to select relevant clinical information from the EHR and then used the information to improve the performance of relation detection. The results achieved by the methodology demonstrated that contextual filters improved automated knowledge acquisition of certain relations, which is critical for many clinical applications including pharmacovigilance and decision support. A strategy of combining other more complex and complementary methods such as more sophisticated statistical methods, more complex temporal models and information from other knowledge source could be devised to help identify specific types of relations among clinical entities.

Acknowledgments

The authors thank Lyudmila Shagina for assistance with use of MedLEE, and Dr Amy Chused for reviewing the results obtained in this study and construction of the reference standards. This work is supported in part by grants T15-LM007079 (XW), R01 LM010016, R01 LM010016-0S1, R01 LM010016-0S2 (CF), R01 LM008635 (CF), and R01 LM06910 (GH) from the National Library of Medicine, and DMS-0504957 (MM) from the National Science Foundation.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Weeber M, Klein H, Aronson AR, Mork JG, de Jong-van den Berg LT, Vos R. Text-based discovery in biomedicine: the architecture of the DAD-system. Proc AMIA Symp. 2000:903–7. [PMC free article] [PubMed] [Google Scholar]
  • 2.Baruch JJ. Progress in programming for processing English language medical records. Ann N Y Acad Sci. 1965;126:795–804. doi: 10.1111/j.1749-6632.1965.tb14324.x. [DOI] [PubMed] [Google Scholar]
  • 3.Christensen L, H P, Fiszman M. MPLUS: a probabilistic medical language understanding system. Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain. 2002:29–36. [Google Scholar]
  • 4.Hahn U, Romacker M, Schulz S. Creating knowledge repositories from biomedical reports: the MEDSYNDIKATE text mining system. Pac Symp Biocomput. 2002:338–49. [PubMed] [Google Scholar]
  • 5.Aronson AR, Bodenreider O, Chang HF, Humphrey SM, Mork JG, Nelson SJ, Rindflesch TC, Wilbur WJ. The NLM Indexing Initiative. Proc AMIA Symp. 2000:17–21. [PMC free article] [PubMed] [Google Scholar]
  • 6.Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003;36:462–77. doi: 10.1016/j.jbi.2003.11.003. [DOI] [PubMed] [Google Scholar]
  • 7.Chen L, Friedman C. Extracting phenotypic information from the literature via natural language processing. Stud Health Technol Inform. 2004;107:758–62. [PubMed] [Google Scholar]
  • 8.Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11:392–402. doi: 10.1197/jamia.M1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Heinze DT, Morsch ML, Holbrook J. Mining free-text medical records. Proc AMIA Symp. 2001:254–8. [PMC free article] [PubMed] [Google Scholar]
  • 10.Rindflesch TC, Pakhomov SV, Fiszman M, Kilicoglu H, Sanchez VR. Medical facts to support inferencing in natural language processing. AMIA Annu Symp Proc. 2005:634–8. [PMC free article] [PubMed] [Google Scholar]
  • 11.Cao H, Markatou M, Melton GB, Chiang MF, Hripcsak G. Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics. AMIA Annu Symp Proc. 2005:106–10. [PMC free article] [PubMed] [Google Scholar]
  • 12.Cao H, Hripcsak G, Markatou M. A statistical methodology for analyzing co-occurrence data from a large sample. J Biomed Inform. 2007;40:343–52. doi: 10.1016/j.jbi.2006.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C. Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inform Assoc. 2008;15:87–98. doi: 10.1197/jamia.M2401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wang X, Friedman C, Chused A, Markatou M, Elhadad N. Automated knowledge acquisition from clinical narrative reports. AMIA Annu Symp Proc. 2008;6:783–7. [PMC free article] [PubMed] [Google Scholar]
  • 15.Wang X, Hripcsak G, Markatou M, Friedman C. Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J Am Med Inform Assoc. 2009;16:328–37. doi: 10.1197/jamia.M3028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.McBride W. Thalidomide and congential malformation. Lancet. 1961;2:1238. [Google Scholar]
  • 17.http://www.emea.europa.eu/
  • 18.http://www.fda.gov/cder/aers/default.htm
  • 19.Wood L, Martinez C. The general practice research database: role in pharmacovigilance. Drug Saf. 2004;27:871–81. doi: 10.2165/00002018-200427120-00004. [DOI] [PubMed] [Google Scholar]
  • 20.Sturkenboom M. Pharmacovigilance. 2. Wiley; chischester: 2007. Other database in Europe for the analytic evaluation of drug effects. [Google Scholar]
  • 21.Wysowski DK, Swartz L. Adverse drug event surveillance and drug withdrawals in the United States, 1969-2002: the importance of reporting suspected reactions. Arch Intern Med. 2005;165:1363–9. doi: 10.1001/archinte.165.12.1363. [DOI] [PubMed] [Google Scholar]
  • 22.O’Connor MJ, Shankar RD, Parrish DB, Das AK. Knowledge-level querying of temporal patterns in clinical research systems. Stud Health Technol Inform. 2007;129:311–5. [PubMed] [Google Scholar]
  • 23.Zhou L, Friedman C, Parsons S, Hripcsak G. System architecture for temporal information extraction, representation and reasoning in clinical narrative reports. AMIA Annu Symp Proc. 2005:869–73. [PMC free article] [PubMed] [Google Scholar]
  • 24.Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009;10:10. doi: 10.1016/j.jbi.2009.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Uzuner O, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15:14–24. doi: 10.1197/jamia.M2408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.http://www.nlm.nih.gov/research/umls/
  • 27.Mori AR, C F, Galeazzi E. A tagging system for section headings in a CEN standard on patient record. Proc AMIA Symp. 1998:755–9. [PMC free article] [PubMed] [Google Scholar]
  • 28.Meystre S, H P. Automation of a problem list using natural language processing. BMC Med Inform Decis Mak. 2005:5. doi: 10.1186/1472-6947-5-30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Meystre S, H P. Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation. J Biomed Inform. 2006;39:589–99. doi: 10.1016/j.jbi.2005.11.004. [DOI] [PubMed] [Google Scholar]
  • 30.Denny JC, S A, Johnson KB, Peterson NB, Peterson JF, Miller RA. Evaluation of a method to identify and categorize section headers in clinical documents. J Am Med Inform Assoc. 2009 Nov-Dec 16;:806–15. doi: 10.1197/jamia.M3037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.http://www.micromedex.com/
  • 32.http://www.webmd.com/
  • 33.Ferri F. Ferri’s Differential Diagnosis: A Practial Guide to the Differential Diagnosis of Symptoms, Signs, and Clinical Disorders. Mosby Elsevier. 2006 [Google Scholar]
  • 34.Wang X, Hripcsak G, Friedman C. Characterizing Environmental and Phenotypic Associations using Information Theory and Electronic Health Records. BMC Bioinformatics. 2009 doi: 10.1186/1471-2105-10-S9-S13. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Denny JC, M R, Johnson KB, Spickard A., 3rd Development and evaluation of a clinical note section header terminology. AMIA Annu Symp Proc. 2008:156–60. [PMC free article] [PubMed] [Google Scholar]
  • 36.Pestian J, Brew C, Matykiewicz P, Hovermale D, Johnson N, Cohen K, Duch W. Proceedings of the Workshop on BioNLP 2007. Biological, Translational, and Clinical Language Processing; 2007. A Shared Task Involving Multi-label Classification of Clinical Free Text. [Google Scholar]
  • 37.Hripcsak G, Heitjan DF. Measuring agreement in medical informatics reliability studies. J Biomed Inform. 2002;35:99–110. doi: 10.1016/s1532-0464(02)00500-2. [DOI] [PubMed] [Google Scholar]

RESOURCES