Abstract
Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which can lead to information overload for clinicians, especially in high-throughput fields like radiology. Thus, it would be beneficial to have a mechanism for summarizing the most clinically relevant patient information pertinent to the needs of clinicians. This study presents a novel approach for the curation of clinician EHR data preference information towards the ultimate goal of providing robust EHR summarization. Clinicians first provide a list of data items of interest across multiple EHR categories. Since this data is manually dictated, it has limited coverage and may not cover all the important terms relevant to a concept. To address this problem, we have developed a knowledge-driven semantic concept expansion approach by leveraging rich biomedical knowledge from the UMLS. The approach expands 1094 seed concepts to 22,325 concepts with 92.69% of the expanded concepts identified as relevant by clinicians.
1. Introduction
Patient Electronic Health Records (EHRs) usually contain very detailed information and are a source of a large amount of clinical data for a patient. EHRs document various aspects of the patients’ clinical information such as the reason for visit, problem list, labs and test results, allergies, medications, etc. They include information over multiple patient encounters and care provided by different healthcare professionals. Although the information is valuable, an abundance of patient information can lead to an information overload condition for clinicians1. In order to find information pertinent to the current case, clinicians may go over several notes, labs, and reports spread across multiple visits and years. In such cases, identifying the most relevant information for clinical decision making can be difficult and time-consuming. Moreover, current EHR systems often do not present this tremendous amount of patient data in a way that supports the clinical workflow or cognitive reasoning, and immensely large records can negatively affect clinical work due to error of omission or delay2,3. Thus, it would be beneficial to have a mechanism for summarizing the most clinically relevant patient information. Previous studies4,5 have shown that EHR summaries can have a positive impact on overall patient care.
Selecting the relevant information from a patient record is a difficult problem. In this work, we propose a method for the curation and expansion of EHR data preferences, towards summarization based on clinician identified concepts. In the first step, clinicians manually generate a summarization blueprint, or a “summary template”, by specifying which information they would like to see in a holistic summary of a patient’s EHR. This patient information is captured by identifying important clinical concepts and their categories that are useful for generating summary documents. Since the summary template is generated manually, it has limited coverage and may not cover all the important terms relevant to a concept. For example, for the “diabetes” concept, there are multiple relevant concepts such as diabetes mellitus, high blood sugar, diabetes mellitus infantile, insulin pump, diabetes mellitus insulin dependent, insulin resistance, pregnancy induced diabetes, etc. Moreover, for a given seed concept, the related concepts may span across multiple categories such as medication, labs, and allergy.
The objective of this work is to semantically expand the clinician-generated initial summarization template in order to have broader coverage of clinically relevant concepts and categorize these concepts in the selected 11 clinical categories. Although powerful tools such as the Universal Medical Language System (UMLS)6 Metathesaurus are able to aid in the expansion of medical concepts, there is no universally agreed upon method for performing this expansion. The most straightforward expansion methods all have notable limitations. Keyword-based search on the UMLS Metathesaurus does not consider the semantics of the query and returns all the concepts containing the keyword; it does not rank concepts by considering the relevance of the expanded concepts to the seed concepts. And while UMLS has relationships between nodes, these relationships are often vague and inconsistent in their granularity, making simple graph traversal perilous. Using UMLS semantic types and relationships provides a large number of possible related concepts that has to be filtered-out vigilantly to select relevant expansions. Thus, most naïve methods of semantic expansion would require a large amount of manual correction by clinician experts.
In order to address this problem, we propose the following multi-step summary template generation approach. First, clinician experts provide a list of important concepts of interest across multiple summarization categories. As these experts may not have knowledge of the UMLS structure, these concepts are initially provided in plain-text format rather than mapped to UMLS identifiers. In the second step, we use a clinical concept extractor7 to map these plain text terms to UMLS concept identifiers automatically. This automated mapping is then validated and corrected as necessary by the original clinical experts. These validated seed concepts are then automatically expanded by leveraging rich information from a UMLS-backed biomedical knowledge graph. The seed concepts are expanded by identifying their clinical variants and related concepts based on the hierarchy and relationships from the biomedical knowledge graph. Further, the expanded concepts are filtered by removing duplicate concepts and semantically irrelevant concepts. In the fourth step, the filtered concepts are categorized into selected 11 clinical categories-based on concept semantics and keyword-based approach. Finally, clinicians review and validate the expanded template to insure the concept relevancy and categorization. This expanded template can then be used as a guide for understanding clinical data preferences to aid with the prioritization of data in an EHR summary.
This work aims to utilize semi-automated clinician-in-the-loop systems in order to efficiently understand data preferences. By doing so, we are able to keep the burden on the clinical experts low while greatly expanding upon their seed concepts and ensuring that their initial intent is understood and met. Further, linking every term to one or more concepts in the knowledge base ensures that they can be mapped to EHR data from disparate sources. The proposed method is able to expand 1,094 seed concepts from 11 clinical categories to 22,325 expanded concepts. Also, this is done while keeping the relevancy rate at 92.69% and with 84.52% concepts categorized into correct clinical categories, which helps to reduce the validation burden. By streamlining data preference gathering and interpretation, summarization systems may more easily be tailored towards the needs of specific specialists. To the best of our knowledge, this is the first EHR data prioritization approach, which keeps the clinicians’ inputs as the focal point while leveraging the rich knowledge from a biomedical knowledge base. The data prioritization and expansion techniques described herein should be applicable across disciplines.
1.1. Related work
1.1.1. Patient summarization report generation
Summarization methods can be broadly categorized as extractive or abstractive5,8. Extractive summaries are created by borrowing phrases or sentences from the original input text. In the domain of clinical summarization, an extractive approach can identify pieces of the patient’s record and display them without providing additional layers of abstraction. Abstractive summaries generate new text that synthesizes the original text. In the domain of clinical summarization, abstractive summaries may provide additional higher-level context to explain the data, such as computed quantities (e.g., trends) or automatically generated text. Much of the current research on summarization in the biomedical domain has focused on text summarization, in which one or more texts are reduced to a single condensed reference text9. Text summarization strategies have been developed for automated summarization of scientific literature10, generation of literature abstracts11, and summarization and translation9.
Another dichotomy in summary generation techniques is between methods based on knowledge and those based on data. The data-driven approaches require less context-specific knowledge to create the summary while ‘‘knowledge rich’’ methods necessitate larger and more advanced knowledge bases8. A review by Mishra et al12. indicated that there is a growing interest in knowledge-rich approaches in the biomedical domain, coinciding with the increased availability of comprehensive lexical resources, such as WordNet13 and the Unified Medical Language System (UMLS). Common techniques of extraction-based summarization include topic-based sentence extraction14, where the relevance of a sentence is computed with respect to one or more topics of interest; Van Vleck et al.15 performed structured interviews to identify and classify phrases that clinicians considered relevant to explaining a patient's history.
1.1.2. Automatic expansion of medical terms:
In this paper, we present an approach to data expansion reliant on automatic semantic expansion of seed terms. This is similar to existing methods in information retrieval (IR) that falls broadly into the category of automatic query expansion (AQE)16. The majority of AQE techniques are tailored towards Web search and rely on relevancy feedback17, click-through data18,19 or similar queries16. Until recently most techniques used in AQE were “knowledge-poor”, but with recent advancements in knowledge graphs20,21 some AQE techniques have been developed to leverage this knowledge22.
In the medical domain, researchers have developed AQE techniques based on domain-specific ontologies and using the UMLS Metathesaurus. Most of the recent medical IR research has focused on developing knowledge-based (or concept-based) retrieval models dependent on medical resources such as the UMLS. Aronson and Rindflesch23 use the MetaMap program for associating UMLS Metathesaurus concepts with the original query. They conclude that the optimal strategy would be to combine query expansion with retrieval feedback. Many studies have also utilized the Medical Subject Headings (MeSH) thesaurus in query expansion24. Zhou et al.25 expanded query terms utilizing several vocabulary sources, including MeSH. Sondhi et al.26 used both the MeSH medical thesaurus and manual physician feedback in query expansion, comparing different combinations of methodology.
2. Methods
This section gives an overview of the proposed data preference gathering method based on summary template generation and expansion. Broadly, the major steps are as follows: 1) Generation of the seed summarization template, 2) Semantic expansion of the summarization template, and 3) Patient summary generation based on the expanded template.
2.1. Generation of seed summarization template
The seed summarization template is a minimal version of one or more clinical experts’ data preferences seeded by manual curation. To arrive at this initial template, one must first identify the clinical categories of interest and the terms within those categories that are most relevant to the clinical experts. The following are the major components to the generation of the seed summarization template:
Identification of Clinical categories: The clinical categories of interest may vary between specialties. In each iteration of the data preference gathering technique described herein, the clinical experts provide feedback to help select the appropriate categories. In any examination of electronic health records, the clinical variables of interest can be described in terms of a set of known clinical categories denoted as {𝑪𝟏 , 𝑪𝟐 , … 𝑪𝒏}. The clinical categories used in this work are shown in Table 1. These categories were first seeded by those commonly used in the history and physical examination and were then modified based on input from radiologists who participated in the seed summary template generation.
Term identification: During the term identification stage, 10 practicing radiologists worked collaboratively to identify data items in each category that they felt were important to see as part of a clinical summary, focusing specifically on what data was of interest in their discipline. These terms were provided as plain text explanations with varying levels of specificity. For instance, categories such as Physical Exam Findings could contain both terms specific to a particular finding such as murmurs, or a much more general statement of interest such as enlarged solid organs.
Term to clinical concept (CUI) mapping: Terms identified by clinicians are free text English terms and in order to understand their semantics we need to map these terms to their respective clinical concepts in a biomedical knowledge base. Using a medical concept extractor7, we mapped the clinician identified terms to UMLS concepts.
Concept mapping review: With automatic term to CUI mapping, each term can be mapped to multiple concepts or may fail to be mapped as all. Therefore, to confirm that the terms are mapped to the correct concepts, the mapping is reviewed by the clinicians who first generated the seed terms. A summarization template 𝑻 is then defined as collection of clinical concepts linked seed concepts corresponding to their clinical category 𝑪𝒊.
Table 1.
Clinical categories selected for the summarization template
| No | Clinical Category | Description |
|---|---|---|
| 1 | Allergy | Allergy to medication, contrast agents, food or other allergens |
| 2 | Family Member | List of “close”, usually first degree, family member, whose history of a heritable medical illness may be an indication of increased risk for the same illness for the patient |
| 3 | Family history | Heritable medical illnesses |
| 4 | Imaging | Radiology imaging related concepts |
| 5 | Implanted Devices | Medical devices implanted in the patients body, such as a pacemaker |
| 6 | Labs | All non-imaging investigations such as blood and urine tests |
| 7 | Medications | The medications that the patient has been prescribed |
| 8 | Patient Management | Treatments and procedures (such as surgery) |
| 9 | Problem list | Patients medical illness, sings and symptoms |
| 10 | Social History | Patients social factors that may influence health outcome, such as smoking status, occupation and other at-risk behaviors |
| 11 | Vitals | Key physical exam parameters, such as heart rate, temperature and respiratory rate |
2.2. Semantic expansion of the summarization template
A multi-fold, automated semantic term expansion methodology is employed in order to improve the coverage of the seed summarization template. The following are the major components of the semantic template expansion approach: 1) Biomedical knowledge graph, 2) Clinical variant generation, 3) Hierarchical expansion, 4) Concept filtration, 5) Concept categorization, and 6) Review and validation of expanded template concepts
2.2.1. Biomedical knowledge graph.
The core of our biomedical knowledge graph is the UMLS6. The National Library of Medicine (NLM) produces the UMLS to facilitate computer understanding of biomedical text. The UMLS is a repository of more than 100 biomedical vocabularies. The UMLS consists of the following three subcomponents: the metathesaurus, the semantic network, and the SPECIALIST lexicon. The Metathesaurus forms the base of the UMLS and comprises of over 4 million biomedical concepts and 16 million concept names, all of which stem from the over 130 incorporated controlled vocabularies and classification systems. The Semantic Network consists of semantic types and semantic relationships. Semantic types are broad subject categories like “Disease or Syndrome” and “Clinical Drug”. Semantic relationships are useful relationships that exist between Semantic Types. Each concept in the Metathesaurus is assigned one or more semantic types (categories), which are linked with one another through semantic relationships. The SPECIALIST lexicon is an English-language lexicon that contains biomedical terms. The lexicon entry for each word or term records the syntactic, morphological, and orthographic information of the respective lemma. It also contains spelling variants, acronyms, and abbreviations.
The biomedical knowledge base used in this work contains a subset of the UMLS, generated as follows: all vocabularies with license type 0 as defined by UMLS are selected along with SNOMED CT. All vocabularies in this set that are not in English are then removed. Our biomedical knowledge base also includes Radlex27, a radiology-specific vocabulary not typically contained within UMLS. To link terms from this vocabulary to the UMLS-based concepts in the rest of the knowledge base, we utilize the linking provided by the National Cancer Institute Metathesaurus (NCIm)28.
2.2.2. Clinical variant generation
In the initial summarization template generation step, all the terms identified by clinicians are mapped to clinical concepts in the knowledge base. In the variant generation process, clinical variants of each seed concept are generated by identifying synonymous concepts. In the UMLS, each concept is associated with a single unique identification string referred as its Concept Unique Identifier (CUI). This CUI may be linked to multiple concepts from the source vocabularies that feed into UMLS, which may be described by different term variants. For instance, UMLS CUI C0020538 “Hypertensive disease” is linked to multiple term variants such as “high BP”, “high blood pressure”, and “hypertension”. The clinical variant generation collects all of these synonymous concepts.
In the second phase of clinical variant generation, related concepts are identified. The candidate-related concepts are assembled by a keyword-based search performed on the knowledge-base using the lexical variants of the seed concepts and their clinical concept variants as inputs. The keyword-based search on the clinical knowledge-base returns numerous concepts. In order to maintain the relevancy of the concepts, from the search results, we select all the concepts that encompass either the seed terms or their clinical variants (e.g., for “diabetes”, get all the concepts that contains “diabetes”) and select only top 20 concepts from the partial match results. While this particular step helps us to improve our recall for the concept expansion to identify relevant concepts, it also identifies a lot of irrelevant concepts. For example, for “diabetes”, this step identifies concepts like “diabetes insipidus” and “vasopressin resistant diabetes insipidus” which are clinically not relevant to “diabetes”. Therefore, the candidate-related concepts are filtered out based on the semantic and textual features as described in the “Template concept filtration” section.
2.2.3. Hierarchical expansion
Hierarchical expansion involves identification of parent, sibling and children concepts for the clinical variants. Concepts in the UMLS are connected to each other using semantic relationships in the semantic network. For each concept, we can retrieve the concept’s parent and children concepts by traversing the concept hierarchy of a source vocabulary (such as SNOMED CT). Using the seed’s parent concept, we further identify the siblings of the seed term. The concept hierarchy traversal is done for two hops and parent/children concepts are retained only if their semantic type matches with the starting concept. This helps to keep the scope narrow and to get relevant concepts. For all the concepts retained in the concept hierarchy traversal, we further identify their clinical variants as described previously. Figure 1 illustrates an extract of the relevancy graph generated for the seed term “diabetes” that maps to the concept ‘diabetes mellitus’ in the knowledge graph. As shown, the sphere of influence for a semantic expansion includes synonyms and many similar terms that may have been intended to be captured by the original clinician-provided phrase.
Figure 1.
Illustration of semantic expansion of a seed term (“diabetes”).
2.2.4. Concept filtration
During the clinical variant generation and hierarchical expansion steps multiple synonymous and related concepts are identified as relevant for the seed concepts. In this step, duplicate and semantically irrelevant concepts are filtered out to reduce noise. First, all concepts are converted to singular form (e.g., medications to medication). Duplicate concepts are filtered out based on the syntactic and morphological information of the concepts. If multiple concepts contain the same set of words in a different order (e.g., “high blood pressure”, “blood pressure, high”), all but one of these concepts is filtered out.
Semantically irrelevant concepts are identified by examining the semantic information associated with each concept in the knowledge base as well as some empirically discovered textual features. The semantic features employed include the semantic type of the concept, the distance in hops from the concept to the original seed concept, and the UMLS relationship types between the concepts. We have selected a list of semantic types from UMLS that are relevant to EHR summarization such as “Clinical Drug”, “Laboratory or Test Result”, “Diagnostic Procedure”, “Disease or Syndrome”, and “Medical Device”. We retain only those concepts that have semantic types from our selected list of semantic types. Second, we compute the distance from the expanded concept to the seed clinical concept in the knowledge graph and we retain only those concepts that are less than or equal to 3 hops. This step helps us to filter out “diabetes insipidus” as a related concept for “diabetes” since the distance between these concepts in the knowledge-graph is not less than or equal to 3 hops. Finally, we perform filtration based on selected UMLS relationship types.
For textual features, several stop words indicative of semantic drift were identified by experimentation. Examples of identified stop words include “institute”, “doctor”, “nurse”, “admission”, “hospital”, “education”, and “facility”. The semantic and textual features were selected and tuned based on iterative empirical analysis of relevant and non-relevant concepts.
2.2.5. Concept categorization
In the concept categorization step, the concepts are categorized in to the selected 12 clinical categories. This work is inspired by previous approach that uses UMLS semantic types for the concept categorization29. As mentioned earlier, we have curated a list of semantic types that relevant for EHR summarization. We mapped these selected semantic types to one or more clinical categories from (Table 1). We found direct mapping to one or more of from the selected list of semantic types for most of the health categories e.g., for the “Medication” clinical category, we have mapped “Pharmacologic Substance”, and “Clinical Drug” semantic types. For some of the categories, we have used a combination of semantic types and concepts for the categorization, e.g., for “Social history” category, we have used semantic types like “Occupational Activity” and concepts/keywords like “smoking”, “substance abuse”, and “illicit drug”.
2.2.5. Review and validation of expanded template concepts.
Once potentially relevant concepts have been identified via the above expansion process, they are given to clinician annotators for review. Two clinicians were involved in the review process. To reduce the probability of human errors and subjectivity, the two clinicians worked together to set-up an annotation scheme and annotated the first 250 concepts collaboratively. First, the clinicians labeled each term as relevant or not relevant for the template. Second, for each relevant concept, the clinicians checked the assigned clinical category of the concept based the category description and label it correct/incorrect. The clinicians then worked independently to annotate the rest of the expanded concepts from the dataset. Each concept was annotated by two clinicians. After all the annotations were completed, the annotators discussed and resolved conflicts on the concepts that had mismatching labels. This step is further elaborated on in the Results section.
2.3. Use-case - Patient summarization report generation
Once the expanded template is complete, it can be used in conjunction with EHR extraction tools to generate a patient summarization report. If a piece of information in the patient record is linked to a UMLS CUI, the expanded summarization template can be used as a guide to understand its relevancy. All the assembled concepts from the patient’s electronic health record can be matched against the concepts from the expanded template, with only matched concepts retained in the summary. Alternatively, the extracted information could be ranked based on template-guided factors, such as whether the patient information matches concepts specified by the clinician in the initial template or the expanded concepts or the depth of the knowledge graph to which the concept matches. Since much of the clinical information is buried in free text such as in surgical notes or discharge summaries, having a list of important concepts would be very valuable. Furthermore, since the concepts are organized by the clinical categories, one approach would be to extract relevant sections of EHR documents and use concepts from related clinical categories for the summarization purpose
3. Results
The seed template generated based on methodology described in section 2.1 had 1094 concepts. The distribution of these seed concepts across the selected clinical categories is listed in the 2nd column of the (Table 2). After all the steps in the template expansion process were completed, the 1,094 concepts in the seed template were expanded to a total of 22,325 concepts. The (Table 2), provides a summary results of our template expansion work. The “number of relevant concepts” is computed based on the concept annotations (section 2.2.5) while the “percentage of relevant concepts” is computed as (number of relevant concepts*100/number of concepts after expansion). The “number of relevant and correctly categorized concepts” indicates the number of concepts that clinician found 1) relevant for the summarization template and 2) categorized into correct clinical categories. The “percentage of relevant and correctly categorized concepts” is computed as (number of relevant and correctly categorized concepts*100/number of concepts after expansion). Finally, the “average number of relevant and correctly categorized concepts per seed” is computed as (number of relevant and correctly categorized concepts/number of seed concepts).
Table 2.
Summary result for template concept expansion by clinical categories
| Clinical categories | Number of seed concepts | Number of concepts after expansion | Number of relevant concepts | Percentage of relevant concepts | Number of relevant and correctly categorized concepts | Percentage of relevant and correctly categorized concepts | Average number of relevant and correctly categorized concepts per seed |
|---|---|---|---|---|---|---|---|
| Allergy | 16 | 129 | 125 | 96.90% | 113 | 87.60% | 7.06 |
| Family History | 66 | 385 | 313 | 81.30% | 313 | 81.30% | 4.74 |
| Family Member | 21 | 107 | 80 | 74.77% | 80 | 74.77% | 3.81 |
| Imaging | 119 | 1196 | 1038 | 86.79% | 948 | 79.26% | 7.97 |
| Implanted Devices | 86 | 1452 | 1376 | 94.77% | 1308 | 90.08% | 15.21 |
| Labs | 83 | 1396 | 1278 | 91.55% | 1190 | 85.24% | 14.34 |
| Medication | 66 | 1058 | 1028 | 97.16% | 936 | 88.47% | 14.18 |
| Patient Management | 270 | 5160 | 4768 | 92.40% | 4203 | 81.45% | 15.57 |
| Problem list | 332 | 10468 | 9746 | 93.10% | 8939 | 85.39% | 26.92 |
| Social History | 21 | 516 | 497 | 96.32% | 442 | 85.66% | 21.05 |
| Vitals | 14 | 458 | 445 | 97.16% | 397 | 86.68% | 28.36 |
| Total | 1094 | 22325 | 20694 | 92.69% | 18869 | 84.52% | 17.25 |
As shown in (Table 2), the number concepts, their expansion, relevancy percentage, and categorization performance differ with the clinical categories. The “Problem list” and “Patient Management” categories had most number of seed concepts while “Allergy”, “Family Member”, “Social history”, and “Vitals” categories had least number of seed concepts. The “percentage of relevant concepts” is over 90% for most the categories, while “Family history”, and “Family member” had lowest “percentage of relevant concepts”. The outcome of “percentage of relevant and correctly categorized concepts” is dependent on the performance of each of the previous modules (clinical variant generation, hierarchical expansion and concept filtration) as well as our categorization approach. We found that 84.52% of the expanded concepts are relevant for the summarization template and categorized into correct clinical categories. The “average number of relevant and correctly categorized concepts per seed” shows the average concept expansion rate for relevant and correctly categorized concepts by clinical categories. The “Problem list” and “Vital” categories had most average concept expansion rate (more than 26 concepts per seed concepts) while “Family history” and “Family member” categories had least average concept expansion rate (less than 5 concepts per seed concepts). During the validation step, clinicians manually corrected the categories of misclassified concepts and removed incorrect concepts resulting in a final template of 20,694 validated concepts.
Although the validation of the expanded concepts gives some indication of the overall precision of the method, it is unable to assess its recall. A full understanding of recall would need ground truth information about all concepts that are relevant to each seed concept, which would be a taxing and likely unreliable thing to produce via manual clinician input. However, in order to get a rough understanding of recall, we performed a manual analysis on a small scale. We randomly selected twenty seed terms from the original summarization template, and two clinicians manually expanded these selected terms. Since manual expansion between two clinicians for the selected few terms did not match (varied by multiple terms), we could not calculate the recall statistics. However, we discovered that our automated approach covered, on average, 94% of the terms mentioned by the clinicians and many relevant terms that clinicians missed. The 6% terms that were missed in the automated expansion were related to the seed terms but were separated by multiple hops in the knowledge graph.
4. Discussion and Conclusion
Patient Electronic Health Records (EHRs) usually contain very detailed information. Although the information is valuable, an abundance of patient information can lead to an information overload condition for clinicians. Thus, it would be beneficial to have a mechanism for summarizing the most clinically relevant patient information pertinent to the needs of the clinicians. For example, let us consider a real-world use-case with radiologists. Radiologists today rely mostly on their visual interpretation of imaging studies for their diagnostic decisions. There is considerable information about the patient in the electronic health record that could positively impact their decisions. However, this data is often physically distributed across many enterprise hospital systems including electronic medical record (EMR) systems, radiology informatics systems (RIS), picture archiving and communication systems (PACS), laboratory systems, pharmacy systems, etc. Due to the volume of imagery to examine, radiologists have little time themselves to assimilate all pertinent clinical information from these distributed records. As a result, overlooking of diagnosis and misinterpretation is a common problem contributing to diagnosis error rates30. This problem is further exacerbated by the differing needs among clinicians. Much of the information of note to a primary care physician would not impact a radiologist’s diagnosis, but certain domain-specific pieces of knowledge, such as an allergy to imaging contrast, could be critically important. If a radiologist’s visual examination could be augmented by providing a compact summary of the patient’s clinical history that focused on their specific data needs, it could lead to improvement in overall clinical decision making.
To this end, this work presents a hybrid approach that leverages biomedical knowledge and expert (clinician) knowledge for efficiently gathering clinical data preferences for use in the production of a EHR data summary. To do so, we introduce the notion of a “summarization template”, a catalogue of data preferences linked to a knowledge base. Summarization template is crafted semi-automatically with clinician assistance. An initial summarization template is manually generated by clinicians and is a high-level specification of patient information that the clinician would like to see in a holistic summary of the patient’s EHR. In its initial state, the template is incomplete and specifies the expected information in free text without linking to a knowledge base. If used as such, it would have limited coverage within an actual patient record. To alleviate this, seed concepts from the initial template are automatically linked to the knowledge base, validated by clinicians, and further expanded with related concepts. The expanded template is then reviewed and validated by clinical experts before it is considered to be ready for use in EHR summarization.
Limitations and outlook: In this work, we have not presented evaluation of the summaries generated using the summarization template due to scope of this work. In section, 2.3 we have discussed an approach for EHR summarization using the summarization template. In future, we are planning to report evaluation of the summaries generated using the summarization template and investigate if specialty templates (e.g., cardiology, neurology) are more helpful in summarization than a generic summarization template.
This study provides a novel approach to EHR data preference selection, keeping the clinicians at the center. It presents a knowledge-driven semantic expansion approach that leverages a rich biomedical knowledge base to expand upon manually identified seed terms. The approach expands 1,094 seed concepts to 20,694 clinically relevant concepts. Results of a validation study indicate that the expansion technique is able to identify relevant concepts with a relevancy rate of 92.69% and able to correctly categorize these concepts in 84.52% of cases, keeping the burden of the clinical validators fairly low. By iteratively relying on clinical expertise and automatic semantic expansion, the proposed method is able to generate and validate specialty-specific data preferences, which can then be used to tailor the output of EHR summarization systems to the needs of each specialist. The approach presented in this work can be used to curate custom vocabularies for different specialties of medicine like cardiology vocabulary, neurology vocabulary. Another possible application of this work would be in improving EHR document annotation and understanding. The approach presented in work is transferable to (and replicable) to address different problems in healthcare field that may benefit from domain/problem-specific custom vocabularies.
Figures & Table
References
- 1.Beasley John W., Tosha B, Wetterneck Jon Temte, Jamie A, Lapin Paul Smith, Rivera-Rodriguez A. Joy, Ben-Tzion A. Karsh. "Information chaos in primary care: implications for physician performance and patient safety.". The Journal of the American Board of Family Medicine. 2011;24(6):745–751. doi: 10.3122/jabfm.2011.06.100255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Laker LF, Froehle CM, Windeler JB, Lindsell CJ. Quality and efficiency of the clinical decision‐making process: Information overload and emphasis framing. Production and Operations Management. 2018 [Google Scholar]
- 3.Payne TH. EHR-related alert fatigue: minimal progress to date, but much more can be done. BMJ quality & safety. 2019 Jan 1;28(1):1–2. doi: 10.1136/bmjqs-2017-007737. [DOI] [PubMed] [Google Scholar]
- 4.Feblowitz JC, Wright A, Singh H, Samal L, Sittig DF. Summarization of clinical information: a conceptual model. Journal of biomedical informatics. 2011 Aug 1;44(4):688–99. doi: 10.1016/j.jbi.2011.03.008. [DOI] [PubMed] [Google Scholar]
- 5.Pivovarov R, Elhadad N. Automated methods for the summarization of electronic health records. Journal of the American Medical Informatics Association. 2015 Apr 15;22(5):938–47. doi: 10.1093/jamia/ocv032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Unified Medical Language System (UMLS) https://www.nlm.nih.gov/research/umls/index.html.
- 7.Guo Y, Kakrania D, Baldwin T, Syeda-Mahmood T. Efficient clinical concept extraction in electronic medical records. In Thirty-First AAAI Conference on Artificial Intelligence. 2017 Feb 12.
- 8.Moen H, Peltonen LM, Heimonen J, Airola A, Pahikkala T, Salakoski T. Salanterä S. Comparison of automatic summarisation methods for clinical free text notes. Artificial intelligence in medicine. 2016 Feb 1;67:25–37. doi: 10.1016/j.artmed.2016.01.003. [DOI] [PubMed] [Google Scholar]
- 9.Feblowitz JC, Wright A, Singh H, Samal L, Sittig DF. Summarization of clinical information: a conceptual model. Journal of biomedical informatics. 2011 Aug 1;44(4):688–99. doi: 10.1016/j.jbi.2011.03.008. [DOI] [PubMed] [Google Scholar]
- 10.Elhadad N, Kan MY, Klavans JL, McKeown KR. Customization in a unified framework for summarizing medical literature. Artificial intelligence in medicine. 2005 Feb 1;33(2):179–98. doi: 10.1016/j.artmed.2004.07.018. [DOI] [PubMed] [Google Scholar]
- 11.Paice CD. Constructing literature abstracts by computer: techniques and prospects. Information Processing & Management. 1990 Jan 1;26(1):171–86. [Google Scholar]
- 12.Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, Del Fiol G. Text summarization in the biomedical domain: a systematic review of recent research. Journal of biomedical informatics. 2014 Dec 1;52:457–67. doi: 10.1016/j.jbi.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fellbaum C. Dordrecht: Springer; 2010. WordNet. InTheory and applications of ontology: computer applications; pp. 231–243. [Google Scholar]
- 14.Goldstein J, Mittal V, Carbonell J, Kantrowitz M. Association for Computational Linguistics. 2000 Apr 30. Multi-document summarization by sentence extraction. In Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization; pp. 40–48. [Google Scholar]
- 15.Van Vleck TT, Stein DM, Stetson PD, Johnson SB. Assessing data relevance for automated generation of a clinical summary. In AMIA annual symposium proceedings. 2007;2007:761. [PMC free article] [PubMed] [Google Scholar]
- 16.Carpineto C, Romano G. A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR) 2012 Jan 1;44(1):1. [Google Scholar]
- 17.Salton G, Buckley C. Improving retrieval performance by relevance feedback. Readings in information retrieval. 1997 Dec 1;24(5):355–63. [Google Scholar]
- 18.Joachims T, Granka L, Pan B, Hembrooke H, Gay G. Accurately interpreting clickthrough data as implicit feedback. In ACM SIGIR Forum. ACM. 2017 Aug 2;51(1):4–11. [Google Scholar]
- 19.White RW, Marchionini G. Examining the effectiveness of real-time query expansion. Information Processing & Management. 2007 May 1;43(3):685–704. [Google Scholar]
- 20.Wang H, Zhang Q, Yuan J. Semantically enhanced medical information retrieval system: a tensor factorization based approach. IEEE Access. 2017 Apr 26;5:7584–93. [Google Scholar]
- 21.Xiong C, Callan J. Query expansion with Freebase. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval. ACM. 2015 Sep 27:111–120. [Google Scholar]
- 22.Martinez D, Otegi A, Soroa A, Agirre E. Improving search over Electronic Health Records using UMLS-based query expansion through random walks. Journal of biomedical informatics. 2014 Oct 1;51:100–6. doi: 10.1016/j.jbi.2014.04.013. [DOI] [PubMed] [Google Scholar]
- 23.Aronson AR, Rindflesch TC. Query expansion using the UMLS Metathesaurus. In Proceedings of the AMIA Annual Fall Symposium. American Medical Informatics Association. 1997:485. [PMC free article] [PubMed] [Google Scholar]
- 24.Díaz-Galiano MC, García-Cumbreras MA, Martín-Valdivia MT, Montejo-Ráez A, Urena-López LA. Berlin, Heidelberg: Springer; 2007 Sep 19. Integrating mesh ontology to improve medical information retrieval. In Workshop of the Cross-Language Evaluation Forum for European Languages; pp. 601–606. [Google Scholar]
- 25.Zhou W, Yu C, Smalheiser N, Torvik V, Hong J. Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 2007 Jul 23:655–662. [Google Scholar]
- 26.Sondhi P, Sun J, Zhai C, Sorrentino R, Kohn MS. Leveraging medical thesauri and physician feedback for improving medical literature retrieval for case queries. Journal of the American Medical Informatics Association. 2012 Mar 21;19(5):851–8. doi: 10.1136/amiajnl-2011-000293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.RadLex radiology lexicon. https://www.rsna.org/en/practice-tools/data-tools-and-standards/radlex-radiology-lexicon.
- 28.National Cancer Institute Metathesaurus (NCIm) https://ncim.nci.nih.gov/ncimbrowser/
- 29.Jadhav A, Sheth A, Pathak J. Analysis of online information searching for cardiovascular diseases on a consumer health information portal. InAMIA Annual Symposium Proceedings. American Medical Informatics Association. 2014;2014:739. [PMC free article] [PubMed] [Google Scholar]
- 30.Reiner BI, Krupinski E. The insidious problem of fatigue in medical imaging practice. Journal of digital imaging. 2012 Feb 1;25(1):3–6. doi: 10.1007/s10278-011-9436-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

