Abstract
Objective:
Word embeddings project semantically similar terms into nearby points in a vector space. When trained on clinical text, these embeddings can be leveraged to improve keyword search and text highlighting. In this paper, we present methods to refine the selection process of similar terms from multiple EMR-based word embeddings, and evaluate their performance quantitatively and qualitatively across multiple chart review tasks.
Materials and Methods:
Word embeddings were trained on each clinical note type in an EMR. These embeddings were then combined, weighted, and truncated to select a refined set of similar terms to be used in keyword search and text highlighting. To evaluate their quality, we measured the similar terms’ information retrieval (IR) performance using precision-at-K (P@5, P@10), Additionally a user study evaluated users’ search term preferences, while a timing study measured the time to answer a question from a clinical chart.
Results:
The refined terms outperformed the baseline method’s information retrieval performance (e.g., increasing the average P@5 from 0.48 to 0.60), Additionally, the refined terms were preferred by most users, and reduced the average time to answer a question.
Conclusions:
Clinical information can be more quickly retrieved and synthesized when using semantically similar term from multiple embeddings.
Keywords: electronic medical records (EMR), search engines, query expansion, highlighting, clinical similar terms, semantic embeddings
BACKGROUND AND SIGNIFICANCE
Electronic Medical Records (EMRs) [1–3] contain detailed, unstructured text describing medical conditions. As the size and complexity of EMR systems grow [4,5], tools are needed to help clinicians and researchers efficiently find relevant information within clinical notes. Two popular methods for clinical information retrieval and consumption are search engines and text highlighting.
Search engines have been widely used to help users retrieve information from medical charts [6–11], Some EMR search engines apply the query expansion method [12–14] that takes the original search term, expands it into multiple terms, and returns documents containing any of the expanded terms. Similarly, text-highlighting systems highlight text within a note that contain a search term or similar terms to quickly focus the reviewer on important text [6–11].
Underpinning both tools is the need to extract similar terms for a given keyword. Two popular ways to produce clinical similar terms are [i] ontologies, such as SN0MED-CT[15], UMLS[16], and [ii] EMR-based semantic embeddings. While clinical ontologies are hard to construct and update, EMR-based semantic embeddings are trained using unsupervised machine learning methods (e.g., GloVe[17], word2vec [18]) on EMR text and identify similar terms based on the EMRs’ semantics. For example, Pakhomov et al. [19] found that word embeddings capture semantic relatedness between medical terms. Moreover, Zhu et al.[20] and Hanauer et al. [21] showed that semantically-based query recommendation systems can effectively expand search queries.
However, little research has analyzed the quality or impact of combining multiple semantic embeddings for chart review tasks. In this paper, we present the EMR-subsets method to identify similar terms from multiple embeddings, and evaluate the quality and quantity of similar terms across various clinical chart review tasks.
To evaluate the identified similar terms across quantitative and qualitative dimensions, we conduct multiple experiments including an information retrieval evaluation, a user preference study and a timed chart review task The results show that the identified similar terms achieved better IR performance than the baseline methods, were preferred by most users, and reduced the time to answer a question significantly. Moreover, the selection method is able to identify an optimal number of similar terms.
This work differs from previous work in two critical ways: (1) the EMR-subsets method extracts similar terms by combining multiple EMR-based word embeddings; and (2) is evaluated across multiple dimensions including information retrieval performance, user preference and time to answer a question from a chart.
MATERIALS AND METHODS
Semantic Embeddings
A semantic embedding, such as the word2vec, projects words into a vector space by training a neural network with text [22,23], Word2vec embeddings can be trained with two different methods, the Continuous Bag-of-Words [CBOW] method and the skip-gram method (using a set of words vs. the position of words, respectively). Researchers have already applied the word2vec embeddings to support clinical chart reviews, such as with query expansion[13] and search [24], In this paper, we use the CBOW method for training, which is the default training algorithm of a word2vec model in Gensim[25], The positions of words in the learned embedded vector space are used to estimate their similarity. Specifically, we measure the similarity of two words wordi and wordj using the cosine similarity of their embedded vectors vt and Vj. The range of similarity is from zero to one.
Table 1 lists the documents used to train our embeddings. The "Complete EMR" data set refers to all clinical notes from the Vanderbilt University Medical Center Synthetic Derivative [26], a de-identified mirror of the EMR, which contained approximately 100 million clinical notes at the time of this study.
Table 1.
Data sets used tor training semantic embeddings. Vocabulary size is the number ot distinct words in the data set appearing at least 50 times.
| Training Data Set | Note Count | Vocabulary Size |
|---|---|---|
| Complete EMR | 100m | 277k |
| Clinical Communication | 19.2m | 67.0k |
| HP | 8.0m | 24.1k |
| Outpatient rx Order Summary | 5.0m | 16.2k |
| Prescription | 4.0m | 17.1k |
| Problem List | 3.1m | 6.4k |
| Provider Communications | 2.6m | 12.2k |
| Clinic Note | 2.4m | 33.2k |
| Respiratory Care | 2.2m | 3.4k |
| Clinic Summary | 2.2m | 14.7k |
| Clinic Summary 2 | 2.1m | 28.2k |
| Rehab | 1.7m | 31.5k |
| Nurse’s Note | 1.4m | 16.9k |
| Emergency Department Nurse's Triage | 1.0m | 19.3k |
| Letter | 1.0m | 26.2k |
The other data sets are the largest 14 subsets of the EMR, each containing at least 1 million notes. For each data set, we trained a word2vec model with the default parameters using the implementation provided by Gensim [25], a Python library for semantic analysis. We name each embedding with the name of its training data set, and we call any embedding trained with a subset of the EMR system an “EMR-subset embedding.”
In addition to the EMR-based embeddings, we downloaded the pre-trained word2vec model from Google News (which we refer to as the News embeddings) [22], which contains 3 million word vectors in a 300-dimension vector space, as one of the baseline word embeddings. The News embedding has been used in prior work for query expansion[27] and identifying similar terms [18].
The preprocessing transformations applied before training include:
-
(1)
Parsing XML and HTML data formats to plain text using Beautiful Soup [28].
-
(2)
Excluding stop words, words with a length less than two characters, and words with a frequency less than ten in the training data set.
-
(3)
Tokenizing the words using the Gensim[25] word tokenizer and lowercasing all words.
It is important to note that all EMR-based embeddings in this work utilize one-grams for training, while the Google News embedding includes bi-grams.
Similar Terms Extraction
In this section, we describe our EMR-subsets method to extract and merge similar terms from multiple EMR-subset embeddings. The approach is motivated by the observation that embeddings created from the entire EMR can be distorted by frequently occurring text. Instead, terms should be similar to the keyword throughout subsets of the EMR. For example, as shown in Table-A in the appendix, the “Rehab” EMR-subset embedding identifies “ca” as a top-10 similar term for “cancer.” Similarly, another EMR-subset embedding identifies “grandfather” as a similar word to “cancer”, likely because physicians document family history (We queried the complete EMR and found that 27% of the documents that contain “cancer” also contain “grandfather”, and that many of these were within five words of each other). However, the term “cancer” is not similar to “ca” and “grandfather” in other subsets, indicating these similar terms might be biased by the text in the subset, and therefore may not be ideal for searching or highlighting clinical documents.
The EMR-subsets method identifies similar terms of a given keyword w that have consistent similarity values across EMR subsets. As shown in Figure 1, three metrics are calculated to determine a similarity score for the EMR-subsets method. The intra-subset similarity is a term’s similarity to the keyword w using a specific subset’s embedding. The inter-subsets similarity is a term’s average similarity to the keyword w in all other subsets’ embeddings. The harmonic similarity is the harmonic mean between the intra-subset and intersubsets similarities, which is maximized when the two similarities are equal and is zero if a term exists in a single subset.
Figure 1.
Similar terms of “cancer” from the “Clinic Note” EMR-subset embedding broken down by intrasubset similarity, inter-subsets similarity, and harmonic similarity. The harmonic similarity is used for ranking terms.
Extracting similar terms from the EMR-subset embeddings requires multiple steps.
-
(1)
Candidate Term Generation and Intra-Subset Similarity: For a given keyword w and an EMR-subset embedding (e.g., the “Clinic Note” embedding), we generate the top-K similar terms of the keyword w. The similarities of these terms define the intra-subset similarities. The first column in Figure 1 lists the similar terms from the “Clinic Note” EMR-subset embedding for “cancer” including family history terms (e.g., grandfather), misspellings (e.g., caner) and organs (e.g., colon).
-
(2)
Inter-Subsets Similarity: For each candidate term t in each subset, we compute its average similarity to the keyword w (i.e. inter-subsets similarity) based on other EMR-subset embeddings (i.e., excluding the “Clinic Note” EMR-subset embedding). A candidate term that does not exist in some embeddings has a similarity of zero and lowers the inter-subsets similarity. If a candidate term only exists in the current EMR-subset embedding, we set its inter-subsets similarity to a minimum value (e.g., 0.001). The second column in Figure 1 lists those terms’ similarities to cancer across the other subsets - we observe that grandfathers has a lower similarity in other subsets, while melanoma is more similar.
-
(3)
Harmonic Similarity: For each candidate term t in each subset, we compute the harmonic mean of its intra-subset similarity and inter-subsets similarity. As shown in Figure 1, the inter-subsets similarity of “cancer” and “grandfathers” is 0.21, which is much lower than its intra-subset similarity. Therefore, “cancer” and “grandfathers” is only similar to each other in “clinic note” embedding, meaning it is unlikely to be included the similar term list.
-
(4)
Term Cutoff: For each subset, we apply the similarity-based cutoff method (described in detail below) to remove candidate terms with low harmonic similarities. As shown in Figure 1 in red, we remove some of the family terms, such as “grandfathers” and “great-grandfather,” using the similarity cutoff 0.33.
-
(5)
Merge Similar Terms: Repeat step (l)-(4) in each subset embedding and merge the similar terms by merging the similar terms extracted from each EMR-subset word embedding. Table (A) in the appendix shows the top 10 similar terms of “cancer” generated by the EMR-subsets similarity algorithm.
Formally, we present the process of extracting similar terms from a list of EMR-subset embeddings M = {M1, M2 … Mm} for a keyword w (i.e., there are m embeddings in the list, one for each note type). Given an EMR-subset embedding Mj, we define the intra-subset similarity of two words as Sj (w1, w2), and the inter-subsets similarity of two words as Ij (w1, w2).
For each EMR-subset embedding Mj, we generate the top-K similar terms of the keyword w as the candidate terms. We then compute the inter-subsets similarity of each candidate term:
We then compute the harmonic similarity of each candidate term t:
Next, we remove low similarity terms provided by each EMR-subset embedding Mj, since the number of similar terms impacts the quality of search and highlighting. For example, Figure 2 shows that as the list of search terms is expanded from [epilepsy] to include additional terms, the relevance of the retrieved documents increases initially, but then decreases as the list grows (here, relevance is defined as the percentage of highly similar terms from documents in the expanded search result).
Figure 2.
Example of expanded document quality analysis for “epilepsy.” The proportion of high similarity terms (i.e. terms that have similarities larger than 0.60 while 1.0 is the maximum value) decreases with similar term expansion.
Cutoff Method: The method to determine the similarity cutoff is outlined as follows. We represent the similar terms of a keyword as a two-dimension curve L (Figure 3), with the similar terms along the x-axis (represented by their indexes) sorted by the harmonic similarity in descending order, and their similarity values along the y-axis. We define the cutoff point as the “elbow” of the curve L because the benefit of adding more terms after this point is lower than the average benefit of choosing all terms. Formally, a cutoff point has a smoothed derivative equal to the slope of the line ℓ joining the endpoints of L. Because there are irregularities in the curve L that produce multiple points with a derivative that matches the slope of ℓ, we use an approximate method to identify a unique cutoff point in the curve L:
-
(1)
Draw a line ℓ between the endpoints of L.
-
(2)
Calculate the minimum distance from each point in the curve L to the line ℓ.
-
(3)
Choose the point that has the maximum distance to the line ℓ as the cutoff point. The derivative of L at this point equals the slope of ℓ, by the fundamental theorem of calculus.
Figure 3.
Example of similarity cutoff computation. Since all terms have similarities larger than 0.40, the y-axis starts from 0.3. Similarity cutoff is at the “elbow” of the similarity curve (arrow).
Finally, we merge the similar terms extracted from the EMR-subsets embeddings as the final similar term list for the keyword w.
The results of the “elbow method” are dependent on the number of terms (i.e., the K) chosen. It is true that different K values impact the curve and result in different cutoff values. In fact, the elbow method can be used to choose the best K value. Given the search terms “diabetes” and “seizure,” we tested different values for K (from 1 to 1000) and the elbow method identified K = 100. Larger values of K did not improve results. The K value may vary for different search terms.
Baseline Methods
To compare the EMR-subsets method’s similar terms, we use three baseline methods: (i) terms from the Complete EMR embedding, (ii) terms from the News embeddings, and (iii) terms from the combined Complete EMR and News embedding, which is defined as follows. For a keyword w, the EMR-News similar terms are terms that are similar to the keyword w in both the EMR and News embeddings. We extracted EMR-News similar terms in three steps:
-
(1)
Extract the top-K similar terms of the keyword w from the EMR and News embedding. Then use the intersection of those similar terms as the candidate terms.
-
(2)For each candidate term t, compute its EMR-News harmonic similarity to the keyword w, which is the weighted harmonic mean of the similarity Sc provided by the EMR embedding and the similarity Sg provided by the News embedding:
-
(3)
Sort all candidate terms by their EMR-News harmonic similarities and use the cutoff method to select a similar term list. If and only if a candidate term t has high similarity to the keyword w in both the EMR and News embeddings, the candidate term t has a high EMR-News harmonic similarity.
Evaluations
User Preference Study
We designed a user preference study to evaluate whether the extracted similar terms are preferred by users with different medical knowledge levels in various chart review scenarios. We compared the selections of similar terms provided by the EMR-subsets method and the three baseline methods. As shown in Table 2, we recruited 11 Vanderbilt University Medical Center medical doctors (MDs) at the level of residency training or above, and 20 Non-MD Amazon Mechanical Turk [29] workers in the United States. Only the MDs have verified clinical knowledge. We chose fourteen keywords (each of which was categorized as a general or clinical term), and asked users to choose the best list of similar terms for each keyword.
Table 2.
Framework of the user preference study.
| (a) User types | ||
|---|---|---|
| Name | Knowledge Level | Size |
| MD | Medical Doctor Level | 11 |
| Non-MD | No Verified Level | 20 |
| (b) Medical note review tasks | ||
| Type | Keyword | |
| Advil | ||
| Cancer | ||
| Fracture | ||
| General | Kidney | |
| Ventilator | ||
| Walking | ||
| Cefuroxime | ||
| EEG | ||
| Epilepsy | ||
| Clinical | Irrigate | |
| Keppra | ||
| Pruritis | ||
| Rhinorrhea | ||
| (c) Similar Terms | ||
| Source | Abbreviation | |
| EMR word embedding | EMR | |
| News word embedding | News | |
| EMR and News word embeddings | EMR-News | |
| EMR-subset word embeddings | EMR-subsets | |
Figure 4 shows the web page for the user preference study, which contains 14 questions asking participants to choose their preferred similar term list in a chart review task.
Figure 4.
Screenshot of the preference survey. An introduction is provided, followed by 14 questions that ask the participant to choose the best list to expand a keyword. List orders were randomized to hide source methods.
We applied multinomial logistic regression [30,31] to analyze users’ preferences of similar terms across the extraction methods. As shown below, each logistic model takes user type (0-MD, 1-Non-MD) and task type (0-Clinical, 1-General) as the input, and outputs the log-odd ratio of choosing one method over the reference method. The null hypotheses are: [1] The user type and task type have no effect on the selection of similar terms; [2] There is no significant preference among the similar terms provided by the EMR-subsets method and the baseline methods.
Information Retrieval Experiments
To evaluate the information retrieval [IR] performance of the EMR-subsets method, as shown in Table 3, we selected nine search terms from Table-2, including eight single-word search terms and one multiple-words search term. For each search term, we randomly selected 60 documents from patient cohorts defined by a specific ICD-9 code (Table 3) in which some documents contain the search term (referred as the exact-match subset) and others do not (referred as the non-exact match subset). Then we asked three medical researchers (referred to as users 1, 2, 3) to label each note’s relevance to the search term (1-relevant, 2-partially relevant or 3-irrelevant).
Table 3.
Information Retrieval Performance Evaluation Data Sets
| Search Term | ICD-9 code | Type | Number of Exact-Match Documents | Number of Non-Exact Match Documents |
|---|---|---|---|---|
| Breast Cancer | 174.9 | General | 40 | 20 |
| Epilepsy | 345.9 | Clinical | 37 | 23 |
| Fracture | 829.0 | General | 38 | 22 |
| Headache | 784.0 | General | 30 | 30 |
| Kidney | 593.9 | General | 34 | 26 |
| Pruritus | 698.9 | Clinical | 26 | 34 |
| Respiration | 786.52 | Clinical | 36 | 24 |
| Rhinorrhea | 478.19 | Clinical | 31 | 29 |
| Walking | 719.7 | General | 36 | 24 |
Next, for each search term and extraction method, we evaluated the P@5 and P@10 scores for the various methods, in which precision-at-K (P@K) is defined as the number of relevant or partially relevant notes in the top-K ranked notes. Notes are ranked proportionally to the number and weight of similar terms in a note. The formal equation is as follows for a keyword w and terms in a note.
Elbow Method Evaluation
To evaluate the elbow method, we randomly identified 300 notes from patients in the EMR system that have an ICD-9 code for “seizure” (780.39), and another 300 notes from patients with an ICD-9 code for “diabetes” (250.*). As a result, some notes are relevant to diabetes or seizure care, and some are not. Then we asked four medical researchers to label each note’s relevance to a disease (1-relevant, 2-partially relevant or 3-irrelevant), which produced four labeled document sets for the ‘diabetes’ cohort and four labeled document sets for the ‘seizure’.
Next, for each document set, we used “diabetes” and “seizure” as the initial queries for the respective document sets, expanded the search with the similar terms from the EMR-subsets method and evaluated the impact of the cutoff method by comparing its IR performance to three manually selected cutoff values.
Time Efficiency Experiment
Two medical researchers, who were not investigators of this study, analyzed a cohort of 100 patients (with an average of 75 notes per patient) to determine if a patient had dialysis within 2 weeks of surgery. For each patient, the researchers answered the question YES or NO. For half of the patients, we provided exact keyword search and highlighting to support chart review, in which notes were ranked higher proportionally to the keywords frequency in a note. For the other half of the patients, similar terms were used to expand the search and highlighting feature. In this case, We recorded and compared the time needed to identify the answer for the two methods. Moreover, we compared the results of medical researchers by measuring label accuracy.
RESULTS
User Preference Study
We received 11 MDs' and 20 Non-MDs' response (the response rate is 100%) for a total 441 preferences (i.e., 31 x 14 = 434 + 7 multiple choices). As shown in Table 4, the EMR-subsets method received 52% of the selections, which is more than the other similar term extraction methods. Moreover, the selection of EMR-subsets method varies with user type and task type.
Table 4.
Form (a) records the overall preferences of similar terms extracted from different sources. Form (b) records the count and the percentage of selections of similar terms by User type and task type. Form (c) records the selections of each similar term extraction method.
| (a) Preference of Similar Terms | |||||
|---|---|---|---|---|---|
| Source | EMR | News | EMR-News | EMR-subsets | Total |
| Total Selections | 44 (9.9%) | 129 (29.0%) | 39 (8.8%) | 229 (52.0%) | 441 |
| (b) Similar Terms Selections by User type | |||||
| Source | EMR | News | EMR-News | EMR-subsets | Total |
| MD Selections | 15 (9.7%) | 31(20.0%) | 15(9.7%) | 93 (60.0%) | 154(100%) |
| Non-MD Selections | 29 (10.0%) | 98 (34.0%) | 24 (8.0%) | 136 (47.0%) | 287(100%) |
| (c) Similar Term Selections by Task type | |||||
| Source | EMR | News | EMR-News | EMR-subsets | Total |
| Clinical Selections | 26 (12.0%) | 56 (25.0%) | 21 (9.0%) | 120 (54.0%) | 223(100%) |
| General Selections | 18 (8.0%) | 73 (33.0%) | 18 (8.0%) | 109 (50.0%) | 218(100%) |
We applied multinomial logistic regression models to analyze the result of the user preference study. As shown in Table 5, both the user type and task type have a significant effect on user preference. Based on the intercepts and coefficients of models with indexes 3, 4, 5 in Table 5, we concluded that both the MD and Non-MD users prefer the similar terms provided by the EMR-subsets method compared to other baseline methods, in both the clinical and general tasks.
Table 5.
Analysis of the Impact of user type and task type on the preference of similar terms. User type (MD=0, Non-MD=1) and task type (Clinical=0, General=1) are the inputs of the multinomial logistic regression models. The significance levels are: {**: p-Value < 0.001, *: p-Value < 0.05, one-tailed}.
| Index | Logistic Regression Model | Intercept | User type | Task type |
|---|---|---|---|---|
| 1 | EMR vs. News | -0.40 | -0.51 | -0.64 |
| 2 | EMR-News vs. News | -0.50 | -0.69 | -0.43 |
| 3 | EMR-subsets vs. News | 1.30** | -0.78* | -0.38 |
| 4 | EMR-subsets vs. EMR | 1.70** | -0.27 | 0.26 |
| 5 | EMR-subsets vs. EMR-News | 1.80** | -0.09 | 0.06 |
| 6 | EMR vs. EMR-News | 0.09 | 0.18 | -0.21 |
Information Retrieval Performance
Table 6 shows the average percentage of positive labels (i.e., relevant or partially relevant labels) in the exact match and non-exact match subsets of each evaluation data set. As we can see from Table 6, the non-exact match subsets contain non-negligible amounts of positive documents as the exact match subsets. Therefore, it is important that we develop efficient methods to identify useful documents in the non-exact match subsets.
Table 6.
The Distribution of Positive Labels in the Evaluation Data Sets.
| Search Term | Average percentage of positive labeled Exact Match Documents | Average percentage of positive labeled Non-Exact Match Documents |
|---|---|---|
| Breast Cancer | 68.5% | 73.6% |
| Epilepsy | 47.7% | 59.4% |
| Fracture | 48.2% | 54.5% |
| Headache | 83.3% | 65.6% |
| Kidney | 69.6% | 71.8% |
| Pruritus | 47.4% | 68.6% |
| Respiration | 43.5% | 51.4% |
| Rhinorrhea | 52.7% | 33.3% |
| Walking | 68.5% | 73.6% |
The P@5 performances for all search terms are shown in Table 7; The P@10 performances for all search terms are shown in Table 8. As we can see from Tables 7–8, adding similar words provided by the EMR-subsets method improves the average P@5 and P@10 results in all evaluation data sets compared to keyword-only search. Moreover, the EMR-subsets method outperforms the other extraction methods. Particularly, the EMR-subsets method significantly outperformances other method in non-exact match subsets, which means the EMR-subsets method provides better similar words.
Table 7.
The Average P@5 Scores of each Similar Word Extraction Methods in different datasets. Onesided Mann-Whitney U test was applied to compare the P@5 scores of EMR-subsets and other methods. Methods that the EMR-subsets method significantly outperformed are marked with ** (p-Value < 0.001).
| Data Sets | EMR-subsets | EMR | News | EMR-News | Keywords |
|---|---|---|---|---|---|
| Exact & Non-Exact Match | 0.60 | 0.48 | 0.59 | 0.55 | 0.48 |
| Exact Match | 0.57 | 0.48 | 0.59 | 0.56 | 0.48 |
| Non-Exact Match | 0.59 | 0.39** | 0.37** | 0.41** | 0.00 |
Table 8.
The Average P@10 Scores of each Similar Word Extraction Methods in different datasets. Onesided Mann-Whitney U test was applied to compare the P@10 scores of EMR-subsets and other methods. Methods that the EMR-subsets method significantly outperformed are marked with ** (p-Value < 0.001).
| Data Sets | EMR-subsets | EMR | News | EMR-News | Keywords |
|---|---|---|---|---|---|
| Exact & Non-Exact Match | 0.56 | 0.46 | 0.50 | 0.55 | 0.50 |
| Exact Match | 0.53 | 0.46 | 0.47 | 0.57 | 0.50 |
| Non-Exact Match | 0.59 | 0.32** | 0.19** | 0.39** | 0.00 |
Elbow Method Evaluation
As shown in Table 9, the similarity cutoff method is able to identify an optimal similarity cutoff, which provides a better P@20 score than the manually selected similar cutoffs when using the EMR-subsets method.
Table 9.
The Average P@20 scores of searching “diabetes” and “seizure” with similar words defined by different similarity cutoff
| Similarity Cutoff | Average P@20 when searching “diabetes” | Average P@20 when searching “seizure” |
|---|---|---|
| 1.0 | 0.64 | 0.80 |
| 0.8 | 0.64 | 0.89 |
| 0.4 | 0.61 | 0.90 |
| 0.2 | 0.54 | 0.61 |
| Elbow method | 0.68 | 0.94 |
Time Efficiency Analysis
For the note review task, we measured the time to complete each task and the quality of labels produced by the two researchers. Ideally, the researchers would maintain their label accuracy while completing tasks faster.
The result showed that the labels provided by the researchers were highly consistent. The researchers agreed on all documents except one. Table 10 shows the median time and the Interquartile Range (IQR) of time that each researcher spent reviewing notes with or without highlighting similar words of the search query. We used a one-sided Mann-Whitney U test to analyze the difference in average times with and without highlighting similar words. All Mann-Whitney U test provided p-values less than 0.05, which showed that searching and highlighting similar words reduced task time.
DISCUSSION
This paper reports the development and evaluation of a novel similar term extraction method, the EMR-subsets method. The EMR-subsets method utilizes the subsets of an EMR system to extract similar terms that are applicable to support efficient search and consumption of clinical documents. The EMR-subsets method (i) utilized less training data, (ii) received more selections in the user preference study, (iii) achieved higher IR performance than to the baseline methods, and (iv) reduced the time needed to answer questions in a timed chart review task.
Previous research demonstrated that ensemble semantic embeddings provide better similar terms (for example, summing the similarities from multiple semantic spaces [5] or combining vectors from multiple semantic embeddings[18]). However, these methods combined embeddings trained with different data sources, or attempted to learn a global embedding instead of merging the most similar terms from each subset. In this paper, the EMR-subsets method utilized the subsets of a single data set and was preferred by users, while the combination of the EMR and News embeddings was less preferred in the user study.
Interestingly, as shown in the Table-A of Appendix, highly similar terms for “cancer” in the Complete EMR embedding are related to family history, while the similar terms from the News embedding describe types of cancer. In contrast, the EMR-subsets method listed more clinical terms as being similar to “cancer”. One possible reason for this difference is physicians commonly document a patient’s family history of cancer in specific note types. The EMR-subsets method reduces the impact of co-occurring words from a popular note type. Therefore, the community should be careful about incorporating increasingly large data sets when training semantic embeddings for clinical applications.
There are several limitations and possible future work of this study. First, we limited the EMR-subsets method to the largest clinical note types in an EMR system. Future work can consider all note types or subsets constructed in alternative methods such as by common phenotypes[32]. Second, while the study attempted to discern the scenarios in which the News embedding would perform best (i.e., general note review tasks), additional analysis is needed to understand why some users preferred the similar terms provided by the News embedding in some tasks. In addition, a fine-grained information retrieval analysis is needed to determine if positive search preferences correlates with information retrieval performance across many search scenarios (i.e., the preferred similar terms provide better information retrieval performance). Third, the constructions of user types and task types can be formalized and made more fine-grained, for example, by categorizing MD users by discipline or skill. Fourth, we utilized semantic embeddings to identify similar words, while other methods could be used find related terms like graphical models [33], Fifth, we only included unigrams when training EMR-based embeddings in our current study. We did try word embeddings based on bi-grams or tri-grams. However, bi- and tri-gram embeddings needed much more training data and computational resources due to the larger vocabulary space. Moreover, some bi-grams have no clinical meaning, such as “table_also.” One possible future work is extending the vocabulary with bi-grams or tri-grams using a clinical dictionary, such as SNOMED CT or RxNorm. Sixth, we ranked notes by the sum of term similarities. Possible future work includes normalizing the similarities before ranking and introducing other ranking methods. Moreover, as shown in Table 6, many notes contain the search term but were not marked as relevant, which confounds recall evaluations. Therefore, in the information retrieval experiments, we only presented the P@K scores.
CONCLUSION
This paper presents the EMR-subsets method, which extracts similar terms from multiple semantic embeddings trained from subsets of the EMR. We systematically evaluated the similar terms extracted by the approach using qualitative and quantitative methods. Compared to the other baseline methods, the similar terms provided by the EMR-subsets method were preferred in a user preference study, achieved higher P@5 and P@10 scores across multiple search terms, and reduced the time spent searching and consuming clinical information for two researchers in a small pilot study.
Table 10.
The median time (25th and 75th percentile time) medical researchers spent on reviewing one patient’s notes. One-sided Mann-Whitney U test was applied for the analysis. The significance levels are: {**: p-Value < 0.001, *: p-Value < 0.05 one-tailed}.
| Researcher | Median time in seconds when reviewing one patient’s notes (25th and 75th percentile time) with highlighted similar words | Median time in seconds when reviewing one patient’s notes (25th and 75th percentile time) with highlighted exact words |
|---|---|---|
| Medical researcher 1 | 9.0 (8.0 – 11.0)** | 11.5 (9.0–26.3)** |
| Medical researcher 2 | 76.5 (57.0 – 112.0)* | 91.5 (73.5 – 135.0)* |
Highlights.
Multiple word embeddings have been trained on different types of clinical notes.
A method is presented to extract and merge terms from the word embeddings.
The method was evaluated in terms of user preference and retrieval performance.
The method allows researchers to answer chart review questions more quickly.
Acknowledgments:
The training data for the word2vec semantic embeddings was obtained from VUMC’s Synthetic Derivative, which is supported by institutional funding and by the Vanderbilt CTSA grant ULTR000445 from NCATS/NIH.
Funding: Crowd Sourcing Labels from Electronic Medical Records to Enable Biomedical Research Award Number: 1 UH2 CA203708–01
Appendix
Table-A: the similar terms for “cancer” provided by the EMR-subsets, EMR-News, EMR and News similar term extraction methods and the “Clinical Summary 2”, “Problem List” and “Rehab” EMR-subset embeddings.
| EMR-subsets | EMR-News | EMR | News | Clinical Summary 2 | Problem List | Rehab |
|---|---|---|---|---|---|---|
| melanoma | leukemia | cancem | lung cancer | anut | Prostate | ca |
| breast | hashimoto | cnacer | colon cancer | paternal | Lung | carcinoma |
| prostate | malignancies | endocrinopathies | leukemia | deceased | colon | melanoma |
| carcinoma | nonpolyposis | at age | cancers | maternal | breast | xrt |
| metastatic | diabetes | cousins | liver cancer | grandmother | father | ademiocarcinoma |
| colon | cancer | gf | brain tumor | Uncle | maternal | lumpectomy |
| malignant | alzheimer | social history | brain tumors | Sister | skin | prostate |
| tumor | hpth | grandfather | bladder cancer | Father | oid | chemoxrt |
| radiation | sitosterolemia | meopausal | prostrate cancer | Mother | diabetes | chemoradiation |
| ca | masectomy | cance | colorectal cancer | Grandfather | of | malignant |
Table-B: the similar terms of "epilepsy" provided by provided by the EMR-subsets, EMR-News, EMR and News similar term extraction methods and the “Clinical Summary”, “Clinical Summary” EMR-subset embeddings,
| EMR-subsets | EMR-News | News | EMR | Clinical Summary | Clinical Summary 2 |
|---|---|---|---|---|---|
| seizures | jme | schizophrenia | eses | clonipin | seizure |
| seizure | onti | bipolar disorder | localization | emu | seizures |
| eeg | clobazam | intractable epilepsy | epileptic | vimpat | epileptic |
| intractable | Pgb | Epilepsy | semiology | amir | intractable |
| keppra | vimpat | ADHD | jme | arain | gastaut |
| epileptic | epilepsies | Lennox Gastaut syndrome | clobazam | hasan | staring |
| tonic | epileptic | epileptic seizures | veeg | somnezturk | cerebral |
| myoclonic | epileptiform | dystonia | generalization | bassel | lemiox |
| bipolar | dystonia | multiple sclerosis | wada | klialil | myoclonic |
| neurology | astatic | Dravet syndrome | cbz | neuro | siezures |
Table-C: the similar terms of "ventilator" provided by provided by the EMR-subsets, EMR-News, EMR and News similar term extraction methods and the “Clinical Summary”, “Clinical Summary 2” and “Rehab” EMR-subset embeddings.
| EMR-subsets EMR-News | News | EMR | Clinical Summary Clinical Summary 2 Rehab | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| vent | cmv | respirator | j servo | vent ; vent ; canula | |||||||
| trach | ventilators | mechanical ventilator | asset | hme | ventilation | vent | ||||
| cannula | bedside | intensive care | hed | trach | intubation | hfov | ||||
| intubation | flolan | ventilators | simy | humidified | gj | vapotherm | ||||
| tracheostomy | ventilation | breathing tube | idb | saturations | requirement | ncpap | ||||
| oxygen | intubation | artificial res0070irator | model | tracheostomy | peep | extubation | ||||
| ventilation | extubated | ECMO machine | drager | percussion | vapotherm | hljv | ||||
| bipap | nebulizer | ventilator support | serial | cannula | pccu | intubation | ||||
| sats | cannula | Intensive Care Unit | rrt | cuffed | weaned | simv | ||||
| intubated | weaned | tracheotomy tube | monitor | acapella | cooling | ventilation | ||||
Table-D: the similar terms of "EEG" provided by provided by the EMR-subsets, EMR-News, EMR and News similar term extraction methods and the “Clinical Summary 2”, “Prescription” EMR-subset embeddings.
| EMR-subsets | EMR-News | News | EMR | Clinical Summary 2 | Prescription |
|---|---|---|---|---|---|
| seizure | pdr | electroencephalogram EEG | discharges | emu | vlkuham |
| seizures | epileptiform | electroencephalograph | pdr | wada | seizure |
| emu | fosphenytoin | electroencephalogram | interictally | ictal | aed |
| epilepsy | hypsarrhythmia | electroencephalograph EEG | generalized | deprived | keppra |
| mri | alternant | EEGs | voltage | epileptogenicity | emu |
| brain | jme | evoked potentials | frontally | epileptiform | epilepsy |
| spells | opisthotonus | electroencephalograph EEG | fronto | discharges | pinaqvi |
| aed | amobarbital | electroencephalography | electrographic | nmri | seizures |
| staring | clonic | electroencephalograms | epileptogenici | seizures | gnb |
| neurology | frontopolar | brainwaves | theta | seizure | vimpat |
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Competing interests: None.
Provenance and peer review: Not commissioned; externally peer reviewed.
REFERENCES
- [1].Rasmussen LV, The electronic health record for translational research, J. Cardiovasc. Transl. Res. 7 (2014) 607–614. doi: 10.1007/sl2265-014-9579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Chen L, Guo U, Illipparambil LC, Netherton MD, Sheshadri B, Karu E, Peterson SJ, Mehta PH, Racing Against the Clock: Internal Medicine Residents’ Time Spent On Electronic Health Records, J. Grad. Med. Educ. 8 (2016) 39–44. doi: 10.4300/JGME-D-15-00240.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Hripcsak G, Vawdrey DK, Fred MR, Bostwick SB, Use of electronic clinical documentation: time spent and team interactions, J Am Med Inf. Assoc. 18 (2011) 112–117. doi: 10.1136/jamia.2010.008441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Lai KH, Topaz M, Goss FR, Zhou L, Automated misspelling detection and correction in clinical free-text records., J. Biomed. Inform. 55 (2015) 188–95. doi: 10.1016/j.jbi.2015.04.008. [DOI] [PubMed] [Google Scholar]
- [5].Henriksson A, Moen H, Skeppstedt M, Daudaravicius V, Duneld M, Synonym extraction and abbreviation expansion with ensembles of semantic spaces., J. Biomed. Semantics. 5 (2014) 6. doi: 10.1186/2041-1480-5-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Biron P, Metzger MH, Pezet C, Sebban C, Barthuet E, Durand T, An information retrieval system for computerized patient records in the context of a daily hospital practice: the example of the Leon Berard Cancer Center (France]., Appl. Clin. Inform. 5 (2014) 191–205. doi: 10.4338/ACI-2013-08-CR-0065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Natarajan K, Stein D, Jain S, Elhadad N, An analysis of clinical queries in an electronic health record search utility, Int. J. Med. Inform. 79 (2010) 515–522. doi: 10.1016/j.ijmedinf.2010.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Tawfik AA, Kochendorfer KM, Saparova D, S. A1 Ghenaimi, J.L. Moore, “I don’t have time to dig back through this”: The role of semantic search in supporting physician information seeking in an electronic health record. Perform. Improv. Q. 26 (2014) 75–91. doi: 10.1002/piq.21158. [DOI] [Google Scholar]
- [9].Zalis M, Harris M, Advanced search of the electronic medical record: Augmenting safety and efficiency in radiology, J. Am. Coll. Radiol. 7 (2010) 625–633. doi: 10.1016/j.jacr.2010.03.011. [DOI] [PubMed] [Google Scholar]
- [10].Gregg W, Jirjis J, Lorenzi NM, Giuse D, StarTracker: an integrated, web-based clinical search engine., AMIA Annu. Symp. Proc. (2003) 855. http://www.ncbi.nlm.nih.gov/pubmed/14728360 (accessed October 24, 2016. [PMC free article] [PubMed] [Google Scholar]
- [11].Hanauer DA, Mei Q, Law J, Khanna R, Zheng K, Supporting information retrieval from electronic health records: A report of University of Michigan’s nine-year experience in developing and using the Electronic Medical Record Search Engine (EMERSE],J. Biomed. Inform. 55 (2015) 290–300.doi: 10.1016/j.jbi.2015.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Ooi J, Ma X, Qin H, Liew SC, A survey of query expansion, query suggestion and query refinement techniques, 2015 4th Int. Conf. Softw. Eng. Comput. Syst. ICSECS 2015 Virtuous Softw. Solut. Big Data. (2015) 112–117. doi: 10.1109/ICSECS.2015.7333094. [DOI] [Google Scholar]
- [13].Goodwin T, Harabagiu SM, UTD atTREC 2014: Query Expansion for Clinical Decision Support, 23rd Text Retr. Conf. (TREC 2014) Proc. 1 (2014), [Google Scholar]
- [14].Pal D, Mitra M, Bhattacharya S, Exploring Query Categorisation for Query Expansion : A Study, arXiv Prepr. arXivl509.05567. (2015) 1–34.http://arxiv.Org/pdf/1509.05567vl.pdf%5Cnhttp://arxiv.org/abs/1509.05567. [Google Scholar]
- [15].NIH-NLM SNOMED Clinical Terms® (SNOMED CT®], NIH-US Natl. Libr. Med. (2015), http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html. [Google Scholar]
- [16].Martinez D, Otegi A, Soroa A, Agirre E, Improving search over Electronic Health Records using UMLS-based query expansion through random walks, J. Biomed. Inform. 51 (2014) 100–106. doi: 10.1016/j.jbi.2014.04.013. [DOI] [PubMed] [Google Scholar]
- [17].Pennington J, Socher R, Manning CD, GloVe: Global Vectors for Word Representation, Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (2014) 1532–1543. doi: 10.3115/vl/D14-1162. [DOI] [Google Scholar]
- [18].Speer R, Chin J, An Ensemble Method to Produce High-Quality Word Embeddings, Arxiv. (2016). http://arxiv.org/abs/1604.01692.
- [19].Pakhomov SVS, Finley G, McEwan R, Wang Y, Melton GB, Corpus domain effects on distributional semantic modeling of medical terms. Bioinformatics. 32 (2016) 3635–3644. doi: 10.1093/bioinformatics/btw529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Zhu D, Wu S, Carterette B, Liu H, Using large clinical corpora for query expansion in text-based cohort identification,]. Biomed. Inform. 49 (2014) 275–281. doi: 10.1016/j.jbi.2014.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Hanauer DA, Wu DTY, Yang L, Mei Q, Murkowski-Steffy KB, Vydiswaran VGV, Zheng K, Development and empirical user-centered evaluation of semantically-based query recommendation for an electronic health record search engine, J. Biomed. Inform. 67 (2017) 1–10. doi: 10.1016/j.jbi.2017.01.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Mikolov T, Corrado G, Chen K, Dean J, Efficient Estimation of Word Representations in Vector Space, Proc. Int. Conf. Learn. Represent. (ICLR 2013), (2013) 1–12. doi: 10.1162/153244303322533223. [DOI] [Google Scholar]
- [23].Mikolov T, Chen K, Corrado G, Dean J, Distributed Representations of Words and Phrases and their Compositionality, Nips. (2013) 1–9. doi: 10.1162/jmlr.2003.3.4-5.951. [DOI] [Google Scholar]
- [24].Jin M, Li H, Schmid CH, Wallace BC, Using Electronic Medical Records and Physician Data to Improve Information Retrieval for Evidence-Based Care, IEEE Int. Conf. Healthc. Informatics. (2016). doi: 10.1109/ICHI.2016.12. [DOI] [Google Scholar]
- [25].Rehurek R, Sojka P, Software Framework for Topic Modelling with Large Corpora, in: Proc. Lr. 2010 Work. New Challenges NLP Fram., ELRA, 2010: pp. 45–50. [Google Scholar]
- [26].Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR, Masys DR, Development of a Large-Scale De-Identified DNA Biobank to Enable Personalized Medicine, Clin. Pharmacol. Ther. 84 (2008) 363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Diaz F, Mitra B, Craswell N, Query Expansion with Locally-Trained Word Embeddings, arXiv Prepr. arXivl605.07891. (2016) 367–377. http://arxiv.org/abs/1605.07891 (accessed October 24,2016). [Google Scholar]
- [28].Richardson L, Beautiful Soup Documentation, (2016) 1–72. http://www.crummy.com/software/BeautifulSoup/bs4/doc/. [Google Scholar]
- [29].Buhrmester M, Kwang T, Gosling SD, Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data?, Perspect. Psychol. Sci. 6 (2011) 3–5. doi: 10.1177/1745691610393980. [DOI] [PubMed] [Google Scholar]
- [30].Starkweather J, Moske AK, Multinomial logistic regression. Multinomial Logist. Regres 51 (2011) 404–410. doi: 10.1097/00006199-200211000-00009. [DOI] [Google Scholar]
- [31].Group SC, MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES, (2014), https://stats.idre.ucla.edU/r/dae/multinomial-logistic-regression/.
- [32].Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, Lai AM, A review of approaches to identifying patient phenotype cohorts using electronic health records, J. Am. Med. Informatics Assoc. 21 (2014) 221–230. doi: 10.1136/amiajnl-2013-001935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Ganesan K, Lloyd S, Sarkar V, Discovering Related Clinical Concepts Using Large Amounts of Clinical Notes., Biomed. Eng. Comput. Biol. 7 (2016) 27–33. doi: 10.4137/BECB.S36155. [DOI] [PMC free article] [PubMed] [Google Scholar]




