Abstract
This paper presents a method for converting natural language questions about structured data in the electronic health record (EHR) into logical forms. The logical forms can then subsequently be converted to EHR-dependent structured queries. The natural language processing task, known as semantic parsing, has the potential to convert questions to logical forms with extremely high precision, resulting in a system that is usable and trusted by clinicians for real-time use in clinical settings. We propose a hybrid semantic parsing method, combining rule-based methods with a machine learning-based classifier. The overall semantic parsing precision on a set of 212 questions is 95.6%. The parser’s rules furthermore allow it to “know what it does not know”, enabling the system to indicate when unknown terms prevent it from understanding the question’s full logical structure. When combined with a module for converting a logical form into an EHR-dependent query, this high-precision approach allows for a question answering system to provide a user with a single, verifiably correct answer.
Introduction
The wealth of information available in the electronic health record (EHR) of a patient can be both a blessing and a curse. While having the most information possible for a patient stored in the EHR is important for the many uses of EHR data (record keeping, billing, legal, research, etc.), the large amount of information can create difficulties for clinicians when they need to quickly locate specific information. For this reason, much focus has been placed on presenting clinicians with the most important information, either visually1 or in text2,3. This is no replacement, however, for mechanisms to efficiently access any piece of information. Largely this is accomplished by (often cumbersome) graphical interfaces for structured information and textual search for unstructured notes. Textual search offers notable advantages: it is often quick to enter a query and far less training is required than vendor- and provider-specific graphical interfaces. While some effort has gone into a more semantic search to find structured data elements4, these methods still largely rely on a keyword-based approach that cannot deeply grasp the user’s information need in a manner that presents a single, verifiably correct response. Instead, a search engine-style listing of ranked results is presented, burdening the user with determining the relevance of each result in turn.
Instead of keyword queries, which are often ambiguous in regards to their precise information need, natural language questions provide a clear and intuitive form to query EHRs. For this reason, observational studies of clinician needs often use natural language questions as their representation of choice5,6. Due to the utility of expressing information needs as questions, numerous automatic question answering (QA) methods have been proposed for medical data7, combining natural language processing (NLP) and information retrieval (IR) methods. However, comparatively little attention has focused on QA for patient-specific data in the EHR. Furthermore, the QA methods that have been proposed for EHRs are largely analogous to the keyword-based semantic search methods for general-purpose IR queries. This limits their abilities to search over structured data and furthermore limits their ability to provide a single, veri- fiably correct answer. For example, given the question “What are her three most recent glucose measurements?”, a clinician would ideally want a single result (e.g., “145, 139, 156”) that can be verified (e.g., by providing proof the question was properly understood and links to the lab results in the EHR interface). Further, if a clinician were to ask a question without an answer (e.g., the patient only had two glucose measurements), or if the system did not fully understand the question (either through human error, e.g. a misspelling, or lack of system functionality, e.g. inability to understand “three most recent”) then appropriate responses can be shown instead of providing a list of completely incorrect search results.
To provide such an answer, the natural language question must be converted to a complete, semantically-grounded logical form using a pre-defined set of logical operations. While semantic grounding task such as concept normalization8 and semantic relation tasks such as semantic role labeling9 provide partial information toward a complete semantic grounding, they are insufficient in capturing the full meaning of a question. The NLP task for transforming natural language into a logical form based on pre-defined logical operations is known as semantic parsing. While semantic parsing is a well-studied task in NLP, almost no work exists on semantic parsing for a task like EHR QA.
This paper presents a semantic parsing method for understanding EHR questions. As one might imagine, converting any form of natural language to a completely structured form is a difficult, NLP-complete task. To avoid the difficult task of enumerating the lexicon of logical operations, some semantic parsing methods utilize ungrounded logical forms10,11, where the words in the question themselves become the logical functions. However, this is generally only used for searching unstructured data, or assumes words are easily mappable to the database structure. This can further be problematic when mathematical operations (e.g., min, less_than) or knowledge-based operations (is-positive, issignificant) are mixed in with logical operations that correspond to fields in the database. As such, for tasks such as this, pre-defined grounded logical forms are preferable, as they are easier to convert to structured queries. Manually enumerating a pre-defined lexicon for natural language is largely intractable, so most grounded semantic parsing techniques thus focus on short texts (such as questions) and small domains, such as U.S. geography12 (e.g., What states border Texas?). Even under closed domains, oftentimes thousands of manually annotated <question,logical form> pairs are necessary to train a semantic parser to achieve satisfactory results. Our method, therefore, differs from many existing machine learning (ML) based semantic parsers due to the complexity of the domain and the difficulty of obtaining so many annotations. This is done by relying on existing clinical NLP resources and incorporating several rule-based elements to improve its customizability. Instead of performing semantically-embedded syntactic parsing as done by many semantic parsers (e.g., using a combinatorial categorical grammar (CCG) on the raw question input), the approach proposed here first leverages existing clinical NLP methods to simplify the question. Second, the simplified question is converted to a dependency-based tree structure. Third, a lexicon is used to convert lexico-syntactic substructures into logical operations, resulting in an initial set of logical trees. Fourth, a set of rules transform and then filter these trees. Fifth, a ML-based classifier choses the best logical form tree for the question. Sixth, a separate classifier identifies the relevant temporal intervals for medical concepts in the question. Finally, the best tree is turned into a flat logical form and items withheld during the simplification step are substituted back in. We use a logical form designed to concisely represent the question while also being easily convertible to EHR query standards such as Fast Healthcare Interoperability Resources (FHIR). Examples of such logical forms are:
Is he on any blood products?
δ(λx.has_treatment(x, C0852255, status))
How low has her blood pressure been?
min(λx.has_test(x, C0005823, visit))
What are the positive tests?
positive(λx.has_test(x, C022885, visit))
What is the trend in hemoglobin?
trend(λx.has_test(x, C0019046, visit))
This paper addresses the critical second through fifth step described above–the actual semantic parsing steps. The first step can leverage any number of existing clinical NLP systems (e.g., MetaMap13, cTAKES14); the sixth step is an important but tangential task to the semantic parsing; and the final step is a deterministic process re-incorporating the information extracted during the first step. Evaluating the semantic parsing steps in isolation enables identifying key issues in converting a natural language into a logical structure without the difficulties involved in error propagation from upstream components. Optimizing the semantic parser in the presence of errors in the first step and classifying the appropriate time intervals in the sixth step are both left to future work.
Background
As previously stated, numerous methods have been proposed for medical QA7. Again, the bulk of these methods are not focused on patient-specific EHR QA, but rather on general medical knowledge or the scientific literature. Some works focus on questions asked by clinicians, while others focus on consumer questions (see Roberts et al.15 for a brief discussion of these systems). The key reasons for the separation include the linguistic differences between medical professionals and consumers16 as well as the appropriate information source (e.g., consumers are less likely to likely to understand scientific literature articles). Further, while QA for EHRs is less studied, IR for EHRs has received more attention4,17–20.
QA methods for EHR data vary along a spectrum from keyword-based IR approaches to deep semantic approaches, though the only work involving a deep semantic approach that we are aware of falls under the project discussed in this paper. The IBM Watson system21 uses a passage scoring method. While this is similar to IR approaches, QA-specific features are integrated to improve the method over baseline IR capabilities. A more semantic method is proposed by Patrick & Li22 utilizing templates. Templates are essentially pre-defined logical relations that are classifiable directly from the question using traditional multi-class ML. They lack the full representative power achievable through semantic parsing and likely require far more training data to achieve similar levels of understanding23.
Our current project aims to overcome the limitations of existing EHR QA methods to enable a true natural language interface for structured data in the EHR. We have previously described the structure of the logical form23 as well as an annotated corpus of <question, logical form> pairs15. The corpus furthermore contains gold-annotated concepts and other manual annotations in a layer-wise fashion to enable training the semantic parser on correctly simplified questions. There are three annotated layers: (1) a syntactic simplification step (i.e., question decomposition), (2) a concept recognition and normalization step, and (3) a logical form step (i.e., semantic parsing). We have further proposed an automatic method for syntactic simplification24 as well as an automatic method for distinguishing patient-specific questions (i.e., answerable with the EHR) and other types of questions (i.e., answerable with many of the existing QA approaches) so that multiple QA systems can be integrated into a single interface25. In this work, we focus on the most critical NLP component of the overall project: the semantic parser that converts natural language into a logical form. Future work will cover the remaining tasks, such as mapping logical forms to FHIR queries for direct integration of a QA system into EHRs.
Materials and Methods
The overall QA framework for this project is shown in Figure 1. Most of the following subsections except the semantic parsing subsection describe elements of this framework at a high level, motivating certain choices made in semantic parsing and providing references to already-published descriptions.
Figure 1:

High-level architecture.
The input question (Q) is first split into multiple questions via the Syntax Decomposition module, then each of the n output questions are separately run through the remaining pipeline. After Concept Normalization, the question is simplified (SQ) by replacing concepts with consistent placeholders. A dependency parse is then generated from SQ and run through the semantic parsing components.
The dependency tree is run through the Lexicon, which produces all N possible Lexicon Match Trees. Each of these is run through both generation rules (increasing the number of trees) and filtering rules (pruning trees), resulting in M Logical Trees. A Tree Classifier then selects the best Logical Tree.
After the semantic parser, the concept CUIs are re-substituted back into the logical form, the TIME attribute is classified, and finally a query module uses the logical form as a template for interacting with the EHR (e.g., through FHIR queries).
1. Data
The dataset consists of 212 manually-annotated questions originally collected from actual clinicians by Patrick & Li22. Full details of the dataset can be found in our previous work15,23, only a high-level description is provided here. The questions were annotated with successive layers. The first layer splits multi-answer questions (referred to as decomposition) using the question’s syntactic structure. For instance, “What was the highest and lowest glucose?” is split into “What was the highest glucose?” and “What was the lowest glucose?”. This step is rarely necessary, as less than 5% of questions require decomposition. The second layer recognizes and normalizes concepts, both traditional medical concepts, such as problems, treatments, and tests, as well as concepts specific to clinical questions, such as references to the patient (e.g., “she”, “he”, “the patient”) as well as admission, discharge, staff, etc. The traditional medical concepts are normalized to SNOMED codes (using UMLS CUIs). Finally, the last layer provides the logical forms using a lambda calculus notation (an extension of first order logic incorporating a A function that identifies all items matching some condition). Most medical concepts correspond to a common A pattern-Ax. has-TYPE (x, CUI, TIME)-which corresponds to all events x of the type CUI within the temporal interval specified by TIME. TIME can take several values, including visit-corresponding to the current hospital visit-and status-which is just the currently active events. Non-concept words in the question generally correspond to logical functions, e.g., “highest” corresponds to the function max. Most questions, roughly 97%, have only one A-statement, while multi-A questions can be particularly complicated for semantic parsers. For instance:
δ(λx.has_treatment(x, C0087111, visit) Λ δ(λy.is_response (x, y)))
This logical form corresponds to the question “Did she have a reaction to the treatment?”. The second λ-statement is nested inside the first, essentially testing every treatment event x (C0087111 is the high-level code for treatments) to see if any other event (denoted y) is a response to event x. There is insufficient space here to discuss the reasoning behind every decision for the logical form annotations. We thus refer the interested reader to our previous work describing these decisions15,23.
2. Initial Question Processing
The important insight behind the layer-wise approach is that while the concepts are present in the final logical form, they do not affect the structure of the logical form. As a result, the semantic parser need not understand the differences between “diabetes” and “hypertension”, only that “Does he have PROBLEM?” corresponds to the logical form δ(λx.has_problem(x, CUI, TIME)), where the UMLS code for the problem is substituted into the place of CUI. For this reason, the first two layers are referred to as simplification steps, as they reduce both the lexical and syntactic complexity of the question. Further, these steps correspond to clinical NLP tasks for which there is existing data. As a result, far less training data for the semantic parser is required. The layer-wise annotation on this dataset furthermore enables experimentation using gold standard simplification, isolating just the semantic parsing.
While the automatic methods behind these steps are beyond the scope of this paper, as previously stated, existing methods can largely be leveraged to perform this processing. A syntactic decomposition method, for instance, was proposed by Roberts24. Concept recognition and normalization has been extensively studied, with both ML-based methods based on the SemEval task focused on disorders8 as well as rule-based methods such as MetaMap13.
The final step prior to the semantic parser, which is performed automatically in this work, is the dependency parse. The Stanford dependency parser26 operates on the simplified questions. To ensure a proper dependency parse, the concepts in the simplified question are substituted with a part-of-speech appropriate replacement word. For example: a noun concept is replaced by the word “concept”; a past-tense verb concept is replaced by the word “ate”; and the reference to the patient is replaced by the pronoun “he” or “his” (if possessive). This replacement ensures a proper dependency parse on the simplified question, since replacing a phrase with a word from a different part-of-speech would likely result in an erroneous dependency parse. The original placeholders are then substituted back into the output dependency tree (e.g., “ate” is changed back to “concept” and “he” is changed back to “patient”).
To demonstrate the full extent of processing prior to the semantic parser, consider the following two questions:
Lexical: When was the patient most recently dialyzed? What was her lowest and highest blood sugar level?
Syntax: When was the patient most recently dialyzed? What was her lowest blood sugar level?
Concept: when was patient most recently vbd:procedure what was pos:patient lowest nn:procedure
Dependency:

3. Semantic Parsing
The semantic parsing components are shown inside the box in Figure 1. The first component is a dependency tree- based lexicon. Note that most publicly available semantic parsers27,28 make use of a similar lexicon: while a lexicon can be learned directly from <question, logicalform> pairs, such systems generally prefer the rely on lexicons for increased precision. In our case, since a high precision semantic parser is crucial, we thus follow suit.
The tree-based lexicon maps nodes and edges in the dependency tree to logical operations. The simplest case are individual nodes (i.e., single words/concepts):
concept ⇒ λx.has.concept when ⇒ time did ⇒ >
lowest ⇒min first ⇒ earliest was ⇒ latest
temporal_ref ⇒ time_within patient ⇒ null yet⇒null
Where null matches indicate words not useful for semantic parsing, but possibly useful for downstream tasks (e.g., “yet” is useful for time classification). Multi-word matches are also common. They are specified with the head node and specific edge types to dependent nodes:
concept (det:the) ⇒ λx.has.concept recently (advmod:most) ⇒ latest
done (aux:have, auxpass:been) ⇒ 5 often (advmod:how) ⇒ count
much (advmod:how) ⇒ sum received (dobj:concept) ⇒ λx.has.concept
Note that while the overall focus of the system is on precision, lexicon building invariably must focus on recall. For this reason, multiple lexicon entries are often necessary for the same word (e.g., [was ⇒ null], [was ⇒ 5]). The result is that most dependency trees have at least one overlapping lexicon match. When this happens, all possible combinations of lexicon matches are generated such that every node in the dependency tree is covered by exactly one lexicon match. A new tree is then created with nodes corresponding to the dependency relations (nodes and edges in multi-word matches are collapsed into a single node). Leaf nodes with null values are also discarded. These trees are referred to as lexicon match trees: the nodes contain logical operations, but the edges between those operations do not necessarily correspond to relations in the final logical form. The trees for the above examples include:
Lexicon Match Tree:
Note that while the two initial questions are quite different on the surface, by now the structures are quite similar.
At this point in the processing, questions without any lexicon match trees are considered unanswerable, as this means some part of the question has an unknown logical function. This is the source of the system’s high precision design: we would prefer the system returns answers to fewer questions than risk returning incorrect answers. Incorrect answers would result in diminished trust in the system by clinicians. When a question is unanswerable, the user can be presented with the unknown word (s) and be asked to rephrase the question.
The next component converts each lexicon match tree into zero or more logical trees. A set of rules using a grow-and- prune strategy is used. First, two “grow” rules generate new trees by manipulating related nodes in the initial lexicon match tree:
1. FlipRule: if a parent has one child, create a new tree with parent and child flipped (e.g., the tree [A→B→C→D] would also result in [A→C→B→D]).
2. PromoteRule: if a parent has more than one child, promote each child in a new tree (e.g., [A→B, C] would also result in [A→B→C] and [A→C→B]).
Trees generated by these rules are also run through the same rules, resulting in a final set of trees not only resembling the original lexicon match tree (which frequently contains nodes with multiple children), but also every possible unary tree (which are more typical of the logical forms in the dataset). Next, the “prune” rules remove invalid trees:
NullRule: null leaves are removed, and trees with null non-leaf nodes are filtered.
TypeRule: Every logical function has pre-defined input and output types. For instance, λx.has.concept has no input and returns an E VENT S ET; latest and max take in an E VENT S ET and return an E VENT; δ takes in an E VENT S ET and returns T RUE F ALSE; and time takes in an E VENT a returns a T IME. Using these types, any tree that has an incompatible parent-child relationship (parent’s input must equal child’s output) is filtered.
These rules drastically reduce the number of trees, as the grow rules produce many invalid trees. As the trees are now compatible with logical forms, they are referred to as logical trees. Logical trees for the previous examples include:
Logical Tree:
This tree structure is useful for downstream processing, however, so they are not yet “flattened” into logical forms.
The final step in semantic parsing is to choose between the remaining logical trees. There are several statistical factors that dictate the true logical forms. The two most important are that some lexicon matches are more likely than others (e.g., “was” is more likely to be null than δ), and some arrangements of the logical form are more likely (corresponding to parent-child relations in the logical tree). For this reason, a ML-based approach using a support vector machine29 (SVM) is used. The two statistical factors described are the first two features considered by the SVM, along with a third feature emphasizing the importance of the top-level logical operation. The three features are:
LexiconMatch: IDs of the lexicon matches that were used to generate the lexicon match tree that, in turn, was used to generate the logical tree.
ParentChild: All parent-child pairs in the logical tree, including identifying the root and leaf nodes.
Stem+Root: The question stem (e.g., what, how) combined with the logical operation of the logical tree’s root node. For example, what-latest, how_many_count, when-time.
The SVM classifies all remaining logical trees for a question. The tree with the highest positive confidence is then selected as the output of the semantic parsing phase.
4. Subsequent Question Processing
After semantic parsing, several more steps are necessary. First, the TIME element needs to be classified, which can be done using traditional ML-based multi-class classification. Second, the concepts simplified after concept normalization need to have their CUIs inserted into their proper place (easily done with multiple concepts by tracking which logical tree node corresponds to which concept). This will produce the final logical form, which can be represented as a flat string.
To query an EHR, a query module needs to convert the elements in the logical form to either EHR-specific queries or functions within the module. For instance, given the logical form time(latest(λx.has_test x, C0392201, visit))), the query module might convert λx.has_test(x, C0392201, visit) into a FHIR query (likely multiple queries) to find all the blood glucose measurements occurring during the patient’s visit. Each of these results would be represented as an event, and the full set of events would be passed to the latest function to find the most recent matching event. Finally, that event would be given to the time function to return just that event’s timestamp.
Using this approach, a QA system can provide a single answer (resulting from the structured query and logical functions) that is verifiably correct in that a visualization of the logical form (for instance, see Figure 2) can be provided to the user to verify the system properly understood the question. If the question understanding was accurate, the answer should therefore be correct (at least from the NLP perspective, errors in data entry and EHR querying notwithstanding).
Figure 2:

Possible graphical representation of logical form.
Results
We evaluate the semantic parsing accuracy on the questions with a leave-one-out validation to maximize data use (the small number of questions and features means this is still quite fast, around 15 seconds). Table 1 shows feature experiments to demonstrate the importance of each of the three features, along with several baselines. All methods reported in Table 1 use the same lexicon and logical tree rules, only the tree classifier is altered. The Stem+Root feature, which simply looks at the question stem and the root logical operation, has the highest performance. This is likely due to the type-based filtering rules: if the top-level logical operation is correct, this limits the potential children. Additionally, so many logical forms start either with latest or δ, after which the amount of ambiguity decreases, and the Stem+Root feature can identify high-likelihood root operations (e.g., is-delta, what-latest) and eliminate unlikely operations (e.g., what-delta, did-latest).
Table 1:
Experiments for tree classifier using baseline methods and feature combinations.
| System | Accuracy |
|---|---|
| Random Baseline | 39.1% |
| Largest Tree Baseline | 42.5% |
| Smallest Tree Baseline | 48.9% |
| LexiconMatch | 74.8% |
| ParentChild | 73.1% |
| Stem+Root | 78.2% |
| LexiconMatch, ParentChild | 86.7% |
| LexiconMatch, Stem+Root | 88.8% |
| ParentChild, Stem+Root | 88.1% |
| LexiconMatch, ParentChild, Stem+Root | 95.6% |
Table 2 shows experiments to assess the impact of logical tree rules on the final result. Performance is hurt dramatically without the generation rules, achieving 56.8% and 72.1% without the FlipRule and PromoteRule, respectively. Removing both rules results in a performance of 50.7%. The filtering rules have far less of an impact on the final result. This makes sense as removing generation rules fundamentally limits recall. The filtering rules, however, are designed to help with precision, but the ultimate decision rests with the tree classifier. The NullRule, which removes logical trees with null nodes, removes trees that are easily detected by the SVM classifier, and so there is no effect on the final score. The TypeRule has some effect on the system, as without it the accuracy drops to 89.8%. While the classifier should ultimately learn inaccurate parent-child relationships, they can be quite sparse for some relations. So having a hard rule that forces agreement is useful. It also greatly speeds up the system: without the type rule, there is a mean of 206 and a median of 13 logical tree candidates per question. With the TypeRule, that drops to a mean of 11 and a median of 2 candidates. This further has the positive effect of reducing data imbalance, which likely has a positive impact on training the SVM.
Table 2:
Experiments removing various logical tree rules using best tree classifier.
| System | Accuracy |
|---|---|
| + All Rules | 95.6% |
| -FlipRule | 56.8% |
| -PromoteRule | 72.1% |
| -FlipRule, PromoteRule | 50.7% |
| -NullRule | 95.6% |
| -TypeRule | 89.8% |
| -NullRule, TypeRule | 89.9% |
| -All Rules | 49.7% |
To demonstrate the effect of data size, Figure 3 shows the leave-one-out accuracy (using all three features) with increasing percentages of the total data. With small amounts of data (<20%), there are random fluctuations in performance, as would be expected, but afterward there is a steady climb in performance. It appears likely that adding more data would continue to benefit the semantic parsing performance.
Figure 3:

Leave-one-out accuracy using increasing amounts of data (measured in percent of the full dataset).
Discussion
The high performance of the semantic parser (>95%) on the annotated dataset demonstrates the potential for a high- precision question understanding system, converting natural language to a fully-structured logical form. However, the use of the lexicon means that the results reported above are high estimates of the performance of the semantic parser on unseen questions, since almost all questions had the necessary lexicon entries to generate their proper logical form. For this reason, the 95.6% accuracy is less of a measure of system accuracy than it is a measure of system precision: some unseen questions will contain unseen words, which have no lexicon match and therefore will return no answer at all (i.e., a false negative). Precision does not measure false negatives, however, so the results above are a decent estimate for precision. Recall, however, while not a focus of our method, would still be a concern if it were so poor that hardly any questions would be answered. The primary limitation of recall in our system is the lexicon: if it fails to find an entry for a term in the question, or the entries it does find fail to produce a valid logical tree, then no answer will be provided. Figures 4 and 5 provide a sense of the distribution of the entries in the lexicon. Figure 4 shows that the lexicon has a very long tail: around 20 lexicon entries are highly used when generating gold logical trees, while dozens of entries are only associated with a single gold logical tree. Figure 5 shows that 72 questions rely on at least one lexicon entry that is not used for any other question. It is difficult to estimate recall exactly from these figures, but it would not be surprising if recall on unseen questions was in the 60-80% range, far below the high precision of the semantic parser. However, this assumes that users ask questions without knowledge of the system: over time a user would likely come to understand how to phrase questions based on the parser’s vocabulary.
Figure 4:

Sorted histogram showing how often the lexicon entries were used in generating gold logical trees. That is, the most frequent lexicon entry was used around 350 times, the second most frequent around 255, etc.
Figure 5:

Histogram showing the number of questions (y-axis) that rely on a lexicon entry with a given frequency (x-axis), i.e., the minimum frequency lexicon match used by a question. So, over 70 questions rely on a lexicon entry that is only used once, while roughly 40 questions rely entirely on lexicon entries used at least 42 times.
Even on the annotated data for which we have a complete lexicon, however, the semantic parser is not completely perfect, so some discussion of the types of errors it makes is worthwhile. The most common type of error revolves around rare logical operations. For instance, the question “Is his Computed Tomography positive?” has the gold logical form is_positive (latest (λx.has-Concept)). The predicted logical form, however, was δ(λx.has-concept A is-positive (x)). The predicted form more closely resembles other uses of is.positive in the data, but unfortunately in this case would actually answer the question “Does the patient have a positive Computed Tomography”, i.e., it will return true if any prior test is positive, whereas the original question is only interested in whether the most recent is positive. In order to maximize precision further, these rare predicates either need to be removed or generalized to a common concept. For example, many of the rarer predicates revolve around testing the value of some event (positive, negative, significant, present, etc.), so this provides hope for at least partially solving this problem in future work.
Other, more difficult errors, involve some degree of ambiguity or implied answer type. While the intentional decision was made during annotation to take the questions as literally as possible, it was not possible to remove all real-world knowledge of the true intention of the question. For instance, the question “How high is his Troponin?” has the gold logical form latest (λx.has.concept), which essentially is just asking for the result of the latest Troponin test. The phrase “How high” is roughly analogous to “What is the value of” in the question. While that interpretation can be added to the lexicon, the semantic parser still focuses on a strict interpretation of the phrase “How high” and predicts the logical form max (λx.has_concept), i.e., “What is the highest Troponin level?”. Similarly, “How much sputum does he have?” is interpreted as the summation of all sputum tests (sum (λx.has_concept)) and not simply the result of the most recent test (latest(λx.has_concept)).
Many of these errors can be tolerated if the answer is placed in the proper visual context. For instance, the incorrect interpretation of the question about sputum level should result in a graphically produced answer that indicates there was a misunderstanding. In this case, not only showing a graphical query representation, but also having the actual answer that shows not just the sum, but all of the sputum tests below the sum, would be a good visual clue to the clinician that the question was misinterpreted. This is yet another case of how placing a noisy NLP system within a proper visual interface can prevent NLP errors from propagating to clinical decision-making, and thus retain the trust of clinicians. Unlike most NLP systems, however, the logical forms created by the semantic parser provide an incredibly useful structure to visually represent information. NLP systems that rely on bag-of-words text categorization, for instance, are difficult to visually interpret beyond highlighting noteworthy words. Instead, the logical form, when displayed graphically as in Figure 2, allows the user to not only know an error occurred almost immediately, but also gives a strong clue of how to re-phrase the question in a less ambiguous manner. Such a trust-building combination of NLP and interface design can lead to a greater use of NLP to directly benefit clinicians in a real-time situation. Currently, however, most clinical applications that involve NLP focus on its offline use for research and administrative purposes.30 For the most part, this is largely due to the fact that NLP systems are not sufficiently precise to risk introducing errors into clinical decision-making. It is our hope, however, that methods like the one proposed here focus sufficiently on precision, while still offering substantial utility, that their adoption by clinicians will eventually take hold.
Conclusion
We have presented a semantic parsing method to convert clinical questions to a structured logical form. The method combines rule-based techniques (a lexicon and generation/pruning rules) to create tree-based structures that are then classified by a SVM classifier to select the tree corresponding to the best logical form. This semantic parsing method achieves an accuracy of 95.6% on a manually annotated set of questions, though as discussed this is probably better interpreted as the system’s precision. The method is intentionally high-precision, sacrificing recall in order to preserve user trust. Not only is the logical form approach an effective way to retrieve a single, verifiably correct answer, it also provides a useful opportunity to visualize the NLP prediction in a way that can further increase user trust. Future work on the semantic parsing includes expanding the number of questions, including questions with additional complexity. For the project as a whole, future work will focus on integrating the semantic parsing component into an end-to-end QA system, from natural language question to FHIR query.
References
- 1.Shneiderman B, Plaisant C, Hesse BW. Improving health and healthcare with interactive visualization methods. Computer. 2013 [Google Scholar]
- 2.Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, Del Fiol G. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inform. 2014;52:457–467. doi: 10.1016/j.jbi.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Pivovarov R, Elhadad N. Automated methods for the summarization of electronic health records. J Am Med Inform Assoc. 2015;22:938–947. doi: 10.1093/jamia/ocv032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Prager JM, Liang JJ, Devarakonda MV. SemanticFind: Locating What You Want in a Patient Record, Not Just What You Ask For. In AMIA Jt Summits Trans Sci Proc. 2017 [PMC free article] [PubMed] [Google Scholar]
- 5.Ely JW, Osheroff JA, Ebell MH, Chambliss ML, Vinson DC, Stevermer JJ, Pifer EA. Obstacles to answering doctors’ questions about patient care with evidence: qualitative study. BMJ. 2002;324:1–7. doi: 10.1136/bmj.324.7339.710. PMC99056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Del Fiol G, Workman TE, Gorman PN. Clinical Questions Raised by Clinicians at the Point of Care: A Systematic Review. JAMA Intern Med. 2014;174(5):710–718. doi: 10.1001/jamainternmed.2014.368. [DOI] [PubMed] [Google Scholar]
- 7.Athenikos SJ, Han H. Biomedical question answering: A survey. Comput Meth Prog Bio. 2010;99:1–24. doi: 10.1016/j.cmpb.2009.10.003. [DOI] [PubMed] [Google Scholar]
- 8.Pradhan S, Elhadad N, South BR, Martinez D, Christensen L, Vogel A, Suominen H, Chapman WW, Savova G. Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. J Am Med Inform Assoc. 2015;22(1):143–154. doi: 10.1136/amiajnl-2013-002544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Albright D, Lanfranchi A, Fredriksen A, Styler IV WF, Warner C, Hwang JD, Choi JD, Dligach D, Nielsen RD, Martin J, Ward W, Palmer M, Savova GK. Towards comprehensive syntactic and semantic annotations of the clinical narrative. J Am Med Inform Assoc. 2013;20(5):922–930. doi: 10.1136/amiajnl-2012-001317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Terol RM, Martinez-Barco P, Palomar M. A knowledge based method for the medical question answering problem. Comput Biol Med. 2007;37(10):1511–1521. doi: 10.1016/j.compbiomed.2007.01.013. 17374369. [DOI] [PubMed] [Google Scholar]
- 11.Reddy S, Täckström O, Collins M, Kwiatkowski T, Das D, Steedman M, Lapata M. Transforming dependency structures to logical forms for semantic parsing. TACL. 2016;4:127–140. [Google Scholar]
- 12.Artzi Y, Zettlemoyer LS. Learning compact lexicons for CCG semantic parsing. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 2011 [Google Scholar]
- 13.Aronson A, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17:229–236. doi: 10.1136/jamia.2009.002733. PMC2995713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17:507–513. doi: 10.1136/jamia.2009.001560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Roberts K, Demner-Fushman D. Annotating Logical Forms for EHR Questions. In LREC. 2016:3772–3778. [PMC free article] [PubMed] [Google Scholar]
- 16.Roberts K, Demner-Fushman D. Interactive use of online health resources: A comparison of consumer and professional questions. J Am Med Inform Assoc. 2016;23(4):802–811. doi: 10.1093/jamia/ocw024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Voorhees EM, Tong RM. Overview of the TREC 2011 Medical Records Track. In TREC. 2011 [Google Scholar]
- 18.Voorhees EM, Hersh W. Overview of the TREC 2012 Medical Records Track. In TREC. 2012 [Google Scholar]
- 19.Hanauer DA, Mei Q, Law J, Khanna R, Zheng K. Supporting information retrieval from electronic health records: A report of University of Michigans nine-year experience in developing and using the Electronic Medical Record Search Engine (EMERSE) J Biomed Inform. 2014;55:290–300. doi: 10.1016/j.jbi.2015.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hanauer DA, Wu DT, Yang L, Mei Q, Murkowski-Steffy KB, Vinod Vydiswaran VG, Zheng K. Development and empirical user-centered evaluation of semantically-based query recommendation for an electronic health record search engine. J Biomed Inform. 2017;67:1–10. doi: 10.1016/j.jbi.2017.01.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Raghavan P, Patwardhan S. Question Answering on Electronic Medical Records. In AMIA Jt Summits Trans Sci Proc. 2016:331–332. [Google Scholar]
- 22.Patrick J, Li M. An ontology for clinical questions about the contents of patient notes. J Biomed Inform. 2012;45:292–306. doi: 10.1016/j.jbi.2011.11.008. [DOI] [PubMed] [Google Scholar]
- 23.Roberts K, Demner-Fushman D. Toward a Natural Language Interface for EHR Questions. In AMIA Jt Summits Trans Sci Proc. 2015:157–161. [PMC free article] [PubMed] [Google Scholar]
- 24.Roberts K, Kilicoglu H, Fiszman M, Demner-Fushman D. Decomposing Consumer Health Questions. In Proceedings of the 2014 BioNLP Workshop. 2014:29–37. [Google Scholar]
- 25.Roberts K, Rodriguez L, Shooshan SE, Demner-Fushman D. Resource Classification for Medical Questions. In AMIA Annu Symp Proc. 2016:1040–1049. [PMC free article] [PubMed] [Google Scholar]
- 26.de Marneffe MC, MacCartney B, Manning CD. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC. 2006 [Google Scholar]
- 27.Artzi Y, Zettlemoyer L. UW SPF: The University of Washington Semantic Parsing Framework, 2013. arXiv:1311.3011. [Google Scholar]
- 28.Berant J, Chou A, Frostig R, Liang P. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013 [Google Scholar]
- 29.Fan R, Chang K, Hsieh C, Wang X, Lin C. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research. 2008;9:1871–1874. [Google Scholar]
- 30.Roberts K, Boland MR, Pruinelli L, Dcruz J, Berry A, Georgsson M, Hazen R, Sarmiento RF, Backonja U, Yu KH, Jiang Y, Brennan PF. Biomedical informatics advancing the national health agenda: the AMIA 2015 year-in-review in clinical and consumer informatics. J Am Med Inform Assoc. 2016 doi: 10.1093/jamia/ocw103. [DOI] [PMC free article] [PubMed] [Google Scholar]

