Abstract
Clinical semantic parsing (SP) is an important step toward identifying the exact information need (as a machine-understandable logical form) from a natural language query aimed at retrieving information from electronic health records (EHRs). Current approaches to clinical SP are largely based on traditional machine learning and require hand-building a lexicon. The recent advancements in neural SP show a promise for building a robust and flexible semantic parser without much human effort. Thus, in this paper, we aim to systematically assess the performance of two such neural SP models for EHR question answering (QA). We found that the performance of these advanced neural models on two clinical SP datasets is promising given their ease of application and generalizability. Our error analysis surfaces the common types of errors made by these models and has the potential to inform future research into improving the performance of neural SP models for EHR QA.
1. Introduction
Navigating the information present in electronic health records (EHRs) is difficult due to various usability issues associated with these systems 1. Question answering (QA) systems help in this regard by providing a way to express an information need in the form of natural language (instead of through a long series of clicks). One of the approaches to interpret and tackle the information need expressed by a natural language query is known as semantic parsing (SP) where the inherently ambiguous input query is mapped to an unambiguous logical form (LF) that is understandable by machines 2. Most current approaches to clinical SP (where the queries are focused toward EHRs) have been centered around rule-based techniques and traditional machine learning (ML) 3,4,5. Recently, many neural SP techniques have emerged that hold the promise to achieve comparable levels of performance as that of the traditional techniques while overcoming some of the difficulties involved in building the traditional systems. However, it is unknown how these neural systems will perform in the clinical domain, given the domain-specific intricacies and difficulties such as the lack of large high-quality datasets.
Many traditional semantic parsers make use of a lexicon to map the phrases present in natural language (such as healed) to the logical predicates present in the target grammar (e.g., is healed). Lexicons are good at modeling the compo-sitionality of logic 2. However, building such a lexicon requires domain expertise and is usually a time-consuming process. Also, while rule-based systems may perform well on a specific dataset for which they are implemented, they often fail to generalize across a broader variety of questions beyond what is present in the original dataset. Neural semantic parsers overcome some of these shortcomings by diminishing the need to use a lexicon. This, in turn, also improves the generalizability of the systems, i.e., their capability to accommodate a wider variety of data. For instance, given a word with its corresponding entry present in the lexicon (e.g., healed is healed), a lexicon-based system may fail to understand a different synonym of the word (in clinical context) that is unavailable in the lexicon (e.g., repaired, mended). Thus, the reach of such systems (based on a lexicon) are somewhat limited to the phrases available in their lexicon. The neural systems use advanced embeddings for representing the words in a query that plays an important role in identifying synonyms and even misspellings 6.
Another requirement of a traditional semantic parser is feature-engineering for its ML component. It is a crucial step in building ML models, where input features are extracted from raw data 7. It requires domain knowledge and is usually a time-consuming process along with being error-prone due to human involvement. More importantly, since feature selection and extraction are based on a given dataset, these techniques are oftentimes not generalizable to other kinds of datasets. Neural SP approaches alleviate the need to build manual features by using advanced forms of representations (such as word and character embeddings) to pass input text into the model 2.
Despite the numerous advantages to using the neural semantic parsers, a traditional system with a well-engineered manually-built lexicon can still outperform them on a given dataset, if not generalize well. However, especially given the ease of setting up a neural SP system and its advantages, analyzing the errors made by neural semantic parsers is a significant step toward improving these systems in a clinical context. Thus, in this paper, we systematically assess the efficacy of neural SP when applied to the task of EHR QA. To identify the common errors made by such models, we use two different neural SP models in our evaluations. Also, we evaluate these models on two different clinical SP datasets to identify the usual types of errors seen in the clinical context. We also report the results of traditional SP techniques on all the evaluated datasets for a rounded evaluation. Our error analysis surfaces the most frequent types of errors made by neural models on clinical datasets that, along with our detailed discussion to tackle these types of errors, can serve as a starting point for future research in this domain. To our knowledge, this is the first work to systematically assess the performance of neural SP techniques for EHR QA.
2. Background
2.1. Clinical Natural Language Understanding
Data: Much work is done for building datasets to understand the information need from the clinical questions, specifically directed toward the EHRs 3,5,8,9,10. Pampari et al. 10 take a template-based approach to building a large corpus of question-LF pairs (based on a different representation for LFs) exploiting an existing set of natural language processing (NLP) annotations from the i2b2 datasets. Though this dataset, named emrQA, is large (with around 1 million question-LF pairs), the variety of questions is limited due to templatization (as is also found in a separate systematic analysis of emrQA’s machine comprehension data 11) and thus this dataset is not representative of the real world clinical QA. Likewise, Wang et al. 12 built a large dataset of question-SQL (structured query language) pairs using SQL-based templates where they first automatically generated a set of questions and later filtered/rephrased them through crowd-sourcing. Again, the use of templates restricts the variety of questions in this dataset to only those that have a SQL query with the same underlying structure as one of the templates. Patrick & Li 3 built a dataset of clinical questions collected from the physicians and staff at an intensive care unit (ICU) setting. Later, Roberts & Demner-Fushman 8,9 designed a grammar based on standard λ-calculus to represent the meaning of clinical questions and annotated the aforementioned set of questions with their corresponding LFs. Another work used the same λ- calculus grammar to build a dataset of clinical questions annotated with their LFs in a study using Fast Healthcare Interoperability Resources (FHIR) 13. We use these two clinical question-LF datasets for our analysis as the underlying λ-calculus grammar allows for a rich representation of the information need from questions and also because several general-domain neural SP models have been developed for this kind of grammar.
Methods: Another strand of work in this domain focuses on methods for clinical natural language understanding 3,4,10,14,15. Patrick & Li 3 built a multilayer classification model to classify the clinical questions into their corresponding templates using rules and traditional ML. Such a technique is severely limited to the set of templates used for building a model and does not capture the whole spectrum of information representation achievable through SP. Another work by Ruan et al. 14 leveraged knowledge graphs to build a SP system specifically focused on statistical clinical queries in Chinese. But, their tool supports few statistical operations and the vocabulary is limited. Schwertner et al. 15 built a Portuguese QA system using knowledge bases where the semantic understanding engine was based on an external system. Conversely, two studies 16,17 tackled the task of shallow SP (for a dialogue system) where the main focus is on filling slots of information from a given utterance (not on mapping it to a meaning representation). In one, Neuraz et al. 16 evaluated the efficacy of expanding their existing dataset using an external machine translation system (Google Translate) while in another, Neuraz et al. 17 experimented with different word embeddings trained on general and biomedical domain data. Pampari et al. 10 applied a baseline neural-based model on the emrQA dataset but did not conduct a thorough analysis of the errors made by neural models. Similarly, Wang et al. 12 proposed a neural model based on sequence-to-sequence networks with attention-based copying. They, however, do not assess the efficacy of their technique in comparison to the rule-based alternatives. Roberts & Patra 4 developed a SP pipeline using rules (to generate a candidate set of LFs) and traditional ML (to select the best candidate thereafter). In this work, we use this SP methodology as the lexicon-based system and compare the performance against the neural alternatives.
2.2. General-domain Neural Semantic Parsing
Neural SP has had much success in the general domain 2, particularly using the sequence-to-sequence (Seq2Seq) models 18. Such methods for parsing input queries to their corresponding LFs can be broadly divided into 3 categories on the basis of intermediate and target meaning representations (MRs) used for training.
Direct Learning: The models in first category directly learns to map input queries to their LFs in target grammar. Here, the approaches do not explicitly define an intermediate meaning representation (such models do however use an implicit IR that is usually not insightful). E.g., Dong & Lapata 19 tackle the SP task using an encoder-decoder scheme with attention where they pass each sequence of the query as input to the encoder and generate a LF using the decoder. They proposed a decoder that takes into account the hierarchical structure of the LFs. To overcome a common problem with these methods related to rare words, a series of works came about anonymizing the entities with corresponding types (using an ontology). E.g., mapping Paracetamol to its type Medicine. Specifically, Jia & Liang 20 proposed a method using attention-based copying where, alongside the normal decoding mechanism of employing softmax over the entire vocabulary, the decoder can choose to copy words directly from the input query.
Using Intermediate MR: The second category of models employs abstract representations as intermediate forms while parsing the input to its respective LF. This type of abstraction allows sharing between examples and helps models to better generalize on smaller datasets 2. E.g., Cheng et al. 21 build insightful intermediate representations (in natural language) using a transition system and map them to target representations thereafter. Similarly, Dong & Lapata 22 develop a COARSE2FINE model that generates a coarse intermediate representation (where granular details of the LFs are masked) that is passed through the decoder to predict the final LF (with all the fine details).
Constrained Grammar: The last category of models use a constrained grammar that controls the decoder while generating the derivations for LFs. E.g., Krishnamurthy et al. 23 use a type-constraining grammar to guide the decoder. In other words, the grammar ensures that the decoder predicts LFs satisfying the type constraints. Similarly, the TRANX model by Yin & Neubig 24 employs abstract syntax description language (ASDL) framework to learn a general-purpose representation in the form of abstract syntax tree (AST). The ASTs are not passed through a decoder, instead, they are mapped to the final LFs using a deterministic user-defined function. Such an architecture releases the model from learning the target grammar from inadequate training data.
Looking at the advantages offered by the last two categories for smaller datasets (such as the clinical datasets), we choose to evaluate one model from each. Specifically, we use the COARSE2FINE (by Dong & Lapata 22) and TRANX (by Yin & Neubig 24) models for our analysis. We also choose to modify an existing direct learning model (Jia & Liang 20) by incorporating transformers 25 and recombination techniques 20 to augment our data.
3. Materials and Methods
3.1. Data
We use two clinical QA datasets for our evaluations. Both the datasets consist of patient-specific clinical questions (that can be answered using EHR data) and their corresponding LFs based on λ-calculus.
ICUDATA: Questions in one dataset were originally collected from the staff at an intensive care unit (ICU) setting through interviews and observing the clinical workflow 3. This set of questions was further annotated with their meaning representations in a separate study 9. We use a set of 401 deduplicated questions from this study.
FHIRDATA: Another dataset was constructed in a study that demonstrates the efficacy of an annotation tool based on Fast Healthcare Interoperability Resources (FHIR) 13 to reduce the overall LF annotation time by automating the time-consuming step of concept normalization annotation and better aligning the normalized concepts to a specific EHR implementation 5. The questions were created (and annotated with their LFs) by a physician and a biomedical informatics doctoral student as they viewed EHR data using the constructed FHIR tool. This work constructed a corpus of 1000 questions annotated with their meaning representations using the same λ-calculus grammar as ICUDATA 9. We include a deduplicated set of 980 questions in our analyses.
Logical Forms The LFs in all these datasets consist of two main elements, namely, predicates and parameters, along with the quantifier λ. Predicates are functions that are used to retrieve and manipulate event information. For instance, consider the following question (Q) and LF pair from the ICUDATA.
Q: What microorganisms were have been grown? LF: λx.has concept(x, C2242979, visit)
Here, has concept is a predicate in the clinical grammar for retrieving all the concepts with a given cui (a unique concept identifier) and implicit time frame (temporal information to use while retrieving the concepts from the EHRs) as parameters. The cui is usually normalized to some standard vocabulary (such as the UMLS 26). In this case,
C2242979 is a UMLS code for Microbial culture. The other parameter visit is the implicit time frame that restricts the retrieved concepts to the current inpatient/outpatient visit.
Other predicates are frequently applied with the aforementioned has concept predicate to express different information needs of the natural language questions. Consider the following example.
Q: What is the volume of his urine last night?
LF: sum(λx.has concept(x, C0042036, visit) ∧ time within(x, ‘last night’))
In this instance, the time within predicate, when applied in conjunction with the has concept, further filters down the retrieved concepts using the explicit temporal reference from the question, i.e., “last night”. Further, the sum predicate adds up the values of the given set of concepts and return that information. Some other examples include latest (selects the most recent concept from a given set of concepts), is negative (returns whether a given concept is positive or negative in the clinical context), count (returns the total number of concepts in a given set), and reason (returns the reason associated with a given concept). Though all the predicates in a LF have some direct or indirect relation to the natural language question, the outermost predicate can provide a good sense of the question and answer types. E.g., the outermost predicates for the aforementioned examples are has concept and sum.
3.2. Preprocessing
Most of the neural SP models assume the entities in input queries to be identified 2,22,24. We thus follow suit and replace the entities in the clinical datasets such as person references, temporal references, and measurements by identifiers person, temporal ref, and measurement. The clinical entities in the questions are replaced by concept. This abstraction step ensures a proper evaluation of the semantic parser without error propagation from other tasks such as entity extraction. We exclude the implicit time frames to be consistent with the evaluated models’ LF format.
Further, we apply some textual preprocessing steps to the clinical datasets in order to conform to the input and output data requirements of the neural models used in our evaluations. Specifically, for all datasets we (i) lowercase the questions, (ii) apply the Porter stemming algorithm using NLTK, and (iii) remove all punctuation. An example from each dataset is shown in Table 1.
Table 1:
Example Qs. Q – Question. LF – Logical Form. ES – Post entity substitution. PP – Post preprocessing.
| Dataset | Example |
|---|---|
| ICUDATA | Q: Did her temperature fall below 38C? |
| LF: delta(λx.has concept(x, cui) less than(x, ‘38C’)) | |
| ES: Did patient concept(C0005903) fall below measurement(‘38C’) ? | |
| PP: did patient concept fall below measur | |
| FHIRDATA | Q: How many times were the influenza shots given to her in the past 3 years? |
| LF: count(λx.has concept(x, cui) time within(x, ‘38C’)) | |
| ES: How many times were the concept(C0234422) given to patient temporal ref(‘in the past 3 years’) ? |
Descriptive statistics for all the datasets used in our evaluation are presented in Table 2. Note that the FHIRDATA
Table 2:
Descriptive statistics. All the statistics are on full datasets. The statistics are calculated after preprocessing. Note that the set of unique predicates for FHIRDATA is a proper subset of the same in ICUDATA.
| Metric | ICUDATA | FHIRDATA |
|---|---|---|
| # of queries | 401 | 980 |
| # unique tokens | 157 | 191 |
| # unique predicates | 53 | 21 |
| Avg # of tokens / query | 5.18 | 5.84 |
| Avg # of preds / query | 3.64 | 3.73 |
consists of a wider variety of natural language tokens than the ICUDATA (191 versus 157).
3.3. Models
We use two neural SP models based on different architectures that are shown to be effective for small datasets. To view the performance of neural systems in light of a traditional lexicon-based approach, we also report the results of two traditional rule-based and ML approaches for all the evaluated datasets. We collectively refer to these traditional methods as LEXICON-BASED as both of them employ some kind of lexicon to map the natural language phrases from the input utterances to logical predicates present in the λ-calculus.
3.3.1. TRANX
A high-level architecture of this system is presented in Figure 1a. The main component of this model is a transition system that converts a given query to its corresponding abstract syntax tree (AST), used as a general-purpose meaning representation 24. The transition system employs a domain-specific grammar, defined under the abstract syntax description language (ASDL) framework by the user, to generate ASTs. After a candidate AST is selected by the neural network (an encoder-decoder framework), it is converted to a domain-specific meaning representation (in our case, this is λ-calculus) in a deterministic manner. We make minimal modifications to theexisting ASDL grammar 27 to accommodate both the clinical datasets. Specifically, we add a few constructors to the grammar to cover the clinical domain-specific constructs such as Dose and Reason (that are corresponding to certain logical predicates present in the clinical lexicons such as dose and reason). We further define additional mappings in the grammar for predicates such as has concept and time within.
Figure 1.
High-level architectures of the evaluated models.
3.3.2. COARSE2FINE
The workflow of this system is shown in Figure 1b. The approach for this model is broadly divided into two parts where a series of encoders and decoders are employed to construct intermediate representations 22. In the earlier part, the input is encoded (using input encoder) into a representation that is used to predict a sketch sequence (using sketch decoder). A meaning sketch generated by this model is an abstraction over the LFs where more specific (or fine) details related to a LF such as variable names and arguments are omitted. Thus, sketch is a more coarse representation of the LFs and are somewhat easier to predict (because of fewer details). In the later phase, they use this sketch (through sketch encoder) and the input query representation to predict a LF (using sketch-guided output decoder). The decoders in both the phases use an attention mechanism to better model the alignments.
3.3.3. TRANSFORMER
We also employ transformer models 25 to our SP task (Figure 1c). The model architecture is inspired from Jia & Liang 20, where they used recurrent neural networks (RNNs) for encoding and decoding. Differently, we explore transformers 25 for both encoder and decoder. While RNNs model sequential information in text very well, transformer models are shown to learn bidirectional representations that takes into account textual dependencies from both directions. One caveat with these powerful models is that they require large amounts of data in order to produce good results. To overcome this, we also apply data recombination techniques 20 to expand the size of our datasets. Specifically, we use 3 types of augmentations, namely, entity abstraction (replace entities in questions with other similar entities to create new questions), whole phrase abstraction (replace key phrases in the dataset questions that represents similar meaning), and concatenation (combine different questions to create larger harder queries for the model).
3.3.4. LEXICON-BASED
An overall architecture of the models used in our evaluations is given in Figure 1d. Both the systems use some kind of candidate generator to produce a set of candidate LFs (or representations that can be trivially converted to LFs). Then, an ML model is used to rank the candidates from which the top ranked item is returned as the predicted LF. The method used for the clinical datasets employs a rule-based pipeline to generate a set of candidate LFs 4. Then, a support vector machine (SVM) is employed to select the best LF based on the features extracted from the input utterance and the generated candidates.
3.4. Evaluation
The models are trained on a subset of the dataset called a train set while evaluating their performances during the training on a development set for early stopping. Finally, the selected models (after early stopping) are evaluated on a test set to get the reported performances. We report the accuracy of the SP models that measures the fraction of input queries for which the predicted LF exactly matches the gold annotated LF.
3.4.1. Cross-validation
For both the clinical datasets, the train, dev, and test splits are not well-defined. Thus, for the neural models, we use a 10-fold cross-validation (CV) scheme to evaluate the performance. Specifically, we first split each clinical dataset into 10 equal (or nearly equal) parts or folds. Then, each fold is treated as a test set while the model is trained on the other 8 folds keeping 1 fold as a development set. We report the average performance of the models after considering each fold as a test set exactly once, i.e., running the model 10 times. In the case of LEXICON-BASED model, we report the results using a leave-one-out CV (LOOV) evaluation scheme. LOOV is a special (more extensive) type of CV where the number of folds is equal to the total number of instances in a dataset. Precisely, each instance in the dataset is kept aside for testing exactly once while the model is trained on all the remaining instances.
3.4.2. Cross-dataset
Alongside the CV evaluation for the clinical datasets, we also experiment with training the neural models in a crossdataset (CD) fashion. Here, a model is trained on one clinical dataset and then its performance is evaluated on another clinical dataset. Particularly, we use the best performing models from our CV evaluation for each dataset and apply them to a different clinical dataset, considering the whole target dataset as a test set.
3.5. Experimental setting
We use the recommended set of hyperparameters from the TRANX and COARSE2FINE papers. For TRANX, the hidden vectors size is 256; the word vector dimension is 128; the learning rate is 0.0025; the maximum number of epochs is 100 with early stopping using a training patience of 5. For COARSE2FINE, the hidden vectors size is 300; the word vector dimension is 150; the learning rate is 0.005; the max number of epochs is 100 validating every 10 epochs for early stopping. For both the models, the dropout rate is 0.5 and the learning rate decay is 0.985. The word embeddings are initialized using GloVe 28. For the TRANSFORMER model, the number of examples added using recombination is 1800 for train and 300 for dev; the number of epochs is 200; learning rate is 5e-4.
4. Results
The performance of the evaluated models is presented in Table 3. In both CV and CD settings, the LEXICON-BASED model performed the best as compared to the other neural alternatives, highlighting the importance of domain-specific “manual” fine-tuning that goes into constructing these. In CV setting, the COARSE2FINE model performed consistently better than both the other neural models. While the performance of all the models on the ICUDATA is promising, their accuracies on the FHIRDATA is much higher (about 15 points better for the neural models).
Table 3:
Cross-validation and cross-dataset evaluation.
| Model | Cross-validation | Cross-dataset | ||
|---|---|---|---|---|
| ICUDATA | FHIRDATA | ICUDATA | FHIRDATA | |
| TRANX | 77.5 | 93.1 | 71.8 | 67.7 |
| COARSE2FINE | 80.5 | 94.7 | 72.8 | 66.2 |
| TRANSFORMERS | 79.3 | 93.9 | 74.8 | 45.9 |
| LEXICON-BASED | 86.8 | 97.6 | 74.6 | 71.7 |
In CD setting, the performance of all the models on the target datasets dropped after training on a different clinical dataset. Note that the reduction in accuracy for the FHIRDATA (more than 25 points) is larger than that for the ICUDATA (around 4-7 points). Contrary to CV results, all the models performed better on the ICUDATA than on the FHIRDATA and the TRANSFORMER model achieved the best score on ICUDATA while the LEXICON-BASED model performed the best on FHIRDATA. Interestingly, the TRANSFORMER model did not hold good performance during the cross-data evaluation on FHIRDATA. Also, LEXICON-BASED models suffered the worst during CD on ICUDATA.
The errors made by all the models for a given dataset and evaluation method may be representative of the typical errors made by such models in these configurations (dataset and evaluation). Thus, we report the proportion of errors made by both the neural models for a given corpus and evaluation method in Table 4. The example on the first row shows a question where the LEXICON-BASED model correctly predicted the LF but all the other models failed. Here, the reason may be domain-specific artifacts of data (such as fistula closed being translated to is healed predicate) that are incorporated into the lexicon built for these models. On the other hand, referring to the second example under FHIRDATA (asking about MMR vaccine start date), the complex syntactic information in the question was not properly parsed by the LEXICON-BASED system. We further analyze these errors and present some of the prominent types of errors made by the MR models in Table 5. We note that the most wrongly predicted answer types for both the datasets include is and λx followed by sum for ICUDATA and reason for FHIRDATA. Interestingly, among the few instances of count in the ICUDATA, both the models failed to predict correctly in the case of CV while in a cross-dataset evaluation not all the models made such an error. Similarly, the LFs starting with earliest in the FHIRDATA were all incorrectly predicted by both the models in the case of cross-dataset evaluation while it was not the case in CV.
Table 4:
Error analysis. MR – models using intermediate machine representations (TRANX and COARSE2FINE). TF MR TF LB – TRANSFORMER. LB – LEXICON-BASED. Q – Question. L – Simplified logical form for brevity.
| Corpus | Model Prediction | Example | Count | ||
|---|---|---|---|---|---|
| MR | MR | MR | |||
| ICUDATA | ✗ | ✗ | ✓ | Q: Has the fistula closed? L: delta(λx.has concept(x) ∧ is healed(x)) | 18 |
| ✓ | ✓ | ✗ | Q: How much sedation is she on? L: latest(λx.has concept(x)) | 7 | |
| ✓ | ✗ | ✗ | Q: How did they confirm H1N1? L: latest(λx.has concept(x))) | 2 | |
| ✗ | ✓ | ✗ | Q: How was the neurological state overnight? L: λx.has concept(x) ∧ time within(x) | 5 | |
| ✗ | ✗ | ✗ | Q: Has his culture grown any microorganisms? L: is positive(latest(λx.has concept(x)))) | 32 | |
| FHIRDATA | ✗ | ✗ | ✓ | Q: What part had laceration? L: location(latest(λx.has concept(x)))) | 17 |
| ✓ | ✓ | ✗ | Q: When was the MMR started? L: time(earliest(λx.has concept(x)))) | 8 | |
| ✓ | ✗ | ✗ | Q: How much did he weigh in the last measurement? L: latest(λx.has concept(x))) | 3 | |
| ✗ | ✓ | ✗ | – – |
0 | |
| ✗ | ✗ | ✗ | Q: What about her animal dander allergy? 8 L: λx.has concept(x) | 8 | |
Table 5:
Incorrect predictions made by both TRANX and COARSE2FINE models in case of different evaluation strategies. Variant refers to the questions with logical form having the given predicate/quantifier at the outermost position. CV – Cross-validation. CD – Cross-dataset. Q – Question. L – Simplified logical form excluding some parameters for brevity.
| Corpus | Variant | Errors | |||
|---|---|---|---|---|---|
| Variant | Example | Count | CV | CD | |
| ICUDATA | is_∗ | Q: Was the blood culture negative? L: is negative(latest(λx.has concept(x))) | 20 | 11 | 19 |
| λx.∗ | Q: What were the drain outputs for the past 12 hours? L: λx.has concept(x) ∧ time within(x) | 19 | 10 | 16 | |
| sum | Q: What is the volume of his urine last night? L: sum(λx.has concept(x) ∧ time within(x)) | 33 | 5 | all | |
| count | Q: How often were they bleeding? L: count(λx.has concept(x)) | 3 | all | none | |
| FHIRDATA | is * | Q: Are his triglycerides getting low? L: is decreasing(λx.has concept(x)) | 28 | 5 | 27 |
| λx.∗ | Q: What is her vaccination history for polio? L: λx.has concept(x) | 9 | 7 | 4 | |
| reason | Q: What was the recent office visit for? L: reason(latest(λx.has concept(x))) | 29 | 2 | all | |
| earliest | Q: What are the details of her first allergy screening? L: earliest(λx.has concept(x)) | 5 | none | all | |
5. Discussion
In the case of CV, both the neural models achieve a lower accuracy on the ICUDATA than that on the FHIRDATA (see Table 3). This can be related to the higher diversity of logical predicates present in the ICUDATA. Specifically, the ICUDATA contains a total of 53 unique logical predicates while the FHIRDATA has 21 distinct predicates (refer Table 2). This difference may have led to an improved performance in the case of the FHIRDATA, as the neural models had to generalize on a smaller number of unique predicates. The models performed better on the FHIRDATA regardless of the fact that it contains a higher number of unique tokens, i.e., it is more diverse in terms of natural language words. However, the pre-trained word embeddings (GloVe 28) may have helped the models in interpreting the different natural language tokens and thus reduced any performance differences that may have occurred because of this.
On the other hand, the LEXICON-BASED systems performed with almost the same levels of performance on both the clinical datasets (a little better on the ICUDATA). The reason behind not seeing much difference in the performance of these traditional systems may be rooted in the rule-based backbone of these methods that uses a hand-built lexicon to generate a set of candidate LFs. The higher the number of questions with a correct LFs in its candidates (i.e., higher the coverage), the higher the chance is for the final ML model to predict correctly. Thus, the quality of lexicons impacts the overall performance of such semantic parsers, more so than the variety of logical predicates present in a dataset. The LEXICON-BASED methods achieved the best scores for each of the datasets. Also, the COARSE2FINE model performed consistently better than the TRANX model on both the datasets.
The cross-dataset results maintains the same consistency in terms of the relative performances of the models on the clinical datasets (see Table 3). However, both the models suffered a performance drop when trained on a different dataset and tested on another. Such a drop in the performance was also seen in the task of machine comprehension when various transformer-based deep learning models were tested in a cross-dataset fashion (i.e., fine-tuned on one clinical dataset and tested on another) 29. This observed performance difference highlights the differences between both the evaluated clinical SP datasets in terms of the variety of logical predicates and natural language tokens. This, in turn, also suggests the requirement of a generalizable clinical SP dataset.
Some of the most common errors made by both the neural models in each evaluation setting were related to questions with a LF starting with is . Since the specific methods of this form have fewer instances in the entire dataset, the models could not figure out the correct natural language terms to be mapped to these predicates. For most of such cases, the models incorrectly predicted a LF starting with delta (a predicate that checks if a given set is non-empty) or in other cases beginning with a different is predicate. Note that the answer type for both the variety of predicates (is and delta) is same, i.e., yes or no. So the models were able to map the questions to an appropriate answer type but failed to interpret the correct information need of a given question in such cases.
Interestingly, for the questions with LF having λx at the outermost position, we found that almost all the incorrect predictions resulted in a LF starting with latest (a predicate that returns the most recent concept from a given set). Here, even the answer types were wrongly interpreted by the neural models, in that a LF that returns a single concept is generated instead of the one that returns a set. We hypothesize that it could be because of the similarity of questions with LFs having λx or latest at the outermost place (refer the examples for λx and earliest in Table 5).
Not surprisingly, the most common type of errors found during the cross-dataset evaluation were related to the presence of certain predicates in the training set. Consider, for example, the questions with LFs starting with sum for the ICUDATA and starting with reason or earliest for the FHIRDATA. In all these cases, the models failed to predict even a single instance correctly. The common thing between these cases is that the predicate at hand was either not present in the training set or only a handful of those were present. For the examples shown in Table 5, sum is not at all present in the FHIRDATA while there is only one question for each of the other variants, reason and earliest, in the ICUDATA. We confirmed that the ICUDATA fold used for training the model that was used for the cross-dataset evaluation contained these individual instances for the 2 aforementioned variants (reason and earliest). Also, for the variant count, there are only 3 examples in the ICUDATA to learn from while there are a total of 111 instances in the FHIRDATA. Again, interestingly, the models performed poorly in the cross-validation setting while they predicted all the examples correctly in the cross-dataset setting. Hence, though the neural models can be used without the laboriously built lexicons, the variety and availability of the different kinds of predicates in the training set plays an important role in their overall performance.
For the scope of this study and being consistent with the task of SP as tackled by the evaluated neural models, we took several steps to map our clinical SP task. Unlike the general domain, it is much harder to extract and normalize the clinical concepts due to many reasons such as limited availability of the annotated datasets (due to concerns such as privacy) and the differences in EHR standards and physician preferences 30. Also, the implicit time frames play an important role in interpreting and retrieving the exact information need from a given question. Though it can be classified separately and added to the resulting LFs from these models, it is important to note that such a classifier may add another layer of error in the process to generating an extended LF. In the future, we plan to stack all the key components involved in a clinical SP system, to go from raw question to LF, and evaluate the impact of each step on the overall performance.
In this evaluation, we use GloVe word representations to feed questions into the neural models, following the recommendations from the original papers for the models. Such vector representations are based on co-occurrence statistics between the words and have been used extensively in the research community. Recently, language modeling techniques such as BERT 31 have been shown to work well on the other types of clinical QA such as machine comprehension 29 as well as on the other clinical NLP tasks such as concept extraction 30. In the future, we aim to apply such techniques, that take into account the nuances of natural language via transfer learning, to the task of clinical SP.
6. Conclusion
We applied different neural semantic parsers to the task of EHR QA to assess their performance relative to rule-based alternatives. The performance of neural methods is promising given their ease of application and generalizability. The performance disparity of the neural-based methods between different clinical datasets can be attributed to the diversity present in different corpora. To understand the common sources of errors made by such neural models, we conducted a thorough error analysis using the incorrect predictions from all the neural models.
Acknowledgments This work was supported by the U.S. National Library of Medicine, National Institutes of Health (NIH), (R00LM012104); the National Institute of Biomedical Imaging and Bioengineering (NIBIB: R21EB029575); the Cancer Prevention and Research Institute of Texas (CPRIT RP170668); and the UTHealth Innovation for Cancer Prevention Research Training Program Predoctoral Fellowship (CPRIT RP210042).
Figures & Table
References
- [1].Roman LC, Ancker JS, Johnson SB, et al. Navigation in the Electronic Health Record: A Review of the Safety and Usability Literature. Journal of Biomedical Informatics. 2017;67:69–79. doi: 10.1016/j.jbi.2017.01.005. [DOI] [PubMed] [Google Scholar]
- [2].Kamath A, Das R. A Survey on Semantic Parsing. Automated Knowledge Base Construction. 2019.
- [3].Patrick J, Li M. An Ontology for Clinical Questions about the Contents of Patient Notes. Journal of Biomedical Informatics. 2012;45:292–306. doi: 10.1016/j.jbi.2011.11.008. [DOI] [PubMed] [Google Scholar]
- [4].Roberts K, Patra BG. A Semantic Parsing Method for Mapping Clinical Questions to Logical Forms. AMIA Annual Symposium Proceedings. 2017;vol. 2017:1478–1487. [PMC free article] [PubMed] [Google Scholar]
- [5].Soni S, Gudala M, Wang DZ, et al. Using FHIR to Construct a Corpus of Clinical Questions Annotated with Logical Forms and Answers. AMIA Annual Symposium Proceedings. 2019;vol. 2019:1207–1215. [PMC free article] [PubMed] [Google Scholar]
- [6].Kalyan KS, Sangeetha S. SECNLP: A Survey of Embeddings in Clinical Natural Language Processing. Journal of Biomedical Informatics. 2020;101:103323. doi: 10.1016/j.jbi.2019.103323. [DOI] [PubMed] [Google Scholar]
- [7].Zheng A, Casari A. Feature Engineering for Machine Learning. 1st ed 2018.
- [8].Roberts K, Demner-Fushman D. Toward a Natural Language Interface for EHR Questions. AMIA Jt Summits Transl Sci Proc. 2015;vol. 2015:157–161. [PMC free article] [PubMed] [Google Scholar]
- [9].Roberts K, Demner-Fushman D. Annotating Logical Forms for EHR Questions. LREC. 2016. pp. 3772–3778. [PMC free article] [PubMed]
- [10].Pampari A, Raghavan P, Liang J, et al. emrQA: A Large Corpus for Question Answering on Electronic Medical Records. EMNLP. 2018. pp. 2357–2368.
- [11].Yue X, Jimenez Gutierrez B, Sun H. Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset. ACL. 2020. pp. 4474–4486.
- [12].Wang P, Shi T, Reddy CK. Text-to-SQL Generation for Question Answering on Electronic Medical Records. Proceedings of The Web Conference. 2020. pp. 350–361.
- [13].Health Level Seven International. Welcome to FHIR. Available from: https://www.hl7.org/fhir/
- [14].Ruan T, Huang Y, Liu X, et al. QAnalysis: A Question-Answer Driven Analytic Tool on Knowledge Graphs for Leveraging Electronic Medical Records for Clinical Research. BMC MIDM. 2019;19:82. doi: 10.1186/s12911-019-0798-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Schwertner MA, Rigo SJ, Araujo DA, et al. Fostering Natural Language Question Answering Over Knowledge Bases in Oncology EHR. IEEE Computer-Based Medical Systems. 2019. pp. 501–506.
- [16].Neuraz A, Llanos LC, Burgun A, et al. Natural Language Understanding for Task Oriented Dialog in the Biomedical Domain in a Low Resources Context. arXiv. 2018;1811.09417 [Google Scholar]
- [17].Neuraz A, Rance B, Garcelon N, et al. The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding. Studies in Health Technology and Informatics. 2020;270:432–436. doi: 10.3233/SHTI200197. [DOI] [PubMed] [Google Scholar]
- [18].Sutskever I, Vinyals O, Le QV. Sequence to Sequence Learning with Neural Networks. NeurIPS. 2014. pp. 3104–3112.
- [19].Dong L, Lapata M. Language to Logical Form with Neural Attention. ACL. 2016. pp. 33–43.
- [20].Jia R, Liang P. Data Recombination for Neural Semantic Parsing. ACL. 2016. pp. 12–22.
- [21].Cheng J, Reddy S, Saraswat V, et al. Learning Structured Natural Language Representations for Semantic Parsing. ACL. 2017. pp. 44–55.
- [22].Dong L, Lapata M. Coarse-to-Fine Decoding for Neural Semantic Parsing. ACL. 2018. pp. 731–742.
- [23].Krishnamurthy J, Dasigi P, Gardner M. Neural Semantic Parsing with Type Constraints for Semi-Structured Tables. EMNLP. 2017. pp. 1516–1526.
- [24].Yin P, Neubig G. TRANX: A Transition-Based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation. EMNLP: System Demonstrations. 2018. pp. 7–12.
- [25].Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need. NeurIPS. 2017. pp. 5998–6008.
- [26].Lindberg DAB, Humphreys BL, McCray AT. The Unified Medical Language System. Methods of Information in Medicine. 1993;32:281–291. doi: 10.1055/s-0038-1634945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Rabinovich M, Stern M, Klein D. Abstract Syntax Networks for Code Generation and Semantic Parsing. ACL. 2017. pp. 1139–1149.
- [28].Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation. EMNLP. 2014. pp. 1532–1543.
- [29].Soni S, Roberts K. Evaluation of Dataset Selection for Pre-Training and Fine-Tuning Transformer Language Models for Clinical Question Answering. LREC. 2020. pp. 5534–5540.
- [30].Fu S, Chen D, He H, et al. Clinical Concept Extraction: A Methodology Review. Journal of Biomedical Informatics. 2020. p. 103526. [DOI] [PMC free article] [PubMed]
- [31].Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Un-derstanding. NAACL-HLT. 2019. pp. 4171–4186.

