Enhancing Clinical Information Retrieval through Context-Aware Queries and Indices

Andrew Wen; Yanshan Wang; Vinod C Kaggal; Sijia Liu; Hongfang Liu; Jungwei Fan

doi:10.1109/bigdata47090.2019.9006241

. Author manuscript; available in PMC: 2024 Jan 11.

Published in final edited form as: Proc IEEE Int Conf Big Data. 2020 Feb 24;2019:2800–2807. doi: 10.1109/bigdata47090.2019.9006241

Enhancing Clinical Information Retrieval through Context-Aware Queries and Indices

Andrew Wen ¹, Yanshan Wang ², Vinod C Kaggal ³, Sijia Liu ⁴, Hongfang Liu ⁵, Jungwei Fan ⁶

PMCID: PMC10782810 NIHMSID: NIHMS1955973 PMID: 38213777

Abstract

The big data revolution has created a hefty demand for searching large-scale electronic health records (EHRs) to support clinical practice, research, and administration. Despite the volume of data involved, fast and accurate identification of clinical narratives pertinent to a clinical case being seen by any given provider is crucial for decision-making at the point of care. In the general domain, this capability is accomplished through a combination of the inverted index data structure, horizontal scaling, and information retrieval (IR) scoring algorithms. These technologies are also being used in the clinical domain, but have met limited success, particularly as clinical cases become more complex. One barrier affecting clinical performance is that contextual information, such as negation, temporality, and the subject of clinical mentions, impact clinical relevance but is not considered in general IR methodologies. In this study, we implemented a solution by identifying and incorporating the aforementioned semantic contexts as part of IR indexing/scoring with Elasticsearch. Experiments were conducted in comparison to baseline approaches with respect to: 1) evaluation of the impact on the quality (relevance) of the returned results, and 2) evaluation of the impact on execution time and storage requirements. The results showed a 5.1–23.1% improvement in retrieval quality, along with achieving 35% faster query execution time. Cost-wise, the solution required 1.5–2 times larger space and about 3 times increase in indexing time. The higher relevance demonstrated the merit of incorporating contextual information into clinical IR, and the near-constant increase in time and space suggested promising scalability.

Keywords: Electronic Health Records, EHR, Information Retrieval, Clinical Information Retrieval, Elasticsearch

I. Introduction

Big data in the digital health era has revived interest in identification and prompt retrieval of past clinical cases from electronic health records so as to provide insights to support clinical decision making at the point of care [1], notably for the development of clinical decision support systems and in support of personalized medicine. It is estimated that up to 80% of the clinical information generated in the practice is contained within clinical narratives [2]. The capability to retrieve relevant clinical cases from clinical narratives has therefore become essential in any mature EHR-centered digital infrastructure. Given that these information needs typically occur at the point of care, any such retrieval mechanism must be responsive, a significant challenge in the context of healthcare big data.

In non-clinical domains, search engines based on information retrieval (IR) techniques have been evolving and delivering impressive results. Unlike the general domain, however, clinical information retrieval must incorporate contextual information into determination of relevance. Specifically, negation (“patient was ruled out for …”), temporality (“patient had … three years ago”), certainty (“this presentation might be …”), and subject (“mother had …”), are all important in clinical contexts and affect whether a given match is actually relevant to the input search. General IR approaches often are not equipped with the capacity to take contextual information into account when determining relevance. For example, when searching for patients with diabetes, one would want to avoid matching a family member mentioned as the subject of having diabetes or patients explicitly mentioned as having no diabetes. This suggests that context-aware relevance scoring should provide a benefit to retrieval performance, as supported by our prior findings [3].

In this study, we explored the enhancement of an open source IR engine in an operational clinical big data environment, with incorporation of several clinically relevant types of contextual information into the document scoring process: temporality, certainty, subject, and negation. Specifically, we aimed to achieve this objective while assessing the two critical factors impacting operational use of a search engine: retrieval performance (i.e. the relevance of the returned results) and scalability (i.e. runtime performance as data grows larger). Further, we aimed to achieve this in a user-friendly manner by implementing the context support in a manner that continues to support queries formulated in the natural language.

II. Background

A. Incorporating Contextual Information into Information Retrieval Systems

One of the most popular open source IR frameworks available today is Elasticsearch, which provides a scalable wrapper around and adds functionality on top of Apache Lucene [4]. Apache Lucene leverages inverted indices [5], [6], which is foundational to its ability to quickly return results from a large document corpus. Instead of running a search on each individual document in the index, a token (in most cases, a normalized word) to document set mapping is used, allowing for responsive retrievals from large document corpora. Retrieving results for a given query can then generally be described as the following process:

Tokenization of the query into a sequence of tokens
Combination of the document sets corresponding to each token
Scoring the documents for relevance ranking purposes, which are then returned as a result

The scoring algorithm in use varies, but typically involves some combination of term features for each of the individual tokens in the document vs. the overall corpus, such as the well-known TF-IDF algorithm [7] and its successor, Okapi BM25 [8]. By default, Elasticsearch operates using the Okapi BM25 relevance scoring algorithm.

A key limitation of typical inverted index approach is that scoring is done as an aggregation of per-token document scores. Specifically, the inverted index schema in general means that the original document is not accessible at query execution time, but instead a mapping of individual tokens in the document (or query) as well as their frequencies and relative positions is stored. While there is a mechanism built into Lucene for the incorporation of positional information (e.g. distance between individual tokens and ordering in the source query vs. the target document), there is limited accommodation for contextual information as it is neither prepared nor availed into the index for relevance scoring. A naïve approach to addressing this issue would be to not use an inverted index for retrieval, but rather do a document scan that would then have the whole document (and presumably any context feature that could then be derived) accessible at query time. It is, however, difficult to justify moving away from an inverted index structure as it offers a logarithmically scaling retrieval time complexity compared to the linearly scaling time complexity that would be required for a document by document scan. This compromise becomes even evident as the number of the records gets much larger.

Another approach towards addressing this problem is to formulize semi-structured queries, incorporating Boolean operators such as AND, OR, and NOT to incorporate some of the logic into the scoring process. For instance, a query for positive personal history of diabetes could be formulated as

(diabetes AND NOT (“no diabetes” OR “neg hx diabetes” OR …)).

This is, however, neither intuitive, nor easy to use. Additionally, there are many differing indicators for clinically relevant contextual information and can be widely separated from the target term. For instance, the example presented above would fail to exclude diabetes in the context of “neg hx stroke and diabetes”. Formulation of such a Boolean query would therefore involve many Boolean clauses and be time-consuming to create and execute.

In summary, any clinical information retrieval system implementation that incorporates semantically relevant contextual information therefore has the following requirements:

For Performance: The inverted index structure must be used. By extension, any semantic information must be stored at a token level within the document index and retrievable without examination of the full text around a given matching word.
For Ease of Use: Natural language queries should be allowed and parsed for semantic information, which must also be available and taken into account during scoring.

B. Determining Clinical Contexts from Unstructured Text

With respect to determination of the clinical contexts themselves, the NegEx algorithm has been established to have high performance for negation detection in the clinical domain [9]. Briefly, NegEx operates as a combination of start and stop triggers; start triggers denote that all words contained within a span of text (which can extend to either the left or right of the trigger, depending on the specific trigger) are negated, up until the point where either a terminal trigger or a sentence boundary is encountered. For example, one might encounter the following sentence fragment in a clinical narrative:

“… [ruled out] iron deficiency and anemia, [but] hypothyroidism is still a possibility”.

Here, using the NegEx algorithm, “ruled out” would be a right-extending trigger, and “but” would be a terminal trigger, thus, the entire span in between, “iron deficiency and anemia,”, would be marked as negated.

A successor algorithm, ConText [10] extended the NegEx algorithm by supporting identification of temporality, certainty, and experiencer. Accordingly, the defined triggers are also expanded to include historical, experiencer, and certainty related triggers. Note that it is possible for multiple context types to overlap. For example:

“… [no] known [family] [history] of diabetes, [but] …”,

Here “no”, “family”, and “history” would be recognized as negation, experiencer, and temporal triggers respectively, resulting in the diabetes mention being marked as negated and historical, with a non-patient/familial subject.

III. Methods

In this section, we will first outline how context support was implemented within an Elasticsearch/Lucene environment, and the necessary changes to support context-aware document scoring. We will then discuss how this implementation was evaluated. An Elasticsearch version of 6.6.0 (with corresponding Lucene version of 7.6.0) was used for all experiments conducted in this study.

A. Implementation

Implementation of semantic context support fundamentally takes advantage of the fact that within the inverted index itself, Lucene allows for attachment of a byte array payload to each token occurrence within any given document. Because of this capability, the semantic contextual information of each token can be compressed into what we hereafter refer to as the NLP Payload, and be attached to the token in the index.

With that in mind, implementation can be divided into two portions: changes in tokenization and changes in document scoring. We present a visual representation of the query process in Fig. 1 and a detailed description of the tokenization and scoring implementation in the ensuing sections.

Fig. 1. — Context-Aware Query Implementation (Payloads in Historical — Negation — Assertion — Subject Format)

Tokenization Changes: All text that enters the Lucene ecosystem, be they a document for indexing, or a query, are analyzed by an analyzer [11]. The analyzer is responsible for breaking down input text into a sequence of tokens and associated metadata. Typically, an analyzer definition is a sequence of different operations, such as tokenization and stemming. We defined a tokenizer implementation that uses the OpenNLP tokenizer version 1.9.0 [12] and uses the ConText algorithm to find semantic contexts. An NLP Payload is then generated for each token using this information and the results of the text processing is returned as an ordered stream of tokens. These tokens are then parsed through a lowercase filter and an English stop word removal filter, which corresponds to the default analyzer definition as defined by Elasticsearch.
Document Scoring Changes: In document scoring, there are two interacting functions that are important to define: document scoring and term similarity. The document scoring function determines how to score a given document result in relation to an input query, and is typically some summation of the term similarity for each of the individual tokens within the input query taken in conjunction with query statistics (such as number of terms in query) and collection statistics. Term similarity, on the other hand, determines what score to give to a specific token in the query in relation to a specific document, typically through some combination of the term’s frequency in the document and the term’s overall frequency in the complete document corpus. For the purposes of implementing context-aware scoring, modifications were made to both components.

The default Elasticsearch term similarity scoring for a single token in the query (BM25(1)) and document scoring (2) formulas that were extended in this study are defined as follows (included here to ease cross-reference with our modified version):

B M 25 (q_{i}, d, C) = I D F (q_{i}, C) \cdot \frac{f (q_{i}, d) \cdot (k_{1} + 1)}{f (q_{i}, d) + k_{1} (1 - b + b \frac{| d | | C |}{\sum_{d_{i} \in C} |d_{i}|})}

(1)

score (q, d, C) = \sum_{i = 1}^{| q |} B M 25 (q_{i}, d, C) \cdot b s t (q_{i})

(2)

where a document (sequence of tokens) $d = [w_{1}, w_{2}, \dots, w_{n}]$ and query $q = [q_{1}, q_{2}, \dots, q_{n}]$ , document index $C = \{d_{1}, d_{2}, \dots, d_{n}\}, f (q_{n}, d)$ represents the frequency of the token $q_{n}$ in document $d, k_{1}$ and $b$ are hyperparameters, defined by default to be 1.2 and 0.75 respectively [13], $b s t (q_{i})$ is the user-defined query boost/weight for that specific query term $q_{i}$ , and $I D F (q_{n}, C)$ is the inverse document frequency of $q_{n}$ in collection $C$ (3), defined as:

I D F (q_{i}, C) = l o g (1 + \frac{| C | - colfreq (q_{i}, C) + 0.5}{colfreq (q_{i}, C) + 0.5})

(3)

where colfreq $(q_{i}, C)$ represents the collection frequency (4) of the query token $q_{i}$ in $C$ :

colfreq (q_{i}, C) = \sum_{d \in C} |\{w ∣ w \in d \land w = q_{i}\}|

(4)

In the term similarity component, a context multiplier was applied to all similarity scores. Our methodology for determining said multiplier can be summarized as follows, with respect to a query term $t_{q}$ and a document term $t_{d}$ :

If the subject of $t_{d}$ differs from that of $t_{q}$ , e.g. CHF in the query refers to a patient, while CHF in the document refers to a relative, then this token should not be counted as a match at all, and should play no factor in the scoring. As such, the context multiplier should be set to 0.
If the negation state of $t_{d}$ differs from that of $t_{q}$ , then such a pair are considered opposites. According, it is not enough to zero out the similarity multiplier as is done for a subject mismatch. For example, “diabetes with retinopathy” is a meaningful subtype in contrast with “diabetes without retinopathy” and is worth distinguishing. To reflect this, the term similarity score should have the sign set to negative as opposed to just being set to 0. In the above example, while the overlapping token “diabetes” will contribute to the score in both cases, the negation mismatch will be heavily penalized.
No action (multiplier of 1) is taken with respect to the temporal state of $t_{d}$ compared to $t_{q}$ as a condition occurring in the present also counts as part of a patient’s history (this is configurable per application needs)
If the assertion state of $t_{d}$ differs from that of $t_{q}$ , the action to be taken depends on the input query:
- If the query is searching for asserted instances (i.e. is looking for certain as opposed to possible), then heavily penalize the term similarity
- If the query is searching for non-asserted instances, then moderately penalize the term similarity

To allow customization to fit the characteristics unique to any given document corpus, the precise numeric definitions for all multipliers defined here are customizable by the end-user to define as a bounded value between 0 and 1. Multipliers are multiplicative: for instance, with a heavy penalization multiplier of 0.5, for the input query “diabetes” and document “probably no history of diabetes”, the context multiplier for “diabetes” would be −1 (differing negation) * 0.5 (differing assertion), resulting in a context multiplier of −0.5. The final multiplier then is the average of all non-zero (indicating same subject class) multipliers found in a given document corresponding to a query token.

Beyond term similarity, several modifications were made to the overall document scoring function to handle context triggers. In scoring, items that act as a ConText trigger should not be factored into the score. For example, there is no functional difference between “no history of…” and “negative for past…”, and the exact words being used (e.g. “history” vs “past”) should not impact scoring. As such, all parts of the document scoring formula were changed to ignore the presence of query tokens that are marked as triggers, including when calculating query term similarities and when using query level statistics, such as query length. An exception to this is subject triggers, which were preserved for scoring purposes, as the specific subject (e.g. mother, father, brother, sister, etc. for non-patient subjects) is also considered to be clinically relevant information that is not directly captured in the NLP payload.

Equations (5) and (6) represent the modifications made to the BM25 term similarity (1) and document scoring (2) functions respectively to incorporate these described modifications:

{sim}_{ctx} (q_{i}, d, C) = B M 25 (q_{i}, d, C) \cdot \frac{\sum_{t \in m (d, q_{i})} w_{c t x} (t, q_{i})}{|m s (d, q_{i})|}

(5)

{score}_{ctx} (q, d, C) = \sum_{i = 1}^{|q_{c t x} (q)|} {sim}_{ctx} (q_{i}, d, C) * b s t (q_{i})

(6)

where $w_{c t x} (t, q_{i})$ represents the context multiplier calculated from query term $q_{i}$ and document term $t$ as previously described, $m (d, q_{i})$ represents the set of tokens in $d$ for which the lexemes match that of $q_{i}, m s (d, q_{i})$ represents the same as $m (d, q_{i})$ with the additional constraint that the calculated context multiplier must be non-zero, and $q_{c t x} (q)$ represents the sequence of tokens corresponding to query $q$ with non-subject triggers removed.

In this study, semantic context support was implemented as an Elasticsearch plugin, the source code of which can be found at http://www.github.com/OHNLP/elasticsearch_nlp_plugin.

B. Experiment

Our evaluation focused on performance of the IR system after incorporating contextual information into the relevance scoring. Specifically, two aspects were defined for performance: runtime performance (i.e., how long it takes for the IR system to return results and the practicality of scaling out such a system), and retrieval performance (i.e., how accurate the retrieved results are when taking into account contextual information).

A document set corresponding to 45,000 patients used in a past study [14] was loaded separately into two independent indices, each with the same level of parallelism (4 shards). To isolate performance, these two indices were run on different Elasticsearch instances running on separate machines, with each machine having the same hardware specifications. A query set from the same past study consisting of 56 distinct queries used for clinical trial cohort identification tasks was reused for both phases of the evaluation [14].

Runtime Performance Evaluation: When discussing runtime performance and scalability to datasets of large volume, we must consider both the time and space complexity (i.e. growth with each new document added to the document corpus). We assume that the default Elasticsearch implementation is scalable as it is considered to be an industry standard. It is thus sufficient to show that both complexities remain the same as the baseline (i.e. the order of the complexity in Big O notation remains the same).

We conducted this evaluation during the initial loading of documents into the two indices. Specifically, indexing was done simultaneously to both indices in 1 million document increments up to a total of 39 million documents (corresponding to those 45,000 patients). At each increment, each of the 56 queries was executed 5 times and the average execution time across all query runs for each index was retained. Additionally, the total size of each of the indices was stored for each increment so as to evaluate the NLP payload implementation’s impact on storage space consumption. Finally, to evaluate the impact additional tokenization and storage requirements has on indexing performance, we determined the average indexing time per document for both indices at each increment.
Retrieval Performance Evaluation: The same prior study [14] also produced a pool of relevance judgements (a determination by a human annotator as to whether a retrieved document was relevant or not relevant) for 46 of the 56 queries that could be reused to conduct evaluation in our study. This evaluation set was the best we could obtain without additional manual annotation, but with an important caveat that must be considered for evaluation purposes: the earlier judgement set can only serve as a silver standard, as it was derived by pooling the top N results from a variety of IR algorithms and thus would not contain relevance judgements on all documents available in the document set. Additionally, the construction of these pools also involved patient-level pre-filtering by structured data such as demographics and billing codes. Because of this, it is expected that the set of results retrieved by the context-aware algorithm would consist of significantly different documents versus that contained within the sliver standard set.

The above limitations would not invalidate our primary objective of comparing the context-aware enhancements against the baseline, because the same silver standard was a constant benchmark. However, to accommodate the effect that many of the results returned by the context-aware queries may not be covered in the silver standard, our evaluation was moderately adjusted by instead conducting two separate evaluations:

Simple: When running queries and collecting results, return the top 10,000 documents instead of the top 1,000 (typical in IR evaluations for the Text REtrieval Conference), to ensure that a sufficient number of documents contained within the silver standard judgements were retrieved for computing meaningful comparative metrics.
Re-Ranked: When running the query, retrieve only the documents in the judgement pool. This essentially evaluated only the relevance ranking, as if our entire document corpus consisted of documents that we had judgements for.

The metrics used for evaluation in this study are as follows:

Precision@N (7): a metric that measures the quality of the top ranked N results. Namely, it is the precision of the system at the top N returned documents, defined as:
$Precision (N) = \frac{t p (N)}{t p (N) + f p (N)}$ (7)
Where $t p (N)$ represents the number of relevant results returned within the top $N$ ranked documents and $f p (N)$ represents the number of non-relevant results returned within the same top $N$ ranked documents. We measured both the Precision@N of individual queries and the overall averaged Precision@N across all 46 queries
Mean Average Precision (MAP) (8): a metric that is commonly used to evaluate IR systems. It is the mean of the average precision values across all queries, where average precision represents the average of the precision of the system at the position of each returned relevant document [15]. Specifically, for a set of queries $Q = \{q_{1}, q_{2}, \dots, q_{n}\}$ :
$M A P (Q, R) = \frac{\sum_{q \in Q} AveP (q, R)}{| Q |}$ (8)
where $A v e P (q, R)$ is the average precision for query $q$ and a set of relevant results $R$ (9):
$AveP (q, R) = \frac{\sum_{r \in R} Precision @ R}{| R |}$ (9)
and Precision@R represents the Precision @N of the system where $n$ =the rank of $r$ within the result set
Binary Preference-Based Measure (BPref) (10): a metric that aims to address the same information need as MAP, but also supports missing judgements in its gold standard [16]. More specifically, it is a measure that penalizes the score when a known (judged to be) non-relevant document is ranked higher than a known relevant document. For each known relevant document $r$ in the overall set of documents judged relevant $R$ from the gold standard, BPref can be defined as:
$Bpref (R) = \frac{1}{| R |} \sum_{r \in R} 1 - \frac{numNonRelPre (r)}{| R |}$ (10)
where numNonRelPre(r) is the number of known non-relevant documents that appear in the ranked result list before known relevant document $r$ .
Inferred Average Precision (infAP): The inferred average precision is a measure that performs a similar function to that of BPref in that it is robust to missing judgements, but has been shown to better estimate the true average precision compared to BPref as the number of missing judgements grows larger. Its full definition can be found in [17].
R-Precision (R-Prec): A fundamental problem with Precision@N is that it can underrate performance when there are fewer than $N$ relevant results truly present in the results. Using our experimental dataset as an example, 5 of the 46 queries had fewer than 10 results judged relevant in the overall judgement pool, and 25 had fewer than 100 relevant judgements. Conventional Precision@N would therefore be unfairly penalized for those queries that have fewer than $N$ relevant judgements. Despite this, it is sometimes desirable to have large values of $N$ , as it is useful to evaluate overall performance of the system. Accordingly, R-precision aims to resolve this issue; for any given query $q$ , it is equivalent to Precision@N (7) with $N$ equal to the number of documents judged relevant in the pool for $q$ .

Due to the limitations in the relevance judgement pools, the specific metrics used depended on the evaluation being conducted:

Simple: MAP, Precision@N and R-Prec cannot be used as a significant portion of the returned results will have no judgements and thus be unusable in the determination of precision. We must therefore instead use BPref and infAP, which can handle the missing judgements.
Re-Ranked: we do not have any missing judgements in the returned results; the traditional MAP, R-Prec, and Precision@N measures can therefore be used. We measured Precision@N for $n = {10, 100}$ . Due to the fact that the value of the Precision@N metric is only useful for queries that have more than $n$ relevant results, we report only the average Precision@N values for those queries that have a number of documents judged relevant greater than or equal to $n$ .

IV. Results

A. Runtime Performance

Runtime: A plot of the runtime performance is presented in Fig. 2, and a subset of the data points is presented in Tab. I. The runtime performance achieved suggests that the context-augmented queries had the same time complexity as the baseline and ran 35% faster. This runtime performance improvement appears counter-intuitive given the additional computational burden that would be expected to be introduced from the incorporation of contextual information into document scoring. We will further examine this result in the discussion section.
Storage: Context-Aware indices were found to have the same space complexity as the baseline, with a size requirement corresponding to a 1.5x-2x increase compared to baseline. We present a subset of the data in Tab. II, and a full plot of the data in Fig. 3.
Indexing: The indexing time was found to have the same time complexity as the baseline. The average indexing time was 0.3 milliseconds per document for the baseline index, and was 1.21 milliseconds per document for the context-aware index (i.e., roughly 4 times the amount of time necessary per document).

Fig. 2. — Execution Time – Baseline vs Context-Aware

TABLE I.

Execution Time – Baseline vs Context-Aware

Documents in Index	Baseline Time (ms)	Context-Aware Time (ms)	Relative Performance
5,000,000	162.16	112.38	−30.70%
10,000,000	262.38	168.95	−35.61%
15,000,000	346.09	226.77	−34.48%
20,000,000	472.27	308.45	−34.69%
25,000,000	574.46	372.09	−35.23%
30,000,000	733.29	480.27	−34.50%
35,000,000	865.55	552.66	−36.15%

Open in a new tab

TABLE II.

Index Size – Baseline vs Context-Aware

Documents in Index	Baseline Size (GB)	Context-Aware Size (GB)	Relative Size
5,000,000	4.15	7.70	+85.35%
10,000,000	7.20	16.29	+126.22%
15,000,000	14.52	22.84	+57.26%
20,000,000	15.24	27.49	+80.40%
25,000,000	20.13	38.95	+93.50%
30,000,000	19.67	38.79	+97.18%
35,000,000	28.22	47.68	+68.95%

Open in a new tab

Fig. 3. — Index Size – Baseline vs Context-Aware

B. Retrieval Performance

For this study, we used a multiplier of 0.75 for moderate penalization, and a multiplier of 0.5 for heavy penalization, which was determined to best fit our document corpus through a naïve iteration and review process. The retrieval performance is presented in Tab. III, and the results indicate that incorporating contextual awareness into queries does help increase the precision of the top ranked results.

TABLE III.

Retrieval Performance – Baseline vs. Context-Aware, Averaged Across All Queries

Evaluation Type	Evaluation Metric	Baseline Value	Context Aware Value	Relative Difference
Simple	BPref	0.3089	0.3359	+8.7%
Simple	infAP	0.0195	0.0240	+23.1%
Re-Ranked	map	0.6825	0.7174	+5.1%
Re-Ranked	P@10 (Subset 41/46)	0.7707	0.8146	+5.7%
Re-Ranked	P@100 (Subset 21/46)	0.8614	0.8981	+4.2%
Re-Ranked	R-Prec	0.6464	0.6911	+6.9%

Open in a new tab

V. Discussion

The results of this study support that incorporation of contextual information increased the retrieval performance, with a consistent gain in performance across all evaluations metrics used. More importantly, we demonstrated that it is possible to implement this performance gain in a way that is bidirectional, i.e. the weighting is implemented in such a way that queries that explicitly look for negated, asserted, or non-patient context mentions are supported, and does not add additional user burden for use by allowing the input query to be expressed in the natural language.

The experimental results further showed that the proposed implementation is scalable, with the time and space complexities both remaining consistent with the current Elasticsearch implementations for both querying and indexing. Specifically, storage requirements increased moderately 1.5–2 times, indexing time increased about 4 times, and the querying time even decreased by about 35%. Because the smoothed relative performance change for each of these three metrics is near-constant across the different increments of index size, we can make the statement that time and space complexity is of the same order as those of the baseline.

To investigate the 35% speedup in retrieval performance despite expecting additional computational burden, we conducted instrumentation of Elasticsearch using JVisualVM [18] to determine the amount of time needed to perform each part of the scoring process. We found that the improvement in query runtime performance stems from the context triggers being excluded from the scoring process. Specifically, the JVisualVM instrumentation graph showed that BM25 scoring continued to dominate the execution time of the context-aware scoring function (83%), indicating that the additional computational burden introduced by context-aware scoring does not contribute much to the overall execution time. On the other hand, we found that triggers accounted for on average 12% of the input query (excluding stopwords that are already covered by the English stopwords filter), and that they tended to be words that occur frequently in clinical narratives (Tab. IV).

TABLE IV.

A List of Query Words Removed From Scoring

admitted,at,but,considered,currently,department,emergency, following,for,had,have,high,history,if,is,may,negative,never, no,not,of,past,prior,risk,test,underwent,used,was,who,without

Open in a new tab

We believe that the combination of fewer terms to consider and significantly fewer documents to score due to the removal of these pseudo-stopwords are what contributed to the improved runtime performance, and empirically confirmed that this was indeed the case by manually disabling the trigger term filter.

Despite the positive results, several limitations do exist in our current implementation: firstly, because contextual information is implemented as a payload and the scoring algorithm changes are implemented at a document level, it is not possible to affect some portions of the scoring process. The word “diabetes” is seen significantly more in the context of family history mentions than in the patient context. If we consider the typical TF-IDF based scoring algorithm, the scoring algorithm as currently implemented is unable to take this distinction into account on the IDF side of the scoring: if we were to search for a personal history of diabetes, the scoring algorithm is able to filter out family mentions of diabetes in the determination of document level term frequency, but the collection frequency used for IDF calculation will continue to contain a mixture of both personal and familial history. This is a limitation as a search for personal history of diabetes as part of a larger query should have a much greater impact on the overall score as opposed to a query containing a search for family history of diabetes.

Secondly, it is known that the NLP contextual determination system is imperfect and performance can still be improved. Despite this, substitution of context detection systems is fairly straightforward, with the caveat that because payloads are generated at index time for performance reasons, the entire document corpus must be re-indexed to leverage any updated context detection algorithms.

Thirdly, when measuring retrieval performance, we were measuring a combined effect of the context detection algorithm plus incorporation of the contextual information into IR scoring. Without a gold standard for the correct contexts, we could not evaluate each of these tasks in isolation.

Finally, the ConText algorithm embedded in our distributed implementation is entirely rule-based: while the default ruleset provided has been found to perform well generally, it does share the same limitations of any rule-based systems in terms of portability. We would therefore suggest that any adopters spend some time customizing the ConText rules to fit their data set. Similarly, the penalization multipliers presented here were tuned to the experimental data set and may require further tuning.

VI. Future Work

In terms of clinical information retrieval capabilities overall, while contextual information is a key portion of NLP, there are still many other potential NLP features that can be incorporated into IR to improve retrieval performance. For example, there are many textual variants for the expression of any given concept. The concept “Myocardial Infarction” may be also be expressed as “MI”, “STEMI”, “NSTEMI”, none of which would be found if the input query is the full text “myocardial infarction”.

As part of future work, we aim to leverage publicly available terminologies and ontologies in conjunction with named entity recognition to further improve search performance, either through autonomous query expansion or through direct encoding of this information into the index.

VII. Conclusion

In this work, we presented an implementation of context-aware queries within an information retrieval framework that is shown to significantly improve clinical document retrieval performance. We also demonstrated that this proposed implementation is scalable and supports a seamless experience allowing for usage of queries written in the natural language.

Acknowledgment

We thank the Research Computing Services and Big Data Technology Services Units of the Mayo Clinic Department of Information Technology for supplying and maintaining the infrastructure and data warehousing capabilities needed to perform this study.

This work was supported by National Institutes of Health grants R01LM011934 and U01TR002062. This study was approved by the Mayo Clinic institutional review board (IRB # 17-003030) for human subject research

Contributor Information

Andrew Wen, Division of Digital Health Sciences Mayo Clinic, Rochester MN, USA.

Yanshan Wang, Division of Digital Health Sciences Mayo Clinic, Rochester MN, USA.

Vinod C. Kaggal, Department of Information Technology Mayo Clinic, Rochester MN, USA

Sijia Liu, Division of Digital Health Sciences Mayo Clinic, Rochester MN, USA.

Hongfang Liu, Division of Digital Health Sciences Mayo Clinic, Rochester MN, USA.

Jungwei Fan, Division of Digital Health Sciences Mayo Clinic, Rochester MN, USA.

References

[1].Horwitz RI, Hayes-Conroy A, Caricchio R, and Singer BH, “From Evidence Based Medicine to Medicine Based Evidence,” Am J Med, vol. 130, no. 11, pp. 1246–1250, Nov 2017. [DOI] [PubMed] [Google Scholar]
[2].Martin-Sanchez F and Verspoor K, “Big data in medicine is driving big changes,” Yearb Med Inform, vol. 9, pp. 14–20, Aug 15 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Wu S, Masanz J, Ravikumar K, and Liu H, “Three questions about clinical information retrieval,” Proceedings of the Twenty-first Text REtrieval Conference. National Institute of Standards and Technology (NIST), 2012. [Google Scholar]
[4].Apache Lucene, The Apache Software Foundation. [Online]. Available: https://lucene.apache.org/. [Google Scholar]
[5].Zobel J and Moffat A, “Inverted files for text search engines,” ACM computing surveys (CSUR), vol. 38, no. 2, p. 6, 2006. [Google Scholar]
[6].Manning C, Raghavan P, and Schütze H, ‘Introduction to information retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103, 2010. [Google Scholar]
[7].Ramos J, “Using tf-idf to determine word relevance in document queries.”
[8].Robertson S, “The Probabilistic Relevance Framework: BM25 and Beyond,” Foundations and Trends^® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2010. [Google Scholar]
[9].Chapman WW, Bridewell W, Hanbury P, Cooper GF, and Buchanan BG, “A simple algorithm for identifying negated findings and diseases in discharge summaries,” J Biomed Inform, vol. 34, no. 5, pp. 301–10, Oct 2001. [DOI] [PubMed] [Google Scholar]
[10].Harkema H, Dowling JN, Thornblade T, and Chapman WW, “ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports,” J Biomed Inform, vol. 42, no. 5, pp. 839–51, Oct 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].(2017). Understanding Analyzers, Tokenizers, and Filters [Online]. Available: https://lucene.apache.org/solr/guide/6_6/understanding-analyzers-tokenizers-and-filters.html.
[12].Morton T, Kottmann J, Baldridge J, and Bierner G. (2005). OpenNLP: A Java-based NLP Toolkit. [Online]. Available: http://opennlp.sourceforge.net.
[13].BV E. (2019). [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/index-modules-similarity.html#bm25.
[14].Wang Y, Wen A, Liu S, Hersh W, Bedrick S, and Liu H, “Test Collections for EHR-based Clinical Information Retrieval,” JAMIA Open, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Voorhees EM and Harman DK, TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, 2005. [Google Scholar]
[16].Buckley C and Voorhees EM, “Retrieval evaluation with incomplete information,” in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004: ACM, pp. 25–32. [Google Scholar]
[17].Yilmaz E and Aslam JA, “Inferred AP: estimating average precision with incomplete judgments,” in Fifteenth ACM International Conference on Information and Knowledge Management (CIKM) 2006, pp. 102–111. [Google Scholar]
[18].(2019). VisualVM 1 [Online]. Available: https://docs.oracle.com/javase/8/docs/technotes/guides/visualvm/.

[R1] [1].Horwitz RI, Hayes-Conroy A, Caricchio R, and Singer BH, “From Evidence Based Medicine to Medicine Based Evidence,” Am J Med, vol. 130, no. 11, pp. 1246–1250, Nov 2017. [DOI] [PubMed] [Google Scholar]

[R2] [2].Martin-Sanchez F and Verspoor K, “Big data in medicine is driving big changes,” Yearb Med Inform, vol. 9, pp. 14–20, Aug 15 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Wu S, Masanz J, Ravikumar K, and Liu H, “Three questions about clinical information retrieval,” Proceedings of the Twenty-first Text REtrieval Conference. National Institute of Standards and Technology (NIST), 2012. [Google Scholar]

[R4] [4].Apache Lucene, The Apache Software Foundation. [Online]. Available: https://lucene.apache.org/. [Google Scholar]

[R5] [5].Zobel J and Moffat A, “Inverted files for text search engines,” ACM computing surveys (CSUR), vol. 38, no. 2, p. 6, 2006. [Google Scholar]

[R6] [6].Manning C, Raghavan P, and Schütze H, ‘Introduction to information retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103, 2010. [Google Scholar]

[R7] [7].Ramos J, “Using tf-idf to determine word relevance in document queries.”

[R8] [8].Robertson S, “The Probabilistic Relevance Framework: BM25 and Beyond,” Foundations and Trends^® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2010. [Google Scholar]

[R9] [9].Chapman WW, Bridewell W, Hanbury P, Cooper GF, and Buchanan BG, “A simple algorithm for identifying negated findings and diseases in discharge summaries,” J Biomed Inform, vol. 34, no. 5, pp. 301–10, Oct 2001. [DOI] [PubMed] [Google Scholar]

[R10] [10].Harkema H, Dowling JN, Thornblade T, and Chapman WW, “ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports,” J Biomed Inform, vol. 42, no. 5, pp. 839–51, Oct 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].(2017). Understanding Analyzers, Tokenizers, and Filters [Online]. Available: https://lucene.apache.org/solr/guide/6_6/understanding-analyzers-tokenizers-and-filters.html.

[R12] [12].Morton T, Kottmann J, Baldridge J, and Bierner G. (2005). OpenNLP: A Java-based NLP Toolkit. [Online]. Available: http://opennlp.sourceforge.net.

[R13] [13].BV E. (2019). [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/index-modules-similarity.html#bm25.

[R14] [14].Wang Y, Wen A, Liu S, Hersh W, Bedrick S, and Liu H, “Test Collections for EHR-based Clinical Information Retrieval,” JAMIA Open, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Voorhees EM and Harman DK, TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, 2005. [Google Scholar]

[R16] [16].Buckley C and Voorhees EM, “Retrieval evaluation with incomplete information,” in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004: ACM, pp. 25–32. [Google Scholar]

[R17] [17].Yilmaz E and Aslam JA, “Inferred AP: estimating average precision with incomplete judgments,” in Fifteenth ACM International Conference on Information and Knowledge Management (CIKM) 2006, pp. 102–111. [Google Scholar]

[R18] [18].(2019). VisualVM 1 [Online]. Available: https://docs.oracle.com/javase/8/docs/technotes/guides/visualvm/.

PERMALINK

Enhancing Clinical Information Retrieval through Context-Aware Queries and Indices

Andrew Wen

Yanshan Wang

Vinod C Kaggal

Sijia Liu

Hongfang Liu

Jungwei Fan

Abstract

I. Introduction

II. Background

A. Incorporating Contextual Information into Information Retrieval Systems

B. Determining Clinical Contexts from Unstructured Text