Abstract
In many applications autocompletion functionality saves keystrokes, increases user experience, and helps the user to comply with standardized terminology. Intuitively, the more context information we have about the user, the more accurate autocompletion suggestions we can give. In this paper we research the added value of contextual information for autocompletion algorithms, measured as the average number of saved keystrokes. In our experiments, a context is represented as a set of SNOMED CT terms. Using the structure of SNOMED CT we determine the semantic distance of each SNOMED CT term to the context terms. The resulting distance function is injected in the autocompletion algorithms to reward terms that are semantically close to the context. Our results show that semantic enhancement saves up to 18% of keystrokes, in addition to the percentage of keystrokes saved for the non-semantic base algorithm.
Introduction
Autocompletion is a common feature of many types of software applications, such as online search engines (e.g. Yahoo! and Google), web browsers (e.g. Chrome and Firefox), e-mail tools (e.g. Microsoft Outlook), and texting (e.g. T9). The advantages of autocompletion are manifold. Autocompletion saves keystrokes (hence typos) and increases the user experience by reassuring the user that the software application “understands” his intentions. Further, it helps the user to adopt standardized terminology. The latter advantage is important for clinical software applications in view of the increasing interest in standardization of data in clinical information systems. For instance, ICD, CPT and SNOMED CT provide standardized terminology for (parts of) medicine.
Autocompletion algorithms take as input a string entered by the user and, possibly, information that describes the context of the query (previous queries, geographic location, user profile, etc.). Contextual information can be used to suggest terms used earlier or to disambiguate queries. We hypothesize that more contextual information results in more accurate autocompletion suggestions. To test this hypothesis, we consider the use case wherein the user wants to produce a certain term (the “target”) and the context is described by a set of terms semantically related to the target. For instance, the user wants to produce “optic nerve sheath meningioma” and the context is described by “cranial nerve II” and “tumor”. The concepts are extracted from a large corpus of MEDLINE abstracts allowing us to run large-scale experiments that quantitatively measure the added value of contextual information under different parameter settings. A downside of this approach is that the contexts generated in our experiments do not necessarily correspond to situations that occur in clinical practice.
The target and context terms are SNOMED CT terms that we automatically extracted from a collection of MEDLINE journal abstracts from various biomedical journals. SNOMED CT [1] comprehensively describes the biomedical domain and contains more than 480k terms spread over forty categories including body structure, disorder and finding. SNOMED CT groups synonymous terms together in concepts (of which it has 307k), and relates concepts along various types of relations.
For a target term from a certain abstract, we consider a variety of context types. For instance, one context type selects context terms from the target’s abstract. In another type, we select as context terms the most representative terms of the journal to which the abstract belongs. The context types are the first degree of freedom of the contexts in our experiments, called “dimension (i)”. The other two dimensions are: (ii) the number of terms in the context and (iii) the semantic categories of the terms (body structure, disorder, finding). Thus, a context is determined by its settings for the dimensions (i)–(iii). For instance, a context could consist of twelve body location terms selected from one and the same abstract.
We have developed two autocompletion algorithms. The suggestion space of these algorithms is spanned by the terms in SNOMED CT. In their non-semantic modus, the algorithms rank the suggestion terms on syntactic criteria only. In their semantic modus, they take into account the semantic distance of a candidate term to the context terms. We determine the semantic distance between two terms by means of SNOMED CT’s relations.
To measure the performance of a (semantic) autocompletion algorithm, we developed a metric that measures the minimal number of keystrokes required to produce a given target (with respect to one particular user interface). This quantitative metric enables us to measure the added value of various types of contextual information in terms of the performance of our autocompletion algorithms.
Background
For references on autocompletion, [2] is a good entry point. The work we found on semantic autocompletion focuses on disambiguation of queries [3] or presentation of suggestions grouped semantically [4]. Taking into account the context is well known in the area of text search technology [5]. The web search publication [6] bears considerable similarity to our approach, as it proposes to use as context the text surrounding a marked query. To the best of our knowledge no publications quantify the performance of semantic autocompletion algorithms.
Methods
In this section we describe two autocompletion algorithms. Further, we describe how the context terms are selected, how the semantic neighborhood of these terms is determined and how this information is inserted in the two autocompletion algorithms.
Autocompletion
We developed two autocompletion algorithms for integration in an industrial healthcare application the details of which are not pertinent to the contents of this paper. The first algorithm, coined horizon expansion (HE), suggests terms that start with a given query (i.e. what the user has entered). HE works on the prefix tree of all terms. Since SNOMED CT has many terms, the number of possible suggestions exceeds the number of terms that can be displayed (∼10), even for longer queries. In these cases, HE suggests the words that are shared by many extensions of the query. For instance, 39 SNOMED CT terms start with “optic n”. For this query the HE’s top-four suggestions are “optic nerve”, “optic neuritis”, “optic neuropathy” and “optic neuromyelitis”.
The second algorithm, coined multi-word matching (MWM), matches words from the query with words in SNOMED CT terms. There is a match between a query word and a term word, if the latter starts with the former. Only those terms are suggested in which every query word has a matching term word. For instance, the query “op ne” matches “optic nerve” and “nerve operation”, but not “metal-press operator”. MWM gives a score to every term in SNOMED CT. MWM favors shorter candidate terms that have less term words and respect the order of the query words.
Extracting SNOMED CT terms
Target and context terms are selected from the SNOMED CT terms extracted from 148,480 MEDLINE abstracts. The corpus was developed for the genomics track of the TREC conference [4]. Henceforth, when we use the word “term” we refer to a SNOMED CT term.
We only extracted terms that belong to the three main SNOMED CT categories: body structure, disorder and finding categories. They account for more than 44% of all the terms.
We developed an extraction method that compares terms with phrases from the abstracts ignoring capitalization (Blood = blood); ignoring word order (left upper arm = upper left arm); ignoring stop-words; and using Porter stemmer to normalize words (body = bodies). By means of example, in the following fragment the five extracted terms are highlighted:
“…with the experimental autoimmune diseases to which they are susceptible: insulin-dependent diabetes mellitus, systemic lupus erythematosus and experimental autoimmune encephalomyelitis. We discovered recently that NOD/LtJ mice also spontaneously produce IgG antibodies to the acetylcholine receptor (AchR), an antigen that can induce experimental autoimmune myasthenia gravis (EAMG) in susceptible...”
From each abstract from which more than twelve terms were extracted, we randomly pick one term with nine or more characters and collect it in the “target list”. The target list comprises more than 29k terms. In order to determine the target’s context, we maintain the link between a target and its abstract.
Context types
The context of a target term is modeled as a set of terms. We consider four context types: abstract-based, journal-based, static and random. These types span dimension (i) from the Introduction. Let n be the number of context terms (dimension (ii)).
An abstract-based context of a target is a random selection of n terms from the target’s abstract. We skip abstract-based contexts that contain the target itself as those abstracts would potentially bias the results. For instance, in the fragment above, “autoimmune disease”, “insulin-dependent diabetes mellitus” and “encephalomyelitis” are an abstract-based context (n = 3) for “myasthenia gravis”.
The journal-based context of a target consists of the top-n most discriminative terms of the journal of the target’s abstract, compared to the other journals. The discriminative power of a term for a journal is quantified by a term frequency-inverse document frequency metric (tf-idf) [5]. This metric considers a journal as the bag of terms that appear in its abstracts.
For instance, the top-four terms of the journal Nephrology, dialysis, transplantation and Carcinogenesis are “disease of kidney”, “transplant”, “kidney” and “entire kidney”. Each journal has one journal-based context (for given n). Not to bias the results, we skip targets in our experiments that sit in their journal-based context.
The static context is the context that contains all body structure, finding and disorder terms. This context type is independent of the number of context terms n. For benchmarking, we include random contexts that consist of n randomly drawn body structure, finding and/or disorder terms.
Semantic distance
The semantic autocompletion algorithms reward candidate terms that are semantically close to the context terms. We use a spreading activation algorithm [7] to compute for each term its semantic distance to the context terms. The spreading activation algorithm assigns maximum activity = 1.0 to the context terms. Then, it iteratively propagates activities from terms with nonzero activity to terms to which it is related in SNOMED CT. A decay factor causes that only a portion of the activity is propagated. Activity spreads thus from the context terms to semantically related terms until the propagated activity no longer exceeds a certain threshold. The activation of a term (a value between 0.0 (distant) and 1.0 (close)) is taken as its semantic distance to the context terms. The semantic distances are collected in a distance function.
No spreading was performed on static contexts (hence the name).
Semantic autocompletion
We inject the distance function in the algorithms in a uniform way. To this end, we understand HE and MWM as scoring functions F that, given a query, assign a score between 0 and 1 to every candidate term. The terms with highest F scores are suggested to the user. The semantic scoring function S of HE and MWM takes into account the distance function D as follows:
for a query q and a candidate term t. The constant 0.5 normalizes the score between 0 and 1.
The rationale behind the semantic scoring function S is illustrated as follows. Let s and t be two candidate completions for q = “chronic r”. Suppose they receive equal score from the standard scoring function: F(q, s) ≈ F(q, t). Suppose that the context consists of kidney transplantation related terms and that s = “chronic renal failure”, whereas t = “chronic residual schizophrenia”. Semantically, s is closer to the context terms than t: D(s) > D(t). Hence, the semantic scoring function prefers s over t: S(q, s) > S(q, t). Besides modifying the scoring function, we also propagated activity upwards in HE’s prefix tree. This is to overcome the problem that, since semantic distance is only assigned to terms (leaf nodes in the prefix tree), if HE is evaluating candidates for a short query somewhere high up in the tree, it does not know where the semantically similar terms sit down in the tree.
Keystroke-based metric
To evaluate the algorithms we developed a metric based on the least number of keystrokes required to produce a target. We assume a text box is in place in which the user manipulates a “current” string by entering characters. For every current string, the suggestions are presented in an ordered list in which focus is on the list’s first element. Focus is shifted to the next element by pressing the down-arrow button. Pressing enter replaces the current string by the suggestion that has focus. HE interacts with this UI in an interesting way. For illustration, suppose the target is “optic nerve head”. We saw that the first suggestion for “optic n” is “optic nerve”. The first suggestion for “optic nerve” is “optic nerve head”. So, one way to produce the target with the help of HE is by entering “optic n” followed by pressing enter twice (= nine keystrokes).
In our metric every keystroke (character, space, down arrow, enter) counts as one. Extensions of the metric may assign different weights to the various buttons, but this will not be pursued in this paper. The κ score of any of our (semantic) autocompletion algorithms denotes the average minimal number of keystrokes required to produce a target term from the target list. For a semantic autocompletion algorithm, λ is the fraction of its κ and the κ of its non-semantic counterpart. So if λ < 1, the semantic autocompletion saves more keystrokes on average than its non-semantic counterpart.
We developed a module that automatically computes an autocompletion algorithm’s κ, taking into account the cascading effect described for HE.
Results
The mean number of characters per target on the target list is 13.30 (standard deviation: 4.79). On the target list, the non-semantic HE and MWM both score κ = 6.71 (no significant difference; paired, 2-tailed t-test, p = .05). Thus, both autocompletion algorithms save 50% of keystrokes.
In the first experiment, we vary dimension (i) the context types and (ii) the number of context terms n = 4, 8, 12 (not for static contexts). The λ scores are given in Table 1. For instance, in abstract-based contexts with 4 context terms, HE has λ = .97. In the same setting, MWM has λ = .93. Colloquially speaking, this means that abstract-based contexts save 3% of keystrokes for HE and 7% for MWM in addition to the 50% they already obtained (corresponding to 0.2 and 0.5 keystrokes, respectively).
Table 1.
The λ score and standard deviation (to the right of ±) for varying context types and n.
| Context type | Horizon Expansion | Multi-word matching | ||||
|---|---|---|---|---|---|---|
| n = 4 | n = 8 | n = 12 | n = 4 | n = 8 | n = 12 | |
| Abstract | .97±.09 | .97±.09 | .97±.09 | .93±.13 | .91±.14 | .90±.15 |
| Journal | .98±.06 | .98±.07 | .98±.08 | .97±.08 | .97±.09 | .95±.10 |
| Static | .98±.06 | .88±.13 | ||||
| Random | .97±.20 | .98±.20 | .98±.20 | .97.22 | .97±.22 | .97±.22 |
We conclude the following (claims of statistical significance are with respect to 1-tailed t-test, p = .05):
Semantic enhancement significantly improves the base algorithm for every context type (λ < 1).
The number of context terms n does not affect HE. It does affect MWM, especially in abstract-based contexts: λ significantly decreases as n increases.
For every n and context type, HE is significantly outperformed by MWM.
Abstract-based contexts produce smallest λ for HE. MWM performs best in abstract-based and static contexts.
Not shown in Table 1 is that only 2 to 6% of the targets requires more keystrokes under semantic auto-completion than under their non-semantic counterparts.
In the previous experiment target and context terms were selected from the body structure, disorder and finding terms, in any configuration. The second experiment takes subsets on the concept categories (dimension (iii)). For each target term of category B (body structure), we construct a context of terms (a) of the same category B; (b) of different categories D+F (disorder and/or finding); and (c) of all categories considered B+D+F. Likewise for targets of category D and F. Again, the target is not among the context terms.
Table 2 gives the results of the second experiment, which considers abstract-based and static contexts only. The row in which the target and the context terms’ categories are not restricted (B+D+F twice) is copied from Table 1 for comparison. Smallest λ are obtained in scenario (a) in which the target’s category matches that of the context terms. If the target’s category does not match that of the context terms (scenario (b)), the results are considerably poorer. These findings match our expectations. In static contexts of scenario (b), semantic enhancement even results in poorer autocompletion results than non-semantic MWM (λ > 1).
Table 2.
The first two columns give the categories of target and context terms, respectively. The columns marked “w/o sem.” give the keystroke reduction of the non-semantic algorithm (fraction of κ of the non-semantic algorithms and average target length). The other columns give λ and standard deviation of the semantic algorithms.
| Target’s category | Context terms’ category | Horizon Expansion | Multi-word matching | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| w/o sem. | Abstract-based | Static | w/o sem. | Abstract-based | Static | ||||||
| n = 4 | n = 8 | n = 12 | n = 4 | n = 8 | n = 12 | ||||||
| B+D+F | B+D+F | .50 | .97±.09 | .97±.09 | .97±.09 | .98±.06 | .50 | .93±.13 | .91±.14 | .90±.15 | .88±.13 |
| Body str. | B+D+F | .49 | .96±.09 | .96±.09 | .96±.09 | .98±.05 | .49 | .88±.14 | .85±.15 | .82±.15 | .92±.10 |
| Body str. | Body str. | .47 | .94±.09 | .94±.09 | .94±.09 | .95±.08 | .48 | .88±.13 | .85±.14 | .83±.14 | .84±.11 |
| Body str. | D+F | .51 | .98±.07 | .98±.08 | .98±.08 | .98±.05 | .51 | .93±.11 | .92±.13 | .91±.13 | 1.51±.51 |
| Disorder | B+D+F | .50 | .97±.09 | .96±.09 | .97±.09 | .99±.03 | .50 | .93±.13 | .91±.14 | .90±.15 | .88±.12 |
| Disorder | Disorder | .42 | .94±.12 | .93±.12 | .93±.11 | .98±.05 | .43 | .91±.13 | .89±.14 | .89±.14 | .82±.14 |
| Disorder | B+F | .55 | .99±.07 | .99±.08 | .99±.08 | 1.01±.09 | .54 | .99±.08 | .98±.10 | .98±.10 | 1.14±.22 |
| Finding | B+D+F | .56 | .99±.08 | .99±.09 | .99±.09 | .96±.07 | .55 | .97±.10 | .96±.11 | .95±.12 | .86±.13 |
| Finding | Finding | .59 | .92±.15 | .92±.16 | .92±.16 | .96±.07 | .57 | .89±.16 | .88±.17 | .87±.18 | .82±.14 |
| Finding | B+D | .50 | .98±.05 | .97±.06 | .97±.07 | .96±.07 | .52 | .97±.07 | .96±.08 | .96±.08 | 1.10±.15 |
Discussion
Table 1 shows that semantic enhancement saves 2 to 12% of keystrokes in addition to the savings obtained by the non-semantic algorithm. If the category of the context terms matches the category of the target term, semantic autocompletion algorithms save 5 to 18%, see Table 2.
Random contexts save between 2 and 3% of keystrokes, see Table 1. This may be due to the fact that they contain body structure, disorder and finding terms only. We found that after spreading random contexts, 97% of the activated terms still belong to one of these categories. Since the targets are drawn exclusively from these categories, a random context is biased towards the target’s category.
HE is somewhat unaffected by semantic enhancement, compare random contexts to the other context types (standard deviations are considerably lower though than that of random contexts). This is caused by the fact after propagating all activations upward, nodes higher up in the prefix tree mostly have high if not maximal activation. This renders the propagated activation values meaningless. Recall that the suggestions given by HE are the result of inspecting the terms that start with the query and grouping them together, and that activation was propagated upward from the leafs. A considerable fraction of terms (up to 17%) is activated after spreading.
Semantic enhancement improves MWM. Journal-based contexts perform only slightly better than random contexts. If we extract terms from the abstracts, which constitute the target’s immediate environment, performance increases considerably, up to 10% saved keystrokes compared to non-semantic MWM. Static contexts appear to be the optimal context type. If we know the target’s category (scenario (a)), static contexts save up to 18% keystrokes. However, if the static context activates the terms of the wrong category (scenario (b)), the average minimal number of keystrokes decreases up to 51%.
What is the most appropriate context type for a given application depends on its prerequisites. Static contexts, namely, assume that we know the target’s category. This is not required by abstract-based contexts. On the other hand, abstract-based contexts require that we have access to the target’s immediate environment. In practice it may not always be clear what this is, and if it is, it may be problematic to extract these terms from the available IT systems.
The performance of HE and MWM has been optimized in numerous development cycles. Thus, we are tempted to believe that, for MWM in particular, 7 to 18% keystroke reduction in addition to the base algorithm’s reduction is considerable. In informal evaluations of MWM we have seen that users recognize that the suggested terms are related to what they have been told is the context (e.g. “breast cancer”).
The κ metric quantifies keystroke reduction. We mentioned two other advantages of autocompletion (user experience and standardized terminology compliance). Even though κ does not measure those, we believe that they are correlated with κ. Thus, we believe that κ is a relevant metric for evaluation of auto-completion algorithms.
Conclusion
We conducted experiments that measure the benefits of taking into account a semantic representation of the context. The definition of the experiment allowed us to quantify the benefits of semantic autocompletion in terms of saved keystrokes. We are not aware of a similar quantitative approach to measuring the performance of autocompletion algorithms.
Our experiments show that semantically enhancing syntactic autocompletion algorithms generally saves keystrokes, but that the extent to which they save keystrokes depend on the syntactic autocompletion algorithm and various aspects of the context, such as size and semantic category. Further research is required to translate these results to a clinical application. This entails understanding the application’s query domain, its context and how this can be best captured as a set of SNOMED CT concepts.
We believe that the setup of our experiments may be an interesting test bed for researching the potential of knowledge representation and semantic reasoning techniques for autocompletion. The setup allows namely for large-scale experiments with quantitative outcomes that can be compared to purely syntactic autocompletion algorithms, i.e. algorithms that are not semantically enhanced.
References
- 1.Stearns MQ, Price C, Spackman KA, Wang AY. SNOMED clinical terms: overview of the development process and project status. AMIA. 20012001:662–666. [PMC free article] [PubMed] [Google Scholar]
- 2.Bast H, Weber I. Type less, find more: fast autocompletion search with a succinct index. ACM SIGIR. 2006:364–371. [Google Scholar]
- 3.Hyvönen E, Mäkelä E. Semantic autocompletion. ASWC. 20062006:4–9. [Google Scholar]
- 4.Amin A, Hildebrand M, van Ossenbruggen J, Evers V, Hardman L. Organizing suggestions in autocompletion interfaces. ECIR 2009, LNCS 5478. 2009:521–529. [Google Scholar]
- 5.Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. 2008.
- 6.Finkelstein L, Gabrilovich E, Matias Y, et al. Placing search in context: The concept revisited. WWW10. 2001:406–414. [Google Scholar]
- 7.Collins AM, Loftus EF. A spreading-activation theory of semantic processing. Psychological Review. 1975;82(6):407–428. [Google Scholar]
