Abstract
The idea behind any auto-complete feature is to suggest words or phrases which are likely to complete what the user is typing. The user can then select a suggestion and avoid typing it in full. This feature by all accounts has enhanced the usability of web interfaces. Typically browsers implement this feature by caching a fixed number of queries, previously entered by the user on the client side. Google Suggest (http://ajaxpatterns.org/Suggestion) is an application that replaces a browser’s auto-complete feature with one specific to Google searching. Implementing a google suggest like feature requires a large server infrastructure, as basically every time you type a letter, a database is being hit. One of the ways that this feature can be implemented by many applications without access to a large server infrastructure would be to create a small list of suggestions that contains the most likely suggestions for any given application In this poster we describe the methodology used to create a list of suggestions for the UMLS Knowledge Source Server (UMLSKS) interface. We believe that this methodology can be used by any web based application to create a list of suggestions.
Introduction
The auto-complete feature supported by most browsers today is client specific. Every time a user starts to type a query in a web interface, the user is presented with a subset of the user’s previously typed queries. This subset is determined by matching the sequence of letters typed so far with all of the previously typed queries. Only those suggestions which contain the sequence of letters typed so far are displayed. So as the user types in more letters the list of possible suggestions becomes shorter. Google came up with an innovative way to implement an auto-complete feature that is application specific. This implies that every time a user types a letter in a web interface a request is made to the server, which generates a suggestion list specific to that application. In order for this feature to be useful, the list from the server has to be displayed very quickly in the browser in response to key strokes from the user, almost giving the illusion that this list is available locally on the client. Google implemented this feature using a design approach called AJAX (Asynchronous JavaScript and XML). AJAX takes advantage of two browser features that have been overlooked for years by web developers: Make requests to the server without reloading the entire web page and secondly parse and work with XML documents. The challenge is to create a useful list of suggestions.
Methodology
As part of the UMLSKS logging system we store the details of every query made by a user. These details include the query term, term matching method (exact match, normalized string index, word index etc.) and a flag indicating whether the query was successful or not. We extracted a list of all queries made to the UMLSKS from 2003 to 2005. From this list we eliminated all queries made by NLM users, as most of those queries were test queries. We further eliminated all queries where the query terms were CUI’s. E.g. query term=C0001175. We also eliminated all queries for which no results were found using any of the matching methods. We then lower cased all the query terms and sorted the list alphabetically. From this list we created a new list which contained for each query term, the term frequency count and the total number of unique users who have used that query term. Since the goal of this method is to create a list of most likely suggestions for all of the users of the UMLSKS, we threw away all queries with a term frequency count of one and a user count of one. After performing these steps we were able to reduce the size of our list by about 90%. In the remaining list we needed a metric to rank each term based on frequency of usage. We used the commonly used tf-idf weight (term frequency-inverse document frequency) to compute relevancy of each query term. We substituted users in place of documents. A high weight in tf-idf indicates a high term frequency and a low frequency of usage. Since we wanted to rank terms by high frequency of usage, a lower weight term got a higher ranking. We then sorted this list alphabetically on the query term and in descending order of ranking. In order to reduce the load on the backend server we also put a restriction that a minimum of four characters have to be typed in before any suggestions are displayed. Initial feedback from users on the list of suggestions generated by this method has been very encouraging.
Future work
We want to extend this work to take advantage of term matching methods and user profiles to tailor suggestion lists for individual users. We plan to turn this into a generic open source tool to be distributed with the SPECIALIST NLP tools.