Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Oct 20.
Published in final edited form as: J Biomed Inform. 2008 Dec 30;42(5):831–838. doi: 10.1016/j.jbi.2008.12.006

Improving accuracy for identifying related PubMed queries by an integrated approach

Zhiyong Lu 1,*, W John Wilbur 1
PMCID: PMC2764279  NIHMSID: NIHMS143932  PMID: 19162232

Abstract

PubMed is the most widely used tool for searching biomedical literature online. As with many other online search tools, a user often types a series of multiple related queries before retrieving satisfactory results to fulfill a single information need. Meanwhile, it is also a common phenomenon to see a user type queries on unrelated topics in a single session. In order to study PubMed users’ search strategies, it is necessary to be able to automatically separate unrelated queries and group together related queries. Here, we report a novel approach combining both lexical and contextual analyses for segmenting PubMed query sessions and identifying related queries and compare its performance with the previous approach based solely on concept mapping.

We experimented with our integrated approach on sample data consisting of 1,539 pairs of consecutive user queries in 351 user sessions. The prediction results of 1,396 pairs agreed with the gold-standard annotations, achieving an overall accuracy of 90.7%. This demonstrates that our approach is significantly better than the previously published method. By applying this approach to a one day query log of PubMed, we found that a significant proportion of information needs involved more than one PubMed query, and that most of the consecutive queries for the same information need are lexically related. Finally, the proposed PubMed distance is shown to be an accurate and meaningful measure for determining the contextual similarity between biological terms. The integrated approach can play a critical role in handling real-world PubMed query log data as is demonstrated in our experiments.

Keywords: PubMed Distance, Related Query, PubMed Query Log, Session Segmentation, Lexical Similarity, Contextual Similarity

1 Introduction

PubMed is the most widely used tool for searching biomedical and life science literature online. Since the beginning of 2007, there have been about three to four million user queries to PubMed each day. Query logs are widely studied in the general information retrieval domain as they are key data for understanding the intent of user information needs. Although there is a large interest in analyzing commercial Web search engine query logs (Silverstein et al., 1998; Chau et al., 2005; Beitzel et al., 2007), studies on the query logs of PubMed are almost absent in the literature. This is perhaps largely because of the lack of publicly available data. Due to the user privacy policy of the National Library of Medicine (NLM), the agency which collects and manages the query logs of PubMed, PubMed query logs are not generally released for public use. The only publicly available data are a digested form of a day’s worth of queries to PubMed and can be accessed from the NLM website 1. The data file contains information in three columns, separated by a pipe symbol as shown in Table 1. The first column contains user information. In Table 1, multiple queries from the same user are listed together. The second column is a time stamp, which is the number of seconds since midnight. The third column is query text that varies significantly in length and complexity. For example, the first user in Table 1 issued three queries. In this work, we call this a single user session, which includes all of the queries issued by a single user in one or more information-seeking tasks.

Table 1.

Query log example. The one day query log file provided by the NLM has three columns: unique user identification, time stamp, and query text, separated by the pipe (|) sign. Two user sessions (separated by the horizontal bar) are selected to be shown below. Each user session includes consecutive queries issued by the same user. Duplicated queries (e.g. repeated Laughter) and misspells (e.g. genatic) are found in the log and are not removed or corrected for authenticity. User1: 07CBgI-IOFkIAAE9G8OEAAAAK; User2: FWIDAYIOFkIAAGjJ2roAAAAD.

User1|47178|“NEONATAL SCREENING” “BLOOD SPOT”
User1|48111|“NEONATAL SCREENING” “BLOOD”
User1|52170|“newborn SCREENING” “samples”
—–
User2|11368|Nicotine
User2|11454|Nicotine general health
User2|11454|Nicotine general health
User2|11860|smoking and lung cancer
User2|11860|smoking and lung cancer
User2|11963|nicotine lung cancer
User2|12160|nicotine lung injury
User2|12206|nicotine lungs
User2|12334|lung cancer smoking
User2|12334|lung cancer smoking
User2|12370|lungs cancer smoking
User2|12393|smoking nicotine lung cancer
User2|12393|smoking nicotine lung cancer
User2|12474|genatic modified food human
User2|12475|genatic modified food human
User2|12497|genetic modifier food human
User2|12543|genetic modified food human
User2|12543|genetic modified food human
User2|12679|brain emotions
User2|12679|brain emotions
User2|12821| Laughter
User2|12821| Laughter

Queries in a single user session are not necessarily related to one information need since a user may switch search topics completely in the same session. For example, the second user in Table 1 issued a total of 22 queries between 03:09:28 and 03:33:41. These 22 queries were manually categorized into three topics by Herskovic et al. (2007): smoking and lung cancer, genetic modified food human, and brain emotions. Each topic involves a sequence of related queries. The problem is further complicated when multiple users search PubMed on a public computer (e.g., in a medical library) because single user sessions are determined in terms of browser cookies in the data-gathering protocol for PubMed logs. The work presented here specifically tackles the session segmentation problem. That is, we are developing new methods to automatically identify individual groups (in a given user session) of user queries that fill unique information needs. The step of session segmentation has served as a building block in many query log based analyses (Silverstein et al., 1998; He and Göker, 2000). For instance, this step plays a critical role in finding refined queries: a process to help users refine their search by automatically suggesting alternative queries in response to a user input (Fonseca et al., 2003; Huang et al., 2003; Shi and Yang, 2007). We conduct this research as part of an on-going project for introducing a similar search assistant into PubMed.

Previous methods have attempted to separate unrelated queries in a user session by inspecting the time when the queries were issued (Silverstein et al., 1998; He and Göker, 2000; Fonseca et al., 2003; Huang et al., 2003). That is, if there is a significant time gap between the two queries, then they would be classified as unrelated. However, such a time cutoff is often difficult to determine in practice. For example, in Table 1, the two time gaps for the three different topics in the second user session are 81 seconds (between smoking nicotine lung cancer and genatic modified food human) and 136 seconds (between genetic modified food human and brain emotions), respectively. However, neither 81 nor 136 seconds is capable of consistently identifying different topics because there exist time gaps between 81 and 136 seconds (e.g., 86 seconds between Nicotine and Nicotine general health), as well as time gaps greater than 136 seconds (e.g., 142 seconds between brain emotions and Laughter) within single topics in this example.

More recently, several alternative approaches have been reported in the literature. Shi and Yang (2007) proposed to improve the previous time interval algorithm by utilizing the surface similarity between adjacent queries based on the Levenshtein distance (Levenshtein, 1966), as a prior step for mining related queries from a Chinese web search engine. The most relevant work is by Herskovic et al. (2007), where the authors proposed a semantics-based algorithm on segmenting user sessions and recognizing related queries 2. To the best of our knowledge, this is the first and only report on this issue for PubMed. These authors pioneered solving the problem by evaluating the semantic distance between consecutive queries. Specifically, the two consecutive queries were first mapped to MeSH concepts (http://www.nlm.nih.gov/mesh/). Next, the semantic distance was computed as the shortest path between pairs of concepts in MeSH. They primarily relied on MetaMap (Aronson, 2001) to map queries to MeSH concepts, which could be subsequently used to infer the search topic of grouped results—an important and unique feature of their method. Here-after, we will call their method the MetaMap approach.

Unlike the MetaMap approach that solely relied on semantics, we developed a novel approach that integrates results of both lexical and contextual analyses. The main contributions of our work include: First, the integrated approach substantially improves the accuracy of segmenting single user sessions and identifying related queries in PubMed. Second, the proposed approach is applicable to handle real-world PubMed log data. By applying this approach to the one day query log in PubMed, we found that a significant proportion of information needs were involved with more than one PubMed query, and that most of the consecutive queries are lexically related. Third, a novel metric for evaluating the contextual distance between consecutive queries was proposed and evaluated in this work. The metric was proven to be accurate and meaningful in measuring contextual similarities between biological terms in our experiments.

The lexical analysis in our approach includes two string similarity measures, one of which is based on edit distance (Myers, 1986; Ukkonen, 1985). The second measurement looks for overlapping keywords as an indication of lexical similarity.

The proposed contextual metric is an adaptation of the normalized Google distance (NGD), which uses the Google page counts to measure the similarity of two words and/or phrases from the world-wide-web, and has been successfully applied to several applications such as using it to weight approximate ontology matches (Gligorov et al., 2007). The NGD is computed as follows:

NGD(x,y)=max{logf(x),logf(y)}logf(x,y)logMmin{logf(x),logf(y)} (1)

In Equation (1), M is the total number of web pages indexed by Google. f(x) is defined as the number of pages a Google search returns for the search term x. Similarly for f(y). f(x, y) is the number of pages Google returns for searching x and y. The range of the NGD is between zero and infinity. More specifically, according to Cilibrasi and Vitanyi (2007):

  1. NGD(x, y) is undefined for f(x) = f(y) = 0;

  2. NGD(x, y) = ∞ is for f(x, y) = 0 and either or both f(x) > 0 and f(y) > 0;

  3. NGD(x, y) ≥ 0 otherwise.

Although it was originally conceived for the general domain, we adapted it to the biomedical domain. We do this by replacing the Google page counts with the PubMed counts for biomedical search terms. The subsequent new metric is named the PubMed distance in this work.

2 Methods

To identify related queries in each single user session, our system makes use of the results of both lexical and contextual analyses in a three-step process as shown in Figure 1.

Fig. 1.

Fig. 1

Three steps for determining related queries in a single user session. All six queries were selected from the second user session in Table 1 for the purpose of illustration. In the first step, we compute both lexical and contextual similarities for each pair of consecutive queries. Next, consecutive queries are classified as either related or unrelated and subsequently grouped together or put into separate groups according to similarity scores. Thus, all of the queries in each group are related and are meant to fill a single information need. Finally, consecutive groups that target the same information need are joined.

First, for each pair of two consecutive queries in a single user session, we compute separately their lexical as well as contextual similarity. Depending on the corresponding similarity scores, the two queries are classified to be either related or unrelated. Specifically, two consecutive queries are classified to be unrelated only when they are neither lexically nor contextually related. Next, according to the classification results, the two queries are either clustered together into one group (related) or put into two separate groups (unrelated). In the final step, pairs of two queries from two consecutive groups (one from each group) are compared for string similarity. If there exists a pair of two related queries, the two groups are subsequently joined. As a result, all of the queries belonging to both groups are considered to be related.

We perform two different measurements for string similarity in the lexical analysis. A pair of consecutive queries are classified to be lexically related when they meet the requirement for similarity in either measure. In one measurement, we make use of string edit distance (character-based), which is primarily based on the approximate string matching algorithm described in (Myers, 1986; Ukkonen, 1985). The algorithm roughly works by looking at the smallest number of edits to change one string into the other after punctuation removal. When two strings are compared, the output is a score between 0 and 1. A value of 0 means that the strings are entirely different, while a value of 1 means that the strings are identical. Everything else between 0 and 1 indicates the amount of similarity. In this work, the threshold for string similarity was predefined to be 0.8 (see Section 4.2 for discussion). For example, the consecutive queries nicotine lung cancer and nicotine lung injury in Figure 1 were considered to be lexically related because of their similarity score of 0.8.

In the second measurement, we search for overlapping keywords as an indication of lexical similarity. Specifically, we first remove punctuation marks (e.g., double quotation marks) and stop words 3 from both queries and then search for overlapping keywords (case-insensitive) between the two query terms. If such a keyword in common can be found, we classify the two queries as lexically related. For instance, the two queries Nicotine general health and nicotine lung cancer in Figure 1 were not considered to be similar by the approximate string matching algorithm because of their low similarity score (0.605); but due to the overlapping word nicotine, we still classify the two queries as lexically related.

The contextual similarity is measured by the PubMed distance where the PubMed counts can be programmatically retrieved by using the Entrez Programming Utilities (eutilities) (Geer and Sayers, 2003). Like the NGD, the value of a PubMed distance is also between 0 and infinity. The smaller a PubMed distance is, the closer the two search terms are. A value of 0 means that they are identically distributed. In this study, the threshold for contextual similarity was predetermined to be 0.5 (see Section 4.2 for discussion). That is, any value below 0.5 indicates that the two queries are contextually related. For instance, PubMed returned 12,252 citations for the query brain emotions, 1,187 for Laughter and 219 for the two queries together (brain emotions) AND (Laughter). The total number of indexed citations searched by PubMed at the time (November 2007) was 17,531,670. Therefore, the PubMed distance for the two queries was computed to be 0.419, suggesting that the two queries are contextually related.

Groups of related queries were established in the second step depending on the previously computed similarity scores. In each of those individual groups, all the queries were related and were meant to fill a single information need. For example, for the sample user session in Figure 1, the second group consisted of three related queries targeting the same information need.

The last step was used to join consecutive groups that were actually meant to target the same information need but were misclassified to be separate by the similarity scores in the previous steps. For example, after the second step in Figure 1, there existed three separate groups for the sample user session that consisted of six queries. The first group included the first query in the session while the next three queries belonged to the second group, and the last group contained the remaining two queries. This segmentation was due to the fact that both lexical and contextual analyses suggested there should be separations between the two consecutive queries Nicotine general health and smoking and lung cancer, as well as between Nicotine lung injury and brain emotions in the previous steps. However, it is obvious that all of the first four queries in this session were related and were meant to satisfy the user’s information need about ‘Nicotine’, ‘smoking’ and ‘lung cancer’. The last step of our approach provided a remedy mechanism such that all of the related queries in consecutive groups could be joined if two lexically related queries between groups were identified. In this example, because the query term Nicotine in the first group was related to nicotine lung cancer in the second group based on the overlapping word nicotine, the first four queries in this session became related and joined into one group. Note that the contextual relatedness was not used in this step due to two reasons: a) our concern for algorithm efficiency (cross group comparisons using eutilities could notably slow down the entire process); and b) both lexical and contextual similarity will be less accurate when the queries compared are more separated in time. There will be fewer truly related query pairs and more false positives with a greater time gap between the queries compared. Therefore we use only the relatively fast and conservative lexical matching. In the end, two unique information needs were identified in Figure 1.

3 Results

In order to evaluate our approach and compare it to the MetaMap approach, we requested and obtained the data set from the authors of (Herskovic et al., 2007). The given data contain a total of 2,372 non-empty queries issued in 351 individual user sessions. All of the queries were previously classified as navigational or informational by Herskovic et al. (2007). Queries were defined as navigational if they “contained only bibliographic tags (e.g., [pdat], [au]).” 514 such queries were found in the data set. Since we were comparing with the MetaMap approach, we followed their lead of only evaluating approaches on the informational queries, which are equivalent to all non-navigational queries in this study.

After removing all of the navigational queries, there remain 1,858 informational queries in 319 user sessions. The session length (i.e. the number of queries in a session) ranges from 1 to 31, with an average of 6. In each session, our three-step algorithm was applied to determine whether a pair of two consecutive informational queries were meant to fill a single information need. There are a total of 1,539 pairs to be compared. Most of them (1,336/1,539) were manually annotated as related by two of the authors in (Herskovic et al., 2007) (Interannotator agreement was 93.10%, Elmer Bernstam, personal communication). Only 13.2% (203) were manually annotated as non-related (i.e., searches on different topics were performed in one user session). Thus, by classifying every pair as related, we could obtain an accuracy of 86.8% for this sample data set. We call this the baseline approach. Note that the total number of comparisons (1,539) is smaller than the total number of informational queries (1,858) because comparisons between different sessions are excluded.

Using our three-step algorithm presented in Section 2, we made predictions for all of the 1,539 pairs. The prediction results of 1,396 pairs agreed with the gold standard, thus achieving an accuracy of 90.7%. This demonstrates that our method is superior to the baseline approach, as well as the previous MetaMap approach (82.0%, Jorge Herskovic, personal correspondence) in identifying related queries in PubMed. We made errors in prediction for 143 (9.3%) pairs. They are classified into two error groups:

  1. 106 pairs that were previously annotated as related but were predicted to be unrelated by our approach.

  2. 37 pairs that were previously annotated as unrelated but were predicted to be related by our approach.

3.1 Error analysis for the 106 related pairs

To gain a better understanding of the results and how they might be improved, we assessed the etiology of errors separately for the two groups. Our approach was unable to recognize related queries in the first group of 106 pairs because no lexical or contextual relations could be found by our metrics. Further analysis on the PubMed distance of these pairs shows that approximately 20% (21/106) pairs had some evidence in PubMed, but not strong enough (i.e., the PubMed distance score is greater than the threshold) to be considered related by our approach. For instance, one such pair is listed below:

d-6Yo4IOFpIAAD9-1ssAAAAN|75704|IDF guideline
d-6Yo4IOFpIAAD9-1ssAAAAN|75732|type diabetes

When the two query terms were searched together in PubMed, three citations were returned. Together with their individual PubMed counts, we obtained a distance score of 0.765 (greater than the predefined threshold for similarity). The remaining 80% of pairs had PubMed distance either infinite or undefined because no citation was found when their corresponding queries were searched together. For instance, in the following user session that lists five user queries, our algorithm predicted three breaks (a break indicates a pair of unrelated queries; shown as plus signs before the second query of each unrelated pair) as opposed to none in the gold standard. Each of the breaks was predicted because the two consecutive queries (e.g., htra2 and omi and jnk and smac) had a PubMed distance of infinite or undefined.

jek0H4IOFt4AAA1WfxkAAAAJ|48201|htra2 and omi
jek0H4IOFt4AAA1WfxkAAAAJ|49367|+jnk and smac
jek0H4IOFt4AAA1WfxkAAAAJ|51442|+htra2 and RNAi
jek0H4IOFt4AAA1WfxkAAAAJ|51450|htra2 and siRNA
jek0H4IOFt4AAA1WfxkAAAAJ|51502|+jnk and smac

Although it is obvious that the third query is lexically related to the first one, the second query is lexically related to the last one, and together all five queries were manually classified as related to the same topic, our approach failed to make these recognitions. This is because after the first two steps, there were already four different groups. Although we attempted to join previously misclassified groups in the third step, our approach is limited to compare queries in adjacent groups. In this example, the first and third query were located in two nonadjacent groups. Thus, they were not able to get compared. Similarly, the last query jnk and smac was only compared with the two queries (htra2 and RNAi and htra2 and siRNA) in its preceding group as opposed to the second query. Therefore, no groups could be joined in the final step in this example.

3.2 Error analysis for the 37 unrelated pairs

The second part of our error assessment involved 37 query pairs. These queries were predicted to be related due to the following three different reasons:

  • Queries were found to be lexically related such as sharing overlapping keyword(S).

  • Queries became related after their groups were joined.

  • Queries were related in context because the PubMed distance score indicated so.

Eight pairs can be classified into the first category. For example, in the following user session, there were two breaks (shown as asterisks before the queries) in the gold standard. The output of our approach disagreed with the first break in the gold standard because it found an overlapping word stress in both queries stress factors and fetal stress.

HVRHI4IOFj4AACuUVBYAAAAC|65593|stress
HVRHI4IOFj4AACuUVBYAAAAC|65713|stress in humans
HVRHI4IOFj4AACuUVBYAAAAC|65870|stress factors
HVRHI4IOFj4AACuUVBYAAAAC|65870|stress factors
HVRHI4IOFj4AACuUVBYAAAAC|65897|*fetal stress
HVRHI4IOFj4AACuUVBYAAAAC|65897|fetal stress
HVRHI4IOFj4AACuUVBYAAAAC|66218|*childbirth
HVRHI4IOFj4AACuUVBYAAAAC|66218|childbirth
HVRHI4IOFj4AACuUVBYAAAAC|66857|causes posttraumatic
stress childbirth

The second category consisted of 18 pairs. For example, in the session above, the output of our algorithm also disagreed with the second break: although the consecutive queries fetal stress and childbirth were neither lexically nor contextually related according to the computed string similarity and PubMed distance scores, the two groups (one included the first six queries and the other included the last three queries) formed after step two were united in the third step due to the fact that there existed two lexically related queries (fetal stress and causes postraumatic stress childbirth shared the same word stress). Consequently, all the queries in the two groups were considered to be related.

The third category included 11 pairs. The queries in these pairs were predicted to be related because the corresponding query terms co-occurred statistically significantly in literature rather than randomly. We strongly suspected these were annotation mistakes in the gold standard. Thus, we (the two authors) independently judged whether the two queries in each pair should be annotated as related. Our inter-judge agreement was 100% and the results of our judgments show that they were indeed annotation mistakes. The 11 cases, together with supporting explanations, are made available as paper supplementary materials as well as accessible online 4 for our readers. For example, two breaks were annotated in the gold standard in the following user session:

exVnaoIOFj0AAGlGfucAAAAW|32806|sec23
exVnaoIOFj0AAGlGfucAAAAW|32849|sec13
exVnaoIOFj0AAGlGfucAAAAW|32968|*ER exit sites
exVnaoIOFj0AAGlGfucAAAAW|33006|*COP II

The results of our predictions suggested that all of the four queries were related to a single topic based on computed PubMed distances. The PubMed distance for sec13 and ER exit sites was 0.186. The PubMed distance for ER exits sites and COP II was 0.362. Both scores were smaller than the threshold. In biology, the COP II vesicle coat protein includes both sec13 and sec23, and it carries secretory proteins to exit from the endoplasmic reticulum (ER) (Salama et al., 1997; Tang et al., 2001).

4 Discussion

4.1 Assessing different parts of our approach

In order to assess the contribution of each individual component of the system, we performed the following experiments. First, we separately experimented with the lexical and contextual analysis. In the third experiment, we used both the lexical and contextual analyses. Next, we experimented using the lexical analysis and group join step, followed by a final experiment where we used all three components. Results of the five experiments are summarized in Table 2.

Table 2.

Prediction results of five different experiments. In the experiments 1 & 2, results of only lexical or contextual analysis were used to determine if two consecutive queries were related. In experiment 3, results of both analyses were used. In experiment 4, both lexical analysis and group join were used.

Finally, all three components were applied in the experiment 5. The classification scheme for prediction errors follows our discussion in Section 3. The thresholds for lexical and contextual similarities are 0.8 and 0.5 in all of the experiments, respectively.

No Experimental Settings Correct Predictions Errors in Group 1 / 2 Accuracy

1 Lexical Analysis Only 1,357 174 / 8 0.882
2 Contextual Analysis Only 961 567 / 11 0.624
3 Lexical & Contextual Analysis 1,381 139 / 19 0.897
4 Lexical & Group Join 1,380 136 / 23 0.897
5 Lexical & Contextual & Group Join 1,396 106 / 37 0.907

The accuracies in the first two experiments show that the lexical analysis itself is capable of achieving good performance while this is not true for the contextual analysis using the PubMed distance. Using only the contextual analysis in experiment 2 yielded 567 errors in group 1 according to our classification scheme in Section 3. That is, 567 pairs of consecutive queries were predicted to be unrelated but were annotated as related in the gold standard. Further error analysis shows that 87% (495/567) of misclassifications were due to the fact their corresponding PubMed distances were either infinite or undefined, i.e., results of zero retrievals by these queries in PubMed.

Although differing significantly in the number of errors in Group 1, both analyses are highly precise in identifying related queries given the number of errors in Group 2 (8 for the lexical analysis and 11 for contextual analysis). As we have discussed earlier in Section 3.2, the 11 errors made by the contextual analysis are suspicious and should be corrected. Thus, we would gain 35 correct predictions when adding the contextual analysis to the results of the lexical analysis (Experiment 3 vs. Experiment 1) as opposed to 24 correct ones and 11 errors shown in Table 2. It is significant to note that these 35 query pairs were correctly predicted solely based on the contextual analysis. Since each prediction was discrete and essentially independent of the others, we performed a Binomial distribution-based statistical test (Brownlee, 1965) and the result showed that, the precision for recognizing contextually but not lexically related queries by the PubMed distance measure is at least 0.918 with a confidence of 95%.

The overall accuracy was enhanced to 90.7% in the last experiment when both steps of contextual analysis and group joining were added to the lexical analysis. Using only the lexical analysis in the experiment 1, 182 errors (174 in Group 1 and 8 in Group 2) were found. This number was substantially reduced to 143 in the last experiment. This suggests that the contextual analysis and group joining are useful and complementary steps to the lexical analysis.

4.2 Effects of threshold choice on system performance

There are two thresholds we predetermined during system development. One is the threshold for string similarity and the other is involved with the PubMed distance.

We predefined a threshold of 0.8 when determining if two queries are similar as strings. Results reported in Section 3 used this threshold. During error analysis, we experimented with a spectrum of different thresholds for string similarity. The red dashed line in Figure 2 shows the total number of errors under different thresholds. Overall, the number of errors remained almost steady for thresholds greater than 0.5, while it became much more sensitive when the threshold was set to below 0.5. The best performance (i.e., the smallest number of errors) was obtained when the threshold was set to be 0.6.

Fig. 2.

Fig. 2

The number of errors under different thresholds. The red dashed line represents performance changes under different thresholds for string similarity while the PubMed distance threshold is held at 0.5. The green solid line represents performance changes under different thresholds for contextual similarity while the lexical similarity threshold is held at 0.8.

The value of 0.5 was chosen as a threshold for the PubMed distance because we followed the lead of a previous observation of the Google distance in (Gligorov et al., 2007). The complete analysis under different thresholds for PubMed distance is presented as the green solid line in Figure 2. Unlike the effects of choosing different thresholds for string similarity, the different thresholds for the PubMed distance had only slight impact on the overall performance. As can be seen from Figure 2, the total number of errors ranges from 139 (T=0.7) to 159 (T=0.3) for most of the threshold values. In this work, when the PubMed distance was computed to be infinite or undefined, we simply assigned a value of one. Therefore, when the threshold was set to be one (i.e., T=1 in Figure 2), all queries were classified as related (i.e., none of the 203 breaks in the data were predicted; same as the baseline approach).

4.3 Comparing with other statistics for determining contextual similarity

First, we compared the use of the PubMed distance versus the Google distance. That is, instead of using the number of citations returned by PubMed, we searched the terms in Google and subsequently used Google page counts through Google’s SOAP search API (http://code.google.com/api/soapsearch).

Unlike the 17 million scientific citations that PubMed searches in the biomedical and life science fields, Google indexes and searches billions of pages across all disciplines and in different genres (e.g., webpages, publications, powerpoints, etc). Hence, Google is expected to find and return results much more frequently than PubMed.

The number of errors of using the Google distance (in orange dotted line) is compared to that of using the PubMed distance (in green solid line) in Figure 3. Comparing the two lines reveals that a) in general, using the PubMed distance results in fewer errors (i.e., better performance); b) the Google distance is more sensitive to the choice of different thresholds; and c) using Google, the best performance was found when the threshold was set to 0.3 as opposed to 0.7 in PubMed. In addition to no performance gain in using Google distance, the availability of the Google API utility also prevents it from widespread usage in practice. This publicly available utility limits ordinary users to 1,000 Google searches per day. In order to compute the distance for the 1,359 pairs in the sample data, 4,617 Google searches were necessary because each pair demanded three searches. Thus, we had to use five days to complete the computation. For the analysis of an entire day or week of query logs discussed in Section 4.4, this method would not be appropriate. Instead, eutilities provided by the NCBI is free and supports access (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) for the general public.

Fig. 3.

Fig. 3

The comparison of the PubMed distance versus the Google distance. The data presented here used the whole system (i.e. No 5 experimental setting in Table 2) with a threshold of 0.8 for the lexical similarity.

In addition to comparing with the Google distance, we also computed the cumulative hypergeometric probability using the PubMed counts and applied the corresponding P-value for determining contextual similarity. A hypergeometric probability refers to a probability associated with a hypergeometric experiment, in which four parameters are involved as follows in this study:

  • N: The number of items in the population – the total number of indexed citations in PubMed.

  • k: The number of items in the population that are classified as successes – the number of PubMed counts for the first query.

  • n: The number of items in the sample – the number of PubMed counts for the second query.

  • x: The number of items in the sample that are classified as successes – the number of PubMed counts for the Boolean AND of the two queries.

The hypergeometric probability and the corresponding P-value are defined in the following formulas:

Pr(x;N,k,n)=(kx)(Nknx)(Nk) (2)
Pvalue=Pr(Xx;N,k,n) (3)

The P-value refers to a cumulative hypergeometric probability that sums the probabilities of PubMed returns greater than or equal to x citations when the combination query is searched. If the P-value for the two queries is no greater than 0.05, we predicted them to be related. Using the P-value, a total of 142 errors were found. 88 query pairs were predicted to be unrelated but annotated to be related; and 54 vice verse. This shows that the overall performance of using the P-value is comparable to that of using the PubMed distance.

4.4 Applying our approach to the one day query log

As mentioned earlier, the work presented here is one step in a larger investigation in which we plan to compile statistics on users’ search strategies and use these to refine their queries for better retrieval. The sample data only included 1,858 user queries, which represent less than 0.1% of the total volume for one day in PubMed. In order to study general user search behaviors, we thus applied our approach to the one day’s worth of PubMed query data. Following the lead in (Herskovic et al., 2007), we excluded users that issued over 50 queries/24 hours as they could represent programmatic searchers. Additionally, we removed 31,851 empty queries (user entered no search terms). 2,657,315 queries issued by 611,083 uniquely identified users remained after the preprocessing. Note that unlike the MetaMap approach, our approach does not require filtering any navigational queries, a critical feature for handling real-world data in practice.

A total of 1,038,684 information needs were identified by our approach. 446,582 users (73%) conducted searches for a single information need while the remaining users switched search topics during their sessions. In addition, 558,622 information needs (54%) were searched by a single query while the rest involved multiple queries (3 queries on the average). Our analysis results are compared against previously reported data in Table 3. We identified more information needs, but a smaller proportion of users with a single information need, and of information needs with a single query.

Table 3.

Comparing analysis results on the one day query log data in this work to those in Herskovic et al. (2007).

Comparison Our Work Herskovic et al.

total number of info needs 1,038,684 740,215
% of users with single info need 73% ~90%
% of info needs with single query 54% 63%

5 Conclusions and Future Work

The major goal of this work was to develop an accurate methodology for identifying related queries in user sessions. We were able to do so by developing an integrated approach that primarily relied on the lexical analysis. Incorporating the contextual information of the query terms into the system can further enhance performance. As a result, our integrated approach significantly reduced the number of incorrect classifications compared to the MetaMap approach (described in Herskovic et al. (2007)), as well as the baseline approach (described in Section 2).

5.1 Improving the PubMed distance by retrieving more relevant citations

As we have illustrated, the PubMed distance is an accurate and meaningful metric for determining contextual similarity between two consecutive PubMed queries if relevant citations in PubMed can be retrieved. However, the fact that some queries result in no citations in PubMed limits its usage. Two techniques could be helpful for alleviating this problem in future work:

  1. Searching queries in full-text articles

  2. Using concept recognition techniques

The first technique was actually inspired by manually inspecting results returned in Google. We noticed that Google was able to return results when nothing was found in PubMed, and that many top Google hits were direct links to the papers that mention query terms in the body of the paper. In contrast, these scientific papers were missed by PubMed because searches were constrained to match text words in the title and abstract in PubMed. For example, the top three hits of the search query proteophosphoglycan and nucleus in Google link to three different publications while none of them were found in PubMed because one of the search words—nucleus never occurred in the abstract of these three papers. The goal of searching queries in full-text can be realized when we search keywords in PubMed Central (PMC), a free digital archive of biomedical and life sciences journal literature. Due to its voluntary participating policy in the past (A recent law makes it mandatory to deposit papers from NIH-funded research into PMC), the number of full-text articles archived in PMC is relatively small (over 1.2 million full-text articles as of Dec 26, 2007). Although this is a small percentage (7%) of the PubMed records, papers are nevertheless all archived in full-text. The number of unique words is about 2.5 million in PubMed but several times more in PMC. Therefore, PMC provides a potential remedy for cases when no relevant articles can be found in PubMed.

The second approach for improving the probability of retrieving relevant articles in PubMed is to make use of concept recognition techniques which first map keywords into biomedical concepts and subsequently search the concepts in literature as opposed to directly searching keywords in text. Integrating concept recognition was shown to be beneficial in information retrieval (Hersh and Bhupatiraju, 2003) and information extraction (Baumgartner et al., 2008) in previous studies. The automatic term mapping feature of PubMed (mapping text to MeSH concepts) can be considered as one such technique and it has been experimentally demonstrated to be useful in our own experience (Lu et al., 2008) when compared with strictly searching keywords in text. In the work of Herskovic et al. (2007), MetaMap (Aronson, 2001) was used to recognize UMLS concepts. Similar approaches can be explored in the future in order to retrieve more relevant results in PubMed.

5.2 Implications of the work reported here

As we have demonstrated, the proposed approach improves on the previous MetaMap and baseline approaches in terms of the identification accuracy. Moreover, unlike the two other approaches, it is applicable in realistic situations as shown in Section 4.4. Therefore, our approach makes possible a deeper query log analysis. For example, recognizing related queries is a prerequisite for building query refinement: a process for recommending new terms in response to a user input. As shown in previous studies on query suggestion (Fonseca et al., 2003; Huang et al., 2003; Shi and Yang, 2007), building such an application involves two separate steps: a) related queries need to be first identified in single user sessions, and b) refined queries can then be extracted by applying machine learning / data mining algorithms to those pre-identified queries in the previous step. Our on-going research aims to build such an application for PubMed. Specificially, we will focus on experimenting with different learning algorithms on the pre-identified (by the approach proposed here) related queries in segmented PubMed sessions. Since lexical analysis showed excellent results in this study, it suggests that one simple and direct refinement strategy would be selecting new terms lexically related to the original user query such as simply adding a few relevant words. For example, if the user query is diabetes, one of the refined queries could be type 2 diabetes, a specific type of diabetes. Although there is no such feature in PubMed, many commercial search engines like Yahoo or Google have already implemented similar strategies.

Another important contribution of this work is introducing and evaluating the PubMed distance, a measure for contextual similarity using PubMed counts. In addition to being a means to enhance accuracy, the PubMed distance is capable of revealing implicit relationships that would otherwise be missed. Take the 11 human annotation errors (See Section 3.2) for example. It was difficult for humans to judge them correctly because the evidence for their relationships was not obviously presented in the queries but rather hidden in the PubMed documents. The unique ability of the PubMed distance has implications in uncovering novel relationships and new hypotheses by automatic recognition of contextual relationships from text — challenges for the text mining field (Altman et al., 2008). While there are fewer such results than those produced by the lexical similarity measures, they comprise a less obvious and more interesting set of results. Although we only demonstrated the use of the PubMed distance in identifying related queries, its ability to determine contextual similarity can certainly be applied in many other situations. For example, it can be used to detect biologically significant relations between genes, between genes and diseases, etc in literature. This measurement is superior to other co-occurrence methods in that it takes advantage of the automatic query expansion strategy in PubMed, in that its computation is relatively straight-forward, and because the computed score can accurately identify statistically significant correlations as opposed to randomly co-occurring items.

Supplementary Material

supplementary file

Acknowledgements

The authors would like to thank Jorge R. Herskovic and Elmer V. Bernstam for their previous work that motivated this study and their kindness for providing the test data and results used in their work for us to compare with. We are grateful to Larry Hunter for bringing the Google distance metric to our attention and Kevin B. Cohen for his critical review of this manuscript. We also thank Bill Baumgartner for proofreading this manuscript. This research was funded by the Intramural Research Program of the NIH, National Library of Medicine.

Footnotes

2

In this article, we use ‘segmenting user sessions’, ‘recognizing related queries’, and ‘separating unrelated queries’ with interchangeable meanings.

References

  1. Altman RB, Mergman C, Blake J, Blaschke C, Cohen A, Gannon F, Grivell L, Hahn U, Hersh W, Hirschman L, Jensen L, Krallinger M, Mones B, O‘Donoghue S, Peitsch M, Rebholz-Schuhmann D, Shatkay H, Valencia A. Text mining for biology - the way forward: opinions from leading scientists. Genome Biology. 2008;9 Suppl 2:S7. doi: 10.1186/gb-2008-9-s2-s7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Aronson AR. Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap approach; Proceedings of the AMIA Symposium; 2001. pp. 17–21. [PMC free article] [PubMed] [Google Scholar]
  3. Baumgartner W, Lu Z, Johnson H, Caporaso J, Paquette J, Lindemann A, White E, Medvedeva O, Cohen K, Hunter L. Concept recognition for extracting protein interaction relations from biomedical text. Genome Biology. 2008;9 Suppl 2:S9. doi: 10.1186/gb-2008-9-s2-s9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Beitzel S, Jensen E, Chowdhury A, Frieder O, Grossman D. Temporal analysis of a very large topically categorized Web query log. Journal of American Society for Information Science and Technology. 2007;58:166–178. [Google Scholar]
  5. Brownlee K. Statistical Theory and Methodology: In Science and Engineering. Wiley Publications in Statistics; 1965. [Google Scholar]
  6. Chau M, Fang X, Sheng O. Analysis of the query logs of a Web site search engine. Journal of American Society for Information Science and Technology. 2005;56:1363–1376. [Google Scholar]
  7. Cilibrasi R, Vitanyi P. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering. 2007;19(3):370–383. [Google Scholar]
  8. Fonseca B, Golgher P, de Moura ES, Ziviani N. Using association rules to discover search engines related queries; Proceedings of the First Conference on Latin American Web Congress; 2003. p. 66. [Google Scholar]
  9. Geer RC, Sayers EW. Entrez: making use of its power. Brief Bioinform. 2003;4(2):179–184. doi: 10.1093/bib/4.2.179. [DOI] [PubMed] [Google Scholar]
  10. Gligorov R, Aleksovski Z, ten Kate W, van Harmelen F. Using Google distance to weight approximate ontology matches. Proceedings of the 16th international conference on World Wide Web. 2007:767–776. [Google Scholar]
  11. He D, Göker A. Detecting Session Boundaries from Web User Logs. Proceedings of the BCS-IRSG 22nd Annual Colloquium on Information Retrieval Research. 2000 [Google Scholar]
  12. Hersh W, Bhupatiraju RT. TREC genomics track overview; Proceedings of The Twelfth Text REtrieval Conference (TREC 2003); 2003. [Google Scholar]
  13. Herskovic JR, Tanaka LY, Hersh W, Bernstam EV. A day in the life of PubMed: analysis of a typical day’s query log. J Am Med Inform Assoc. 2007;14(2):212–220. doi: 10.1197/jamia.M2191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Huang C, Chien L, Oyang Y. Relevant term suggestion in interactive web search based on contextual information in query session logs. Journal of the American Society for Information Science and Technology. 2003;54:638–649. [Google Scholar]
  15. Levenshtein VI. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966;10:707. [Google Scholar]
  16. Lu Z, Kim W, Wilbur WJ. Evaluation of query expansion using MeSH in PubMed. under revision. 2008 doi: 10.1007/s10791-008-9074-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Myers EW. An O(ND) difference algorithm and its variations. Algorithmica. 1986;1(2):251–266. [Google Scholar]
  18. Salama NR, Chuang JS, Schekman RW. Sec31 encodes an essential component of the COPII coat required for transport vesicle budding from the endoplasmic reticulum. Mol Biol Cell. 1997;8(2):205–217. doi: 10.1091/mbc.8.2.205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Shi X, Yang C. Mining related queries from Web search engine query logs using an improved association rule mining model. Journal of the American Society for Information Science and Technology. 2007;58:1871–1883. [Google Scholar]
  20. Silverstein C, Henzinger M, Marais H, Moricz M. Technical report. Digital SRC; 1998. Analysis of a very large web search engine query log. [Google Scholar]
  21. Tang BL, Ong YS, Huang B, Wei S, Wong ET, Qi R, Horstmann H, Hong W. A membrane protein enriched in endoplasmic reticulum exit sites interacts with COPII. J Biol Chem. 2001;276(43):40008–40017. doi: 10.1074/jbc.M106189200. [DOI] [PubMed] [Google Scholar]
  22. Ukkonen E. Algorithms for approximate string matching. Information and Control. 1985;65:100–118. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary file

RESOURCES