Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Jul 15.
Published in final edited form as: Pharmacogenomics J. 2002;2(2):96–102. doi: 10.1038/sj.tpj.6500084

An introduction to information retrieval: applications in genomics

P M Nadkarni 1
PMCID: PMC3137130  NIHMSID: NIHMS303260  PMID: 12049181

Abstract

Information retrieval (IR) is the field of computer science that deals with the processing of documents containing free text, so that they can be rapidly retrieved based on keywords specified in a user’s query. IR technology is the basis of Web-based search engines, and plays a vital role in biomedical research, because it is the foundation of software that supports literature search. Documents can be indexed by both the words they contain, as well as the concepts that can be matched to domain-specific thesauri; concept matching, however, poses several practical difficulties that make it unsuitable for use by itself. This article provides an introduction to IR and summarizes various applications of IR and related technologies to genomics.

Keywords: information retrieval, full-text indexing, text processing, genomics

INTRODUCTION AND HISTORY

Information Retrieval (IR), (more precisely, text information retrieval) is a branch of computer science that deals with the processing of collections of documents containing ‘free text’, such as scientific papers, or even the contents of electronic textbooks. The objective of such processing is to facilitate rapid and accurate search of the text based on keywords of interest. The word ‘free’ (as opposed to ‘structured’) implies that in contrast to, say, an electronic spreadsheet, such documents have little structure that could serve as a guide locating specific content. The term IR originally referred to ‘information’ in a very broad sense (and dealt with problems such as lossless data transmission and data compression), but the focus on textual information can be traced to several researchers, most notably Salton, van Rijsbergen and Sparck-Jones. The first two have written historically important IR texts;1,2 modern IR texts3,4 provide an excellent overview. IR overlaps several other fields within computer science, notably database technology and natural language processing (NLP).

For several decades IR was an ‘orphan’ technology researched by a relative handful of scientists, and commercial offerings were sparse and modestly featured. However, due to the spread of the World Wide Web, IR is now mainstream because most of the information on the Web is textual. Web search engines such as Yahoo, Excite, Alta Vista and AskJeeves are used by millions of users to locate information on Web pages across the world on any topic. The use of search engines has spread to the point where, for people with access to the Internet, the World Wide Web has replaced the library as the reference tool of first choice. Of the individual WWW text-based resources available to the genomics researcher, the single most useful resource is unquestionably PubMed, maintained by the National Center for Biotechnology Information (NCBI), part of the US National Library of Medicine (NLM). PubMed provides access to the contents of MEDLINE, which is curated by the NLM.

Document Indexing

There are several ways to pre-process documents electronically so as to speed up their retrieval. All of these fall under the general term ‘indexing’: an index is a structure that facilitates rapid location of items of interest, an electronic analog of a book’s index. The most widely used technique is word indexing, where the entries (or terms) in the index are individual words in the document (ignoring ‘stop words’—very common and uninteresting words such as ‘the’, ‘an’, ‘of’, etc). Another technique is concept indexing, where one identifies words or phrases and tries to map them to a thesaurus of synonyms as concepts. Therefore, the terms in the index are concept IDs. Several kinds of indexes are created. The global term-frequency index records how many times each distinct term occurs in the entire document collection. The document term-frequency index records how often a particular term occurs in each document. An optional proximity index records the position of individual terms within the document as word, sentence or paragraph offsets. (A proximity index is voluminous, but can allow searches requiring that two or more terms be within the same sentence, or within so many words of each other.) Some issues in indexing are now discussed.

Word Indexing

  • In specific contexts, the list of stop words may need to be expanded. Thus ‘drug’ would be a stop word in a pharmacology-related collection

  • To reduce the size of the index and to enable broader searches, words may be transformed prior to indexing. One transformation is normalization, which involves lower-case conversion and removal of variations in person or tense. Thus, ‘children’ is normalized to ‘child’, and ‘brought’ to ‘bring’. Another is stemming, which transforms the word by stripping off suffixes or prefixes (using simple pattern-matching rules) to a ‘root’ form, which is not necessarily an actual word in the language. Stemming can be drastic (and its use therefore controversial). Thus, with the widely used Porter stemming algorithm,5 ‘modulation’ and ‘module’ yield the same root, ‘modul’. The two words mean different things (though they were both originally derived from the Latin term ‘modulus’). Conversely, ‘modular’, the adjectival form of ‘module’, stems to itself, rather than to ‘modul’. Several studies of stemming have cast doubt on its utility for English text.6,7

Concept Indexing

Some practical problems faced in automated concept indexing are discussed.8

  • The thesaurus may be general-purpose (eg, applying to the English language) or domain-specific. The National Library of Medicine’s Unified Medical Language System (UMLS) (which is a meta-thesaurus or conglomeration of the contents of several biomedical thesauri) is the world’s largest domain-specific thesaurus. The largest and possibly the best curated of the individual component vocabularies within UMLS is the Medical Subject Headings (MeSH) developed at the NLM for the Index Medicus. The UMLS contains much information that is useful to the software developer. Thus, a concept can belong to one or more semantic categories (eg, pharmacological substance, therapeutic procedure) and every term for a concept is tagged with the ID of the source vocabulary from which it was taken. This information allows researchers to create UMLS data subsets for special purposes. The inter-relationships of concepts are also recorded so that concepts can be arranged in a hierarchy from super-concepts (more general concepts) to more specific/particular concepts. The coverage of the UMLS is not uniform across all fields of biomedicine; coverage depends on the source vocabularies that feed into it. Thus, coverage of clinical concepts is much more extensive than coverage of basic science. The UMLS is therefore a continual work in progress, with major updates being issued annually

  • Recognition of concepts by scanning of text is quite difficult. Most attempts have focused on identifying individual noun phrases,912 though verb phrases such as ‘mutated’ and ‘catalyzed’ are also important. The challenge here is a disparity between the granularity of concepts as expressed in phrases in text and as they occur in the thesaurus. Many common concepts in biomedicine are compound, that is, they have been formed by fusion of other concepts, but the extent of fusion is variable. A single phrase may thus correspond to multiple concepts, and vice versa. For example, ‘digitalis-induced atrial fibrillation’ does not match a single UMLS concept, though the sub-phrases ‘digitalis’ and ‘atrial fibrillation’ do. Conversely, the two (underlined) noun phrases in the sentence fragment, ‘hypertension is secondary to renal disease’, together happen to correspond to a single compound concept, renal hypertension, composed of the two simpler underlined concepts

  • Matching a phrase to a concept in the thesaurus is also complicated by polysemy: the existence of terms with multiple meanings, or homonyms. Thus, to cite an example from a recent review by Daniel Masys,13 ‘insulin’ could refer to the gene, the protein, the hormone, or therapy. Neighboring words may help to disambiguate, eg, the phrase ‘insulin molecule’ can be uniquely matched. However, if the disambiguating word is further away, eg in the previous or succeeding sentence, the problem is much more difficult

  • Every field evolves its own set of acronyms, and the same acronym may appear in different fields even though they do not overlap in their subject matter. Thus, it is possible for a bioinformatics article to cite the Microsoft acronym DNA (‘distributed network architecture’, a term coined by Bill Gates), rather than the expected biological equivalent (‘deoxyribonucleic acid’)

  • The concept’s presence in a document does not, per se, make that document relevant for that concept, because the concept may be negated, or ruled out. A paper by Mutalik et al14 showed that concept negation in dictated medical narrative (such as operative notes or discharge summaries) could mostly be detected reliably by simple computational strategies. This conclusion, however, does not apply generally to biomedical literature. When a trained (and busy) clinician dictates notes, the goal is to convey information directly and succinctly. Spoken negations tend to be simple in structure, with the negating phrase being present mostly within the same sentence as the negated concept, eg, ‘no history of vomiting or diarrhea’. In the written word, such directness is uncommon, and the negation often lies in the next sentence or paragraph (as when a hypothesis is described, only to be refuted later). Further, written medical text is full of ‘linguistic hedges’ that language purists loathe, such as ‘not uncommon’ (which is intended to convey a different meaning than ‘common’)

Because of the issues outlined above, concept indexing can, at best, serve only as ancillary to word indexing instead of as a replacement.

QUERYING TEXT DATABASES: RELEVANCE RANKING AND CLUSTERING

An important feature of search using IR technology is that when the user requests documents containing terms of interest, some non-relevant documents may be returned (false positives) and some relevant documents may be missed (false negatives). Errors are due to artifacts of the indexing/concept recognition process. One can therefore conceive of the sensitivity and specificity of a search: the equivalent terms in the IR literature are recall and precision respectively. Further, since the same term can occur multiple times in the same document, when searching a large document database, a simple Boolean query (‘return all documents containing terms X, Y and Z′) is not very useful if it returns thousands of documents that contain these terms even once. The search must therefore be augmented using relevance ranking, where documents that provide the ‘best match’ are ranked highest.

Relevance ranking is typically done through variants of the document vector approach,15,16 which uses the frequency indexes mentioned earlier. Terms in the query that are uncommon in the document collection are given more weight than relatively common terms, and documents containing one or more of the query terms many times (in proportion to the document’s length) are weighted more than documents containing the terms infrequently or not at all. All mainstream text-search systems, including PubMed, use the document vector method. Other customized weightings may be used for scientific articles: eg, terms in the title and abstract are weighted more than terms in the body of the text. Note that relevance ranking can return useful results even if no document contains every single term in the user’s query. One can use Boolean criteria to constrain the search, specifying, for example that certain terms in the query must (or must not) be present in a document for it to be considered relevant. A useful overview of weighting approaches is given in Sparck Jones et al.17,18

As an aside, one must add that today’s commercial engines for Web-page indexing use various proprietary refinements in their relevance-ranking algorithms. The earlier approaches, which were based strictly on frequency indexes, could easily be fooled by Web pages that repeated the same word (such as an obscene phrase) hundreds of times within the HTML page as a comment. (HTML comments are invisible to the human viewer of the page, but visible to the computer programs that perform the indexing.) As a result, the relevancy-ranking program would rank such pages highly when a search using one of those words was performed. Modern engines are harder to trick. Google for example, uses a weighting algorithm based on how many other Web pages in the collection refer to a particular page by way of hyperlinks.19 In a sense, the other pages ‘vote’ for some pages as being more useful/relevant than others. Further, links from pages that are themselves important based on the votes they have received are weighted more than links from non-highly-ranked pages.

Other Uses of the Document Vector Approach

One can use the document vector approach to derive a similarity measure between two documents, based on the terms that they have in common (and their relative frequency of occurrence). This information can be pre-computed when documents are first loaded into the database. This is what NCBI’s Entrez system (which accesses several databases, including PubMed) does. Every document is compared with every other previously loaded document in the database, using an algorithm devised by NCBI’s John Wilbur, and the highest matches are stored for future retrieval. Instant access to the highest matches is available to PubMed’s users with the ‘Find Related Articles’ hyperlink. The similarity algorithm is quite powerful, and it has been successfully used for a purpose not originally intended by Wilbur. The journal Science reported that a researcher using PubMed identified a serial plagiarist.20 The ‘Find Related’ link, applied to several of the suspect’s articles in turn, identified the original article (whose contents were plagiarized) correctly in every case at the top of the ‘highly similar articles’ list. Even more surprising, the plagiarized articles had been written in a different language, and Wilbur’s algorithm was working with English translations of the abstracts.

The document-vector approach can also be used to automatically cluster similar documents in a collection, and the clusters that are identified can then be inspected and interpreted. A new document can then be classified electronically into one of these clusters, based on which documents it most resembles. Automatic document clustering has been used for various forms of ‘literary detective work’. Term frequency does not have to be the only criterion for quantifying similarity. Other measures, such as sentence length, the proportion of verbs with a passive voice, the higher-order semantic categories to which individual terms belong, and so on, can also be utilized in the weighting scheme, to create a kind of ‘literary fingerprint’ for the author of the work. This information can indicate if a well-known author wrote a recently discovered unknown work, and can help identify possible literary forgeries. In both these cases, comparison of various similarity measures is done using a collection of the author’s existing works (as well as the works of several other authors, to serve as controls). During the Cold War, CIA ‘Kremlin watchers’ used document clustering to identify particular communiqués published in Pravda as having come from one of several (anonymous) political commentators.

IR AND TRADITIONAL DATABASES

Today, IR technology is available as part of standard relational database management system (RDBMS) offerings of all major vendors. This enables integrated query of databases that contain a mix of structured and unstructured data, and offers a variety of methods for retrieving information from databases. (It is rather surprising that the integration has taken this long: IR indexes can in fact be implemented with RDBMS technology.) These offerings, however, currently lag somewhat behind those of vendors specializing in IR. For example, proximity indexing and synonym indexing (based on an English-language thesaurus) are currently unavailable with RDBMS-based offerings. Specialized IR systems, such as dtSearch will also index text in a larger variety of formats (word-processing, PDF, HTML) than their RDBMS counterparts. (dtSearch (www.dtsearch.com) is the engine used to index the widely distributed Physician’s Desk Reference (PDR) pharmacotherapy guide, which is distributed on CD-ROM.) The difference between RDBMS and ‘pure IR’ vendors is expected, however, to shrink with time, although, because of the pressure on the latter to innovate in order to survive (IR vendors are much smaller than RDBMS vendors in terms of gross revenues), it will probably never shrink to the point of insignificance.

The degree of IR integration, and ease of use, depends on the RDBMS vendor. Microsoft SQL Server 2000 provides an example of high-quality integration. When indexing a particular table containing numerous fields, the database administrator simply picks all text fields that are considered worth indexing through a graphical user interface tool (a ‘wizard’), and the engine does the rest. MS SQL Server enables search of the data using proprietary extensions of SQL (Structured Query Language, the lingua franca for RDBMS data manipulation). Four programming functions, whose documentation is about three pages long in total, have been added to Microsoft’s version of the language, allowing Boolean-style query as well as relevance ranking All of this can be learned and used by most programmers with an hour or two of effort. When searching for a key phrase, one can simply specify that all indexed fields must be searched, so that Google-like user interfaces can be programmed quite easily.

APPLICATIONS IN GENOMICS

Delivery of Databases Dealing with Specialized Content over the Web

Many databases in genomics contain a large amount of textual content. Some, such as OMIM21 (Online Mendelian Inheritance in Man), which is curated by Victor McKusick’s group at Johns Hopkins University and published over the Web by NCBI, contain text predominantly. An annual issue of Nucleic Acids Research provides brief papers describing various genomics-related databases, and practically all of these are accessible via the Web. Even when a database might really be non-textual and supported by a relational engine, users are so used to keyword-based search (without needing to specify what field the key phrase lies in), that the use of IR technology becomes almost mandated. Vendors of IR engines, such as Google and AskJeeves, provide a valuable service to the WWW community by providing their indexing facilities free of charge. Ultimately, however, their objective is to sell their engines to customers who need to provide similar facilities for their own Web sites. For those customers with more modest budgets, however, less full-featured indexing engines are also bundled with the Microsoft, Netscape and Apache Web servers.

Many genomics-related databases consist almost entirely of text files and HTML pages with hyperlinks.22,23 In such a case, often the only off-the-shelf software technology that is needed is Web-page indexing capability. For example, GeneCards,24 built by a group at Israel’s Weizmann Institute and mirrored at several sites such as the National Cancer Institute, uses the Excite engine to facilitate full-text search. GeneCards collates information on individual genes from several sources, such as SWISS-PROT, OMIM and GDB and also allows ‘approximate searching’25 based on the gene symbol. In approximate searching, a phrase of interest can be located in a block of text even if a certain proportion of letters in a search pattern are mistyped or omitted. It is particularly useful for gene symbols because many symbols are highly similar, differing only with respect to numeric suffixes (and because symbols are often revised subtly).

Data Mining With IR Technologies

‘Data mining’ refers to automated discovery of information by electronic processing of data (typically, large data volumes) that may not have been expressly gathered for that purpose. While some authors use the phrase knowledge discovery, ‘knowledge’ is a somewhat more pretentious term than ‘information’. Data mining programs only detect patterns that might possibly be useful in confirming hypotheses or generating new, possibly interesting hypotheses. This information can only be elevated to the level of ‘knowledge’ if and when it proves to be useful. Many detected patterns might in fact be self-obvious to someone who is intimately, or even superficially, familiar with the nature of the data that is being mined. An obvious example is mining of medical data, which discovers the pattern, which is 100% specific, that ovarian cancer occurs only in females.

Data mining uses both classical and modern statistical algorithms. (One of the major vendors in the multi-million-dollar data mining industry is the SAS Institute, vendor of the SAS statistics package.) It also makes use of more flexible, but far more computationally intensive approaches that make fewer assumptions about the nature of the data distribution for individual parameters—such as neural networks. However, for special kinds of data, almost any ad hoc analysis can be performed with specially written programs, as long as the researchers are able to justify their approach.

The phrase ‘text mining’ originated as a marketing term by IR vendors who saw that, when the data being explored is primarily textual, traditional IR technologies such as automatic clustering could be productively deployed to satisfy information hunger. In many cases, ‘text mining software’ was little more than repackaging of existing IR software. In the genomics area, however, innovative algorithms that could truly be described as mining of text have produced interesting results. We describe a few of these here: this sample is biased by the reviewer’s interest in applications that combine the use of PubMed with components of the UMLS, Genbank and other publicly accessible databases.

Using Concept Hierarchies with PubMed to Assist Interpretation of Microarray-based Expression Patterns

In microarray-based experiments, a particular experimental intervention (eg, the administration of a chemical) results in a decreased or increased expression of several genes. After analysis of the expression pattern using several standard statistical approaches, these genes can be grouped into clusters. However, trying to find a biological basis for the existence of a particular observed cluster is a more complicated task. While the standard textual descriptions associated with gene names include keywords such as ‘respiration’, ‘metabolism’ and so on, these are typically so broad as to be only minimally informative in interpretation: the best guide is still the current scientific literature.

Simply returning the hundreds of citations that would apply to the collective set of genes in a cluster is not enough, however. One must attempt to inter-relate and organize these in some way. The program devised by Daniel Masys and his colleagues at the University of California, San Diego26 accepts a list of up to 750 gene identifiers (which form the individual components of a cluster). For each gene, the top 20 PubMed citations for each identifier are retrieved (if available: Masys’ paper states that currently, less than half the genes on commercial arrays have associated publications, because many of these are anonymous ESTs). MeSH terms and chemical registry number (RN) keywords are extracted from the citations and matched to UMLS concept IDs. Where the concepts refer to enzymes, EC numbers are also retrieved. The UMLS concept hierarchy is then used to retrieve all the ancestral (ie, more general) concepts corresponding to the retrieved concepts.

A set of ‘concept hierarchy trees’ is now generated as an HTML document. Each tree corresponds to a broad semantic category as defined in the UMLS (eg, enzymes, diseases), with more general concepts higher up in the tree. Each item on a tree is a key phrase (eg, ‘Cathepsin D’, ‘myeloid leukemia’, ‘carboxypeptidases’) that is labeled with a hyperlink that shows the number of hits. Navigating to the hyperlink reveals the individual genes that matched this particular key phrase, and the associated PubMed citations. Each item is also labeled with a P value that indicates the probability that this keyword would appear purely by chance. (These are computed on the basis of a prior simulation that uses numerous randomly generated gene clusters, and calculates the expected frequencies of individual keywords when generated by chance). The most significant P values, understandably, are those associated with more specific concepts at lower levels of each tree.

The High-Density Array Pattern Interpreter (HAPI), as the program suite described above is called, is publicly accessible via http:www.array.ucsd.edu/hapi/.

Using Natural Language Processing (NLP) with PubMed to Bootstrap the Contents of Databases or Enhance the Specificity of PubMed Searches

PubMed does not maintain proximity indexes. As stated in the introduction to this review, the disk space requirements for proximity indexes are formidable: these take up much more space than the documents themselves. The most common form of ‘proximity’ search is when users wish to search for a phrase exactly as typed. PubMed will perform automatic term mapping by consulting, in succession, the Medical Subject Headings (MeSH) Translation Table, a Journals Translation Table, a Phrase List, and an Author Index.27 While this approach maximizes the utility of PubMed for the end-user, the lack of proximity indexing limits its direct use for special-purpose searches. However, there is nothing to prevent one from processing PubMed output after identifying likely candidates using a non-proximity search. (For that matter, since the contents of PubMed are provided at no cost to academic institutions after signing a license agreement, one can, if one dares, proximity-index PubMed using a commercial program if disk space is not a constraint.)

Researchers at the National Institutes of Health have developed interesting applications that apply NLP to the contents of PubMed searches. The process is augmented by the use of controlled vocabularies (such as the UMLS, which also contain other useful information such as lists of drugs) or special-purpose databases, such as those containing data on known genes. We briefly describe a few of these.

MedMiner28 assists interpretation of microarray-based gene expression data by seeking out the appropriate literature. Given a list of genes of interest (or phrases indicative of function, such as ‘apoptosis’), MedMiner first queries GeneCards to get a list of all related genes, including those not on the chip. It then retrieves, from PubMed, citations related to those genes. Now, it filters the citations by first parsing the text (title + abstract) of each citation into its individual sentences. It then look for sentences containing at least one gene name/symbol and one or more items from a list of 51 ‘keywords’ (such as ‘inhibits’, ‘stimulates’, ‘resistant’, and so on, with alternative word forms such as ‘inhibited’ being handled by stemming). All citations where no such sentences are present are eliminated, and the remaining citations are relevance-ranked by the number of such sentences, and presented to the user.

ARBITER29 extracts information on macromolecular binding relationships from PubMed citations, matching noun phrases to gene and molecule names from Genbank and UMLS, and looking for variants of the verb ‘bound’/’bind’ that apply to those phrases as subject and object (eg, X binds Y, X is bound by Y), or noun variants of ‘bind’, such as ‘binding of’. It also uses a list of phrases from the domain of molecular biology that indicate entities participating in binding (eg, box, sequence, strand, motif, receptor, domain, target). A similar and more recent program, EDGAR (Extraction of Drugs, Genes and Relations)30 extracts information about drugs and genes applicable to cancer.

Both the ARBITER and EDGAR approaches could be used to automatically populate the contents of special-purpose databases from the biomedical literature. The reliability of automatic database population depends greatly on the precision and recall of the technique used to mine the literature. (The early evaluation of ARBITER describes a precision and recall of 79 and 72 per cent respectively.) If such approaches were sufficiently refined, however, it would be quicker and simpler for human experts to manually curate the contents of an already-populated database (by removing errors due to false positive matches), than having to make entries into a database that was initially empty.

Mining PubMed for Suggestion of Research Hypotheses

Arrowsmith31,32 created by Don Swanson and refined with the help of his collaborator Neil Smallheiser, performs hypothesis discovery by an algorithm that is astonishing in its simplicity, which is best described by a scenario. Suppose a patient who is receiving a recently introduced drug is found to have a rare adverse event. The question is whether the agent, rather than the underlying disease, could have been responsible. A PubMed search specifying both the drug and the adverse event returns no references. Suppose we then do a search specifying only the drug, and store all the titles of the returned citations. We do the same with the adverse event. We then eliminate stop-words from each set of titles, put all the remaining words into two sorted lists. We then pick the words/phrases common to both lists. These phrases act as conceptual ‘bridges’ between the drug and the adverse event. While most bridging phrases may be obvious to the researcher in the context of the domain being investigated, sometimes one may discover unexpected associations, some of which may suggest a cause-and-effect relationship when one goes back to citations containing both the term of interest and the bridging phrase. (Obviously, such relationships must be confirmed by experimental research.)

In 1990, Swanson used this approach to discover an unexpected link between dietary magnesium deficiency and some types of migraine.33 The bridging phrase was ‘cortical depression.’ On inspecting the citations with ‘magnesium’ and ‘cortical depression’, he discovered that cortical spreading depression was inhibited by magnesium. On looking up ‘migraine’ and ‘cortical depression’, he found that cortical spreading depression was one of the mechanisms involved in some forms of migraine (as well as epilepsy). This suggested that insufficient magnesium predisposed to migraine through the intermediary of cortical spreading depression, a finding that has since been abundantly confirmed.

Arrowsmith, which is publicly accessible via kiwi.uchicago.edu, takes two text files containing the results of PubMed title searches. After a short time, it then returns to the user a list of words and phrases that are common to the two sets of titles. There are various possible refinements to the basic Arrowsmith algorithm. For example, one could increase recall by using the abstracts, or even full text, rather than titles alone; this would also have the unfortunate side effect of increasing the amount of material to be sifted through by the human reviewing the results. One could also make the output more succinct by matching with concepts rather than words to address the synonym problem. Arrowsmith has recently received NIH funding as part of the Human Brain Project, which focuses on Neuro-informatics. We have mentioned Arrowsmith here because it is very likely to be of use in the next phase of the human genome project—the task of assigning biological function to all those millions of sequences.

CONCLUSIONS

The most widely available ‘information’ in today’s world is still textual, and IR techniques are broadly applicable in facilitating its productive use. The general availability of large databases such as PubMed, as well as large vocabularies such as the UMLS has greatly facilitated the work of IR researchers in devising and testing scalable approaches that can positively impact the work of the laboratory researcher.

Acknowledgments

The author thanks Cynthia Brandt, MD, and John Fisk, MD, of the Yale Center for Medical Informatics, and the anonymous reviewers for feedback on the article. The author is supported by grants U01 ES10867–02 from the National Institute of Environmental Health Sciences, R01 LM06843–02 from the National Library of Medicine and U01 CA78266–04 from the National Cancer Institute.

Footnotes

DUALITY OF INTEREST

None declared.

References

  • 1.Salton G. Automatic Text Processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley; Reading, MA: 1989. [Google Scholar]
  • 2.Van Rijsbergen CJ. Information Retrieval. Butterworths; London, UK: 1979. [Google Scholar]
  • 3.Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley Longman; Harlow, UK: 1999. [Google Scholar]
  • 4.Witten IH, Moffat A, Bell TC. Managing Gigabytes. Morgan Kaufman; San Francisco, CA: 1999. [Google Scholar]
  • 5.Porter MF. An algorithm for suffix stripping. Program. 1980;14:130–137. [Google Scholar]
  • 6.Harman D. How effective is suffixing? J Am Soc Inform Sci. 1991;42:7–15. [Google Scholar]
  • 7.Xu J, Croft WB. Corpus-based stemming using co-occurrence of word variants. ACM Trans Inform Syst. 1979;16:61–81. [Google Scholar]
  • 8.Nadkarni PM, Chen RS, Brandt CA. UMLS concept indexing for production databases: a feasibility study. J Am Med Inform Assoc. 2001;8:80–91. doi: 10.1136/jamia.2001.0080080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Elkin PL, Cimino JJ, Lowe HJ, Aronow DB, Payne TH, Pincett PS, et al. Mapping to MESH: the art of trapping MESH equivalence from within narrative text. Proc Symposium on Computer Applications in Medical Care; 1988. pp. 185–190. [Google Scholar]
  • 10.Aronson A, Rindflesch T, Browne A. Exploiting a large thesaurus for information retrieval. Proceedings of the RIAO; 1994. pp. 197–216. [Google Scholar]
  • 11.Aronson AR, Rindflesch TC. Query expansion using the UMLS Metathesaurus. Proceedings/AMIA Annual Fall Symposium; 1997. pp. 485–489. [PMC free article] [PubMed] [Google Scholar]
  • 12.Rindflesch TC, Aronson AR. Ambiguity resolution while mapping free text to the UMLS Metathesaurus. Proceedings the Annual Symposium on Computer Applications in Medical Care; 1994. pp. 240–244. [PMC free article] [PubMed] [Google Scholar]
  • 13.Masys D. Linking microarray data to the literature (Editorial) Nature Genet. 2001;27:9–10. doi: 10.1038/ng0501-9. [DOI] [PubMed] [Google Scholar]
  • 14.Mutalik P, Deshpande A, Nadkarni P. Use of general-purpose negation detection to augment concept indexing of medical documents: a quantitative study using the UMLS. J Am Med Inform Assoc. 2001;8:598–609. doi: 10.1136/jamia.2001.0080598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Williams JH, Perriens MP. Automated full text indexing and searching systems. IBM Information Systems Symposium; Washington, DC. 1968. pp. 335–350. [Google Scholar]
  • 16.Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. J Documentation. 1972;28:11–21. [Google Scholar]
  • 17.Sparck-Jones K, Walter S, Robertson SE. Information retrieval: development and comparative experiments (Part I) Inform Proc Manage. 2000;36:779–808. [Google Scholar]
  • 18.Sparck-Jones K, Walter S, Robertson SE. Information retrieval: development and comparative experiments (Part 2) Inform Proc Manage. 2000;36:809–840. [Google Scholar]
  • 19.Google Inc. Google: Technology Overview. 2001. [Google Scholar]
  • 20.Marshall E. Medline searches turn up cases of suspected plagiarism (News) Science. 1998;279:473–474. doi: 10.1126/science.279.5350.473. [DOI] [PubMed] [Google Scholar]
  • 21.OMIM. Online Mendelian Inheritance in Man. McKusick Nathans Institute for Genetic Medicine, Johns Hopkins University; Baltimore, MD: National Center for Biotechnology Information, National Library of Medicine; Bethesda, MD: 2001. [Google Scholar]
  • 22.Ley K, Brewer K, Moton A. A web-based research tool for functional genomics of the microcirculation: the leukocyte adhesion cascade. Microcirculation. 1999;6:259–265. [PubMed] [Google Scholar]
  • 23.Achard F, Vayssix G, Dessen P, Barillot E. Virgil database for rich links (1999 update) Nucl Acids Res. 1999;27:113–114. doi: 10.1093/nar/27.1.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Rebhan M, Chalifa-Casp iV, Prilusky J, Lancet D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics. 1998;14:656–664. doi: 10.1093/bioinformatics/14.8.656. [DOI] [PubMed] [Google Scholar]
  • 25.Wu S, Manber U. Fast text searching allowing errors. Commun ACM. 1992;35:83–91. [Google Scholar]
  • 26.Masys D, Welsh J, Lynn Fink J, MG, Klacansky I, Corbeil J. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics. 2001;17:319–326. doi: 10.1093/bioinformatics/17.4.319. [DOI] [PubMed] [Google Scholar]
  • 27.National Center for Biotechnology Information. PubMed help. 2001. [Google Scholar]
  • 28.Tanabe L, Scherf U, Smith L, Lee J, Hunter L, Weinstein J. MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999;27:1210–1217. doi: 10.2144/99276bc03. [DOI] [PubMed] [Google Scholar]
  • 29.Rindflesch T, Hunter L, Aronson A. Mining molecular binding terminology from biomedical text. AMIA Fall Symposium; 1999. pp. 127–31. [PMC free article] [PubMed] [Google Scholar]
  • 30.Rindflesch T, Tanabe L, Weinstein J, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pacific Symposium on Biocomputing; Honolulu, Hawaii. 2000. pp. 517–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Swanson D, Smalheiser N. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artif Intell. 1997;91:183–203. [Google Scholar]
  • 32.Finn R. Program uncovers hidden connections in the literature. The Scientist. 1998;12 www.the-scientist.com.
  • 33.Swanson D. Migraine and magnesium: eleven neglected connections. Perspect Biol Med. 1988;31:526–557. doi: 10.1353/pbm.1988.0009. [DOI] [PubMed] [Google Scholar]

RESOURCES