Skip to main content
Human Genomics logoLink to Human Genomics
. 2010 Oct 1;5(1):17–29. doi: 10.1186/1479-7364-5-1-17

What the papers say: Text mining for genomics and systems biology

Nathan Harmston 1, Wendy Filsell 2, Michael PH Stumpf 1,
PMCID: PMC3500154  PMID: 21106487

Abstract

Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.

Keywords: data mining, systems medicine, literature processing, hypothesis generation

Introduction

The scientific literature provides an important source of knowledge generated by the research community; it does not become defunct five years after publication and it is not just something to promote the authors' careers. While large amounts of data relating to biological systems are stored in public repositories, an even larger amount can be found in a semi-structured form in the literature (see Figure 1). This knowledge is potentially very useful in a variety of genomics and systems biology contexts [1]. For example, manually curated and literature-derived protein-protein interaction data-sets are typically used as gold standards by the systems biology community and it is standard practice to extract parameters for mechanistic models from the literature.

Figure 1.

Figure 1

Biology is becoming a data-driven science, with an exponential growth in the number of papers being published, increasing numbers of databases indexed in the Nucleic Acids Research (NAR) database collection and an exponential growth in the number of base pairs stored in Genbank.

Manual curation lacks the scalability to deal with the ever-increasing numbers of papers being published [2,3] and suffers from inter-annotator disagreement: different curators may interpret a piece of text in different ways. This means that a single paper needs to be annotated at least twice if the reliability of the proposed annotations is in any way to be calculated. The increase in the numbers of papers being published also means that it is becoming harder for researchers to stay up to date with the relevant literature in their field. This has an impact on their ability to generate meaningful and testable hypotheses, with some even suggesting that this is becoming a bottleneck in the scientific discovery process [4].

These issues have motivated a sustained interest in the application of text mining (TM) techniques by both the industrial [5] and academic [6] communities to address some of these problems. TM refers to the process of extracting information encoded in text by authors through the use of techniques from a variety of fields such as information retrieval (IR), machine learning (ML), natural language processing (NLP), statistics and computational linguistics (CL) [7]. The use of these techniques leads to a decrease in the time and effort required to extract information from a paper, speeding up curation [8] and also providing novel opportunities for hypothesis generation using the literature. We feel that, in the context of human genomics, this is particularly promising: an increasing number of studies report rare disease-causing variants and, in order to annotate such variants, assess their functional relevance or link them to existing clinical information, TM approaches will increase in importance as an enabling technology for biomedical research.

Text mining can be thought of as a method by which a systematic review can be performed. As with all methods for reviewing the existing literature, however, there are several biases. Due to copyright issues, only a relatively small number of papers are available for full-text mining and so most work is restricted to abstracts and titles, which are freely available from MEDLINE (only 30 per cent of curated protein-protein interactions (PPIs) can be found in the abstracts rather than the full text [9]). This does mean that extracted information is subject to a selection bias, although of a different form to that seen in manual curation (where only a subset of full text papers are curated). Neither manual curation nor TM techniques can deal with the inherent publication bias in the literature. Publication bias [10] refers to the fact that only positive findings (rather than negative or no findings) tend to get reported in the literature, and certain topics and/or genes tend to be reported more when they are in vogue. There is also evidence that PPI networks derived from the literature [11,12] are subject to ascertainment bias. This occurs when sampling is non-random and conclusions about the population are made based solely on this distorted group of frequently studied proteins. Conclusions about networks that are generated in the presence of ascertainment bias can dramatically change once the necessary corrections have been made [13]. Despite these biases, the literature is still extremely important to researchers as a method for communicating results and ideas and for testing and generating hypotheses.

Below, we give an overview of the current status of TM methodology. Some technical detail is required in order to appreciate fully the potential of this methodology, as well as its (current and future) limitations. At its worst, TM will be an exercise in high-throughput 'stamp collecting'; at best, it opens up the possibility of distilling vast amounts of published information into concrete hypotheses and functional insights into genomics and systems biology.

TM

In order to extract knowledge from text, named entities (NEs) must first be recognised; these NEs are then normalised to identifiers and any relationships between them are identified (see Figure 2). Biological NEs correspond to classes such as genes, proteins, cell lines, species, compounds, phenotypes, diseases, etc. Named entity recognition (NER) refers to the problem of labelling both the location (start, end) and the semantic class/type of a NE in text, and normalisation refers to the process of mapping a NE to a unique identifier (or set of identifiers). Following NER and normalisation it is useful to determine if a real relationship exists between two or more NEs, as well as the type of relationship. Simply identifying that NEs occur together in a contiguous block of text does hint at the existence of some form of relationship; however, this relationship may be completely speculative, or the text may state that a relationship between the NEs does not exist [14]. In biological research papers, two entities can co-occur for many reasons, including functional, physical, syntenic and evolutionary relationships. The performance of TM systems is often evaluated using precision and recall metrics against manually curated gold standard corpora. Precision can be interpreted as the probability that a randomly selected result is a true positive and is calculated as the number of true positives obtained over the sum of true positives and false positives. Recall can be intuitively interpreted as the probability that a randomly selected positive result is correctly identified; it can be calculated as the number of true positives divided by the number of items that should be found (the sum of true positives and false negatives).

Figure 2.

Figure 2

An example of a text-mining pipeline. Given a sentence from a paper (A), named entities (NEs) are extracted (green for species entities, red for protein/gene entities, blue for relationship cues) (B); these entities are then normalised to a corresponding identifier scheme (C); and relationships between entities extracted (D). The final result in this case is a network which explicitly encodes the semantic relationships between NEs found in the sentence. Text taken from PMID:14613582.

NER

Biology is a dynamic and ever-expanding research area. This means that there are millions of entity names in use, with new ones constantly being created (eg through genome annotation and drug development). Neologisms are prevalent in the literature; it has been jokingly commented: 'Scientists would rather share each other's underwear than use each other's nomenclature' (Keith Yamamoto). Biological NER thus tends to be more difficult than NER tasks in other domains (eg newswires) due to the variability of biological nomenclatures [15,16]. A single gene can have many synonyms (eg P53, TP53 and TRP53 all refer to the same gene). Gene names are subject to morphological (eg transcription factor, transcriptional factor), orthographic (eg nuclear factor [NF] kappa B, NF κB), combinatorial (eg homologue of actin, actin homologue) and inflectional variation (eg antibody, antibodies). The HUGO Gene Nomenclature committee (HGNC) was created with the aim of assigning a unique gene symbol to every gene; however, currently, not all genes have been assigned a name and there are still problems with gene names mentioned in the past literature. Gene names can overlap with other names relating to different entity types in the biological domain, as well as with words that are found in everyday language. It is often difficult to disambiguate similar entity classes, as they can have similar contexts and morphologies. For example, a simple heuristic for determining whether a term refers to a gene or protein is that proteins begin with an upper case letter (PspA) and genes begin with a lower case (pspA). This pattern is, however, not maintained consistently in scientific writing, and humans show substantial disagreement on this task,[17] with an average pair-wise agreement among three annotators of 77.58 per cent.

The Drosphilia melanogaster literature is probably the best example of the problems that exist regarding nomenclatures. Some Drosphilia genes are named after their associated phenotype, such as eyeless or fruity, which leads to difficulties in disambiguating whether it is the phenotype or the gene that is being described. Gene names such as Not and That also exist, which are homonymous (see Table 1). Some gene names are multi-word names such as Mind the gap and IL-2 receptor. In the last case, problems detecting the correct boundary may lead to the entity being tagged as IL-2, which completely alters the meaning of the entity [18].

Table 1.

Table of linguistic terms

Term Meaning
Anaphor A word or phrase that refers back to an
earlier word or phrase
Polysemy The coexistence of many possible meanings
for a word or phrase
Homonymy Each of two or more words having the
same spelling and pronunciation but
different meanings and origins
Semantics Relating the meaning in language or logic
Syntax The arrangement of words and phrases to
create well-formed sentences in a language
Part of speech One of the traditional categories of words
intended to reflect their functions in a
grammatical context

Definitions obtained from the Oxford Dictionary and WordNet

A variety of methods have been proposed for biological NER (see Table 2), with only a small portion freely available for download or publicly accessible via web servers/services. These tools fall into four main categories: dictionary-based, rule-based/pattern-based, machine-learning and hybrid systems (and combinations of these approaches). Most research in this area has concentrated on recognising gene and protein mentions; however, there has also been some work on identifying cell lines, chemicals and species. Competitions such as NLPBA [19] and BioCreative [20] are held in order to evaluate NER methods for gene mention recognition.

Table 2.

Some freely available software for NLP tasks in the biological domain

Task refers to the part of a text-mining pipeline that the software can be used for. Abbreviations: NER, named entity recognition; POS, part of speech tagger; PPI, protein-protein interaction extraction; SEN, sentencisation

Dictionary-based methods [21] work by matching text against a fixed dictionary of entity names. The performance of these methods is highly dependent on both the coverage of the dictionary and the performance of matching techniques used. Use of a simple text-matching algorithm will lead to a large number of false positives being found because of the overlap between dictionary words and common English, as well as some false negatives due to misspellings not present in the dictionary. Gene names which lead to false positives are typically filtered out of dictionaries. Most systems that are based on this method either use an approximate method of string matching [22] or expand the dictionary by generating spelling variants [23,24]. These methods tend to lead to an increase in recall accompanied by a decrease in precision. In some cases, dictionary-based NER methods can perform normalisation at the same time [25].

Rule-based methods [26] use orthographic and morpho-syntactic features of NEs (capital letters, numbers, symbols and affixes) and their surrounding words to generate patterns and rules. Biochemical suffixes such as -ase and -in are very useful in indicating possible protein names and so a simple rule would be to tag words with these features as proteins. These systems incorporate expert knowledge easily and the rules generated are human readable and easily extendable. Rule-based techniques are able to reach high levels of precision but at the expense of recall, as they are not robust against unseen names. This is mainly because there are so many potential surface grammatical variations (active, passive voice) and it is not feasible to develop robust patterns for all of these.

Machine learning (ML) methods tend to achieve the highest performance for NER. All of the top ten performing methods in the BioCreative II gene mention task (BCII GM) used a machine-learning component. ML methods use training data in the form of a manually annotated gold standard corpus and learn features that are useful in identifying NEs in text. The performance of the methods used in NER can be very sensitive to feature selection, although this is not always the case [27]. NER can be viewed as either a classification or a sequence-labelling problem. Classification approaches normally consider NER as assigning a class to a bag of features. These features include surface clues and morpho-syntactic features of NEs and their adjacent words. These methods do not tend to take the order of features into account and support only binary classifications. Sequence labelling approaches deduce the most probable sequence of tags for a given sequence of words. Each token is assigned a tag by calculating the most likely label for the current token, given both the features of that token and the previous history of tag assignments. The performance of any ML tagger will be biased by the size, inter-annotator agreement and topic structure of the corpus (see Table 3).

Table 3.

Freely available corpora for training and evaluating text-mining tools in the biological domain

Task refers to the tool training/evaluation use of the corpus. Abbreviations: GM, gene mention (NER); GN, gene normalisation; REL, relationship extraction; SD, species; SM, species mention (NER); SN, species normalisation

Determining the correct class of an NE is complicated by the ubiquitous use of abbreviations and acronyms in biomedical research. Liu et al.[42] found that 81.2 per cent of acronyms in MEDLINE are ambiguous (eg the acronym NF can refer to 61 different full forms [43]). ML methods have been proposed for abbreviation disambiguation,[44] with some work focusing on abbreviations found in the biological literature [43,45].

It is not just gene names that are difficult to identify; the identification of species mentions is also troublesome. Species names can be homonymous with common English words (eg 'honesty' for Lunaria annua and 'bears' for Ursidae) but also with important entities in the biological domain (eg cancer and hippocampus). The performance of a dictionary-based tagging system is again limited by the lack of coverage, widespread use of acronyms and frequent misspelling of species names. Standard dictionaries of species names such as the National Center for Biotechnology Information (NCBI) Taxonomy are incomplete, given the amazing diversity of life. They do, however, contain names for most well-studied organisms. Rule-based methods [46] have been developed which are capable of identifying species terms using rules designed for matching Linnaean binomial nomenclature. The recently published LINNAEAUS [34] system uses a dictionary and a set of regular expressions to identify species mentions in text. This system allows both the identification and normalisation of species names, features an acronym disambiguation component and achieves high performance on its own corpus.

Cell lines are widely used in biological and biomedical research as a platform for functional studies and to validate biomarkers. It is useful to identify cell line mentions as they can aid in identifying experimental techniques/conditions and to determine the species to which other entity types belong during normalisation. A recent analysis of the cell line nomenclature [47] revealed that it, too, is blighted by ambiguity and variability. Several NER taggers have been trained to identify cell line mentions in text, although there is not yet one specifically designed for tagging cell line mentions. Recently, integrating information from different sources has led to the creation of a cell line knowledge base (CLKB). This work represents the start of efforts to create a lexicon of cell line names, although it is incomplete, so dictionary-based techniques may still miss cell line mentions. As with other subsets of biological nomenclature, there is vertical polysemy (see Table 1) with other NE classes (see Figure 3).

Figure 3.

Figure 3

(A) HUman Natural Killer; (B) Large piece of something without definite shape; (C) A well-built, sexually attractive man; (D) Hormonally Upregulated Neu-associated Kinase. Demonstration of the possible problems due to the biological nomenclature, given the sentence HUNK is associated with expression of Frizzled-2: HUNK could refer to a cell type, a protein and two common English words. While, in biological text, it is highly probable that (B) and (C) will not be relevant, it is not so easy to disambiguate (A) and (D). This is an example of the problems posed by polysemy (a word or phrase having multiple meanings), homonymity with common English words and the use of abbreviations in the literature [18].

Entity normalisation

Normalisation of NEs allows the results of text mining to be used in tasks like manual curation,[50] knowledge summarisation [51] and model construction and validation [52,53]. The standard method of normalisation is to compare an NE against a dictionary of synonyms and identifiers, and assign the matching identifier. In some domains, this approach can achieve an extremely good performance; however, the variability and ambiguity of biological nomenclature means that this method is essentially ineffective for biological entities. The genomic nomenclature is also highly ambiguous, in that one gene name can map to multiple canonical identifiers. This means that exact text matching using a dictionary is flawed, as the term may be a variation not found in the list of synonyms. Rule-based approaches [54] have been used which try to normalise terms by applying a set of transformations to a tagged entity in order to try to make it match a term in a lexicon. String similarity metrics [55] have been used with some success [56] to match terms which are not present in the original lexicon.

Due to the ambiguity in biological nomenclatures (Figure 4), it is important to disambiguate between multiple identifiers. Several approaches have been proposed in order to deal with this problem: rule-based, ML or hybrid. Rule-based approaches [57] use various heuristics to try to assign scores to identifiers. The creation of bags of words associated with specific identifiers (known as semantic profiles) has been useful for disambiguation. These profiles are created by extracting information from various genomic knowledge sources such as UniProt, GO and Entrez. These can then be used to train a classifier to distinguish the correct identifier from incorrect ones [58]. Knowledge of paper co-authorship has been found to be useful in identifier disambiguation,[59] based on the idea that an author uses gene names consistently across all of their publications or may work on a specific set of genes consistently.

Figure 4.

Figure 4

The genomic nomenclature is highly ambiguous. The plot shows the rank of a gene name against the total number of times that the gene name is found in Biothesaurus. The inset shows this only looking at human genes. The plot is in log-log coordinates. Both graphs show Zipf-like (discrete power-law) distributions. Biothesaurus is a collection of gene names mapped to Entrez Gene/Uniprot identifiers across approximately 7,000 species.

It is not just the proteomic and genomic nomenclatures that pose problems for normalisation. While the precise Linnaean binomial name for an organism is unambiguous, it may not be the case for its abbreviated form. Caenorhabditis elegans is commonly abbreviated to C. elegans; however, 49 other species have a name that can be abbreviated to this short form. Due to the widespread use of Caenorhabditis elegans as a model organism, the majority of mentions of C. elegans would probably normalise to NCBI Taxonomy identifier 6239 but this heuristic will have exceptions. Another problem with species normalisation is dealing with the abundance of different strains, particularly among microorganisms. It is important to disambiguate the strains if possible, as genes' functional properties can vary between strains.

Good results for normalising human gene names have been reported. The BCII GN task [60] evaluated performance against a manually annotated gold standard corpus. Overall results were promising, with a combined recall of 97.2 per cent (entries from over 20 teams). This evaluation assumed that the species was human, however. Normalisation for other species continues to be a challenge and has not been helped by the decision made at the 22nd International Society for Animal Genetics (in August 1990) that animal gene names should 'follow the rules for human gene nomenclature, including the use of identical symbols for homologous genes and the reservation of human symbols for as yet unidentified animal genes' [15]. This interspecies ambiguity of the genomic nomenclature means that identifying the correct species for a given mention is an important subtask of gene normalisation, although it has only recently begun to be considered [61].

Relation extraction

Identifying the existence and type of relationships between entities is difficult because of the numerous ways that a relationship can be proposed. A binding relationship between two proteins could, for example, be described in at least three ways:

(1) APPL binds Akt2

(2) Binding of Akt2 by APPL

(3) Binding between Akt2 and APPL

Relationships between two entities can be described over multiple sentences, which can lead to complications, as anaphors need to be identified and resolved (eg APPL is later referred to as this protein in a piece of text). This limits the recall of relation extraction approaches that work at the sentence level only. The relationship type that has attracted the most effort is extracting PPIs.

A number of different approaches have been proposed in order to perform this task based on linguistic, rule-based and ML methods. Rule-based methods use a set of syntactic patterns, which specify how an interaction is described. The patterns can be manually or automatically generated. RelEX [62] applies a simple set of rules on a representation of the dependencies between words in a sentence called a dependency graph. The RLIMS-P [63] is a rule-based approach specifically designed to extract information about protein phosphorylation sites, and performs well compared with manually curated literature sets. Some ML methods treat a sentence as a sequence of words or tokens and completely ignore its syntactic structure. These approaches do not achieve good performance compared with methods which take sentence structure into account. It is clearly important to consider both contextual and linguistic features,[64,65] such as interaction keywords and verbs,[66] to extract relationships with good precision.

To complicate matters further, authors frequently speculate about potential relationships (eg APPL may interact with Akt2). These statements do not correspond to the definition of a relationship, but that the relationship is proposed to exist. It is important to identify these speculative statements [67] and prevent them from biasing any downstream analyses. For the same reason, it is equally important to detect the negation of relationships [68] (eg APPL does not interact with Akt2).

Hypothesis generation

The scientific literature not only contains explicit knowledge, such as 'APPL interacts with Akt2', but also implicit knowledge,[69] such as hidden refutations or qualifications, inferences from transitive relations, hidden or unrecognised analogies and the accumulation of weak tests (which could be used in meta-analyses). Swanson's serendipitous discovery of the connection between Raynaud's disease and fish oil [70] is an example of performing an inference on a transitive relation to generate a novel and testable hypothesis. By reading two disjoint sets of literature (no articles are in common, and the articles in one set do not cite or mention articles in the other set), he observed that blood factors were a common theme in both the Raynaud's disease and the fish-oil literature. This led him to propose that fish oil could be used in the treatment of Raynaud's disease, and the relationship was clinically validated in 1989 [71]. The discovery led Swanson to propose that 'new hypotheses can emerge and scientific discovery can be anticipated or stimulated through the investigation of complementary but disjoint literatures'. This method of literature-based discovery is commonly referred to as Swanson's ABC model or Swanson Linking, with the hypotheses and new knowledge being described as undiscovered public knowledge. Although the model has mainly been used within the biomedical and biological fields it has also been applied to the humanities literature and the WWW (see Table 4).

Table 4.

Summary of hypotheses generated using Swanson's ABC model and its extensions

Paper Hypothesis
Cory et al.[76] Proposed links between Frost (a 20th century poet) and Carneades (an ancient philosopher)
Gordon et al.[77] Finding new applications for genetic algorithms using the WWW
Hettne et al.[73] Proposed the role of NF-κb in the aetiology of complex regional pain syndrome
Hristovski et al.[78] Proposed novel candidate genes that may be involved in bilateral perisylvian polymicrogyria
Kostoff et al.[79] Proposed novel non-drug treatments (such as calorific restriction) for the treatment of multiple sclerosis
Kostoff et al.[80] Proposed 'lifestyle/dietary practices that could be interpreted as anti-cataract'
Srinivasan et al.[81,82] Novel uses for curcuma longa/turmeric in the treatment of retinal diseases, Crohn's disease and spinal cord-related disorders
Swanson et al.[83] Classifying viruses as potential biological weapons
van Haagen et al.[74] Predicting and identifying novel interaction partners for proteins in Escherichia coli
Weeber et al.[84] Novel uses of thalidomide in the treatment of myasthenia gravis, chronic hepatitis C, Heliobacter pylori-induced gastritis and acute pancreatitis
Wren et al.[85] Chlorpromazine may reduce cardiac hypertrophy (ABC model in conjunction with experimental evidence)
Wren et al.[86] Pathogenesis of non-insulin-dependent diabetes is most likely epigenetic
Zhou et al.[87] Combined MEDLINE with traditional Chinese medicine to propose new functional knowledge about genes

Mendeleev's discovery of the law of periodicity and the development of the periodic table can be considered an early example of literature-based discovery (LBD), as it was: 'a direct outcome of the stock of generalisations and established facts which has accumulated by the end of the decade 1860-1870.' The information required to build the table of elements had already been published, but it had never been analysed as a whole [72]. More recently, Hettne et al.[73] combined TM with network analysis in order to generate new mechanistic hypotheses relating to the complex regional pain syndrome (CRPS). NF-κB was identified as potentially being involved by first extracting genes relating to CRPS from the literature and then investigating potential links between these genes which were not mentioned in the CRPS literature. This hypothesis has led to several new ideas regarding the aetiology of the disease and the proposal of a novel drug target. By exploiting the context of protein mentions, van Haagen et al.[74] were able to predict a novel interaction between CAPN3 and PARVB. Integrating information extracted from the literature with microarray experiments has led to the proposition of a relationship between SIP and the invasiveness of glioblastoma cell lines [75]. All of this work shows the potential for TM to generate testable hypotheses for use in biology.

Hypothesis generation is challenging even to humans, however. Automating this process, or formulating it in such a way that a computer can quickly generate testable scientific propositions, is a non-trivial and daunting task. Only if the universe of potential hypotheses is sufficiently simple for search or enumeration approaches to cover all potential cases is this currently feasible. We feel that the most promising strategies in the short term include the search for suitable heuristics or iterative procedures involving infrequent human input.

Conclusion

TM tools offer a way to retrieve the pertinent information contained within the mass of scientific literature, make it easier to explore [88] and allow the generation of novel insights into existing data, all in an automated fashion. While TM is currently noisy and imperfect, it should be remembered that, due to inter-annotator disagreement, manual curation is too. TM is not just restricted to extracting functional information; it has also been used to identify best practices within the phylogenetics domain,[89] to generate priors for network reconstruction using Bayesian networks [90] and to aid in protein structure comparison and assignment of function [91]. Recently, TM has shown the greatest potential when used in data fusion style approaches. By using information extracted from the literature, Raychaudhuri et al.[92] were able to develop a method better to distinguish between genomic regions associated with disease and false-positive regions. Ten out of 13 single nucleotide polymorphisms (SNPs) identified by their method as been associated with Crohn's disease were later validated by follow-up genotyping. STRING [93] integrates many different types of evidence about PPIs, including literature co-occurrence, phylogenetic data and results from high-throughput experiments, and has been used to predict novel PPIs in other organisms by transferring annotations to orthologous protein pairs. While there is a significant body of work on applying TM to the biological domain, however, there still remain many challenges in areas like relation extraction, species disambiguation and hypothesis generation.

Systems biology and genomics deal with large data models of unprecedented complexity; TM allows us to draw on the published literature in a disciplined manner to inform the development of quantitative models. We expect TM to become an important addition to the systems biologist's toolkit, complementing existing techniques like comparative and primary data analysis. We hope to have demonstrated the use and limitations of TM in its current guise. Being aware of the limitations, however, should enable the community to develop and adopt protocols that allow for easier, more reliable analysis of published research outputs from these tools. This is important not only for researchers, but also for publishers, funding bodies and regulators. These three players have, of course, different but, crucially, not competing interests as far as accessibility of information is concerned. Regulators, in particular, irrespective of whether or not they are engaged in accrediting new drugs or nutritional supplements or the granting of patents, stand to benefit profoundly from information that is provided in an electronically accessible and unambiguous fashion.

References

  1. Ananiadou S, Kell D, Tsujii J. Text mining and its potential applications in systems biology. Trends Biotechnol. 2006;24:571–579. doi: 10.1016/j.tibtech.2006.10.002. [DOI] [PubMed] [Google Scholar]
  2. Baumgartner WA, Cohen KB, Fox LM, Acquaah-Mensah G. et al. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23:i41–i48. doi: 10.1093/bioinformatics/btm229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Winnenburg R, Wächter T, Plake C, Doms A. et al. Facts from text: Can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform. 2008;9:466–478. doi: 10.1093/bib/bbn043. [DOI] [PubMed] [Google Scholar]
  4. Ng S, Wong M. Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform. 1999;10:104–112. [PubMed] [Google Scholar]
  5. Agarwal P, Searls DB. Literature mining in support of drug discovery. Brief Bioinform. 2008;9:479–492. doi: 10.1093/bib/bbn035. [DOI] [PubMed] [Google Scholar]
  6. Rzhetsky A, Seringhaus M, Gerstein M. Seeking a new biology through text mining. Cell. 2008;134:9–13. doi: 10.1016/j.cell.2008.06.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hearst M. Untangling text data mining. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. 1999. pp. 3–10.
  8. Deshpande N, Fink J, Bourne P, Cohen K. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pacific Symposium on Biocomputing. 2008. pp. 640–651. [PMC free article] [PubMed]
  9. Blaschke C. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp Funct Genomics. 2001;2:196–206. doi: 10.1002/cfg.91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Knight J. Negative results: Null and void. Nature. 2003;422:554–555. doi: 10.1038/422554a. [DOI] [PubMed] [Google Scholar]
  11. Pfeiffer T, Hoffmann R. Temporal patterns of genes in scientific publications. Proc Natl Acad Sci USA. 2003;104:12052–12056. doi: 10.1073/pnas.0701315104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Lehne B, Schlitt T. Protein-protein interaction databases: Keeping up with growing interactomes. Hum Genomics. 2009;3:291–297. doi: 10.1186/1479-7364-3-3-291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dickerson J, Pinney J, Robertson D. The biological context of HIV-1 host interactions reveals subtle insights into a system hijack. BMC Syst Biol. 2010;4:80. doi: 10.1186/1752-0509-4-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jenssen T, Lægreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28:21–28. doi: 10.1038/ng0501-21. [DOI] [PubMed] [Google Scholar]
  15. Chen L, Liu H, Friedman C. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics. 2005;21:248–256. doi: 10.1093/bioinformatics/bth496. [DOI] [PubMed] [Google Scholar]
  16. Mons B. Which gene did you mean? BMC Bioinform. 2005;6:142. doi: 10.1186/1471-2105-6-142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hatzivassiloglou V, Duboue PA, Rzhetsky A. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001;17:97–106. doi: 10.1093/bioinformatics/17.suppl_1.S97. [DOI] [PubMed] [Google Scholar]
  18. Barnes J. Conceptual biology: A semantic issue and more. Nature. 2002;417:587–588. doi: 10.1038/417587b. [DOI] [PubMed] [Google Scholar]
  19. Kim J, Ohta T, Tsuruoka Y, Tateisi YN, Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004. pp. 70–75.
  20. Smith L, Tanabe LK, Johnson R, Kuo CJ. et al. Overview of BioCreative II gene mention recognition. Genome Biol. 2008;9:S2. doi: 10.1186/gb-2008-9-s2-s2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Liu H, Hu ZZ, Torii M, Wu C. et al. Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc. 2006;13:497–507. doi: 10.1197/jamia.M2085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Tsuruoka Y, McNaught J, Tsujii J, Ananiadou S. Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics. 2007;23:2768–2774. doi: 10.1093/bioinformatics/btm393. [DOI] [PubMed] [Google Scholar]
  23. Schuemie M, Mons B, Weeber M, Kors J. Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. J Biomed Inform. 2007;40:316–324. doi: 10.1016/j.jbi.2006.09.002. [DOI] [PubMed] [Google Scholar]
  24. Tsuruoka Y. Probabilistic term variant generator for biomedical terms. Proceedings of the 26th Annual International ACM SIGR Conference on Research and Development in Information Retrieval. 2003. pp. 167–173.
  25. Fundel K, Güttler D, Zimmer R, Apostolakis J. A simple approach for protein name identification: Prospects and limits. BMC Bioinform. 2005;6(Suppl 1):S15. doi: 10.1186/1471-2105-6-S1-S15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P. Protein structures and information extraction from biological texts: The PASTA system. Bioinformatics. 2003;19:135–143. doi: 10.1093/bioinformatics/19.1.135. [DOI] [PubMed] [Google Scholar]
  27. Hakenberg J, Bickel S, Plake C, Brefeld U. et al. Systematic feature evaluation for gene name recognition. BMC Bioinform. 2005;6:S9. doi: 10.1186/1471-2105-6-S1-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Tanabe L, Wilbur W. Tagging gene and protein names in biomedical text. Bioinformatics. 2002;18:1124–1132. doi: 10.1093/bioinformatics/18.8.1124. [DOI] [PubMed] [Google Scholar]
  29. Settles B. ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21:3191–3192. doi: 10.1093/bioinformatics/bti475. [DOI] [PubMed] [Google Scholar]
  30. Sætre R, Sagae K, Tsujii J. Syntactic features for protein-protein interaction extraction. Proceedings of the 2nd International Symposium on Languages in Biology and Medicine. 2007. pp. 6.1–6.14.
  31. Leaman R, Gonzalez G. BANNER: An executable survey of advances in biomedical named entity recognition. Pacific Symposium on Biocomputing. 2008. pp. 652–663. [PubMed]
  32. Tsuruoka Y, Tateishi Y, Kim J, Ohta T. et al. Developing a robust part-of-speech tagger for biomedical text. Proceedings of Panhellenic Conference on Informatics. 2005;3746:382–392. [Google Scholar]
  33. Airola A, Pyysalo S, Björne J, Pahikkala T. et al. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinform. 2008;9:S2. doi: 10.1186/1471-2105-9-S11-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Gerner M, Nenadic G, Bergman CM. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinform. 2010;11:85. doi: 10.1186/1471-2105-11-85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Hahn U, Buyko E, Landefeld R. An overview of JCoRe, the JULIE lab UIMA component repository. Proceedings of the LREC'08 Workshop Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP. 2008. pp. 1–7.
  36. Smith L, Rindflesch T, Wilbur W. MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics. 2004;20:2320–2321. doi: 10.1093/bioinformatics/bth227. [DOI] [PubMed] [Google Scholar]
  37. Mika S, Rost B. NLProt: Extracting protein names and sequences from papers. Nucleic Acids Res. 2004;32:W634–W637. doi: 10.1093/nar/gkh427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Hunter L, Lu Z, Firby J, Baumgartner WA. et al. OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinform. 2008;9:78. doi: 10.1186/1471-2105-9-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Corbett P, Murray-Rust P. High-throughput identification of chemistry in life science texts. Proceedings of the 2nd International Symposium on Computational Life Science. 2006. pp. 107–118.
  40. Song Y, Kim E, Lee GG, Yi BK. POSBIOTM-NER: A trainable biomedical named-entity recognition system. Bioinformatics. 2005;21:2794–2796. doi: 10.1093/bioinformatics/bti414. [DOI] [PubMed] [Google Scholar]
  41. Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H. et al. Text processing through Web services: calling Whatizit. Bioinformatics. 2008;24:296–298. doi: 10.1093/bioinformatics/btm557. [DOI] [PubMed] [Google Scholar]
  42. Liu H, Aronson AR, Friedman C. A study of abbreviations in MEDLINE abstracts. Proceedings/AMIA Annual Symposium AMIA Symposium. 2002. pp. 464–468. [PMC free article] [PubMed]
  43. Okazaki N, Ananiadou S. Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006;22:3089–3095. doi: 10.1093/bioinformatics/btl534. [DOI] [PubMed] [Google Scholar]
  44. Tsuruoka Y, Ananiadou S. A machine learning approach to acronym generation. Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. 2005. pp. 25–31.
  45. Bracewell D, Russell S, Wu A. Identification, expansion, and disambiguation of acronyms in biomedical texts. Lect Notes Comput Sci. 2005;3759:186–195. doi: 10.1007/11576259_21. [DOI] [Google Scholar]
  46. Koning D, Sarkar I, Moritz T. TaxonGrab: Extracting taxonomic names from text. Biodiversity Inform. 2005;2:79–82. [Google Scholar]
  47. Sarntivijai S, Ade AS, Athey BD, States DJ. A bioinformatics analysis of the cell line nomenclature. Bioinformatics. 2008;24:2760–2766. doi: 10.1093/bioinformatics/btn502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Pyysalo S, Ginter F, Heimonen J, Björne J. et al. BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinform. 2007;8:50. doi: 10.1186/1471-2105-8-50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Wang X, Tsujii J, Ananiadou S. Disambiguating the species of biomedical named entities using natural language parsers. Bioinformatics. 2010;26:661–667. doi: 10.1093/bioinformatics/btq002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Alex B, Grover C, Haddow B, Kabadjov M, Assisted curation: Does text mining really help? Pac Symp Biocomput. 2008. pp. 556–567. [PubMed]
  51. Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. 1999. pp. 77–86. [PubMed]
  52. Santos C, Eggle D, States D. Wnt pathway curation using automated natural language processing: Combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics. 2005;21:1653–1658. doi: 10.1093/bioinformatics/bti165. [DOI] [PubMed] [Google Scholar]
  53. Waagmeester A, Pezik P, Coort S, Tourniaire F. et al. Pathway enrichment based on text mining and its validation on carotenoid and vitamin A metabolism. OMICS. 2009;13:367–379. doi: 10.1089/omi.2009.0029. [DOI] [PubMed] [Google Scholar]
  54. Lau WW, Johnson CA, Becker KG. Rule-based human gene normalization in biomedical text with confidence estimation. Comput Syst Bioinformatics Conf. 2007;6:371–379. [PubMed] [Google Scholar]
  55. Wang X, Matthews M. Comparing usability of matching techniques for normalising biomedical named entities. Pac Symp Biocomput. 2008;13:628–639. [PubMed] [Google Scholar]
  56. Grover C, Haddow B, Klein E, Matthews M. Adapting a relation extraction pipeline for the BioCreAtIvE II task. Proceedings of the BioCreAtIvE II Workshop. 2007.
  57. Wang X. Rule-based protein term identification with help from automatic species tagging. Proceedings of CICLING. 2007. pp. 288–298.
  58. Crim J, McDonald R, Pereira F. Automatically annotating documents with normalized gene lists. BMC Bioinform. 2005;6:S13. doi: 10.1186/1471-2105-6-S1-S13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Farkas R. The strength of co-authorship in gene name disambiguation. BMC Bioinform. 2008;9:69. doi: 10.1186/1471-2105-9-69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Morgan AA, Lu Z, Wang X, Cohen AM. et al. Overview of BioCreative II gene normalization. Genome Biol. 2008;9:S3. doi: 10.1186/gb-2008-9-s2-s3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Kappeler T, Kaljurand K, Rinaldi F. TX task: Automatic detection of focus organisms in biomedical publications. Proceedings of the Workshop on BioNLP. 2009. pp. 80–88.
  62. Fundel K, Kuffner R, Zimmer R. RelEx-Relation extraction using dependency parse trees. Bioinformatics. 2007;23:365–371. doi: 10.1093/bioinformatics/btl616. [DOI] [PubMed] [Google Scholar]
  63. Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K. et al. Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics. 2005;21:2759–2765. doi: 10.1093/bioinformatics/bti390. [DOI] [PubMed] [Google Scholar]
  64. Niu Y, Otasek D, Jurisica I. Evaluation of linguistic features useful in extraction of interactions from PubMed: Application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics. 2009;26:111–119. doi: 10.1093/bioinformatics/btp602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Fayruzov T, Cock MD, Cornelis C, Hoste V. Linguistic feature analysis for protein interaction extraction. BMC Bioinform. 2009;10:374. doi: 10.1186/1471-2105-10-374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Hatzivassiloglou V, Weng W. Learning anchor verbs for biological interaction patterns from published text articles. Int J Med Inform. 2002;67:19–32. doi: 10.1016/S1386-5056(02)00054-0. [DOI] [PubMed] [Google Scholar]
  67. Kilicoglu H, Bergler S. Recognizing speculative language in biomedical research articles: A linguistically motivated perspective. BMC Bioinform. 2008;9:S10. doi: 10.1186/1471-2105-9-S11-S10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Sanchez-Graillet O, Poesio M. Negation of protein-protein interactions: Analysis and extraction. Bioinformatics. 2007;23:i424–i432. doi: 10.1093/bioinformatics/btm184. [DOI] [PubMed] [Google Scholar]
  69. Davies R. The creation of new knowledge by information retrieval and classification. J Doc. 1989;4:273–301. [Google Scholar]
  70. Swanson D. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986;30:7–18. doi: 10.1353/pbm.1986.0087. [DOI] [PubMed] [Google Scholar]
  71. DiGiacomo R, Kremer J, Shah D. Fish-oil dietary supplementation in patients with Raynaud's phenomenon: A double-blind, controlled, prospective study. Am J Med. 1989;86:158–164. doi: 10.1016/0002-9343(89)90261-1. [DOI] [PubMed] [Google Scholar]
  72. Murray-Rust P. Data Driven Science-A Scientist's View. NSF/JISC 2007 Digital Repositories Workshop. 2007. http://www.sis.pitt.edu/~repwkshop/papers/murray.html
  73. Hettne K, de Mos M, de Bruijn A, Weeber M. Applied information retrieval and multidisciplinary research: New mechanistic hypotheses in complex regional pain syndrome. J Biomed Discov Collab. 2007;2:2. doi: 10.1186/1747-5333-2-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. van Haagen H, 't Hoen P, Bovo AB, de Morrée A. et al. Novel protein-protein interactions inferred from literature context. PLoS ONE. 2009;4:e7894. doi: 10.1371/journal.pone.0007894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Natarajan J, Berrar D, Dubitzky W, Hack C. et al. Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC Bioinform. 2006;7:373. doi: 10.1186/1471-2105-7-373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Cory K. Discovering hidden analogies in an online humanities database. Comput Hum. 1997;31:1–12. doi: 10.1023/A:1000422220677. [DOI] [Google Scholar]
  77. Gordon M, Lindsay R, Fan W. Literature-based discovery on the World Wide Web. ACM Trans Inter Tech. 2002;2:261–275. doi: 10.1145/604596.604597. [DOI] [Google Scholar]
  78. Hristovski D, Peterlin B, Mitchell J, Humphrey S. Using literature-based discovery to identify disease candidate genes. Int J Med Inform. 2005;74:289–298. doi: 10.1016/j.ijmedinf.2004.04.024. [DOI] [PubMed] [Google Scholar]
  79. Kostoff R, Briggs M, Lyons T. Literature-related discovery (LRD): Potential treatments for multiple sclerosis. Technol Forecast Soc Change. 2007;75:239–255. [Google Scholar]
  80. Kostoff R. Literature-related discovery (LRD): Potential treatments for cataracts. Technol Forecast Soc Change. 2007;75:215–225. doi: 10.1016/j.techfore.2011.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Srinivasan P, Libbus B. Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics. 2004;20:i290–i296. doi: 10.1093/bioinformatics/bth914. [DOI] [PubMed] [Google Scholar]
  82. Srinivasan P, Libbus B, Sehgal A. Mining medline: Postulating a beneficial role for curcumin longa in retinal diseases. HLT Biolink. 2004. pp. 33–40.
  83. Swanson D, Smalheiser N, Bookstein A. Information discovery from complementary literatures: Categorizing viruses as potential weapons. J Am Soc Inf Sci Technol. 2001;52:797–812. doi: 10.1002/asi.1135. [DOI] [Google Scholar]
  84. Weeber M, Vos R, Klein H, de Jong-van den Berg LT. et al. Generating hypotheses by discovering implicit associations in the literature: A case report of a search for new potential therapeutic uses for thalidomide. J Am Med Inform Assoc. 2003;10:252–259. doi: 10.1197/jamia.M1158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Wren JD, Bekeredjian R, Stewart JA, Shohet RV. et al. Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics. 2004;20:389–398. doi: 10.1093/bioinformatics/btg421. [DOI] [PubMed] [Google Scholar]
  86. Wren J. Data-mining analysis suggests an epigenetic pathogenesis for type 2 diabetes. J Biomed Biotechnol. 2005;2:104–112. doi: 10.1155/JBB.2005.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Zhou X, Liu B, Wu Z, Feng Y. Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks. Artif Intell Med. 2007;41:87–104. doi: 10.1016/j.artmed.2007.07.007. [DOI] [PubMed] [Google Scholar]
  88. Hoffmann R, Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005;21:ii252–ii258. doi: 10.1093/bioinformatics/bti1142. [DOI] [PubMed] [Google Scholar]
  89. Eales J, Pinney J, Stevens R, Robertson D. Methodology capture: Discriminating between the "best" and the rest of community practice. BMC Bioinform. 2008;9:359. doi: 10.1186/1471-2105-9-359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Steele E, Tucker A, 't Hoen P, Schuemie M. Literature-based priors for gene regulatory networks. Bioinformatics. 2009;25:1768–1774. doi: 10.1093/bioinformatics/btp277. [DOI] [PubMed] [Google Scholar]
  91. MacCallum R, Kelley L, Sternberg M. SAWTED: Structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics. 2000;16:125–129. doi: 10.1093/bioinformatics/16.2.125. [DOI] [PubMed] [Google Scholar]
  92. Raychaudhuri S, Plenge RM, Rossin EJ, Ng ACY. et al. Identifying relationships among disease regions: Predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 2009;5:e1000534. doi: 10.1371/journal.pgen.1000534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. von Mering C, Jensen L, Snel B, Hooper S. STRING: Known and prediction protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005;33:D433–D437. doi: 10.1093/nar/gki005. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Human Genomics are provided here courtesy of BMC

RESOURCES