Dragon Plant Biology Explorer. A Text-Mining Tool for Integrating Associations between Genetic and Biochemical Entities with Genome Annotation and Biochemical Terms Lists

Vladimir B Bajic; Merlin Veronika; Pardha Sarathi Veladandi; Archana Meka; Mok-Wei Heng; Kanagasabai Rajaraman; Hong Pan; Sanjay Swarup

doi:10.1104/pp.105.060863

. 2005 Aug;138(4):1914–1925. doi: 10.1104/pp.105.060863

Dragon Plant Biology Explorer. A Text-Mining Tool for Integrating Associations between Genetic and Biochemical Entities with Genome Annotation and Biochemical Terms Lists^[w]

Vladimir B Bajic ¹, Merlin Veronika ¹, Pardha Sarathi Veladandi ¹, Archana Meka ¹, Mok-Wei Heng ¹, Kanagasabai Rajaraman ¹, Hong Pan ¹, Sanjay Swarup ^1,^*

PMCID: PMC1183383 PMID: 16172098

Abstract

We introduce a tool for text mining, Dragon Plant Biology Explorer (DPBE) that integrates information on Arabidopsis (Arabidopsis thaliana) genes with their functions, based on gene ontologies and biochemical entity vocabularies, and presents the associations as interactive networks. The associations are based on (1) user-provided PubMed abstracts; (2) a list of Arabidopsis genes compiled by The Arabidopsis Information Resource; (3) user-defined combinations of four vocabulary lists based on the ones developed by the general, plant, and Arabidopsis GO consortia; and (4) three lists developed here based on metabolic pathways, enzymes, and metabolites derived from AraCyc, BRENDA, and other metabolism databases. We demonstrate how various combinations can be applied to fields of (1) gene function and gene interaction analyses, (2) plant development, (3) biochemistry and metabolism, and (4) pharmacology of bioactive compounds. Furthermore, we show the suitability of DPBE for systems approaches by integration with “omics” platform outputs. Using a list of abiotic stress-related genes identified by microarray experiments, we show how this tool can be used to rapidly build an information base on the previously reported relationships. This tool complements the existing biological resources for systems biology by identifying potentially novel associations using text analysis between cellular entities based on genome annotation terms. Thus, it allows researchers to efficiently summarize existing information for a group of genes or pathways, so as to make better informed choices for designing validation experiments. Last, DPBE can be helpful for beginning researchers and graduate students to summarize vast information in an unfamiliar area. DPBE is freely available for academic and nonprofit users at http://research.i2r.a-star.edu.sg/DRAGON/ME2/.

SYSTEMS APPROACHES

Advances in genome sciences, combined with high-throughput technologies, are allowing “omics” researchers to obtain large amounts of data that are essential in understanding biology at the systems level. While obtaining global data at the cellular and organellar levels is possible for a growing number of plant biologists, a major bottleneck still remains in the analyses and interpretation of such complex datasets. One of the formidable tasks for most researchers generating high-volume data is to become familiar with the vast amount of existing knowledge on key genes or gene products uncovered in their system. Systems approaches not only help generate this exhaustive parts list, but also have the potential to enhance this level of knowledge, albeit at the expense of increasing complexity. One of the key features that set apart systems biology from the nonsystems approaches is that it helps to identify associations among the elements of the system and eventually to have a predictive value (Kitano, 2002). In the past, such elements have generally been studied in isolation but in much depth, which has led to the accumulation of a large body of literature on many genes, pathways, and other entities and a limited number on their relationships. Hence, associations that are uncovered or predicted by systems biology need to be evaluated together with the existing knowledge on those elements. Mining of existing literature in relation to the elements of the system being studied is therefore crucial for validating the associations and is useful in designing further experiments.

REQUIREMENTS AND CHALLENGES FOR TEXT MINING IN PLANT BIOLOGY

Any text-mining effort should be able to effectively deal with the diversity of plant-related literature. Plant biology literature is extensive in that it covers diversity in form and function as well as diversity in the molecular and chemical makeup of cells or tissues. The literature is also rich due to the pharmacological effects of many plant natural products. At the genome level, it is well known that nearly 25% of the gene functions are involved in metabolism (Arabidopsis Genome Initiative, 2000; Rensink and Buell, 2004; Nagaki et al., 2004). These metabolic functions are also quite diverse in various plant species (Dewick, 2002). The growth in PubMed documents in selected areas of plant biology over the last two decades is represented in Figure 1. It is not surprising that there has been a steady increase in all the major categories shown here. Few trends emerge from this plot. First, the field of plant development is very active with a steady increase over the last 20 years. Second, the plant gene literature has increased since 1996 but not as rapidly as in other fields shown here. This could possibly be due to the initial burst with the completion of the Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) genomes. In contrast, the amount of plant biology and Arabidopsis literature on the three major systems platforms, genomics, proteomics, and metabolomics, has seen a significant rise. Hence, if this trend continues, a significant rise in literature arising from the adoption of systems approaches is expected.

Figure 1. — Increase in the indexed PubMed entries for scientific reports related to selected domains of plants, including Arabidopsis. Keywords used to search the abstracts in the databases were the same as given in the legend.

Mining of vast literature from the standpoint of a biologist is not trivial. With the exception of secondary literature, such as reviews and book chapters, all scientific reports contain a component of novelty. This novelty not only adds to the existing knowledgebase but also creates an association of concepts with those in many previous reports. The technological capabilities of biological research have even further increased the complexity and volume of information due to the information generated from the large-scale high-throughput approaches in genomics, proteomics, and metabolomics (Fig. 1). To obtain the relevant information on the topics of interest has therefore created a challenge for biologists that did not exist earlier to such an extent. It is no longer possible for an individual user to easily extract the relevant rich information on the individual concepts of his or her interest given the current scale of literature, let alone to be able to identify increased number of potential associations between the concepts. For the above reasons, many biomedical text-mining tools that can help in extracting such potential pieces of knowledge are being developed. However, currently none exists for plant biology.

A number of text-mining tools exist that cover different aspects of information extraction from biomedical literature in a variety of domains (Andrade and Bork, 2000; de Bruijn and Martin, 2002; Grivell, 2002; Schulze-Kremer, 2002; Dickman, 2003). Examples and features of such publicly available systems are given in Table I. However, none of these Web-based solutions is customized for mining plant biology literature, nor are they structured to integrate aspects of cellular function, structure, development, or biochemical mechanisms of plants and, in specific terms, of the model plant Arabidopsis. An effort toward addressing these concerns will require (1) usage of plant-specific vocabularies, (2) an efficient system for extracting information from the text, (3) building relationships between the extracted terms, and (4) representing the associations detected in user-friendly fashion. Looking at the breadth of plant biology and the interests of Arabidopsis researchers, it is also essential for such a system to allow customization and flexibility of use. Keeping in view the state of the art in the text-mining domain and the above requirements, we have developed Dragon Plant Biology Explorer (DPBE; http://research.i2r.a-star.edu.sg/DRAGON/ME2/).

Table I.

Some examples of the online text-mining tools for biologists

Program	Features	URL	Reference
PubGene	Designed to identify relationships between genes based on their co-occurrence in the abstracts of scientific papers	http://www.pubgene.org/	Jenssen et al. (2001)
MedMiner	Filters extract and organize relevant sentences in the literature based on the query given	http://discover.nci.nih.gov/textmining/main.jsp	Tanabe et al. (1999)
XploreMed	Allows exploration of a set of abstracts derived from a MEDLINE search	http://www.bork.embl-heidelberg.de/xplormed/	Perez-Iratxeta et al. (2003)
PubMatrix	Compares a list of terms against another list of terms in PubMed	http://pubmatrix.grc.nia.nih.gov/	Becker et al. (2003)
AbXtract	Extracts domain-specific information from the analysis of abstracts related to set of protein families	http://columba.ebi.ac.uk:8765/andrade/abx	Andrade and Valencia (1998)
VxInsight	A general tool for revealing the implicit structure of the data in large databases	http://www.cs.sandia.gov/projects/VxInsight.html	Kim et al. (2001)
SUISEKI	Extracts protein-protein interactions from large collections of scientific text	http://www.pdg.cnb.uam.es/suiseki/	Blaschke and Valencia (2001)
GIS	Biomedical text-mining system focused on gene-related information	http://iir.csie.ncku.edu.tw/∼yuhc/gis/	Chiang et al. (2004)
PreBIND	Locates biomolecular interaction information in the scientific literature	http://www.blueprint.org/products/prebind/	Donaldson et al. (2003)
Genes2 Diseases	Analyses relations between phenotypic features and chemical objects, and from chemical objects to protein function terms, based on the whole MEDLINE and RefSeq databases	http://www.bork.embl-heidelberg.de/g2d/	Perez-Iratxeta et al. (2002)
HAPI	Links set of genes in the published literature by way of keyword hierarchies	http://array.ucsd.edu/hapi	Masys et al. (2001)
TextPresso	An information retrieval and extraction system for biological literature (Caenorhabditis elegans version)	http://www.textpresso.org/	Muller et al. (2004)
Dragon TF Association Miner	Finds association between transcription factors, GO terms, and diseases. Has a module to filter out irrelevant documents	http://research.i2r.a-star.edu.sg/DRAGON/TFAM_v2/	Pan et al. (2004)

Open in a new tab

THE DPBE SYSTEM

The aim of the DPBE system is to find clues on potential associations between different searched components, particularly those that can suggest function of the entity found or association of their functionality with different domains of subcellular, cellular, organ, or whole-plant function. DPBE complements the existing biological resources by presenting associations that may have been described previously or that are possibly novel. This is possible by integrating the associations and forming their networks based on their relationships. At the core of the DPBE is a collection of nine manually curated vocabularies in combination with the available gene ontology (GO) lists that cover different aspects of plant form, development, biochemistry, genes (including mutants), and gene functions. Selection of different combinations of these vocabularies allows a biologist to analyze biological processes in plants from several domains. This combinatorial use of plant-based, as well as universal, vocabularies together with the visualization of associations in the form of networks feature is unique to DPBE, and none of the existing online resources provides such cross-field and flexible form of integration. The structure of the DPBE system is described at the DPBE Web site under “System Description” (Supplemental Fig. 1). To use the system, the user needs to (1) collect the targeted documents via the PubMed search engine, (2) upload the file with such documents, (3) select a combination of vocabularies suitable for his or her research interest, and (4) provide an e-mail address to which the results can be sent.

VOCABULARIES USED FOR TEXT MINING IN DPBE

Controlled vocabularies based on standard nomenclatures and GOs can be particularly useful for the purposes of integrating information from various sources, including texts (Kelso et al., 2003; Berardini et al., 2004; Harris et al., 2004). Hence, we have adopted such vocabularies for text-mining purposes. GOs are structured, controlled vocabularies that describe gene products in terms of their associated biological processes, cellular components, and molecular functions in a species-independent manner. The GO Consortium (www.geneontology.org) is mainly responsible for developing such species-independent vocabularies. At the next level, there are plant-specific vocabularies developed by the Plant Ontology Consortium (www.plantontology.org) for plant structure and for growth and developmental stages. Finally, there are species-specific vocabularies, such as those for Arabidopsis (Berardini et al., 2004; http://www.arabidopsis.org/info/ontologies). A combination of these vocabularies can therefore cover general ontologies as well as plant structure and function. However, these ontologies do not include fields of metabolism and biochemistry. Hence, we have created a set of vocabularies for this area, as described here.

The functioning of DPBE is based on the use of nine well-controlled vocabularies containing a total of 92,052 terms that cover different aspects of plant biology with special reference to Arabidopsis (Table II). The vocabularies are organized as follows: (1) one list of Arabidopsis genes consisting of the gene names compiled by The Arabidopsis Information Resource (TAIR; Berardini et al., 2004), and developmental and other mutants from various sources; (2) an anatomy and plant parts list containing terms developed by TAIR and the Plant Ontology Consortium; (3) a list on the developmental stages developed the same way as described above; (4) three lists of universal biochemical entities, based on metabolic pathways, enzymes, and metabolites derived from AraCyc (Mueller et al., 2003), BRENDA (Pharkya et al., 2003), LIGAND (Kanehisa, 1997; Kanehisa and Goto, 2000), and other metabolism databases; and (5) three lists of universal GO terms relating to cellular components, biological processes, and molecular function, respectively, that are obtained from the GO Consortium.

Table II.

Details of the nine vocabularies used in DPBE

Class/Coverage	Subclass	No. of Terms/Entries	Source(s)
Genes (including mutants)/Arabidopsis	Genes	56,619	ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes
Genes (including mutants)/Arabidopsis	Mutants	307	http://mutant.lse.okstate.edu/embryopage/emb_list.html
Anatomy/plants (including Arabidopsis)		466	ftp://ftp.arabidopsis.org/home/tair/Genes/Gene_Anatomy/
			http://www.plantontology.org/ontology/index.html
			Fahn (1990); Mauseth (1988)
Developmental stages/plants (including Arabidopsis)		102	ftp://ftp.arabidopsis.org/home/tair/Genes/Gene_Developmentalstage/
Developmental stages/plants (including Arabidopsis)		102	http://www.plantontology.org/ontology/index.html
Metabolism/all organisms (including Arabidopsis)	Pathways	186	ftp://ftp.arabidopsis.org/home/tair/Pathways/
	Enzymes	15,557	ftp://ftp.genome.jp/pub/kegg/ligand/enzyme
			ftp://ftp.arabidopsis.org/home/tair/Pathways/
			http://www.brenda.uni-koeln.de/
	Metabolites	16,042	ftp://ftp.genome.jp/pub/kegg/ligand/compound
	Metabolites	16,042	ftp://ftp.arabidopsis.org/home/tair/Pathways/
General terms/all organisms (including Arabidopsis)	Cellular components	276	ftp://ftp.arabidopsis.org/home/tair
	Cellular components	276	http://www.geneontology.org/ontology/component.ontology
	Biological process	1,113	ftp://ftp.arabidopsis.org/home/tair
	Biological process	1,113	http://www.geneontology.org/ontology/process.ontology
	Molecular function	1,384	ftp://ftp.arabidopsis.org/home/tair
	Molecular function	1,384	http://www.geneontology.org/ontology/function.ontology
Total		92,052

Open in a new tab

EXPLORER MODULES OF DPBE

In the input module, the nine vocabularies have been organized into four Explorer modules for ease of choice by users based on their broad fields of interest (Fig. 2). For advanced users of the system, it is also possible to select any combination of vocabularies as a fifth option. The DPBE explorer modules are as follows.

Gene/Protein Function Explorer. This module only refers to the genes and mutant list, which is specific to Arabidopsis. The searches within the system have been made case insensitive; hence, proteins with the corresponding uppercase names are also picked up. Mutants are also included in this search as their lists have been merged with the gene list.
Plant Development Explorer. As plant development is a very active area with more than 10,000 reports in PubMed currently (Fig. 1), we have built this module with components from common plant anatomy and developmental terms, as well as those adopted by the Arabidopsis consortium, TAIR. Based on searches we have performed with this module, we believe that it is useful not only for integrating information from much work on Arabidopsis development but also in connecting the work with that done in other plant species. The plant anatomy vocabularies are especially helpful in localizing the role and expression of developmental genes.
Metabolome Explorer. This tool has been developed for performing searches on biochemical, metabolic, and physiological aspects of plant biology. Both Arabidopsis and other plant species researchers can benefit directly from this tool based on the nature and the comprehensiveness of the vocabularies developed here (Table II). This module can be bridged with plant development and structure a choice, the literature can be mined for the developmental role of metabolites and their tissue-specific occurrence or co-occurrence with other metabolites in selected plant parts.
Natural Products Pharmacology Explorer. The three biochemical vocabularies developed here have entities found in all types of organisms, including plants and mammals. Similarly, the general GOs are universal in nature. Hence, this combination allows users to explore the pharmacological effects of plant-based natural products.
Customized Plant Biology Explorer. Last, based on the diversity of the fields of interest and the choice of vocabularies available, a wider range of choice is made available in the last mining module. Researchers in plant-microbe interactions, stress physiology, or subcellular biology can easily use a suitable set of vocabularies to suit their purposes. In addition, the DPBE system is designed to integrate well with the outputs from systems approaches such as transcriptomics, proteomics, and metabolomics. This part is explained below as part of the section on applications.

Figure 2. — A section of the input module of the DPBE from its home page (http://research.i2r.a-star.edu.sg/DRAGON/ME2/). There are five input Explorer modules of DPBE: (1) Gene/Protein Function Explorer, (2) Plant Development Explorer, (3) Metabolome Explorer, (4) Natural Products Pharmacology Explorer, and (5) Customized Plant Biology Explorer. For each module, a selection of vocabularies is available. A part not shown here includes instructions for “omics” researchers.

TABULAR AND GRAPHICAL OUTPUTS OF DPBE

Once the input information is provided, the system builds the fast indexing schema based on the selected vocabularies and then matches the terms from the vocabularies with the supplied text. The technical details about this process have been described previously (Pan et al., 2004). When the system has analyzed all documents, it produces two categories of reports: tabular and graphical (Fig. 3). Several types of tables are generated for the users. The first type of table provides a list of all terms found with an indication of the number of documents with the term, as well as with links to the original PubMed documents. Another tabular report lists the most frequently referenced documents, so that the relevance of the document can be indirectly assessed and the user can select the most “informative” ones. The third tabular report provides a list of pairs of terms (we call this “association”) based on the frequency of their co-occurrence in the same document. In all these reports, the terms identified are color marked and linked to the original documents for easier inspection. A second set of reports is generated in DPBE, where the documents are clustered by the self-organizing maps artificial neural network based on the terms identified in the documents. Documents with similar content tend to be clustered together. This allows the user to select more related groups of documents for further study from the large volume of the originally collated document set.

Figure 3. — Overview of the interactive Output Modules of DPBE. A, Three types of tabular outputs. Two tabular outputs that are hyperlinked to the main table are shown with horizontal arrows. B, A network of associations that is linked to the main table is shown with a vertical arrow. C, The color and shape coding used to highlight the vocabulary elements in the abstracts. D, The nodes of the association networks (shown in B) are linked to the PubMed abstracts that are stored locally and displayed with the highlighted text. This allows users to rapidly screen the relevant literature. The interactivity can be seen at the DPBE Web site under “Examples” (http://research.i2r.a-star.edu.sg/DRAGON/ME2/).

Finally, DPBE presents the associations between the terms from vocabularies in a graphical format. Depending on the topic of analysis and the vocabularies chosen, many association networks can be very complex. We have therefore built in a flexible mechanism of choosing vocabularies to handle this complexity. The user is provided with an option to control the minimum number of co-occurrence of pairs of terms that will be shown in the network. The higher this number, the less complex will be the network presented to the user. If the user wants to inspect the association networks in more detail, the threshold can be lowered. The nodes in the network represent the different terms identified from the vocabularies selected by the user. Each varies in shape and color corresponding to the vocabulary to which the term belongs. All nodes are linked to the set of original documents from which the information about linking the node/term is extracted. This is done to allow users to inspect the PubMed abstracts directly for the associations proposed by DPBE.

Several networks of association could be generated as part of the graphical output. There are two reasons for the generation of multiple networks from a single set of documents. First, separate networks are generated when the terms found in one network do not co-occur with the terms from the other networks. Second, depending on the topic of the literature search, even with very specific selection of documents the resulting networks could be very complex. To allow users to view these networks in the browser, an automatic partitioning procedure has been implemented to split a large network into several smaller networks (Pan et al., 2004). Users can minimize this by selecting fewer vocabulary lists and making their selection of documents even more specific.

LIMITATIONS AND EXPECTATIONS FROM DPBE USAGE

The outputs of the DPBE system are not biological networks and should not be taken to be so. Since this system analyses co-occurrence of terms within the document and since the documents analyzed are abstracts of scientific reports that present summaries of the most important findings, there obviously exists a loose relation between the co-occurring terms. However, the actual nature of these relations is not analyzed by the system. It is left to the user to accept or reject the association proposed by the system. For this reason, links are provided to view and quickly review the relevance of PubMed abstracts. The second issue is the completeness of information or lack of it. Since the analyzed documents are abstracts, it is unlikely that the system collects all relevant information on the association of terms. An ideal situation would be the mining capability for full texts, which the core toolset presented here can perform. However, owing to the lack of availability of the full texts from many sources, we have restricted the current version to PubMed abstracts. Thus, the resultant association maps will only represent a subset of all possible relationships. Last, the time frame of analysis has to be kept in mind while submitting the queries. Most of the analysis time is spent on the generation of complex association map networks. Sometimes, the networks produced are so large that they cannot be opened and viewed in the Internet browser. A more specific selection of documents is then suggested, as well as the selection of smaller number of vocabularies for use in the analysis.

SELECTED APPLICATIONS OF THE DPBE SYSTEM

Here, we describe four examples to showcase the ability of the DPBE system to help integrate knowledge from various fields of plant biology, with special reference to Arabidopsis. These examples cover gene functions, biochemistry, and pharmacology of plant natural products. They are also based on single entity analysis and a list of genes from gene expression profiling data. The complete interactive versions of the networks for the four examples, as shown in Supplemental Figures 2 to 5; their starting PubMed abstract lists; and the parameters used for their generation are provided at the DPBE Web site (http://research.i2r.a-star.edu.sg/DRAGON/ME2/) under “Examples.”

Integrating Biochemistry and Gene Function Knowledge in Developmental Biology

We were interested in finding examples of metabolites that affect gene function and displaying networks of how they may bring about this effect. This analysis was done in two parts. In the first part, a very broad text-mining exercise was conducted to identify some associations of biochemical entities and developmental genes. In the second part, one of the developmental genes identified in the first part was chosen for analysis by gene function association networks. For the first part, a set of 11,780 PubMed abstracts was selected with the broad keywords “plants” and “genes” from PubMed. Only two vocabulary lists were selected, namely, “metabolites” and “genes” (including protein and mutant names). Tables showing clusters of terms were then screened to find association of metabolites with some of the well-studied developmental genes. In this analysis, one of the more cited cases was that of the plant growth hormone GA and the well-studied Agamous (AG) gene of Arabidopsis. From among the association networks, the display network 5 with minimum number of links selected per node as two displayed the GA-AG connection via the gene RGA (Supplemental Fig. 2A). This connection was established only recently (Yu et al., 2004). The GA network showed the association of (1) biosynthetic metabolites and enzymes such as ent-kaurene and GA oxidase; (2) GA signal transduction molecules such as RGA, SLY 1, and SPY; and (3) of studies involving multiple growth hormones.

The second part of this analysis focused on the well-studied case of AG. A set of 12,494 PubMed abstracts on “flowering” were submitted to the DPBE system together with three vocabulary lists: genes, anatomy, and development. A part of the largest network showing association of AG with 28 other genes is shown in Supplemental Figure 2B. There were other networks from the same analysis that also showed additional associations of AG with other entities from the three lists selected here. The early transcriptional program controlled by AG was described recently (Gomez-Mena et al., 2005). Nearly all the genes found in the AG association network shown here and from two others (data not shown) are described in that report.

Biochemistry and Pharmacological Effects of Plant Natural Products

Several types of pharmacological effects were uncovered in the course of testing DPBE. While three preanalyzed association network maps with interactive nodes are displayed at the DPBE Web site, here only one example is provided (Supplemental Fig. 3). For this analysis, a set of 10,997 PubMed abstracts containing the keywords “plants” and “alkaloids” was submitted for text-mining analysis by the DPBE system. Only two dictionaries, metabolites and anatomy, were chosen. Display network 5 with a setting of minimum of two links per node showed the associations for several types of alkaloids and with the vasodilator, histamine. As many as 258 networks were generated, likely because of the reasons given above. Alkaloids from at least five different routes of synthesis were captured (Dewick, 2002). Some of the alkaloids in one part of the network shown here are capsicain, scopolamine (hyoscine), phytostigmine, papaverine, and pilocarpine. In the same part of the network, the tropane alkaloid biosynthesis metabolite, tropic acid, was also captured by the system. In a different part of the above network, association of several alkaloids with noradrenaline was displayed. In network 2, with a setting of minimum of two links per node, association of several alkaloids on the human catechol hormones, norepinephrine and phenylephrine, was displayed. This interactive network can be accessed at the DPBE Web site under the section “Examples.” Hence, the DPBE system condensed more than 10,000 documents into several major networks that summarized most of the effects of alkaloids. As a result, more focused searches can be performed as has been described for the plant development case above.

Application of DPBE Analysis for Small Sets of Documents

To test the utility of DPBE in analyzing small sets of documents, a PubMed search was performed using the term “Arabidopsis and brassinosteroid.” This resulted in a small number of 108 documents, which were analyzed by selecting two vocabularies (genes and metabolites). The resulting network 1, with a setting of two links per node, showed connectivities of many aspects of brassinosteroid work (Supplemental Fig. 4). Brassinosteroid biosynthetic and signal transduction pathway members, together with genes involved in cross talk with other hormones, were displayed. Hence, the DPBE analysis allowed merging of existing knowledge from multiple aspects of brassinosteroid biology into few association networks.

Integrating the DPBE System with Results from Systems Approaches

The DPBE system has been designed to integrate lists of biological entities emerging from three major levels of systems approaches used frequently: transcriptomics, proteomics, and metabolomics. A list of differentially affected entities, such as genes, proteins, or metabolites that are identified by such approaches, can be submitted as inputs for analysis by the DPBE system. However, owing to the heavy computational needs, this feature has been restricted to offline usage. Specific instructions are provided at the DPBE Web site for “omics” researchers on their data submission and output handling needs. Metabolomics researchers can use their list of differentially affected metabolites with the “metabolites” and possibly a plant part vocabulary together, depending on the number of texts submitted. Outputs of microarray and proteomics experiments can be performed in ways similar to the one described below.

Here, we present an example of detailed text analysis based on a recent transcript profiling study of genes affected by the well-known stress response regulator DREB1A of Arabidopsis (Maruyama et al., 2004). The authors performed a number of well-controlled experiments to identify a highly specific set of 22 genes that are affected by DREB1A. In addition, detailed validation experiments were performed to arrive at their final list of affected genes. The list of 22 genes is available at the DPBE Web site under the section “Examples.” A set of 3,221 PubMed abstracts was collected using these gene names for keyword searches and submitted for analysis by DPBE. From the networks generated, the lowest value of one link per node was chosen to capture as many associations for a single gene as possible. In network 3, 16 of the 22 genes from in the original microarray dataset were present in the associations. One of the most cited genes was identified in the same network as RD29A, whose name was associated with at least 48 other genes, including DREB1A, as expected. Part of this network surrounding RD29A is shown in Figure 4A, and the full network is shown in Supplemental Figure 5. Because many of the genes present in the network seemed to belong to various biological pathways, we analyzed the text further from the linked PubMed abstracts using the color coding there to rapidly screen the contents.

A total of 37 PubMed abstracts linked to the RD29A node, and its nearest connected links were collected from the network shown in Figure 4A. The interactive network, color-coded abstracts, and the complete list of 37 references are available at the DPBE Web site under “Examples.” These abstracts were manually scanned specifically for the entries highlighted with various colors, which simplified the screening procedure. The color-coded entries were then binned into five groups, namely, drought, cold and osmotic stress, hormones, and others based on the presence of these words within the respective abstracts. These entries were then connected to RD29A and with others, using terms for relationships (such as coregulated, up-regulated, and repressed) as given in the same abstracts. No additional searches or abstracts were used for this analysis. Hence, within a short period of time, the interactive network combined with the color-coded entries generated in the PubMed abstracts by the DPBE system allowed us to zoom-in to a key node (of RD29A) and develop a network of relationships based on experimentally verified data (Fig. 4B). While this analysis may not be complete as it is based solely on one network generated by DPBE, we present these results to demonstrate how the graphical outputs and color-coded entries in texts generated by the system can ease interpretation and capture most significant parts of the existing literature to develop networks, as shown in Figure 4B, which are derived directly from experimental data. Such derived networks therefore have direct biological evidence.

In conclusion, we have developed an online tool that provides flexibility in choosing combinations of gene lists, GOs, and lists of biochemical entities to mine plant biology text, with special reference to Arabidopsis, to build and display association map networks. This system allows researchers to obtain an integrated view of such information from existing literature on a wide range of fields of study. DPBE can be exploited by researchers using systems approaches to gain information on lists of cellular entities in a short period of time. We hope this tool will be especially useful for researchers in handling the knowledge that has to be gained rapidly from the outcomes of such data-intensive approaches. Also, we hope this tool will be useful for beginning researchers and students for becoming familiar with the rapidly advancing fields in a short period of time.

^[w]

The online version of this article contains Web-only data.

www.plantphysiol.org/cgi/doi/10.1104/pp.105.060863.

References

Andrade MA, Bork P (2000) Automated extraction of information in molecular biology. FEBS Lett 476: 12–17 [DOI] [PubMed] [Google Scholar]
Andrade MA, Valencia A (1998) Automatic extraction of keywords from scientific knowledge: application to the knowledge domain of protein families. Bioinformatics 14: 600–607 [DOI] [PubMed] [Google Scholar]
Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815 [DOI] [PubMed] [Google Scholar]
Becker KG, Hosack DA, Dennis G Jr, Lempicki RA, Bright TJ, Cheadle C, Engel J (2003) PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics 4: 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berardini ZT, Mundodi S, Reiser L, Huala E, Hernandez MG, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, et al (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 135: 745–755 [DOI] [PMC free article] [PubMed] [Google Scholar]
Blaschke C, Valencia A (2001) The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform Ser Workshop Genome Inform 12: 123–134 [PubMed] [Google Scholar]
Chiang JH, Yu HC, Hsu HJ (2004) GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 20: 120–121 [DOI] [PubMed] [Google Scholar]
de Bruijn B, Martin J (2002) Getting to the (c)ore of knowledge: mining biomedical literature. Int J Med Inform 67: 7–18 [DOI] [PubMed] [Google Scholar]
Dewick MP (2002) Alkaloids. In Medicinal Natural Products: A Biosynthetic Approach, Ed 2. John Wiley & Sons, Sussex, UK, pp 291–403
Dickman S (2003) Tough mining: the challenges of searching the scientific literature. PLoS Biol 1: 144–147 [DOI] [PMC free article] [PubMed] [Google Scholar]
Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K, et al (2003) PreBIND and Textomy—mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fahn A (1990) Plant Anatomy, Ed 4. Pergamon Press, New York
Gomez-Mena C, De Folter S, Costa MM, Angenent GC, Sablowski R (2005) Transcriptional program controlled by the floral homeotic gene AGAMOUS during early organogenesis. Development 132: 429–438 [DOI] [PubMed] [Google Scholar]
Grivell L (2002) Mining the bibliome: searching for a needle in a haystack? New computing tools are needed to effectively scan the growing amount of scientific literature for useful information. EMBO Rep 3: 200–203 [DOI] [PMC free article] [PubMed] [Google Scholar]
Harris MA, Clark J, Ireland A, Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, et al (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: 258–261 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jenssen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28: 21–28 [DOI] [PubMed] [Google Scholar]
Kanehisa M (1997) A database for post-genome analysis. Trends Genet 13: 375–376 [DOI] [PubMed] [Google Scholar]
Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28: 27–30 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kelso J, Visagie J, Theiler G, Christoffels A, Bardien S, Smedley D, Otgaar D, Greyling G, Jongeneel CV, McCarthy MI, et al (2003) eVOC: a controlled vocabulary for unifying gene expression data. Genome Res 13: 1222–1230 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM, Eizinger A, Wylie BN, Davidson GS (2001) A gene expression map for Caenorhabditis elegans. Science 293: 2087–2092 [DOI] [PubMed] [Google Scholar]
Kitano H (2002) Systems biology: a brief overview. Science 295: 1662–1664 [DOI] [PubMed] [Google Scholar]
Maruyama K, Sakuma Y, Kasuga M, Ito Y, Seki M, Goda H, Shimada Y, Yoshida S, Shinozaki K, Shinozaki KY (2004) Identification of cold-inducible downstream genes of the Arabidopsis DREB1A/CBF3 transcriptional factor using two microarray systems. Plant J 38: 982–993 [DOI] [PubMed] [Google Scholar]
Masys DR, Welsh JB, Fink JL, Gribskov M, Klacansky I, Corbeil J (2001) Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 7: 319–326 [DOI] [PubMed] [Google Scholar]
Mauseth JD (1988) Plant Anatomy. Benjamin/Cummings, Menlo Park, CA
Mueller LA, Zhang P, Rhee SY (2003) AraCyc: a biochemical pathway database for Arabidopsis. Plant Physiol 132: 453–460 [DOI] [PMC free article] [PubMed] [Google Scholar]
Muller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2: 1984–1998 [DOI] [PMC free article] [PubMed] [Google Scholar]
Nagaki K, Cheng Z, Ouyang S (2004) Sequencing of a rice centromere uncovers active genes. Nat Genet 36: 138–145 [DOI] [PubMed] [Google Scholar]
Pan H, Zuo L, Choudhary V, Zhang Z, Leow SH, Chong FT, Huang Y, Ong VW, Mohanty B, Tan SL, et al (2004) Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining. Nucleic Acids Res 1: 230–234 [DOI] [PMC free article] [PubMed] [Google Scholar]
Perez-Iratxeta C, Bork P, Andrade MA (2002) Association of genes to genetically inherited diseases using data mining. Nat Genet 3: 316–319 [DOI] [PubMed] [Google Scholar]
Perez-Iratxeta C, Perez AJ, Bork P, Andrade MA (2003) Update on XplorMed: a web server for exploring scientific literature. Nucleic Acids Res 31: 3866–3868 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pharkya P, Nikolaev EV, Maranas CD (2003) Review of the BRENDA Database. Metab Eng 5: 71–73 [DOI] [PubMed] [Google Scholar]
Rensink WA, Buell CR (2004) Arabidopsis to rice. Applying knowledge from a weed to enhance our understanding of a crop species. Plant Physiol 135: 622–629 [DOI] [PMC free article] [PubMed] [Google Scholar]
Schulze-Kremer S (2002) Ontologies for molecular biology and bioinformatics. In Silico Biol 2: 179–193 [PubMed] [Google Scholar]
Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN (1999) MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques 27: 1210–1214, 1216–1217 [DOI] [PubMed] [Google Scholar]
Yu H, Ito T, Zhao Y, Peng J, Kumar P, Meyerowitz EM (2004) Floral homeotic genes are targets of gibberellin signaling in flower development. Proc Natl Acad Sci USA 101: 7827–7832 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] Andrade MA, Bork P (2000) Automated extraction of information in molecular biology. FEBS Lett 476: 12–17 [DOI] [PubMed] [Google Scholar]

[bib2] Andrade MA, Valencia A (1998) Automatic extraction of keywords from scientific knowledge: application to the knowledge domain of protein families. Bioinformatics 14: 600–607 [DOI] [PubMed] [Google Scholar]

[bib3] Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815 [DOI] [PubMed] [Google Scholar]

[bib4] Becker KG, Hosack DA, Dennis G Jr, Lempicki RA, Bright TJ, Cheadle C, Engel J (2003) PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics 4: 61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Berardini ZT, Mundodi S, Reiser L, Huala E, Hernandez MG, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, et al (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 135: 745–755 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Blaschke C, Valencia A (2001) The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform Ser Workshop Genome Inform 12: 123–134 [PubMed] [Google Scholar]

[bib7] Chiang JH, Yu HC, Hsu HJ (2004) GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 20: 120–121 [DOI] [PubMed] [Google Scholar]

[bib8] de Bruijn B, Martin J (2002) Getting to the (c)ore of knowledge: mining biomedical literature. Int J Med Inform 67: 7–18 [DOI] [PubMed] [Google Scholar]

[bib9] Dewick MP (2002) Alkaloids. In Medicinal Natural Products: A Biosynthetic Approach, Ed 2. John Wiley & Sons, Sussex, UK, pp 291–403

[bib10] Dickman S (2003) Tough mining: the challenges of searching the scientific literature. PLoS Biol 1: 144–147 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K, et al (2003) PreBIND and Textomy—mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Fahn A (1990) Plant Anatomy, Ed 4. Pergamon Press, New York

[bib13] Gomez-Mena C, De Folter S, Costa MM, Angenent GC, Sablowski R (2005) Transcriptional program controlled by the floral homeotic gene AGAMOUS during early organogenesis. Development 132: 429–438 [DOI] [PubMed] [Google Scholar]

[bib14] Grivell L (2002) Mining the bibliome: searching for a needle in a haystack? New computing tools are needed to effectively scan the growing amount of scientific literature for useful information. EMBO Rep 3: 200–203 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Harris MA, Clark J, Ireland A, Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, et al (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: 258–261 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Jenssen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28: 21–28 [DOI] [PubMed] [Google Scholar]

[bib17] Kanehisa M (1997) A database for post-genome analysis. Trends Genet 13: 375–376 [DOI] [PubMed] [Google Scholar]

[bib18] Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28: 27–30 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Kelso J, Visagie J, Theiler G, Christoffels A, Bardien S, Smedley D, Otgaar D, Greyling G, Jongeneel CV, McCarthy MI, et al (2003) eVOC: a controlled vocabulary for unifying gene expression data. Genome Res 13: 1222–1230 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Kim SK, Lund J, Kiraly M, Duke K, Jiang M, Stuart JM, Eizinger A, Wylie BN, Davidson GS (2001) A gene expression map for Caenorhabditis elegans. Science 293: 2087–2092 [DOI] [PubMed] [Google Scholar]

[bib21] Kitano H (2002) Systems biology: a brief overview. Science 295: 1662–1664 [DOI] [PubMed] [Google Scholar]

[bib22] Maruyama K, Sakuma Y, Kasuga M, Ito Y, Seki M, Goda H, Shimada Y, Yoshida S, Shinozaki K, Shinozaki KY (2004) Identification of cold-inducible downstream genes of the Arabidopsis DREB1A/CBF3 transcriptional factor using two microarray systems. Plant J 38: 982–993 [DOI] [PubMed] [Google Scholar]

[bib23] Masys DR, Welsh JB, Fink JL, Gribskov M, Klacansky I, Corbeil J (2001) Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 7: 319–326 [DOI] [PubMed] [Google Scholar]

[bib24] Mauseth JD (1988) Plant Anatomy. Benjamin/Cummings, Menlo Park, CA

[bib25] Mueller LA, Zhang P, Rhee SY (2003) AraCyc: a biochemical pathway database for Arabidopsis. Plant Physiol 132: 453–460 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] Muller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2: 1984–1998 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Nagaki K, Cheng Z, Ouyang S (2004) Sequencing of a rice centromere uncovers active genes. Nat Genet 36: 138–145 [DOI] [PubMed] [Google Scholar]

[bib28] Pan H, Zuo L, Choudhary V, Zhang Z, Leow SH, Chong FT, Huang Y, Ong VW, Mohanty B, Tan SL, et al (2004) Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining. Nucleic Acids Res 1: 230–234 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] Perez-Iratxeta C, Bork P, Andrade MA (2002) Association of genes to genetically inherited diseases using data mining. Nat Genet 3: 316–319 [DOI] [PubMed] [Google Scholar]

[bib30] Perez-Iratxeta C, Perez AJ, Bork P, Andrade MA (2003) Update on XplorMed: a web server for exploring scientific literature. Nucleic Acids Res 31: 3866–3868 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] Pharkya P, Nikolaev EV, Maranas CD (2003) Review of the BRENDA Database. Metab Eng 5: 71–73 [DOI] [PubMed] [Google Scholar]

[bib32] Rensink WA, Buell CR (2004) Arabidopsis to rice. Applying knowledge from a weed to enhance our understanding of a crop species. Plant Physiol 135: 622–629 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Schulze-Kremer S (2002) Ontologies for molecular biology and bioinformatics. In Silico Biol 2: 179–193 [PubMed] [Google Scholar]

[bib34] Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN (1999) MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques 27: 1210–1214, 1216–1217 [DOI] [PubMed] [Google Scholar]

[bib35] Yu H, Ito T, Zhao Y, Peng J, Kumar P, Meyerowitz EM (2004) Floral homeotic genes are targets of gibberellin signaling in flower development. Proc Natl Acad Sci USA 101: 7827–7832 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Dragon Plant Biology Explorer. A Text-Mining Tool for Integrating Associations between Genetic and Biochemical Entities with Genome Annotation and Biochemical Terms Lists^[w]

Vladimir B Bajic

Merlin Veronika

Pardha Sarathi Veladandi

Archana Meka

Mok-Wei Heng

Kanagasabai Rajaraman

Hong Pan

Sanjay Swarup

Abstract

SYSTEMS APPROACHES

REQUIREMENTS AND CHALLENGES FOR TEXT MINING IN PLANT BIOLOGY

Figure 1.

Table I.

THE DPBE SYSTEM

VOCABULARIES USED FOR TEXT MINING IN DPBE

Table II.

EXPLORER MODULES OF DPBE

Figure 2.

TABULAR AND GRAPHICAL OUTPUTS OF DPBE

Figure 3.

LIMITATIONS AND EXPECTATIONS FROM DPBE USAGE

SELECTED APPLICATIONS OF THE DPBE SYSTEM

Integrating Biochemistry and Gene Function Knowledge in Developmental Biology

Biochemistry and Pharmacological Effects of Plant Natural Products

Application of DPBE Analysis for Small Sets of Documents

Integrating the DPBE System with Results from Systems Approaches

Figure 4.

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Dragon Plant Biology Explorer. A Text-Mining Tool for Integrating Associations between Genetic and Biochemical Entities with Genome Annotation and Biochemical Terms Lists[w]

Vladimir B Bajic

Merlin Veronika

Pardha Sarathi Veladandi

Archana Meka

Mok-Wei Heng

Kanagasabai Rajaraman

Hong Pan

Sanjay Swarup

Abstract

SYSTEMS APPROACHES

REQUIREMENTS AND CHALLENGES FOR TEXT MINING IN PLANT BIOLOGY

Figure 1.

Table I.

THE DPBE SYSTEM

VOCABULARIES USED FOR TEXT MINING IN DPBE

Table II.

EXPLORER MODULES OF DPBE

Figure 2.

TABULAR AND GRAPHICAL OUTPUTS OF DPBE

Figure 3.

LIMITATIONS AND EXPECTATIONS FROM DPBE USAGE

SELECTED APPLICATIONS OF THE DPBE SYSTEM

Integrating Biochemistry and Gene Function Knowledge in Developmental Biology

Biochemistry and Pharmacological Effects of Plant Natural Products

Application of DPBE Analysis for Small Sets of Documents

Integrating the DPBE System with Results from Systems Approaches

Figure 4.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Dragon Plant Biology Explorer. A Text-Mining Tool for Integrating Associations between Genetic and Biochemical Entities with Genome Annotation and Biochemical Terms Lists^[w]