Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2011 Mar 8;6(3):e17844. doi: 10.1371/journal.pone.0017844

TargetMine, an Integrated Data Warehouse for Candidate Gene Prioritisation and Target Discovery

Yi-An Chen 1,2,#, Lokesh P Tripathi 1,#, Kenji Mizuguchi 1,2,*
Editor: Vladimir Uversky3
PMCID: PMC3050930  PMID: 21408081

Abstract

Prioritising candidate genes for further experimental characterisation is a non-trivial challenge in drug discovery and biomedical research in general. An integrated approach that combines results from multiple data types is best suited for optimal target selection. We developed TargetMine, a data warehouse for efficient target prioritisation. TargetMine utilises the InterMine framework, with new data models such as protein-DNA interactions integrated in a novel way. It enables complicated searches that are difficult to perform with existing tools and it also offers integration of custom annotations and in-house experimental data. We proposed an objective protocol for target prioritisation using TargetMine and set up a benchmarking procedure to evaluate its performance. The results show that the protocol can identify known disease-associated genes with high precision and coverage. A demonstration version of TargetMine is available at http://targetmine.nibio.go.jp/.

Introduction

Advances in biomolecular research, coupled with rapidly increasing availability of information from multiple genome sequencing initiatives, global gene expression patterns, large scale molecular interaction experiments and genome wide association studies, have led to an exponential increase in biological data. The explosion of data, accompanied by a plethora of theoretical tools for predicting gene function, has created an information overload. The immense challenges in separating the biological wheat from the chaff have necessitated the development of a variety of analytical tools and databases to store and manage biological data and retrieve meaningful information to facilitate further experimental characterisation.

The biological role of a gene or a protein is not only defined by its sequence and structure but also by when and where it is expressed and its interactions with other biomolecules (such as proteins, nucleic acids and metabolites). In the post-genomic era, attempts at function annotation increasingly employ data from different types of repositories. Biological data from a single type of data source, though useful, is often limited in extent to which it may help uncover functional associations; either because of a systematic bias towards specific genes, gene families and pathways and/or inclusion of erroneous entries during data acquisition. With focus shifting from genes and proteins to biological systems, integrating information from multiple data types is a more robust and accurate means of enhancing existing interpretations and unravelling new functional associations as demonstrated in several studies [1], [2].

However, biological data integration is a formidable task. Different computational tools and data sources may often employ different approaches and formats for input, storing and retrieving relevant information that may often result in appreciable differences in data quality. This heterogeneity often restricts compatibility between different resources and limits the extent and efficiency of combined analysis. Furthermore, investigation of diverse data types necessitates a flexible, uniform and simplified interface to query, retrieve and analyse data across diverse sources. Despite these hurdles, the immense potential benefits of a combined investigative approach have spawned several initiatives towards integrated data repositories [3], [4], [5], [6], [7]. Among these, of particular interest are data warehouses, which compile all the relevant information to a common platform [6], [8], [9], [10], [11], [12], [13], [14]. A data warehouse is particularly desirable, since it permits a wide range of queries based on diverse attributes (including genes, proteins, families, pathways, ontologies, diseases and expression profiles) and possesses the ability to produce unified output and the flexibility in selecting the type and the order of the data sources. InterMine is a multi-purpose data warehouse framework (http://www.intermine.org/), originally developed for FlyMine, an integrated database for Drosophila and Anopheles genomics [13]. It features a sequence ontology-based data model and a user-friendly web interface permitting the end users to either design flexible and complex database queries, or choose from a library of ‘templates’ consisting of predefined queries with a simple form and description [13]. In addition, InterMine provides default parsers for integrating data from several resources with the framework for incorporating customised parsers and data sources. The flexibility in designing queries and integrating diverse data types provides a powerful tool for the researchers. In addition to FlyMine, InterMine also powers modEncode (http://intermine.modencode.org/), RatMine (http://ratmine.mcw.edu/ratmine/begin.do), YeastMine (http://yeastmine.yeastgenome.org:8080/yeastmine/begin.do) and MetabolicMine (http://www.metabolicmine.org/).

Identification of suitable targets (such as genes, proteins, non-peptide gene products and pathways) for characterisation is one of the most critical steps in biology, particularly in annotating gene function, drug discovery and understanding molecular bases of diseases. An integrated approach that combines results from multiple data types is best suited for optimal target discovery [15], [16]. The distinct merits of the InterMine framework have inspired us to develop TargetMine, an integrated resource for retrieval of target genes and proteins for experimental characterisation and drug discovery. In this paper, we describe the data sources available in the present release of TargetMine and their access and query capability. We also outline an objective protocol for target prioritisation with TargetMine that relies on the integration of diverse data types. Gene prioritisation refers to the selection of most interesting or promising genes from a larger set of genes for further analysis [17], [18]. Experimental evaluation of large gene lists to identify suitable candidates is a formidable and often impossible task and therefore, computational tools for candidate gene prioritisation have emerged over the years. These tools variously rely on functional associations, protein-protein interactions, gene expression data, sequence and structure properties or combinations thereof to select candidate genes [16], [17], [18], [19], [20], [21], [22], [23], [24], [25]. TargetMine was designed specifically for target prioritisation within the framework of a data warehouse and our prioritisation protocol, though less sophisticated than some standalone tools, is easier to use and provides flexibility in the choice of data sources that may be employed for analysis of query gene sets. Finally, we discuss the possibilities of future implementations in the TargetMine data warehouse to provide maximum coverage of the biological target space.

Results and Discussion

Data sources and Data models

A detailed description of the InterMine system is available elsewhere [13]. Here we restrict ourselves to a brief overview of the InterMine data organisation. InterMine is an open source data warehouse framework. Each entry in the system (such as a gene or a protein) is considered an ‘object’. The InterMine object-based data model, consists of ‘classes’ and reflects the relationships between different data types. Each class contains objects that share similar properties and a set of ‘attributes’ that correspond to various types of information (such as gene symbol and gene/protein identifier) associated with each object of that class. The classes are linked with each other by references that specify the associations between objects in different classes. The InterMine data structure readily allows the navigation of the stored biological data via the relationships between different data types, facilitated by an inbuilt tool termed ‘query builder’. The query builder tool permits the users to select and constrain the data types for the desired output. The list function enables the query process to be performed with a user-supplied list of objects and export the lists as either comma separated (csv) or tab separated values (tsv). It also permits the user to convert genes/proteins from one species to another based on KEGG orthology associations. The InterMine Web Service allows the users to query TargetMine from their own web pages and applications.

In addition to the existing InterMine classes, we have customised the InterMine data model and created new classes to collate biological data types most likely to help facilitate target discovery (Table S1). We will discuss some of these implementations below. As of now, the biological data in TargetMine for most part is limited to human, rat, mouse and fruit fly, the best studied model organisms in biology. The data sources compiled in TargetMine are summarised in Table 1.

Table 1. List of data sources in TargetMine.

Data Organism* Source
InterMine default
Genome annotation H, R, M, F Entrez Gene
Protein annotation H, R, M, F UniProtKB
Protein domain H, R, M InterPro
Pathways H, R, M, F KEGG Pathway
Gene-gene interactions H, R, M, F BioGRID
GO annotation and the Gene Ontology H, R, M Gene Ontology, UniProtKB GOA
Data sources newly incorporated in TargetMine
Protein 3D structure Entire dataset PDBe SIFTS
Protein-protein interactions H PPIview
Protein domain annotations H, R, M IPI
Structural classification Entire dataset SCOP
Orthologues / Paralogues H, R, M, F, E KEGG Orthology
Transcription factor H OregAnno, AMADEUS
Enzyme H, R, M, F The ENZYME database
Drug H DrugBank
Disease H OMIM
Disease Ontology and DO annotation H Disease Ontology, BMC Genomics 10 Suppl 1:S6

*H: human, R: rat, M: mouse, F: fruit fly (Drosophila), E: E. coli.

OMIM data are presently not distributed with the TargetMine demonstration version.

Protein structures and domains

Structural data for biological macromolecules, especially proteins, have been extremely important in explaining their molecular and biochemical functions, evolutionary relationships and understanding their explicit biological roles [26]. It is well recognised that complementing protein sequence information with structural data is a robust approach towards more accurate protein function annotation [27] and hence, more reliable target discovery. However, integrating protein sequence and structural information from different sources remains a non-trivial task. In recognition of the obvious benefits of an integrated protein sequence-structure repository, we customised and embellished the default InterMine data model to combine protein sequence information from the UniProt database [28] with protein structure information from the Protein Data Bank (PDB) [29] and structural classification based on evolutionary relationships in the Structural Classification of Proteins (SCOP) database [30]. With our customised data model, the user can easily query for PDB structures cross-referenced (if available) with the protein of interest in the UniProt repository and other databases such as DrugBank [31] (e.g., “Show all the protein structures that contain the targets, as defined in DrugBank, of a given set of drugs” or “Given a list of proteins, show all the approved drugs solved in complex with any structure of these proteins if present”). The user can also retrieve disease associations, pathway associations and potential protein-drug associations, based on ligands associated with the protein structures, for the protein of interest (e.g., “Show all the PDB entries that contain a given drug”).

Different data sources use different numbering systems for specifying protein regions. To associate protein sequences (in the Protein class) with protein structures (in the ProteinStructure class), we introduced two new classes (ProteinStructureRegion and PDBRegion; Figure 1). We also introduced the ProteinDomainRegion class to link the Protein class to the Protein domain class that stores InterPro [32] domain annotations. The PDB-UniProt mapping was taken from SIFTS [33] and InterPro domain assignments from IPI [34]. The integration facilitated querying detailed domain and structural assignments; for example, the user can query regions of a protein, for which structural information is available, and then retrieve domain annotations falling within these regions.

Figure 1. Schema for selected examples of newly created data sources in TargetMine.

Figure 1

The data model is depicted as a class diagram in the Unified Modeling Language (http://www.uml.org). Some details of the model are ignored to reduce the complexity of the diagram.

Transcription factors

Transcription factors (TFs) are proteins that bind to specific DNA sequences, thereby regulating the expression (transcription) of their target genes [35]. TFs are of immense significance in biomedical investigations and some TFs such as nuclear receptors are important drug targets [36], [37]. In view of the significance of these protein-DNA interactions to cellular physiology, we modified the existing InterMine Interaction class, which describes gene-gene interactions, to define a new class named ProteinDNAInteraction. The ProteinDNAInteraction class contains specific attributes that reflect the unique aspects of protein-DNA interactions, such as protein (TF) binding sites in the regulatory regions of the target genes. These data were retrieved from AMADEUS [38] and OregAnno [39] resources and from assorted literature sources. Since different resources adopt different approaches to compiling protein-DNA interaction information, the combined source data were manually processed to uniformly assign Entrez gene identifiers to each participating gene and remove redundancies prior to the incorporation into TargetMine. The integration enabled us to make a complicated query such as: “Given a list of genes, retrieve all the TF-target relations observed within the list”.

Other data classes

For disease and phenotype association, we created new classes and data parsers to retrieve the data from OMIM database [40] and human genome disease annotations [41]. Enzymes play key roles in many biological processes and are attractive candidates for experimental investigation aimed at understanding cellular processes, diseases and identifying suitable drug targets. We designed a new Enzyme class (linked to the Protein class) to gather all information on enzymes as curated in the Enzyme database [42]. The Enzyme class was also directly linked to the Pathway class by parsing the KEGG [43] mapping files, thereby providing links to their potential roles in cellular processes. Most genes and proteins function in association with other proteins and thus, the study of protein-protein interactions (PPIs) is critical to understanding their roles in living systems. In addition to the default InterMine Interaction class that was employed for storing biomolecular interactions from the BioGRID database [44], we designed a new ProteinInteraction class to collate all interactions curated in PPIview, an integrated repository of human PPIs [45]. This integration facilitated the querying of interacting partners of a gene/protein or a list of genes/proteins of interest and infer overall interaction networks involving these genes/proteins.

In addition, to expand the information space for sparsely annotated genes and proteins, we provided a framework for including in silico annotations derived from selected protein prediction tools (FUGUE [46], Protein-DNA binding propensity [47] and Protein-protein interaction sites [48]) and for including experimental data from in-house research.

Target prioritisation and benchmarking

Our general protocol for target prioritisation using TargetMine is shown in Figure 2. First, we upload a list of initial candidate genes or proteins (e.g., a set of differentially expressed genes or a set of proteins that interact with a given protein) to TargetMine to create a TargetMine gene list. Enrichment of specific biological themes (including but not limited to, KEGG pathways, Gene Ontology (GO) terms [49] and OMIM phenotypes) associated with the initial list is estimated by hypergeometric distribution and the inferred p-values are further adjusted for multiple test corrections to control the false discovery rate using the Benajmini and Hochberg procedure [50]. The significantly enriched biological associations (that satisfied, in this instance, a condition of p≤0.05 after a multiple test correction with the Benajmini and Hochberg procedure) can be visualised in the individual enrichment widgets. We gather the genes mapped to the top N significant associations (where N = 1,2,3…, an adjustable value reflecting incrementally relaxed thresholds) retrieved from KEGG (A), GO Biological Process (B) and OMIM (C) databases into separate lists and merge them (for example, by taking the union AInline graphicBInline graphicC of the retrieved genes) to infer corresponding sets of prioritised genes, albeit no ranking is provided at the moment. (We assume that an initial candidate list is from a single species and the enrichment calculation is performed using the data for this species only.)

Figure 2. A schematic representation of the suggested objective protocol for candidate gene prioritisation with TargetMine.

Figure 2

To evaluate the effectiveness of TargetMine in identifying suitable targets for further characterisation, we performed target gene prioritisation tests (as described above) on 19 sets of known disease-associated genes compiled from the literature [51] (Table 2 and Figures 3 and 4; see Materials and Methods for details). In all instances, our prioritisation approach was supported by high sensitivity and precision values, and enforcing a threshold of collecting only the genes mapped to top seven associations (that satisfied a p-value cutoff of p≤0.05 after a multiple test correction with the Benajmini and Hochberg procedure) was by and large most suited to ensuring maximum coverage and minimum over-prediction (Table S2). Though for cirrhosis and cervical carcinoma, the number of false positives was slightly larger than those for the other diseases, the sensitivity and precision remained high.

Table 2. Benchmarking results for 19 sets of known disease-associated genes using top seven significant associations as the threshold.

Disease Sensitivity Precision F-score
Atherosclerosis (athe) 0.786±0.04 0.848±0.06 0.814±0.03
Autism (auti) 0.824±0.02 0.839±0.07 0.830±0.03
Cervical carcinoma (cerv) 0.779±0.03 0.834±0.03 0.805±0.03
Cirrhosis (cirr) 0.850±0.02 0.848±0.05 0.848±0.02
Endometrial carcinoma (enca) 0.770±0.02 0.903±0.06 0.829±0.02
Endometriosis (endo) 0.621±0.07 0.897±0.07 0.729±0.03
Epilepsy (epil) 0.744±0.02 0.777±0.08 0.759±0.03
Grave's disease (grav) 0.803±0.01 0.934±0.04 0.863±0.02
Hypercholesterolaemia (hycl) 0.875±0.00 0.893±0.03 0.884±0.02
Inflammatory bowel disease (inbd) 0.897±0.04 0.838±0.08 0.863±0.04
Ischaemic stroke (isch) 0.909±0.04 0.845±0.08 0.874±0.04
Lymphoma (lymp) 0.636±0.10 0.788±0.06 0.697±0.05
Migraine (migr) 0.712±0.03 0.812±0.10 0.755±0.03
Myocardial ischemia (myis) 0.803±0.02 0.889±0.06 0.842±0.03
Neural tube defects (neur) 0.682±0.03 0.817±0.07 0.742±0.03
Osteoarthritis (oste) 0.822±0.02 0.870±0.05 0.844±0.02
Pancreatitis (panc) 0.923±0.05 0.874±0.07 0.895±0.03
Systemic scleroderma (sysc) 0.826±0.03 0.818±0.06 0.821±0.03
Ulcerative colitis (ulco) 0.856±0.02 0.831±0.08 0.841±0.04

Figure 3. Outline of the procedure for benchmarking candidate gene prioritisation on 19 sets of known disease-associated genes with TargetMine.

Figure 3

TP- True positive, FP- False positive (see text for details).

Figure 4. Benchmarking results for 19 sets of known disease-associated genes.

Figure 4

(The full disease names and their abbreviations are listed in Table 2.) Each line represents the F-score for a particular disease data set as a function of the threshold (the top N significant associations considered). The error bars show the standard deviation across ten benchmarking evaluations for each disease.

We have repeated the tests by changing the proportion of known curated genes in an input gene list (from one third to one tenth). Although both sensitivity and precision decreased slightly, reasonable performance was maintained with a cutoff of six (Table S3), suggesting that the method still works for situations where only one tenth of input genes are disease-associated. We have also evaluated the results from a method using only a single data source. By taking the union of the collected genes from KEGG, GO and OMIM, the performance in most cases increased by about 0.1 points (measured by the F-score; see Materials and Methods), demonstrating the usefulness of the integration.

These results showed that the integration of diverse biological properties in TargetMine was a successful approach towards the identification of candidate genes for further investigation. Besides, the operation in TargetMine is semi-automatically accomplished by a few mouse clicks instead of preparing specific data files and running external software. The TargetMine data model permits retrieval of stored data and its analysis in a single interface and thus aids in efficient prioritisation. The ease of accomplishing such analysis via a simple web interface further underscores the utility of TargetMine as an effective tool in investigation of genes and genomes. In our benchmark tests, we chose KEGG, GO Biological Process and OMIM as the best sources for highlighting the functional associations of groups of genes but TargetMine also provides enrichment widgets for GO Molecular Function and Cellular Component, Drug and Disease Ontology (DO) associations, which may be used to assist in selecting candidate genes. The user may also employ TF-target associations to identify common regulatory themes that may be associated with a set of co-expressed functionally similar genes.

Comparisons with other databases

As a data warehouse, TargetMine is not an alternative to large public databases (such as UniProt [28]) but rather, it is designed for use in individual laboratories in academia and industry. In comparison to existing integrated databases, TargetMine provides an alternative usage that aims to rapidly and efficiently retrieve varied biological information for large gene sets in a simplified manner. Most integrated databases are able to retrieve different biological properties, but are largely designed for simple queries for a single gene. Though some may provide facilities for batch query, the users in many instances need to employ external scripts for querying and post-processing the relevant data. In contrast, TargetMine provides a simple interface for batch query with numerous templates and the facility to construct complicated queries. The output options permit user-defined displays on the type and the order of different annotations. Besides, the enrichment widgets, as described above, provide a quick preliminary analysis of the genes in the list and thus, greatly help in understanding the enriched themes associated with query sets and also help complement the analysis performed by specialised gene prioritisation tools. Therefore, TargetMine facilitates biological data gathering and data analysis in a single user-friendly interface.

Although some commercial resources such as Ingenuity® (Redwood City, California) and MetaCore™ (GeneGo, St. Joseph, MI) provide more interaction and/or pathway data plus tools for statistical data analysis, they largely emphasise on collating gene annotations and mostly lack protein level annotations such as domains and structures. Additionally, several data types available in TargetMine such as Protein-DNA interactions, to the best of our knowledge, are not made available by other publicly available resources, some of which, including GeneDistiller [52] and PolySearch [53], can perform tasks similar to TargetMine's. However, the key difference is TargetMine's flexibility and its built-in prioritisation protocol; the data size and data types are readily customisable in TargetMine, providing a more flexible and comprehensive framework for target discovery.

TargetMine employs an “unsupervised” protocol for prioritisation, as opposed to most other comparable tools such as ToppGene [21] and Endeavour [20], which are “supervised” learning methods. Thus, while direct comparison with these other tools is difficult (and our data warehouse will complement, not replace, stand-alone tools), the preliminary results above suggest that TargetMine is well suited for target prioritisation. In our group, we have been using TargetMine for analysing a diverse array of experimental data and we have verified experimentally that some of the prioritised genes have been associated with the disease of interest [54].

Future developments

TargetMine is structured to accommodate increasingly available biological data from large-scale experiments. Inclusion of new data sources would enable enhanced repertoire of functional associations currently available in TargetMine and at the same time expand the coverage to newer systems relevant to candidate gene prioritisation and drug discovery. We plan to add new data including host-pathogen interactions, specific gene and protein expression patterns, relationships between potential targets and chemical compounds and/or moieties, protein-compound interactions and single nucleotide polymorphisms (SNPs). We aim to supplement the newer data sources with further developments in the TargetMine web interface, lists, templates and tools for data visualisation (such as novel widgets) and analysis.

Conclusion

TargetMine is an integrated data warehouse that enables complicated searches that are difficult to perform using existing comparable tools and therefore, assists in efficient target prioritisation. The benchmarking results for our proposed protocol for target gene prioritisation suggested the effectiveness of TargetMine in target discovery. The flexibility in TargetMine structure ensures that different types of biological data can be readily added and analysed to generate new hypotheses for further investigation. The inclusion of additional data sources and analytical tools will greatly enhance the ability of TargetMine to investigate biological systems for better target discovery.

Materials and Methods

InterMine was downloaded from http://www.intermine.org. New parsers were written in Java and integrated into the InterMine code base. A list of URLs for the individual data sources can be found in Table S4. Part of OMIM data, not available in downloadable files, was retrieved from the online resource using custom PERL scripts and TF-target associations were manually processed prior to integration into TargetMine.

To benchmark our gene prioritisation protocol, we performed target gene prioritisation on 19 sets of known disease-associated genes (denoted by set x) compiled from the literature [51]. We first created test datasets (set y), where each curated gene set was merged with twice its number of unrelated randomly selected human genes (set r) to incorporate background “noise”. To avoid any bias incurred due to the selection of random genes, the process was repeated 10 times to infer 10 test gene sets for each curated gene list. The prioritisation tests (Figures 2 and 3) were then performed for each test gene set. We gathered the genes mapped to up to the top 10 associations, retrieved from KEGG, GO and OMIM databases to infer prioritised genes (set z). These were then compared with the curated gene sets (xz) and the efficiency of the prioritisation procedure was estimated with sensitivity and precision measures (Table S2). The True Positives (TP) in z were defined as genes present in x, while those corresponding to r were defined as False Positives (FP). The False Negatives (FN) were those genes corresponding to x that were not included in z at the specified threshold, while the True Negatives (TN) were genes corresponding to r correctly left out from the list of prioritised genes at a given threshold. Sensitivity, measuring the proportion of the known disease-associated genes that were correctly prioritised, was defined as TP/(TP+FN) and precision, measuring the proportion of the prioritised genes that were known disease-associated genes, was defined as TP/(TP+FP). The performance of the prioritisation protocol was also assessed using the F-score defined as 2(precision×sensitivity)/(precision+sensitivity) [55], [56].

Supporting Information

Table S1

A full list of newly defined classes in TargetMine.

(XLS)

Table S2

Detailed benchmarking results for candidate gene prioritisation with TargetMine using 19 sets of known disease-associated genes.

(XLS)

Table S3

Detailed benchmarking results for candidate gene prioritisation with TargetMine using 19 sets of known disease-associated genes with increased background noise.

(XLS)

Table S4

A list of URLs for the individual data sources in TargetMine.

(XLS)

Acknowledgments

We thank Mitsubishi Space Software Co., Ltd. for technical support. We gratefully acknowledge Dr. Tadashi Imanishi of Biomedicinal Information Research Centre (AIST) for providing us with the PPIview interactions and permission for publishing the data.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This study was supported by the Industrial Technology Research Grant Program in 2007 from New Energy and Industrial Technology Development Organization (NEDO) of Japan (awarded to KM; project ID: 07C46056a, http://www.nedo.go.jp/). YAC is supported by Interchange Association, Japan (IAJ; http://www.koryu.or.jp/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Ge H, Walhout AJ, Vidal M. Integrating ‘omic’ information: a bridge between genomics and systems biology. Trends Genet. 2003;19:551–560. doi: 10.1016/j.tig.2003.08.009. [DOI] [PubMed] [Google Scholar]
  • 2.Gerstein M, Lan N, Jansen R. Proteomics. Integrating interactomes. Science. 2002;295:284–287. doi: 10.1126/science.1068664. [DOI] [PubMed] [Google Scholar]
  • 3.Burgun A, Bodenreider O. Accessing and integrating data and knowledge for biomedical research. Yearb Med Inform. 2008:91–101. [PMC free article] [PubMed] [Google Scholar]
  • 4.Chen LS, Emmert-Streib F, Storey JD. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol. 2007;8:R219. doi: 10.1186/gb-2007-8-10-r219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Garcia Castro A, Chen YP, Ragan MA. Information integration in molecular bioscience. Appl Bioinformatics. 2005;4:157–173. doi: 10.2165/00822942-200504030-00001. [DOI] [PubMed] [Google Scholar]
  • 6.Stein LD. Integrating biological databases. Nat Rev Genet. 2003;4:337–345. doi: 10.1038/nrg1065. [DOI] [PubMed] [Google Scholar]
  • 7.Wong L. Technologies for integrating biological data. Brief Bioinform. 2002;3:389–404. doi: 10.1093/bib/3.4.389. [DOI] [PubMed] [Google Scholar]
  • 8.Birkland A, Yona G. BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics. 2006;7:70. doi: 10.1186/1471-2105-7-70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Cornell M, Paton NW, Hedeler C, Kirby P, Delneri D, et al. GIMS: an integrated data storage and analysis environment for genomic and functional data. Yeast. 2003;20:1291–1306. doi: 10.1002/yea.1047. [DOI] [PubMed] [Google Scholar]
  • 10.Helfrich JP. Raw data to knowledge warehouse in proteomic-based drug discovery: a scientific data management issue. Biotechniques. 2002;(Suppl):48–50, 52–43. [PubMed] [Google Scholar]
  • 11.Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, et al. EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 2004;14:160–169. doi: 10.1101/gr.1645104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lee TJ, Pouliot Y, Wagner V, Gupta P, Stringer-Calvert DW, et al. BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics. 2006;7:170. doi: 10.1186/1471-2105-7-170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lyne R, Smith R, Rutherford K, Wakeling M, Varley A, et al. FlyMine: an integrated database for Drosophila and Anopheles genomics. Genome Biol. 2007;8:R129. doi: 10.1186/gb-2007-8-7-r129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Shah SP, Huang Y, Xu T, Yuen MM, Ling J, et al. Atlas - a data warehouse for integrative bioinformatics. BMC Bioinformatics. 2005;6:34. doi: 10.1186/1471-2105-6-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Chen X, Jorgenson E, Cheung ST. New tools for functional genomic analysis. Drug Discov Today. 2009;14:754–760. doi: 10.1016/j.drudis.2009.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yang Y, Adelstein SJ, Kassis AI. Target discovery from data mining approaches. Drug Discov Today. 2009;14:147–154. doi: 10.1016/j.drudis.2008.12.005. [DOI] [PubMed] [Google Scholar]
  • 17.Nitsch D, Goncalves JP, Ojeda F, de Moor B, Moreau Y. Candidate gene prioritization by network analysis of differential expression using machine learning approaches. BMC Bioinformatics. 2010;11:460. doi: 10.1186/1471-2105-11-460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Tranchevent Lo-C, Capdevila FB, Nitsch D, De Moor B, De Causmaecker P, et al. A guide to web tools to prioritize candidate genes. Briefings in Bioinformatics. 2010 doi: 10.1093/bib/bbq007. [DOI] [PubMed] [Google Scholar]
  • 19.Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS. SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics. 2006;22:773–774. doi: 10.1093/bioinformatics/btk031. [DOI] [PubMed] [Google Scholar]
  • 20.Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24:537–544. doi: 10.1038/nbt1203. [DOI] [PubMed] [Google Scholar]
  • 21.Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009;37:W305–311. doi: 10.1093/nar/gkp427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hutz JE, Kraja AT, McLeod HL, Province MA. CANDID: a flexible method for prioritizing candidate genes for complex human traits. Genet Epidemiol. 2008;32:779–790. doi: 10.1002/gepi.20346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kohler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008;82:949–958. doi: 10.1016/j.ajhg.2008.02.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007;25:309–316. doi: 10.1038/nbt1295. [DOI] [PubMed] [Google Scholar]
  • 25.Perez-Iratxeta C, Wjst M, Bork P, Andrade MA. G2D: a tool for mining genes associated with disease. BMC Genet. 2005;6:45. doi: 10.1186/1471-2156-6-45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Joachimiak A. High-throughput crystallography for structural genomics. Curr Opin Struct Biol. 2009;19:573–584. doi: 10.1016/j.sbi.2009.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Watson JD, Laskowski RA, Thornton JM. Predicting protein function from sequence and structural data. Curr Opin Struct Biol. 2005;15:275–284. doi: 10.1016/j.sbi.2005.04.003. [DOI] [PubMed] [Google Scholar]
  • 28.The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–148. doi: 10.1093/nar/gkp846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
  • 31.Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36:D901–906. doi: 10.1093/nar/gkm958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–215. doi: 10.1093/nar/gkn785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Velankar S, McNeil P, Mittard-Runte V, Suarez A, Barrell D, et al. E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res. 2005;33:D262–265. doi: 10.1093/nar/gki058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, et al. The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004;4:1985–1988. doi: 10.1002/pmic.200300721. [DOI] [PubMed] [Google Scholar]
  • 35.Latchman DS. Transcription factors: an overview. Int J Biochem Cell Biol. 1997;29:1305–1312. doi: 10.1016/s1357-2725(97)00085-x. [DOI] [PubMed] [Google Scholar]
  • 36.Overington JP, Al-Lazikani B, Hopkins AL. How many drug targets are there? Nat Rev Drug Discov. 2006;5:993–996. doi: 10.1038/nrd2199. [DOI] [PubMed] [Google Scholar]
  • 37.Nebert DW. Transcription factors and cancer: an overview. Toxicology. 2002;181–182:131–141. doi: 10.1016/s0300-483x(02)00269-x. [DOI] [PubMed] [Google Scholar]
  • 38.Linhart C, Halperin Y, Shamir R. Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res. 2008;18:1180–1189. doi: 10.1101/gr.076117.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, et al. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 2008;36:D107–113. doi: 10.1093/nar/gkm967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.McKusick-Nathans Institute of Genetic Medicine JHUB, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD) 2010. Online Mendelian Inheritance in Man, OMIM (TM)
  • 41.Osborne JD, Flatow J, Holko M, Lin SM, Kibbe WA, et al. Annotating the human genome with Disease Ontology. BMC Genomics. 2009;10(Suppl 1):S6. doi: 10.1186/1471-2164-10-S1-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Bairoch A. The ENZYME database in 2000. Nucleic Acids Res. 2000;28:304–305. doi: 10.1093/nar/28.1.304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Aoki-Kinoshita KF, Kanehisa M. Gene annotation and pathway mapping in KEGG. Methods Mol Biol. 2007;396:71–91. doi: 10.1007/978-1-59745-515-2_6. [DOI] [PubMed] [Google Scholar]
  • 44.Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Yamasaki C, Murakami K, Fujii Y, Sato Y, Harada E, et al. The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts. Nucleic Acids Res. 2008;36:D793–799. doi: 10.1093/nar/gkm999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Shi J, Blundell TL, Mizuguchi K. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol. 2001;310:243–257. doi: 10.1006/jmbi.2001.4762. [DOI] [PubMed] [Google Scholar]
  • 47.Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics. 2004;20:477–486. doi: 10.1093/bioinformatics/btg432. [DOI] [PubMed] [Google Scholar]
  • 48.Murakami Y, Mizuguchi K. Applying the Naive Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites. Bioinformatics. 2010;26:1841–1848. doi: 10.1093/bioinformatics/btq302. [DOI] [PubMed] [Google Scholar]
  • 49.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Noble WS. How does multiple testing correction work? Nat Biotechnol. 2009;27:1135–1137. doi: 10.1038/nbt1209-1135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chen J, Xu H, Aronow BJ, Jegga AG. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics. 2007;8:392. doi: 10.1186/1471-2105-8-392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Seelow D, Schwarz JM, Schuelke M. GeneDistiller–distilling candidate genes from linkage intervals. PLoS One. 2008;3:e3874. doi: 10.1371/journal.pone.0003874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Cheng D, Knox C, Young N, Stothard P, Damaraju S, et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 2008;36:W399–405. doi: 10.1093/nar/gkn296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Tripathi LP, Kataoka C, Taguwa S, Moriishi K, Mori Y, et al. Network based analysis of hepatitis C virus Core and NS4B protein interactions. Mol Biosyst. 2010;6:2539–2553. doi: 10.1039/c0mb00103a. [DOI] [PubMed] [Google Scholar]
  • 55.Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc. 2005;12:296–298. doi: 10.1197/jamia.M1733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.van Rijsbergen CJ. Information retrieval. London: Butterworths; 1979. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1

A full list of newly defined classes in TargetMine.

(XLS)

Table S2

Detailed benchmarking results for candidate gene prioritisation with TargetMine using 19 sets of known disease-associated genes.

(XLS)

Table S3

Detailed benchmarking results for candidate gene prioritisation with TargetMine using 19 sets of known disease-associated genes with increased background noise.

(XLS)

Table S4

A list of URLs for the individual data sources in TargetMine.

(XLS)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES