Abstract
BioProfiling.de provides a comprehensive analytical toolkit for the interpretation gene/protein lists. As input, BioProfiling.de accepts a gene/protein list. As output, in one submission, the gene list is analyzed by a collection of tools which employs advanced enrichment or network-based statistical frameworks. The gene list is profiled with respect to the most information available regarding gene function, protein interactions, pathway relationships, in silico predicted microRNA to gene associations, as well as, information collected by text mining. BioProfiling.de provides a user friendly dialog-driven web interface for several model organisms and supports most available gene identifiers. The web portal is freely available at http://www.BioProfiling.de/gene_list.
INTRODUCTION
The development of high-throughput technologies has a dramatic impact on modern biology. Although being different technically, the experimental output of ‘omics’ technologies in the majority of cases is reduced to a list of genes/proteins. Genes or proteins that are differentially expressed or co-expressed across varying cellular conditions or have different epigenetic or mutational status are commonly delivered in many biological and clinically related studies. Functional profiling had become the de facto standard approach for the analysis of high-throughput data (1). Functional profiling can be generally defined as a statistical procedure to understand functional context of the gene/protein list using prior knowledge of gene properties and interactions (1–5). The most widespread example of functional profiling is enrichment analysis of Gene Ontology (GO) terms (6–10).
Recently, we have introduced several web tools, which employ either an advance enrichment profiling schema [ProfCom (11), GeneSet2MiRNA (12), PLIPS (13), CCancer (14)] or a network-based statistical framework [KEGG spider (15), PPI spider (16), R spider (17)] for the interpretation of gene/protein lists based on available prior knowledge stored in public databases. BioProfiling.de provides experimentalists with an efficient interface to these tools: in one submission, the gene list is profiled with respect to the most information available regarding gene function [GO(18)], pathway relations [KEGG database (19), Reactome knowledgebase (20)], protein interactions [IntAct (21)], in silico predicted gene to MiRNA associations [GeneSet2MiRNA (12)] and information collected by text mining [PLIPS&CCancer (13,14)].
BioProfiling.de is not only a common interface for the collection of recently developed tools but also a pipeline for the fast implementation of new tools capable of exploring novel biological principles to group genes into functional classes or to associate genes into a global gene network. For example, ProfCom_PROT_MOTIFS is a new tool implemented within BioProfiling.de pipeline. In this case, genes are grouped into functional classes based on amino acid triplet composition of their protein products. ProfCom_PROT_MOTIFS employs the ‘ProfCom’ statistical framework to identify ‘amino acid triplets’ and logical combinations of ‘amino acid triplets’ overrepresented in the submitted gene/protein list.
BioProfiling.de provides a user-friendly dialog-driven web interface and supports most available gene/protein identifiers. BioProfiling.de provides analyses for the six organisms: Homo sapiens, Mus musculus, Rattus norvegicus, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana.
MATERIALS AND METHODS
Statistical frameworks
The prior knowledge about gene/protein function and interactions is commonly reduced to two data models, either grouping genes into classes based on the shared feature (Type 1) or connecting a pair of genes by edges (Type 2). The GO (18) database is an example of Type 1 data, while IntAct (21) database of protein–protein interactions is an example of Type 2 data. BioProfiling.de implements two different statistical frameworks to deal with both types of prior knowledge. The first statistical framework, referred to as ProfCom, is related to the Type 1 data and represents advanced enrichment schema. The second statistical framework, referred to as Global Network, was recently introduced to deal with Type 2 data.
ProfCom
In this case, the prior knowledge represents grouping genes into functional classes (GO terms) or grouping genes based on whether or not they are regulated by the same microRNA. Let us denote each class (i.e. GO term, microRNA, ‘amino acid triplet’) as f and the set of all available classes as F. In a standard enrichment schema, a query list of genes (referred to as list A) and a reference list (referred to as list B, usually all genes from the genome) are compared. For each class f from the set F, the number a of genes in the list A and the number b of genes in the list B that have been annotated with f are counted. In the next step, the null hypothesis H0 (genes that belong to the set A are independent of having attribute f) is tested. Hypergeometric, binomial or χ2-tests are usually employed to find over/under represented attributes.
ProfCom extends the standard enrichment schema by construction ‘complex classes’, which are Boolean combination of the available classes of F. ProfCom uses two Boolean operations: intersection and difference. For example, intersection (AND operator) of two categories f1 and f2 is formally defined by the set of genes that belong to both classes f1 and f2. The difference (NOT operator) between two classes f1 and f2 is formally defined as the set of genes from f1 which are not in f2.
Unlike the standard enrichment schema, which is limited to the set F, ProfCom tests all possible pairwise combinations joined by logical operators AND, NOT from the set F. Next, ProfCom employs the algorithm based on greedy heuristics to search for the most enriched triplet and quadruplet combinations. In the case of triplet and quadruplet combinations, the use of greedy heuristics does not guarantee finding the optimal solution in every case but does significantly reduce the computational complexity. To adjust P-values for multiple testing ProfCom uses both Bonferroni correction and the Monte–Carlo simulation approach.
Global network (spider tools)
In this case, pairwise gene associations of any biological essence are used as prior knowledge in the form of a global gene network (reference gene network). The sub-network inference procedure is based on natural assumptions:
most genes from the input list are related and
most genes that are not from the input list are unrelated.
These assumptions can be reformulated as standard optimization principle:
to find a gene sub-network with maximal number of input genes connected by a minimal number of missing genes (genes that are not from the input list).
To realize this optimization principle, a network inference algorithm was recently proposed (15–17). A parameter m is introduced which fixes the maximal number of missing genes between any two input genes to be connected by edge in the output network model. The model is inferred in three steps by fixing m to be 0, 1, 2. At each step, any two input genes are connected by edge if they have less then or equal to m genes in between with respect to the reference gene network. At each step (m = 0, 1, 2), a connected sub-network component with maximal number of input genes is inferred and referred to as model D1, D2, D3, accordingly. It is clear that given a reference network and any input gene list (even randomly generated gene list), some genes from the input list might be connected into sub-network just by chance, in particular, when parameter m is equal 2. All spider tools implement robust statistical framework to estimate P-value of the inferred models. More details can be found in the original publications (15–17,22).
BioProfiling.de tools
BioProfiling.de provides a common interface for the collection of recently developed tools. The summary of currently available tools is presented in Table 1. Description and details of the tools can be found in original publications. Here, we provide a short description of the novel (recently unpublished) tools implemented within the BioProfiling.de analytical pipeline.
Figure 1.
According to the global PPI network, all 47 Bosutinib targets (rectangles), which can be mapped to the global PPI network can be connected into sub-network with maximum two missing genes (triangles) in between. The P-value estimated by Monte–Carlo simulation is < 0.005.
Table 1.
A Summary of currently available BioProfiling.de tools for the interpretation of gene/protein list
| Tool name | Statistical framework | Database (prior knowledge) |
|---|---|---|
| ProfCom_GO | ProfCom | GO |
| ProfCom_InerPro | ProfCom | InterPro database |
| ProfCom_GO_not_IEA | ProfCom | GO |
| KEGG spider | Global Network | KEGG |
| PPI spider | Global Network | IntAct |
| GeneSet2MiRNA | ProfCom | In silico predicted gene to MiRNA regulatory relations |
| R spider | Global Network | Reactome and KEGG |
| CCancer&PLIPSa | Standard Enrichment | CCancer and PLIPS databases |
| ProfCom_PROT_MOTIFS | ProfCom | Protein sequences (amino acid triplets) |
| CCancer spidera | Global Network | CCancer and PLIPS databases |
aAvailable only for human genome.
ProfCom PROT_MOTIFS
ProfCom PROT_MOTIFS implements the ‘ProfCom’ statistical framework to identify amino acid triplets or logical combinations of ‘amino acid triplets’ overrepresented in the submitted list (genes are mapped to corresponding proteins). In the case, every ‘amino acid triplet’ represent a functional class (equivalent to GO category) and genes are grouped into the same class if the corresponding protein(s) have the same ‘amino acid triplet’. Single, pair, triplet or quadruplet combinations of amino acid triplets are considered (joined by ‘AND’, ‘NOT’ logical operators) and the ones which mostly discriminate the input list from all other genes are identified.
CCancer spider
CCancer spider implements the ‘Global Network’ statistical framework to analyze gene list using as reference knowledge the global gene association network derived from CCancer&PLIPS database. In total, CCancer&PLIPS database has 5238 gene/protein lists reported in various functional context by independent studies. For each gene pair, the number of times k12 they are reported together (in the same gene/protein list) is counted, as well as, the number of times each gene is reported alone (k1, k2). The standard urn schema is used to derive significantly associated gene pairs. Let us denote the total number of gene/protein lists in CCancer&PLIPS database as N (5 238 at the moment). The value k12 follows a hypergeometric distribution with parameters N, k1 and k2 (k1 balls were drawn without replacement from an urn containing ‘N’ balls in total, k2 of which are white). The P-value need to be adjusted for multiple testing (each gene is tested versus all other genes). Bonferroni correction for multiple testing is used. Two genes are connected by edge in resulting global gene network used by CCancer spider if the significance of their association is <0.01.
RESULTS
BioProfiling.de (http://www.BioProfiling.de/gene_list) is a freely available analytical web portal, which provides a comprehensive analytical toolkit for the interpretation gene/protein lists. In one submission, the gene list is analyzed by a collection of tools. BioProfiling.de has a simple user-friendly interface. As input, it accepts several types of gene or protein identifiers, such as ‘Entrez Gene’, ‘Gene Symbols’, ‘UniProt/Swiss-Prot’ (23), ‘IPI - International Protein Index’, ‘UniGene’, ‘Ensembl’ and ‘RefSeq’.
Data submission
To start the analyses, the user needs to upload a text file with gene/protein identifiers and select an organism. After data submission, a link is provided to the ‘Main Result page’. As soon as computations are finished, the results will be available there. The user can either bookmark this page and return to it in 2–3 h or periodically refresh it.
The submitted gene/protein Ids are automatically mapped to the ‘Entrez Gene’ ids. Gene Id mapping is an inherently difficult problem. To escape errors in results related to mapping issues, we recommend submitting ‘Entrez Gene’ identifiers. We also suggest several resources (6,24,25), which primarily concentrated to solve Gene Id mapping problem.
The mapping report is provided first. If the number of recognized gene/protein ids is less than 10 then the user will get an error message. Next, the table with a short description of the tools available for the submission is provided. Each line of the table corresponds to one tool. The first column of the table specifies the tool name, the second provides the status of the computations (or a link to the results of the tool, in the case the computations are finished). The third column provides a short summary of the tool: the statistical framework, the database of prior knowledge and the total number of gene covered/annotated in the database for the selected genome.
After the computations are finished, the status ‘in progress’ is substituted with a link to the tool results (second column of the summary table). The structure of the output is the same for all ‘spider’ tools as well as for all ‘ProfCom’ tools. In the case of the ‘spider’ tools, the main output summarized in the table ‘Enriched sub-networks’, where the details of the best sub-network models (D1, D2, D3) inferred from the submitted gene list are provided. In the case of the ‘ProfCom’ tools, the user initially gets a short summary table which reports the top enriched complex classes of degree 0, 1, 2, 3. The last column in the table (‘full report’) provides links to the detailed reports of the ‘complex class’ of a given degree.
Example: Bosutinib protein targets
BioProfiling.de provides a comprehensive functional profiling of a gene/protein list from various biological perspectives. The next example aims to demonstrate a wide spectrum of biological insights that one can get by using BioProfiling.de. Bosutinib is a novel drug (promiscuous kinase inhibitor). The whole proteome binding spectra of Bosutinib was identified by chemical proteomics (26), in total 55 proteins were reported to be direct Bosutinib interactors. Here, we used BioProfiling.de to understand properties of Bosutinib protein targets. As one might expect, the list of Bosutinib protein targets was significantly enriched from many functional perspectives. Particularly, interesting are results produced by ProfCom_PROT_MOTIF, a new tool in BioProfiling.de collection. In this case, the logical combinations of amino acid triplets highly discriminative between the list of Bosutinib protein targets and the whole-human proteome are reported. For example, logical pattern ‘((DFG and HRD) not (LPY, HEE))’ was present in 50 (out of 55) Bosutinib protein targets while only 305 (out of approximately 25 000) proteins in the whole genome comply with the pattern. The P-value of the enrichment adjusted by Bonferroni correction for multiple testing is 1.6e-77. In addition, results by spider tools (PPI spider, R spider) suggest that Bosutinib protein targets form densely interaction pattern. The result supports the novel ‘network pharmacology’ paradigm (27) in drug discovery: to be effective the drug should target multiple functionally dependent targets.
CONCLUSIONS
BioProfiling.de provides experimentalists a comprehensive toolkit for gene/protein list interpretation. In one submission, the gene list is profiled with respect to the most information available regarding gene function (GO), pathway relations (KEGG database, Reactome knowledgebase), protein interactions (IntAct), in silico predicted gene to MiRNA associations (GeneSet2MiRNA), information collected by text mining (PLIPS and CCancer) and protein ‘amino acid triplets’ composition.
BioProfiling.de implements two statistical frameworks (‘ProfCom’ and ‘Global Network’), which allow fast implementation of new tools capable to explore novel biological principles (as prior knowledge) to group genes into functional classes or to associate genes by edge into global gene network. In the future, the collection of tools is going to expand to cover novel biological principles to profile gene/protein list using either ‘ProfCom’ or ‘Global network’ statistical framework.
We also would like to point out that both statistical frameworks (‘ProfCom’, ‘Global Network’) are implemented only at BioProfiling.de tools. Although, there are many tools for the functional profiling of gene/protein lists, there several features in both frameworks which make BioProfiling.de distinguishable. Therefore, BioProfiling.de provides a combination of traits that makes it different among other resources available.
FUNDING
This work was supported by the Helmholtz Association “Impuls und Vernetzungsfonds” (Systems Biology Alliance). Funding for open access charge: Helmholtz Zentrum München.
Conflict of interest statement. None declared.
REFERENCES
- 1.Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. doi: 10.1093/bioinformatics/bti565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Berger SI, Posner JM, Ma'ayan A. Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC Bioinformatics. 2007;8:372. doi: 10.1186/1471-2105-8-372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R. A systems biology approach for pathway level analysis. Genome Res. 2007;17:1537–1545. doi: 10.1101/gr.6202607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Reimand J, Tooming L, Peterson H, Adler P, Vilo J. GraphWeb: mining heterogeneous biological networks for gene modules with functional significance. Nucleic Acids Res. 2008;36(Suppl. 2):W452–W459. doi: 10.1093/nar/gkn230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Reimand J, Kull M, Peterson H, Hansen J, Vilo J. g:Profiler–a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 2007;35:W193–W200. doi: 10.1093/nar/gkm226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B. GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol. 2004;5:R101. doi: 10.1186/gb-2004-5-12-r101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Khatri P, Draghici S, Ostermeier GC, Krawetz SA. Profiling gene expression using onto-express. Genomics. 2002;79:266–270. doi: 10.1006/geno.2002.6698. [DOI] [PubMed] [Google Scholar]
- 9.Khatri P, Voichita C, Kattan K, Ansari N, Khatri A, Georgescu C, Tarca AL, Draghici S. Onto-Tools: new additions and improvements in 2006. Nucleic Acids Res. 2007;35:W206–W211. doi: 10.1093/nar/gkm327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA. Global functional profiling of gene expression. Genomics. 2003;81:98–104. doi: 10.1016/s0888-7543(02)00021-6. [DOI] [PubMed] [Google Scholar]
- 11.Antonov AV, Schmidt T, Wang Y, Mewes HW. ProfCom: a web tool for profiling the complex functionality of gene groups identified from high-throughput data. Nucleic Acids Res. 2008;36:W347–W351. doi: 10.1093/nar/gkn239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Antonov AV, Dietmann S, Wong P, Lutter D, Mewes HW. GeneSet2miRNA: finding the signature of cooperative miRNA activities in the gene lists 1. Nucleic Acids Res. 2009;37:W323–W328. doi: 10.1093/nar/gkp313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Antonov AV, Dietmann S, Wong P, Igor R, Mewes HW. PLIPS, an automatically collected database of protein lists reported by proteomics studies. J. Proteome Res. 2009;8:1193–1197. doi: 10.1021/pr800804d. [DOI] [PubMed] [Google Scholar]
- 14.Dietmann S, Lee W, Wong P, Rodchenkov I, Antonov AV. CCancer: a bird's eye view on gene lists reported in cancer-related studies 1. Nucleic Acids Res. 2010;38(Suppl.):W118–W123. doi: 10.1093/nar/gkq515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Antonov AV, Dietmann S, Mewes HW. KEGG spider: interpretation of genomics data in the context of the global gene metabolic network. Genome Biol. 2008;9:R179. doi: 10.1186/gb-2008-9-12-r179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Antonov AV, Dietmann S, Rodchenkov I, Mewes HW. PPI spider: a tool for the interpretation of proteomics data in the context of protein-protein interaction networks. Proteomics. 2009;9:2740–2749. doi: 10.1002/pmic.200800612. [DOI] [PubMed] [Google Scholar]
- 17.Antonov AV, Schmidt EE, Dietmann S, Krestyaninova M, Hermjakob H. R spider: a network-based analysis of gene lists by combining signaling and metabolic pathways from Reactome and KEGG databases 2. Nucleic Acids Res. 2010;38(Suppl.):W78–W83. doi: 10.1093/nar/gkq482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999;27:29–34. doi: 10.1093/nar/27.1.29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis S, Matthews L, et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 2007;8:R39. doi: 10.1186/gb-2007-8-3-r39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Aranda B, Achuthan P, am-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, et al. The IntAct molecular interaction database in 2010 5. Nucleic Acids Res. 2010;38:D525–D531. doi: 10.1093/nar/gkp878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Antonov AV, Dietmann S, Wong P, Mewes HW. TICL–a web tool for network-based interpretation of compound lists inferred by high-throughput metabolomics. FEBS J. 2009;276:2084–2094. doi: 10.1111/j.1742-4658.2009.06943.x. [DOI] [PubMed] [Google Scholar]
- 23.Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A. UniProtKB/Swiss-Prot. Methods Mol. Biol. 2007;406:89–112. doi: 10.1007/978-1-59745-535-0_4. [DOI] [PubMed] [Google Scholar]
- 24.Berriz GF, Roth FP. The Synergizer service for translating gene, protein and other biological identifiers 3. Bioinformatics. 2008;24:2272–2273. doi: 10.1093/bioinformatics/btn424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Baron D, Bihouee A, Teusan R, Dubois E, Savagner F, Steenman M, Houlgatte R, Ramstein G. MADGene: retrieval and processing of gene identifier lists for the analysis of heterogeneous microarray datasets 2. Bioinformatics. 2011;27:725–726. doi: 10.1093/bioinformatics/btq710. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Fernbach NV, Planyavsky M, Muller A, Breitwieser FP, Colinge J, Rix U, Bennett KL. Acid elution and one-dimensional shotgun analysis on an Orbitrap mass spectrometer: an application to drug affinity chromatography 2. J. Proteome Res. 2009;8:4753–4765. doi: 10.1021/pr900455x. [DOI] [PubMed] [Google Scholar]
- 27.Hopkins AL. Network pharmacology: the next paradigm in drug discovery 3. Nat. Chem. Biol. 2008;4:682–690. doi: 10.1038/nchembio.118. [DOI] [PubMed] [Google Scholar]

