Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2008 Aug 1;24(19):2137–2142. doi: 10.1093/bioinformatics/btn403

Predicting pathway membership via domain signatures

Holger Fröhlich 1,*, Mark Fellmann 1, Holger Sültmann 1, Annemarie Poustka 1, Tim Beißbarth 1
PMCID: PMC2553439  PMID: 18676972

Abstract

Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.

Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.

Availability: The R package gene2pathway is a supplement to this article.

Contact: h.froehlich@dkfz-heidelberg.de

Supplementary Information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Biological characterization of genes is of fundamental importance for the understanding of complex cellular processes, like cancer. Valuable information can be obtained from databases, like the Gene Ontology (GO; The Gene Ontology Consortium, 2004) or KEGG (Kanehisa et al., 2008). However, usually only a small fraction of genes have known functions. Most genes are annotated in GO, only few in KEGG. For example, the total number of human genes annotated in KEGG currently is about 4000. This contrasts remarkably with the estimated number of putative protein-coding genes, which is 20 000–25 000 (Pennisi, 2007). It is therefore highly important to link other sources of information with these databases to improve the quality of biological characterization. Especially interesting for this purpose is the InterPro database (Mulder et al., 2008), which offers predicted protein-domain annotation for ∼19 000 genes. Of the 4000 genes in the KEGG database nearly all have at least one InterPro domain. Together, these comprise ∼3000 distinct InterPro domains. Protein domains very often directly correspond to some core biological function, such as DNA binding, kinase or phosphorylation activity or to cellular localization. Hence, predicted protein domains are often utilized for prediction annotations, such as in the GO database.

Hahne et al., 2008) introduced a method linking protein-domain signatures with assignments of genes to KEGG pathways. In this approach one looks for a protein-domain signature being significantly enriched in a list of genes. This information is then used to find the most probable pathway these genes come from by comparing the enriched protein-domain signature with all pathway domain signatures.

In contrast to Hahne et al. our aim is to make a prediction and thus a biological characterization for individual genes. This broadens the applicability of our method significantly. We explicitly take into account that a particular gene can be mapped to different pathways at the same time. Furthermore, our classifier makes use of the hierarchical organization of the KEGG database in three levels: at the top hierarchy there are the four branches ‘Metabolism’, ‘Genetic Information Processing’, ‘Environmental Information Processing’ and ‘Cellular Processes’ (we do not consider ‘Human Diseases’ here). On the next hierarchy level each of these branches is divided further. For instance, ‘Environmental Information Processing’ contains the branches ‘Membrane Transport’, ‘Signal Transduction’ and ‘Signaling Molecules and Interaction’. On the third hierarchy level we have the individual KEGG pathways. We expect that a good classifier should give especially precise predictions at the top levels of the KEGG hierarchy, while at the bottom levels misclassifications are more tolerable. That means it is worse to predict a MAPK pathway (branch ‘Signal Transduction’ in ‘Environmental Information Processing’) gene to be involved in ‘Olfactory transduction’ (branch ‘Sensory System’ in ‘Cellular Processes’) than to predict it as a member of some other signal transduction pathway. This behavior, leading to a hierarchical classification scheme, is encoded into an appropriate loss function within our framework. Our classifier is also able to indicate the reliability of a pathway prediction. A 10 × 10-fold cross-validation experiment with 2346 genes having both, a KEGG annotation and a unique protein-domain signature, shows that our method yields good classification performance. We further demonstrate the usefulness of our method on a microarray dataset, where we obtain meaningful results.

Signaling pathways are of special importance for the functioning of biological systems. In an extension of our approach we demonstrate that it is not only possible to reliably predict a gene's membership to the different signaling pathways, but also to connected pathway components within individual signaling pathways. Again, results on our microarray dataset show the biological relevance of our method.

2 METHODS

2.1 Hierarchical KEGG pathway classification

2.1.1 Classification scheme

We suppose that each gene product p is represented by a binary vector x with component xi=1, if the corresponding InterPro domain is contained in the protein and 0 otherwise. We hereby have to take into account that InterPro domains are organized in a hierarchical fashion. Hence, if domain i is contained in p, also all its parent domains are contained in p, and therefore all corresponding positions in x have to be 1 as well.

The mapping position(s) of a gene to the KEGG hierarchy can be encoded into a binary vector C as well. The dimension K of this vector equals the number of individual KEGG pathways plus the number of branches at level 2 plus the number of branches at top level. We set component Cl=1, if the gene maps to the corresponding branch or any of its sub-branches. Note that any position code vector C can contain more than one, if the corresponding gene maps to more than one branch in the KEGG hierarchy.

Given a binary vector representation x for a gene product p, our classification scheme now consists of two basic steps, which are an adaption of an approach proposed by (Melvin et al., 2007) for classifying proteins within the SCOP hierarchy:

  1. On each hierarchy level we use support vector machine (SVM) classifiers, trained to separate one specific branch from all others. Linear kernels are used, and all soft margin parameters C=1. Each SVM classifier j will produce a decision value fj(x)∈ℝ. Please note that the decision value is not the same as the predicted class label, which is the sign of the decision value. For each gene product p represented by a binary vector x we summarize the decision values of all K SVMs into a input code vector Inline graphic.

  2. Each input code vector Inline graphic is mapped on the best matching position code vector(s)
    graphic file with name btn403m1.jpg (1)
    graphic file with name btn403m2.jpg (2)
    where {C1,…,Cm} is a dictionary of possible position vectors, w is a weight vector and * indicates component-wise multiplication. The dictionary of position vectors consists of all unique position vectors from a training set of gene products with both, KEGG and InterPro domain annotation. The weight vector w is chosen to minimize the mismatch between predicted and true KEGG hierarchy positions on the training data.

    Please note that the maximum in Equation (2) is not necessarily unique. In other words, it is possible to predict several positions vectors, which are all equally likely. Hence, we capture the often appearing situation that a gene maps to several positions in the KEGG hierarchy at the same time.

2.1.2 Training procedure

Similar to the classification scheme, the training procedure consists of two steps.

  1. All K binary SVM classifiers are trained to obtain a position labeled datasetInline graphic. For training the individual SVMs we only use genes belonging to the same super-branch. For example, for training the SVM classifier detecting signal transduction, we only use genes mapping to other branches than signal transduction in ‘Environmental Information Processing’ as negative examples. Each SVM classifier is thus trained to detect one specific branch in the KEGG hierarchy only.

    graphic file with name btn403i4.jpg

  2. Given the position labeled dataset D, we employ the modified ranking perceptron algorithm presented in (Melvin et al., 2007) to learn a weight vector w of the input code vectors Inline graphic. In the spirit of SVM classifiers, the weight vector is optimized to maximize the margin between position code vectors Ci, Cj with CiCj in input code vector space. The algorithm shown in Figure 1 involves updating w proportional to the loss we obtain by predicting a wrong position vector Cj instead of the true position vector Ci. The choice of this loss function is the essential part of the algorithm, because it reflects our knowledge about the KEGG hierarchy. Making a wrong prediction at the higher levels of the hierarchy should be punished more than confusing individual KEGG pathways at the bottom level. We therefore set up the following loss function:
    graphic file with name btn403m3.jpg (3)
    where Anc denotes the set of all ancestors of branch j and 1 is the indicator function. By this loss function we punish the first mismatch on the path down the hierarchy to the final predicted position. The higher in the hierarchy the mismatch occurs, the higher the punishment ci should be. We thus choose
    graphic file with name btn403m4.jpg (4)
    where |T(i)| denotes the size of the hierarchy down of branch i and |T(root)| is the size of the complete KEGG hierarchy.
Fig. 1.

Fig. 1.

Prediction performance of our method (10×10-fold cross-validation). The accuracy measure uses the same loss function, which was used to train the classifier, and which takes into account the KEGG hierarchy. (A) Pathway prediction within pruned KEGG hierarchy (53 branches). (B) Pathway component prediction for signaling pathways (19 branches).

2.2 Hierarchical signaling pathway component classification

Viewing all gene–gene interactions as an undirected graph, we calculated the connected components for each signaling pathway (Siek et al., 2002). Our hierarchy for signaling pathways thus consists of two levels: at the first level we have all individual signaling pathways and at the second level we have their corresponding connected components. The training and classification procedure is then the same as described above.

3 RESULTS

3.1 Estimating prediction performance

3.1.1 Hierarchical KEGG pathway classification

We used all human genes annotated in both, KEGG and InterPro. KEGG annotation was retrieved via the R package KEGG 2.0.1 (released August 2007). InterPro annotation was retrieved directly from the Ensembl database (Flicek et al., 2008) via the R package biomaRt 1.12.1 in March 2008. Hierarchy information for KEGG and InterPro was obtained from the corresponding homepages via FTP in March 2008. A total of 3705 genes had both, KEGG and InterPro annotation. Since we employed a 10-fold cross-validation procedure for estimating the classification accuracy, we decided to remove genes with the same InterPro annotation, thus avoiding an overoptimistic prediction performance estimation by having one of the duplicates in the training and one in the test set. This way our set of genes was reduced to 2346, containing 2752 distinct InterPro domains in total.

As already noted by (Hahne et al., 2008) it is unlikely to reliably separate metabolic pathways based on their InterPro domain signatures. We thus decided to prune the KEGG hierarchy in order to improve the prediction accuracy for branches of especially high importance. We cut the hierarchy for metabolic pathways at the top and for ‘Genetic Information Processing’ pathways at the second hierarchy level. At the same time, we required to have more than 30 genes to be mapped to the corresponding hierarchy branch in order to consider it in the classification hierarchy. This way we ended up with a total of 53 hierarchy branches to distinguish (Table 1).

Table 1.

Pruned KEGG hierarchy used for our classification model

Level 1 Level 2 Level 3
Metabolism
Genetic Inf.Proc. Transcription (–)
Translation
Folding, sorting, degradation
Env. Inf. Proc. Membrane transport (–)
Signal transduction MAPK pathway
ErbB pathway (2, 0)
Wnt pathway (2)
Notch pathway
Hedgehog pathway
TGF-β pathway (3, 2)
VEGF pathway
Jak-STAT pathway
Calcium signaling (4)
Phosphatidylinositol system
mTOR signaling (–)
Signaling molecules and interaction Neuroactive ligand-receptor interaction
Cytokine–cytokine receptor interaction
ECM–receptor interaction
Cell adhesion molecules
Cellular Processes Cell motility
Cell growth and death Cell cycle
Apoptosis
p53pathway
Cell communication Focal adhesion
Adherens junction
Tight junction
Gap junction
Endocrine system Insulin pathway
Adipocytokine pathway
PPAR pathway
GnRH pathway
Melanogenesis
Immune system Hematopoietic cell lineage
Complement and coagulation cascades
Toll-like receptor pathway
Natural killer cell-mediated cytotoxicity
Antigen processing and presentation
T-cell receptor signaling
B-cell receptor signaling
Fc-ɛ RI pathway
Leukocyte transendothelial migration
Nervous system Long-term potentiation
Long-term depression
Sensory system Olfactory transduction (–)
Taste transduction (–)
Development

Hierarchy branches marked with ‘–’ are left out in the cross-validation procedure, but are included in the final model. For signaling pathways the number in brackets indicate the number of connected pathway components. The first number refers to the number of connected pathway components used in the final model, and the second (italic) to the number used in the cross-validation procedure.

We ran a 10 times repeated 10-fold cross-validation procedure to assess the prediction performance of our hierarchical classification model. The classification performance was evaluated using four different measures.

  1. The accuracy, measured as 1 –average classification loss [Equation (3)].

  2. The precision (also known as positive predictive value), defined as Inline graphic, where TP and FP are the number of true positives and false positives summed over all hierarchy branches. That is, we first calculated true and false positives for each component in the position code vector individually and then summed up.

  3. The recall (also known as sensitivity), defined as Inline graphic, where FN are the number of false negatives summed over all hierarchy branches.

  4. The F1 value, defined as Inline graphic

The results, depicted in Figure 1A as boxplots showed a high median accuracy of>95% and a median F1 value of ∼60% with precision and recall being in the same range. It should be noted that only the accuracy measure takes into account the KEGG hierarchy via the loss function Equation (3), whereas the other three measures weight all errors equally. Further analysis of the median F1 values for all top level and second level hierarchy branches approximately showed a uniform distribution, i.e. all branches could be predicted equally well within each hierarchy level.

To train our final hierarchical classification model, which we employed to give predictions on further unseen datasets, we used the complete set of 3705 genes without removing duplicates. The number of hierarchy branches to distinguish was 58 now (Table 1). For further improvement of predictive power and in order to obtain confidence scores for predictions, our final model was bagged (Hastie et al., 2001). That means we drew 11 bootstrap training datasets with replacement and trained our classification model on each of them. To give a prediction, the majority vote among these 11 sub-models was used. This was done for each component in the position code-vector separately. A confidence score for the complete prediction can then be calculated as

graphic file with name btn403m5.jpg (5)

where Inline graphic is the average of all vote proportions>50% and Inline graphic the average of all vote proportions ≤50%.

3.1.2 Hierarchical signaling pathway component classification

A setup similar to the one described above was chosen. The number of human genes with a unique InterPro domain signature and a corresponding KEGG annotation was 515 and the total number of used InterPro domains 795. A minimum of 10 mapping genes per pathway component was required. Therefore, we ended up with 19 hierarchy branches to distinguish (Table 1).

The result, depicted in Figure 1B showed a high median accuracy of ∼100% and a median F1 value of ∼70% with precision and recall being in the same range. Again, the median F1 values for all top-level and second-level hierarchy branches approximately followed a uniform distribution, i.e. all branches could be predicted equally well within each hierarchy level.

To train our final hierarchical classification model, the same procedure was used as described above. The total number of genes used for training was 788, and the number of hierarchy branches to distinguish was 22 (Table 1).

3.2 Application to microarray data

We applied our method to predict the KEGG pathway membership for a microarray dataset produced in our department: human MCF-7 breast cancer cells were treated with 100 nM tamoxifen for 48 h. On mRNA level effects were measured with in-house developed cDNA two-color microarrays having 26 722 functioning probes (Barth et al., 2006). After variance stabilization normalization (VSN) (Huber et al., 2002) 2937 differentially expressed genes were found with limma (Smyth, 2004) using a Benjamini–Hochberg FDR cutoff of 5% (Benjamini and Hochberg, 1995). Further details on the experiment can be obtained from the authors upon request. The 26 722 probes correspond to 12 692 genes with an Entrez gene ID, of which for 10 057 InterPro annotation and for 2760 KEGG annotation was available. Comparison of our predicted and the original KEGG pathway annotations for the 2760 common genes indicated a very good median accuracy of ∼100% with a median F1-value ∼80% and precision and recall in the same range (Fig. 2A). There were a few outliers, as indicated in the boxplot. These genes are mostly linked to the KEGG category ‘Human Diseases’, which we did not include in our model.

Fig. 2.

Fig. 2.

Prediction performance of the hierarchical classification model on an external validation set for the pruned KEGG hierarchy (A, 2760 genes) and for signaling pathway components (B, 458 genes).

By our model we could predict pathway memberships for several genes with previously unknown KEGG annotation: e.g. NR2C2 is a member of the nuclear hormone receptor family and acts as ligand-activated transcription factor (Yoshikawa et al., 1996). We predicted NR2C2 to belong to the branch ‘Neuroactive ligand-receptor interaction’ (confidence=99.66%), which exactly fits this knowledge. As another example we predicted TOMM34 to be a member of the branches ‘Folding, Sorting and Degradation’ and ‘Cell Cycle’ (confidence=100%). Indeed, the protein encoded by TOMM34 is involved in the import of precursor proteins into mitochondria. The encoded protein has a chaperone-like activity, binding the mature portion of unfolded proteins and aiding their import into mitochondria (Chewawiwat et al., 1999).

In a second step of our analysis we filtered those genes, which were either known to be involved in signal transduction by KEGG annotation (458 genes), or which were predicted by our model to map to the corresponding KEGG hierarchy branch with confidence>99% (164 genes). Comparison of our pathway component predictions for the 458 genes with the original KEGG information, revealed a very high median accuracy of ∼100% with a median F1-value>80% and precision and recall in the same range (Fig. 2B). As an example application of our model, in Figure 3 we depict the predicted connected component for PLCH2 (confidence=100%) in the calcium signaling pathway, for which previously no KEGG annotation was available. The gene has an associated GO function ‘calcium ion binding’ and GO process ‘intracellular signaling cascade’ (The Gene Ontology Consortium, 2004).

Fig. 3.

Fig. 3.

Predicted pathway component (shaded) for PLCH2 in the Calcium signaling pathway.

In a final step, we looked for those KEGG branches, which were statistically overrepresented in the set of differentially expressed genes compared to the rest. We used all predicted and all original KEGG annotation for this purpose. Fisher's exact test was employed to assess statistical significance, and a multiple testing correction using the method of (Benjamini and Yekutieli, 2001), which assumes the statistical dependence of the individual tests, with a 10% cutoff was performed. The test shows an enrichment of metabolic, cell motility and cancer-related pathways. None of this would have been found using KEGG annotation only.

4 CONCLUSIONS

We presented a novel hierarchical classification method, which can predict the KEGG annotations of individual genes based on their InterPro domain signatures. In an extension of our approach, we showed that it is also possible to classify individual signaling pathway components via InterPro domain information. We think that linking KEGG with InterPro is an important step to generate new hypotheses about genetic pathways, which is finally of fundamental importance for a better understanding of human diseases like cancer. With our method it is not only possible to analyze lists of genes, as done in (Hahne et al., 2008), but to give predictions for individual genes of interest. This way we can drop the unrealistic assumption that all genes in the list come from the same pathway. Moreover, our method is not restricted to microarray experiments any more, but can be used in a much broader spectrum of applications.

We have implemented our method in the R package gene2pathway, which is available as a supplementary Material to this article.

Supplementary Material

[Supplementary Data]
btn403_index.html (638B, html)

ACKNOWLEDGEMENTS

We thank Markus Ruschhaupt and Ruprecht Kuner for help and discussions, and Dirk Ledwinda for IT support.

Funding: National Genome Research Network (NGFN) of the German Federal Ministry for Education and Research (BMBF)—grants SMP Bioinformatics (01GR0450, subprojects PFB-S19T10, PBF-S02T11 to T.B. and H.F.); exploratory project (EP-S19T03 to T.B. and H.F.); NGFN SMP RNA (01GR0418 to M.F and H.S.).

Conflict of Interest: none declared.

REFERENCES

  1. Barth AS, et al. Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies. J. Am. Coll. Cardiol. 2006:1610–1617. doi: 10.1016/j.jacc.2006.07.026. [DOI] [PubMed] [Google Scholar]
  2. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995:289–300. [Google Scholar]
  3. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001:1165–1188. [Google Scholar]
  4. Chewawiwat N, et al. Characterization of the novel mitochondrial protein import component, tom34, in mammalian cells. J. Biochem. 1999:721–727. doi: 10.1093/oxfordjournals.jbchem.a022342. [DOI] [PubMed] [Google Scholar]
  5. Flicek P, et al. Ensembl 2008. Nucleic Acids Res. 2008;(Database issue):D707–D714. doi: 10.1093/nar/gkm988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Hahne F, et al. Extending pathways based on gene lists using interpro domain signatures. BMC Bioinformatics. 2008:3. doi: 10.1186/1471-2105-9-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hastie T, et al. The Elements of Statistical Learning. New York, USA: Springer; 2001. [Google Scholar]
  8. Huber W, et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002:S96–S104. doi: 10.1093/bioinformatics/18.suppl_1.s96. [DOI] [PubMed] [Google Scholar]
  9. Kanehisa M, et al. Kegg for linking genomes to life and the environment. Nucleic Acids Res. 2008:D480–D484. doi: 10.1093/nar/gkm882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Melvin I, et al. Multi-class protein classification using adaptive codes. J. Mach. Learn. Res. 2007:1557–1581. [Google Scholar]
  11. Mulder NJ, et al. New developments in the InterPro database. Nucleic Acids Res. 2008:D224–D228. doi: 10.1093/nar/gkl841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Pennisi E. Genetics: working the (gene count) numbers: finally, a firm answer? Science. 2007:1113. doi: 10.1126/science.316.5828.1113a. [DOI] [PubMed] [Google Scholar]
  13. Siek JG, et al. The Boost Graph Library: User Guide and Reference Manual. Boston, MA, USA: Addison-Wesley, Pearson Education Inc; 2002. [Google Scholar]
  14. Smyth G. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004 doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
  15. The Gene Ontology Consortium. The gene ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–D261. doi: 10.1093/nar/gkh036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Yoshikawa T, et al. New variants of the human and rat nuclear hormone receptor, tr4: expression and chromosomal localization of the human gene. Genomics. 1996:361–366. doi: 10.1006/geno.1996.0368. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Data]
btn403_index.html (638B, html)

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES