Skip to main content
PLOS One logoLink to PLOS One
. 2013 Jun 11;8(6):e65265. doi: 10.1371/journal.pone.0065265

Prediction of Heterodimeric Protein Complexes from Weighted Protein-Protein Interaction Networks Using Novel Features and Kernel Functions

Peiying Ruan 1, Morihiro Hayashida 1,*, Osamu Maruyama 2, Tatsuya Akutsu 1,*
Editor: Claudio M Soares3
PMCID: PMC3679142  PMID: 23776458

Abstract

Since many proteins express their functional activity by interacting with other proteins and forming protein complexes, it is very useful to identify sets of proteins that form complexes. For that purpose, many prediction methods for protein complexes from protein-protein interactions have been developed such as MCL, MCODE, RNSC, PCP, RRW, and NWE. These methods have dealt with only complexes with size of more than three because the methods often are based on some density of subgraphs. However, heterodimeric protein complexes that consist of two distinct proteins occupy a large part according to several comprehensive databases of known complexes. In this paper, we propose several feature space mappings from protein-protein interaction data, in which each interaction is weighted based on reliability. Furthermore, we make use of prior knowledge on protein domains to develop feature space mappings, domain composition kernel and its combination kernel with our proposed features. We perform ten-fold cross-validation computational experiments. These results suggest that our proposed kernel considerably outperforms the naive Bayes-based method, which is the best existing method for predicting heterodimeric protein complexes.

Introduction

Protein complexes play crucial roles in a variety of biological processes, such as ribosomes for protein biosynthesis, molecular transmission and evolution of interactions between proteins. In fact, many proteins come to be functional only after they interact with their specific partners and are assembled into protein complexes. Hence, much effort has been made for predicting protein complexes from protein-protein interaction (PPI) networks [1][6] in bioinformatics. The Markov Cluster (MCL) algorithm [7] iteratively generates a matrix, called Markov matrix, in which each row (each column) corresponds to a protein and each element represents the relationship between two proteins. Then, MCL extracts clusters from the matrix. This algorithm is efficient also for large-scale networks because Markov matrices are calculated by matrix multiplication and exponentiation of its individual elements. The Molecular Complex Detection (MCODE) algorithm [8] gives a weight to each vertex by using a modified clustering coefficient, which is defined as edge density in a subset of neighboring vertices and the originating vertex. Then, it finds densely connected regions of molecular interaction networks based on the weighted vertices. The Restricted Neighborhood Search Clustering (RNSC) algorithm [9] separates the set of vertices into clusters by searching locally in a randomized fashion based on a cost function. After that, the clusters will be filtered according to the cluster size, density and functional homogeneity. The Protein Complex Prediction (PCP) algorithm [10] finds maximal cliques within PPI networks modified by using the functional similarity weight (FS-Weight) based on indirect interactions, and merges their cliques. These methods are intended for detecting dense subgraphs in a PPI network. Hence, they cannot find a protein complex with size two because the density is always 1.0 and the subgraph (i.e., an edge) itself is a clique even if two proteins that interact with each other do not form a complex. In addition, it is considered that any overlap rate of a predicted protein complex to a small known complex is more likely to be by chance than the same overlap rate to a larger known complex as pointed out in [11]. Most prediction methods have been evaluated for protein complexes with larger size than three excluding complexes with small sizes.

However, the majority of known protein complexes are heterodimeric protein complexes. CYC2008 [12], which is a comprehensive catalogue of 408 manually curated yeast protein complexes reliably supported by small-scale experiments, includes 172 (42%) heterodimeric protein complexes. Besides, MIPS protein complex catalog [13], which provides detailed information involved protein sequences on whole-genome analysis [14][16], contains 64 (29%) heterodimeric protein complexes excluding complexes obtained from high-throughput experiments. Hence, it is necessary to develop another method for predicting smaller complexes. Qi et al. proposed a method using a supervised Bayesian classifier [17] that has good performance for predicting protein complexes of middle sizes. The method still does not work well for heterodimeric protein complexes because they used several features based on graph density and degree statistics. There are some approaches based on random walks on PPI networks. The Repeated Random Walks (RRW) method [18] repeatedly expands a focused cluster of proteins depending on the steady state probability of random walks with restarts from the cluster whose proteins are equally weighted. The Node-Weighted Expansion (NWE) method [19] is an extension of RRW. NWE restarts from the cluster whose proteins are weighted by the sum of the edge weights of the physical interactions with neighboring proteins, where the edge weights are obtained from the WI-PHI database [1]. Then, Maruyama [11] proposed an approach based on a naive Bayes classifier using heterogeneous genomic data for predicting heterodimeric protein complexes with features involved with protein-protein interaction data, gene expression data, and gene ontology annotations. This method outperforms other existing prediction methods, MCL, MCODE, RRW, and NWE, in F-measure for heterodimers [11] although these methods are not supervised.

To further improve the prediction accuracy for heterodimeric protein complexes, we propose a method using C-Support Vector Classification (C-SVC) with several features based on protein-protein interaction weights that are considered as reliability of interactions between proteins. The idea behind the design of feature space mappings is, for example, that the neighboring weights of a heterodimeric complex tend to be smaller than the weight inside of the complex. In addition to features based on weights, we propose feature space mappings based on the numbers of protein domains because those are considered to be functional and structural units in proteins. Furthermore, we propose a domain composition kernel based on the idea that two proteins having the same composition of domains as a heterodimeric protein complex would also form a heterodimer. We perform ten-fold cross validation, and calculate the average F-measures. The results suggest that our proposed kernel considerably outperforms the naive Bayes-based method, which is the best existing method.

Methods

The problem we address in this study is stated as follows: Given a network of protein-protein interactions, where interactions are weighted, determine whether or not two interacting distinct proteins form a protein complex with size exactly two. A network of protein-protein interactions can be considered as a graph, where vertices represent proteins and edges represent protein interactions. Let G(V, E) be an undirected graph with a set V of vertices and a set E of edges, where the weight of each edge Inline graphic is denoted by wij and represents reliability and strength of the interaction related with the edge. Actually, we use the WI-PHI database [1] as edge weights, which is derived from heterogeneous data sources, and was used in previous studies [11], [18], [19]. In this section, we propose several features for predicting heterodimeric protein complexes, a novel kernel matrix based on protein domain composition, and the combination kernel.

Feature Space Mapping Based on Interaction Weights

We propose simple feature space mappings based on weights of interactions, which are regarded to be reliabilities and strengths for protein-protein interactions as shown in Table 1. The basic idea for designing features is as follows. The reliability of the interaction in a heterodimeric complex should be high. In addition, the reliability of the interaction between a protein contained in a complex and a protein not contained in the complex should be low. These features are not only applied to C-SVC through linear kernels but are transformed to other kernel matrices using extended diffusion and label sequence kernels.

Table 1. Feature space mapping from two interacting proteins Pi, Pj and neighbors.

Inline graphic
Inline graphic
Inline graphic
Inline graphic
Inline graphic
Inline graphic
Inline graphic

Consider two interacting proteins Pi and Pj corresponding to an input. Figure 1 shows an example of a subgraph with Pi, Pj, and their neighboring proteins Pk such that Inline graphic or Inline graphic, where interactions between these proteins are shown as edges. One feature is the weight Inline graphic between proteins Inline graphic and Inline graphic, denoted by (F1), because the proteins in a heterodimeric protein complex should interact with each other and the weight Inline graphic should be large.

Figure 1. Example of a subgraph with an interacting protein pair and their neighboring proteins.

Figure 1

Inline graphic and Inline graphic denote focusing interacting proteins shown in the dashed rectangle. Inline graphic is a neighboring protein. Inline graphic denotes the weight of the interaction between Inline graphic and Inline graphic.

However, even if Inline graphic is large, the proteins could be included in a complex with size larger than two. Hence, we consider the weights of interactions with the neighboring proteins Inline graphic. Since the neighboring weights of a heterodimeric complex tend to be smaller than the weight inside of the complex, we introduce the maximum of the neighboring weights denoted by (F2) as a feature.

In contrast, if the neighboring weights are larger than the weight Inline graphic, we can estimate that the proteins Inline graphic and Inline graphic would not form a complex but neighboring proteins and either Inline graphic or Inline graphic would form some complex. Thus, we introduce the minimum of the neighboring weights denoted by (F3).

Even if the maximum of the neighboring weights (F2) is large enough, the proteins Inline graphic and Inline graphic as well as Inline graphic and Inline graphic or Inline graphic and Inline graphic may form a heterodimeric complex. Consider the case that a protein Inline graphic interacts with both of Inline graphic and Inline graphic. If two weights Inline graphic and Inline graphic are large, these proteins Inline graphic, Inline graphic and Inline graphic are likely to form a complex. Besides, if Inline graphic is smaller than Inline graphic and Inline graphic, Inline graphic, Inline graphic and Inline graphic, Inline graphic independently can form a heterodimeric complex. For this reason, we introduce the maximum of smaller weights denoted by (F4).

In the discussion so far, we dealt only with the value of weights. However, differences between weights are also important for discriminating heterodimeric complexes. Hence, we introduce the maximum of differences between the neighboring weights denoted by (F5).

For prediction of complexes, biological knowledge for proteins is helpful. We use protein domains that are parts of proteins known as structural and functional units. Ozawa et al. introduced the domain structural constraint that one domain interacts with at most one other domain for verifying protein complexes [20]. The constraint excludes extra proteins from a set of proteins that is a candidate complex by validating possible interactions between domains. This means that extra domains cause interactions with other proteins and the actual number of proteins contained in the complex may be greater than that in the candidate set of proteins. Since two proteins with small numbers of domains tend to form a heterodimeric complex, we introduce the maximum of the numbers of domains contained in Inline graphic and Inline graphic denoted by (F6). In contrast, we introduce the minimum of the numbers of domains contained in Inline graphic and Inline graphic denoted by (F7) because proteins with large numbers of domains tend to form complexes with large sizes.

Domain Composition Kernel

In the previous section, we introduced several feature space mappings from an example, that is, a pair of proteins. Kernel functions can incorporate prior knowledge. If a set of proteins has the same composition of domains as a known complex, it is highly expected that the set forms a complex. On the basis of this idea, we propose domain composition kernel for candidate complexes Inline graphic and Inline graphic with size Inline graphic (Inline graphic in this paper), in which Inline graphic and Inline graphic are regarded as sets of proteins, Inline graphic and Inline graphic, respectively. Then, we define equivalence Inline graphic between two proteins Inline graphic and Inline graphic as Inline graphic consists of the same domains of Inline graphic, where the number of each domain must also be the same between the proteins. Furthermore, we define equivalence Inline graphic between two sets of proteins Inline graphic and Inline graphic using Inline graphic by

graphic file with name pone.0065265.e070.jpg (1)

where Inline graphic denotes the symmetric group of degree Inline graphic on the set Inline graphic (Inline graphic is a permutation of Inline graphic). For example, in the case of Inline graphic and Inline graphic, Inline graphic if Inline graphic and Inline graphic or Inline graphic and Inline graphic, whereas it is not necessary that Inline graphic.

Then, we propose domain composition kernel Inline graphic by

graphic file with name pone.0065265.e085.jpg (2)

where Inline graphic if Inline graphic holds, otherwise Inline graphic. It should be noted that our kernel is different from pairwise kernels for protein pairs proposed in [21]. Their kernel is defined as Inline graphic for predicting protein-protein interactions, where Inline graphic is called ‘genomic kernel’ and operates on individual genes or proteins. In the case of Inline graphic, that is, Inline graphic, Inline graphic if Inline graphic, otherwise Inline graphic, where Inline graphic. In addition, their pairwise kernels allow extra domains in a candidate complex because the domains do not prevent two proteins to interact with each other.

We can prove that Inline graphic is a kernel.

Theorem 1 Inline graphic defined by Eq. (2) is a positive semidefinite kernel.

Proof) We show that the Gram matrix K for a set of candidate complexes Inline graphic is positive semidefinite. The binary relation Inline graphic on the candidate set is an equivalence relation because for all Inline graphic, Inline graphic (reflexivity), if Inline graphic then Inline graphic (symmetry), if Inline graphic and Inline graphic then Inline graphic (transitivity). Then, the relation Inline graphic partitions Inline graphic into Inline graphic, and we have for any vector Inline graphic

graphic file with name pone.0065265.e112.jpg (3)
graphic file with name pone.0065265.e113.jpg (4)

It should be noted that Inline graphic if Inline graphic and Inline graphic are classified in the same set, otherwise Inline graphic. Consequently, K is positive semidefinite, and Inline graphic is a valid kernel. Inline graphic.

In addition, for the purpose of predicting whether or not two interacting proteins form a heterodimeric complex, we combine some feature space mapping Inline graphic in Table 1 with the domain composition kernel by

graphic file with name pone.0065265.e121.jpg (5)

where Inline graphic is any kernel for real-valued vectors, and Inline graphic is a positive constant. In this paper, we use the linear kernel for Inline graphic, that is, Inline graphic.

Computational Experiments

Data and Implementation

To perform computational experiments, we needed protein-protein interaction data with weights and protein complex data. We used the WI-PHI database [1] including 49607 protein pairs except self interactions as weighted protein-protein interaction data, where the actual file name was ‘pro200600448_3_s.csv’ at the supporting information web page of http://www.wiley-vch.de/contents/jc_2120/2007/pro200600448_s.html. The weights of interactions were calculated as follows. They constructed the literature-curated physical interaction (LCPH) dataset using several databases such as BioGRID [2], MINT [3], and BIND [4], and high-throughput yeast two-hybrid data by Ito [22] and Uetz [23]. To evaluate high-throughput data, they constructed a benchmark dataset having interactions supported by two independent methods from LCPH-LS, which was a low-throughput dataset in LCPH, and calculated a log-likelihood score (LLS) to each dataset except LCPH-LS. For each interaction, the weight was calculated by multiplying the socioaffinity (SA) indices [15] and the LLSs from different datasets, where the SA index measures the log-odds score of the number of times two proteins are observed to interact to the expected value from their frequency in the dataset.

To compare our method with the naive Bayes-based method proposed by Maruyama [11], we prepared the same dataset as in the paper [11] from CYC2008 protein complex database [12], which is available at http://wodaklab.org/cyc2008/resources/CYC2008_complex.tab. In the dataset, a positive example was restricted to a pair of proteins that is included as a PPI in WI-PHI and is not a proper subset of any other complex in CYC2008. Thus, we used 152 heterodimeric protein complexes contained in CYC2008 as positive examples, and selected 5345 negative examples from interacting protein pairs in the CYC2008 complexes with size more than two, where positive examples were excluded. Figure 2 shows an example of complexes Inline graphic and Inline graphic consisting of four proteins Inline graphic and two proteins Inline graphic and Inline graphic, respectively. According to this figure, four sets of two proteins, Inline graphic, Inline graphic, Inline graphic, and Inline graphic are selected as negative examples, where each interaction between two proteins is confirmed to be included in WI-PHI. The set of two proteins Inline graphic is removed from the dataset. Since negative examples selected in this way are more difficult to be correctly predicted than randomly selected ones, this dataset is considered to be useful for the evaluation.

Figure 2. Illustration of the selection of negative examples from complexes with size more than two.

Figure 2

Complex Inline graphic consists of four proteins Inline graphic, whereas heterodimeric complex Inline graphic consists of Inline graphic and Inline graphic. Edges represent protein-protein interactions. According to this figure, four sets of two proteins, Inline graphic, Inline graphic, Inline graphic, and Inline graphic are selected as negative examples. The set of two proteins Inline graphic is removed from the dataset. Each pair of two proteins surrounded by a dashed curve corresponds to a negative example.

Inline graphic-Support Vector Classification (Inline graphic-SVC) for unbalanced data

Since the numbers of positive and negative examples of the dataset used in this paper were very unbalanced, we used the extension of Inline graphic-Support Vector Classification (Inline graphic-SVC) described in [24], [25]. The extended Inline graphic-SVC solves the following optimization problem given input feature vectors Inline graphic and the corresponding classes Inline graphic.

graphic file with name pone.0065265.e153.jpg

where Inline graphic and Inline graphic are regularization parameters for positive and negative classes, respectively, and in the usual Inline graphic-SVC, Inline graphic.

We used ‘libsvm’ (version 3.11) [26] as an implementation of Inline graphic-SVC for unbalanced data.

Performance measure

To evaluate the performance of our method, we used precision, recall and F-measure, which are defined by

graphic file with name pone.0065265.e159.jpg (6)
graphic file with name pone.0065265.e160.jpg (7)
graphic file with name pone.0065265.e161.jpg (8)

where Inline graphic, Inline graphic, and Inline graphic denote the numbers of true positive, false positive, and false negative examples, respectively. Precision means the rate of correctly predicted positive examples to examples predicted as positive, and recall means the rate of correctly predicted positive examples to all positive examples. For evaluation of binary predictors, it is not sufficient to calculate only either the precision or the recall, and thus we used F-measure of their harmonic mean.

Results

To evaluate our method, we used several sets of our proposed features, (F1–5), (F1–6), (F1–5,7), and (F1–7). For example, (F1–5) means that we use a feature vector consisting of five values calculated by (F1), (F2), Inline graphic, (F5) as shown in Table 1. Then, we calculated the combination kernel with the domain composition kernel as shown in Eq.(5), and employed Inline graphic-SVC with varying mixing parameter Inline graphic and regularization parameters Inline graphic, Inline graphic. For each case, we performed 10-fold cross-validation using our combination kernel, and took the average of precision, recall, and F-measure in the same way as in [11].

Figure 3 shows the results on the average F-measures using four sets of features, (F1–5), (F1–6), (F1–5,7), (F1–7), and the domain composition kernel for the cases of Inline graphic, Inline graphic, Inline graphic (see Fig. S1 for more cases of Inline graphic and Inline graphic). We can see from these figures that the average F-measures during Inline graphic were about Inline graphic to Inline graphic and were better than that of Inline graphic in each case. It means that the domain composition kernel enhanced the prediction accuracy comparing with only features. Furthermore, features (F1–7) tended to have better average F-measures than other sets of features.

Figure 3. Result on the average F-measures using four sets of features and the domain composition kernel with Inline graphic.

Figure 3

Inline graphic-SVC was employed with regularization parameters, Inline graphic, Inline graphic. As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) shown in Table 1 were used.

Table 2 shows the results on the average precision, recall, and F-measure using our features and domain composition kernel in the best average F-measures case for each set of features. It also shows the results by the naive Bayes-based method [11], which is the best existing method for heterodimeric complex prediction, MCL [7], MCODE [8], RRW [18], and NWE [19]. (B1), (B2:CC), …, (B6) indicate the features used in the naive Bayes-based method (shown also in Table 3). These existing methods were executed using default parameters except the option of the minimum size of predicted complexes, which was set to be two if possible. For sets of features (F1–5), (F1–6), (F1–5,7), and (F1–7), the average F-measures in the cases of Inline graphic, Inline graphic, Inline graphic, and Inline graphic were best, respectively. In particular, the average F-measure for (F1–7) using Inline graphic was best among all the cases, and was much better than that by the naive Bayes-based method. We investigated which feature most contributed to the prediction accuracy. The discriminant function for SVM with linear kernel can be represented as Inline graphic. Here we suppose that elements Inline graphic of w are the coefficients of the corresponding features (F1),(F7), respectively. If each element of x is normalized, it can be considered that features with the largest absolute value of Inline graphic are effective for the discrimination in the seven features. We calculated the coefficients and averages of the feature values using Inline graphic and the dataset with 152 positive and 5345 negative examples. Thus, we had the coefficients Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic, Inline graphic, and the averages Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic. Then, Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic, and it was (F4),(F1),(F3),(F5),(F7),(F2),(F6) in descending order of Inline graphic. We can see that (F4) was most effective, and worked on the discrimination negatively, whereas (F6) was least effective, in fact, the decrease of the average F-measure by removal of (F6) from (F1–7) was small as shown in Table 2. It should be noted that this result does not necessarily mean that supervised methods such as the naive Bayes-based method and our proposed method are always better than unsupervised methods such as MCL and MCODE because unsupervised methods were evaluated using the whole PPI data whereas supervised methods were trained and evaluated via cross validation using a part of PPI data. Therefore, unsupervised methods may work better in other situations.

Table 2. Result on the average precision, recall, and F-measure using our features and domain composition kernel in the best average F-measure case for each set of features.

method features Inline graphic Inline graphic Inline graphic precision recall F-measure
Our combination kernel F1–5 0.6 0.7 4.0 0.586 0.659 0.620
F1–6 0.7 0.8 3.5 0.566 0.677 0.616
F1–5,7 0.6 0.7 4.0 0.592 0.667 0.627
F1–7 0.5 1.0 4.0 0.618 0.644 0.631
naive Bayes [11] B1, B2:CC 0.24 0.44 0.31
B1–6 0.17 0.65 0.27
MCL [7] 0.017 0.023 0.020
MCODE [8] 0 0
RRW [18] 0.030 0.32 0.055
NWE [19] 0.035 0.33 0.063

As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) shown in Table 1 were used. The results by the naive Bayes-based method [11], MCL [7], MCODE [28], RRW [18], and NWE [19] are also shown, where the experiments for these methods were performed by [11]. (B1), (B2:CC),, (B6) indicate the features by [11] (shown also in Table 3).

Table 3. Feature space mapping from two interacting proteins Inline graphic, Inline graphic in the naive Bayes-based method [11].

(B1) Inline graphic
(B2:X) Inline graphic, where Inline graphic represents
an ontology among biological process (BP), cellular component (CC) and molecular
function (MF) of Gene Ontology [27], and is also regarded to be the set of the terms;
Inline graphic, where Inline graphic is the set of all terms in Inline graphic annotating
both Inline graphic and Inline graphic, and Inline graphic is the set of proteins annotated by term Inline graphic.
(B3) Inline graphic, where Inline graphic and
Inline graphic is the stationary probability from Inline graphic to Inline graphic by a random walk with restarts
at Inline graphic (RRW [18]).
(B4) Inline graphic, where Inline graphic is the Pearson
correlation coefficient between the two genes producing Inline graphic and Inline graphic, respectively, over
some gene expression profiles.
(B5) Inline graphic
(B6) Inline graphic

Figures 4, 5, and 6 show the results on the average precision, recall, and F-measure with varying Inline graphic, Inline graphic, and Inline graphic, respectively, in the case of Inline graphic using features (F1-7). We can see that in the examined range, the average F-measures did not largely fluctuated.

Figure 4. Result on the average precision, recall, and F-measure with varying Inline graphic in the best case using features (F1–7).

Figure 4

Figure 5. Result on the average precision, recall, and F-measure with varying Inline graphic in the best case using features (F1–7).

Figure 5

Figure 6. Result on the average precision, recall, and F-measure with varying Inline graphic in the best case using features (F1–7).

Figure 6

In addition, we performed another experiment to validate our method for the rest PPIs, that is, we used 152 positive and 5345 negative examples as training data, and used the rest, 44110 examples as test data. Then, we obtained the prediction accuracy of 98.7% (43554/44110) using the combination kernel with (F1–7) and Inline graphic. These results suggest that our proposed kernel successfully predicted heterodimeric protein complexes and outperforms the naive Bayes-based method.

Conclusions

We proposed several feature space mappings using weights of protein-protein interactions for predicting heterodimeric protein complexes. In addition, we proposed the domain composition kernel based on the idea that two proteins having the same composition of domains as a heterodimeric protein complex would also form a heterodimer, and proved that the domain composition kernel is actually a kernel function. To validate our proposed method, we performed ten-fold cross-validation computational experiments for the combination kernel of the domain composition kernel with the linear kernel using several sets of features. The results suggest that our proposed kernel considerably outperforms the naive Bayes-based method, which is the best existing method, even in the case using only feature space mappings (F1–5) from weights of protein-protein interactions, that is, (F6,7) was not used and the mixing parameter Inline graphic is 0 although our proposed method is limited to prediction of heterodimeric protein complexes.

An important contribution in this paper is that we have shown that heterodimeric protein complexes are able to be successfully predicted using only information on weights of protein-protein interactions. Furthermore, we indicated that the use of protein domain information enhances the prediction accuracy.

There is some possibility to further improve the prediction accuracy. For instance, we can develop some kernels on protein domains using protein amino acid sequences and multiple sequence alignments. In addition, we can add new features based on other biological knowledge.

We used the C-SVC classifier, which is a variant of support vector machines, because the numbers of positive and negative examples were not balanced. It is interesting future work to develop more robust methods against unbalanced data for classifying heterodimeric protein complexes.

Supporting Information

Figure S1

Result on the average F-measures using four sets of features and the domain composition kernel with Inline graphic. C -SVC was employed with regularization parameters, Inline graphic, Inline graphic. As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) shown in Table 1 were used.

(EPS)

Figure S2

Result on the average F-measures using four sets of features and the domain composition kernel represented by Eq. (S1) with Inline graphic. C -SVC was employed with regularization parameters, Inline graphic, Inline graphic. As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) were used.

(EPS)

Table S1

Result on the average precision, recall, and F-measure using our combination kernel represented by Eq. (S1) in the best average F-measure case for each set of features. As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) were used.

(PDF)

Text S1

Results on our kernel by another combination.

(PDF)

Funding Statement

This work was partially supported by Grants-in-Aid #22240009 and #24500361 from MEXT, Japan (http://www.mext.go.jp/english/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding received for this study.

References

  • 1. Kiemer L, Costa S, Ueffing M, Cesareni G (2007) WI-PHI: A weighted yeast interactome enriched for direct physical interactions. Proteomics 7: 932–943. [DOI] [PubMed] [Google Scholar]
  • 2. Stark C, Breitkreutz B, Reguly T, Boucher L, Breitkreutz A, et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Research 34: D535–D539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, et al. (2002) MINT: a Molecular INTeraction database. FEBS Letters 513: 135–140. [DOI] [PubMed] [Google Scholar]
  • 4. Alfarano C, Andrade C, Anthony K, Bahroos N, Bajec M, et al. (2005) The Biomolecular Interac- tion Network Database and related tools 2005 update. Nucleic Acids Research 33: D418–D424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Sapkota A, Liu X, Zhao XM, Cao Y, Liu J, et al. (2011) DIPOS: database of interacting proteins in Oryza sativa . Molecular BioSystems 7: 2615–2621. [DOI] [PubMed] [Google Scholar]
  • 6. Zhao XM, Zhang XW, Tang WH, Chen L (2009) FPPI: Fusarium graminearum protein-protein interaction database. J Proteome Res 8: 4714–4721. [DOI] [PubMed] [Google Scholar]
  • 7. Enright A, Dongen SV, Ouzounis C (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30: 1575–1584. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4: 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. King A, Prulj N, Jurisica I (2004) Protein complex prediction via cost-based clustering. Bioinfor- matics 20: 3013–3020. [DOI] [PubMed] [Google Scholar]
  • 10. Chua H, Ning K, Sung WK, Leong H, Wong L (2008) Using indirect protein-protein interactions for protein complex prediction. Journal of Bioinformatics and Computational Biology 6: 435–466. [DOI] [PubMed] [Google Scholar]
  • 11. Maruyama O (2011) Heterodimeric protein complex identification. In: ACM Conference on Bioin- formatics, Computational Biology and Biomedicine 2011: 499–501. [Google Scholar]
  • 12. Pu S, Wong J, Turner B, Cho E, Wodak S (2009) Up-to-date catalogues of yeast protein complexes. Nucleic Acids Research 37: 825–831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, et al. (2004) MIPS: analysis and anno- tation of proteins from whole genomes. Nucleic Acids Research 32: D41–D44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, et al. (2002) Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415: 180–183. [DOI] [PubMed] [Google Scholar]
  • 15. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, et al. (2006) Proteome survey reveals modu- larity of the yeast cell machinery. Nature 440: 631–636. [DOI] [PubMed] [Google Scholar]
  • 16. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, et al. (2006) Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440: 637–643. [DOI] [PubMed] [Google Scholar]
  • 17. Qi Y, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z (2008) Protein complex identification by supervised graph local clustering. Bioinformatics 24: i250–i258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Macropol K, Can T, Singh A (2009) Repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics 10: 283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Maruyama O, Chihara A (2010) NWE: Node-weighted expansion for protein complex prediction using random walk distances. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM2010): 590–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Ozawa Y, Saito R, Fujimori S, Kashima H, Ishizaka M, et al. (2010) Protein complex prediction via verifying and reconstructing the topology of domain-domain interactions. BMC Bioinformatics 11: 350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Ben-Hur A, Noble W (2005) Kernel methods for predicting protein-protein interactions. Bioinfor- matics 21: i38–i46. [DOI] [PubMed] [Google Scholar]
  • 22. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98: 4569–4574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Uetz P, Giot L, Cagney G, Mansfield T, Judson R, et al. (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403: 623–627. [DOI] [PubMed] [Google Scholar]
  • 24.Osuna E, Freund R, Girosi F (1997) Support vector machines: Training and applications. In: AI Memo 1602, Massachusetts Institute of Technology.
  • 25.Vapnik V (1998) Statistical Learning Theory. Wiley-Interscience.
  • 26. Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2: 27 1–27: 27. [Google Scholar]
  • 27. Gene Ontology Consortium (2008) The Gene Ontology project in 2008. Nucleic Acids Research 36: D440–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Result on the average F-measures using four sets of features and the domain composition kernel with Inline graphic. C -SVC was employed with regularization parameters, Inline graphic, Inline graphic. As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) shown in Table 1 were used.

(EPS)

Figure S2

Result on the average F-measures using four sets of features and the domain composition kernel represented by Eq. (S1) with Inline graphic. C -SVC was employed with regularization parameters, Inline graphic, Inline graphic. As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) were used.

(EPS)

Table S1

Result on the average precision, recall, and F-measure using our combination kernel represented by Eq. (S1) in the best average F-measure case for each set of features. As sets of features, (F1–5), (F1–6), (F1–5,7), and (F1–7) were used.

(PDF)

Text S1

Results on our kernel by another combination.

(PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES