Abstract
Recognition of indirect interactions is instrumental to in silico reconstruction of signaling pathways and sheds light on the exploration of unknown physical paths between two indirectly interacting genes. However, very limited computational methods have explicitly exploited the indirect interactions with experimental evidences thus far. In this work, we attempt to distinguish direct versus indirect interactions in human functional protein-protein interaction (PPI) networks via a predictive l2-regularized logistic regression model built on the experimental data. The l2-regularized logistic regression method is adopted to counteract the potential homolog noise and reduce the computational complexity on large training data. Computational results show that proposed model demonstrates promising performance even though the training data is highly skewed. From the 304,799 PPIs that are curated in several databases, the proposed method detects 23,131 indirect interactions, most of which have been verified by the breadth-first graph search algorithm to find dozens of physical paths between the interacting partners. Pathway enrichment analysis shows that most of the physical paths can be mapped onto more than one human signaling pathway, indicating that there do exist a series of biochemical signals between the two indirectly interacting genes. The interactome-scale computational results promise to provide useful cues to the following applications (1) exploration of unknown physical PPIs or physical paths between two indirectly interacting genes; (2) amending or extension of the existing signaling pathways; (3) recognition of the physical PPIs for druggable target discovery.
Introduction
Biochemical signal is generally transduced along a cascade of physical protein-protein interactions (PPI), which further mediate a variety of life processes, e.g. gene expression and regulation molecule/ion transport, protein modification and synthesis, complex metabolic processes, etc., At present Yeast 2-Hybrid (Y2H) [1] and Affinity Purification followed by Mass Spectrometry (AP-MS) [2] are the two major experimental techniques to identify direct or physical PPIs. However, these two techniques are prone to yield a fraction of indirect interactions [3]. Take AP-MS for example, a fraction of co-purified prey proteins are essentially the indirect interaction partners of the bait protein. The bait protein does not directly interact with the prey proteins, but interacts with them through a chain of physical interactions. Distinguishing direct interactions from indirect interactions remains to be a challenging problem in experimental and computational biology.
Compared to Y2H and AP-MS, the other indirect high-throughput experimental techniques such as genetic neighbourhood, gene co-expression, concurrence, fusion, etc. are likely to yield a much larger number of indirect interactions. The functional PPI database STRING [4] assigns confidence scores to the evidences that are derived by a variety of techniques, and combines these scores to evaluate the confidence level of the interactions. Besides STRING, the other databases including Reactome [5], IntAct [6] and BioGrid [7] have also curated a fraction of functional or indirect PPIs.
In recent years, numerous computational methods have been developed for rapid reconstruction of protein-protein interaction networks, ranging from intra-species PPI networks [8–13] to inter-species PPI networks [14–18], all of which do not explicitly distinguish direct interactions versus indirect interactions. Furthermore, the definitions of indirect or functional interactions are rather vague. For instance, Wu et al. [19] train a Naive Bayes classifier on the data obtained from Reactome [5] for functional PPI prediction, where functional interaction is defined as the relationship that two proteins are involved in the same biochemical reaction as an input, catalyst, activator, or inhibitor, or as two members of the same protein complex. In [20], indirect interaction is defined as the relationship between two arbitrary nonadjacent genes in the same pathway, which unnecessarily holds true in most cases. For instance, gene b as the source gene transmits some kind of signal (e.g. activation/inhibition) to gene g as the sink gene via a cascade of physical PPIs including gene e, resulting in the activation/inhibition/expression of gene g. In such a case, gene b has an indirect effect on gene g, while gene e is merely an intermediate step of the signal transduction. Gene e alone, or as a source gene, cannot affect gene g. As such, the relationship of indirect interaction between two genes should not be arbitrarily determined. At least, indirect interaction to some extent indicates a causal or regulatory relationship that the source gene causes a signal event and the signal travels through a series of intermediate genes to cause an indirect effect on the sink gene, e.g. gene expression, translation, activation/inhibition, catalytic reaction, biochemical modification, etc. The indirect effect could be observed and determined by experimental techniques. Actually, Reactome [5] and KEGG [21] have curated a small fraction of indirect interactions with experimental evidences.
To date, the computational methods seldom explicitly distinguish direct interactions from indirect interactions. Soong et al. [22] extract the physical PPIs from DIP [23] as the positive training data and randomly sample the negative data from the protein pair space to train a SVM classifier. Since the negative training data contains indirect interactions and non-interactions, the method cannot explicitly recognize indirect interactions. To solve this problem, Elefsinioti et al. [20] train a three-class classifier to predict physical interactions, indirect interactions and non-interactions. The training data of physical interactions are extracted from HPRD [24], the training data of indirect interactions are randomly sampled from those protein pairs in the same pathways that are not located next to each other, and the training data of non-interactions are randomly sampled from those protein pairs that neither are located in the same pathway nor occur in the existing PPI databases. Kim et al. [3] develop a graph method to discriminate direct interactions from indirect interactions. In their method, the proteins within one densely-connected clique are assumed to physically interact and those proteins that are remotely-connected are assumed to indirectly interact. There is a common assumption in these methods that indirect interaction could be viewed as any arbitrary nonadjacent protein/gene pair in the same pathway. This assumption, without any cause-effect constraint, is disputable and open to discussion.
In this work, we exploit the experimental data to train an l2-regularized logistic regression model [25, 26] for interactome-scale distinguishing direct versus indirect interactions. In the context of this work, indirect interaction is defined as a regulatory or cause-effect relationship of two genes/proteins, between which a series of intermediate genes/proteins function to relay biochemical signals. Gene ontology (GO) is used as features to represent protein pairs. To handle the issues of GO sparsity and null-feature vectors, homolog knowledge transfer is conducted by treating the homolog knowledge as independent homolog instances. L2-regularized logistic regression [25, 26] is adopted here to counteract the homolog noise and to reduce the computational complexity on the large training data. The proposed model is evaluated using 5-fold cross validation and independent test, and then is applied to classify the interactome-scale human PPIs into direct interactions and indirect interactions. To verify the credibility of the computational results, we further use the breadth-first graph search algorithm to find the physical paths between any gene/protein pair that is predicted to indirectly interact. In addition, we conduct pathway enrichment analysis to find the supporting evidences that there do exist a series of signal flows between the indirectly interacting genes.
Materials and methods
Data and materials
The physical PPIs are extracted from HPRD [24] and BioGrid [7]. After removing those obsolete/uncurated proteins and those proteins that have no corresponding gene names in Uniprot (http://www.uniprot.org/uniprot/), we obtain 9,991 common physical PPIs between HPRD and BioGrid in total, and use them as the positive training data (see Table 1). It is assumed that the commonly curated physical PPIs in HPRD and BioGrid are of relatively higher quality. The remaining physical PPIs in HPRD and BioGrid amounting to 25,255 and 20,814, respectively, are used as the positive independent test data (see Table 2).
Table 1.
Results of 5-fold cross validation.
| Size | Combined-instance |
Homolog-instance |
Target-instance |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SP | SE | MCC | SP | SE | MCC | SP | SE | MCC | ||
| Physical | 9,991 | 0.9660 | 0.9651 | 0.8608 | 0.9658 | 0.9600 | 0.8532 | 0.9623 | 0.9629 | 0.8508 |
| Indirect | 2,586 | 0.8659 | 0.8689 | 0.8352 | 0.8495 | 0.8692 | 0.8248 | 0.8570 | 0.8550 | 0.8212 |
| [Acc; MCC] | [94.53%; 0.8960] | [94.13%; 0.8888] | [94.13%; 0.8888] | |||||||
| [ROC-AUC] | [0.9758] | [0.9744] | [0.9757] | |||||||
| F1 Score | 0.9655 | 0.9629 | 0.9626 | |||||||
| *hPRINT (AUC Score) [20] | Random forest | Naive Bayes | SVM | |||||||
| 0.849 | 0.793 | 0.777 | ||||||||
Table 2.
Results of independent test and interactome-scale prediction.
| Independent test | Interactome-scale prediction | |||
|---|---|---|---|---|
| Physical interaction | Indirect interaction | |||
| HPRD | BioGrid | KEGG | ||
| Size | 25,255 | 20,814 | 41 | 304,799 |
| Recognition rate | 93.67% | 98.00% | 75.61% | 7.59% |
The indirect interactions are extracted from Reactome [5] and KEGG [21]. Comparatively, the number of indirect interactions is much less than that of physical interactions. From Reactome and KEGG, we obtain 2,627 unique indirect interactions, among which 41 indirect interactions are extracted from KEGG and 2,586 indirect interactions are extracted from Reactome. To reduce the skewness of the training data, i.e. the ratio of the positive class size to the negative class size, we keep all 2,586 indirect interactions from Reactome as the negative training data, so that the positive class contains 9,991 examples and the negative class contains 2,586 examples (see Table 1). The 41 indirect interactions from the relatively independent KEGG are accordingly used as the negative independent test set (see Table 2), which is seemingly too small. This is a compromise between the training data and the independent test data. The more indirect interactions we choose as the negative independent test data, the less indirect interactions are available as the negative training data, so that the risk of model bias caused by the skewed distributions of training data will increase. Fortunately, all the 2,586 negative training data also have to participate in the model estimation of five-fold cross validation. Independent test and cross validation could jointly assess the proposed computational framework.
The interactome-scale prediction set is taken from STRING [4], Reactome [5], IntAct [6] and KEGG [21]. To ensure the quality of data, we only choose those PPIs with experimental evidences. After excluding those PPIs that have been contained in the training set and independent test set, we obtain 304,799 unique PPIs in total as the prediction set (see Table 2).
Multi-instance feature construction
Feature construction is a significant step of machine learning approach to solve biological problems. Protein sequence is the cheapest and most easily available source for feature extraction [13–14]. The commonly used k-mer sequence features are simple and easy to capture the contextual information of specific residues. Nevertheless, k-mer features demonstrate poor or moderate performance for PPI prediction [27], partly because k-mer features cannot tackle these problems properly (1) the lengths of motifs are not fixed but vary in a large range; (2) the conservative degree varies with residues in motifs; (3) the order of motifs along protein sequence is important to protein structure and function, but k-mer features cannot capture this information. Gene expression or gene co-expression to some extent depicts the characteristics of protein or protein interaction at the genomic level [9], but there is a long path from genome to proteome. For instance, a gene transcript has to go through a series of biochemical reactions (e.g. post-transcriptional modifications) and biophysical changes (e.g. conformational folding) to form a functionally and structurally specific protein. Structural information is highly accurate and valuable to depict proteins and infer protein-protein interactions [16], but its high experimental cost restricts its applications for most proteins.
Gene ontology (GO) is a hierarchically organized and controlled vocabulary to characterize gene products [28] that is composed of three aspects, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). The annotations of the three aspects of genes or gene products are provided in terms of GO terms in GOA [29]. Gene ontology has been claimed to be the most discriminative feature to predict protein-protein interactions [12] and has witnessed many applications in protein pair representation and PPI prediction [8–12, 14–18]. As compared to the other features, gene ontology provides experimental biochemical and biophysical evidences to comprehensively depict proteins, which is the major reason why GO features achieve overwhelming predictive performance. Nevertheless, there are many less-studied genes whose GO knowledge is rather limited. In the worst case, we have no GO knowledge about the genes concerned at all. To tackle the problem of GO sparsity, we depict a gene/protein with two instances, namely the target instance and the homolog instance. The target instance depicts the GO knowledge of the gene/protein itself, while the homolog instance depicts the GO knowledge of the homologs. The homolog instance is used to substitute the target instance when the gene/protein is not annotated.
The homologs are extracted from SwissProt [30] using PSI-BLast [31] (E-value=10) against all species, and the GO terms are extracted from GOA [29]. The reason why we do not choose a low E-value is that we need more homolog GO terms to counteract GO term sparsity. For each query protein, its homologs show variant sequence similarity or sequence identity, high sequence identity indicates close homolog while low sequence identity indicates remote homolog. We conduct a statistics of sequence identity of the homologs (see Supplementary Figure S1). It is evident that the sequence identity of the homologs is largely distributed within [50%, 40%], [40%, 30%] and [30%, 20%]. Beyond 20%, there are very limited or no homologs are detected. At present, it is hard to determine the threshold of sequence similarity for true homologs. Maybe we can safely treat the homologs with similarity >40% as significant, similarity <20% as insignificant and similarity within [40%, 20%] as moderate. Here we treat all the results from the BLAST search as homologs to enrich the GO feature information. Such a high E-value inevitably introduces a certain level of homolog noise, which is the reason that we adopt the l2-regularized version of logistic regression.
For each protein i in the training set U, we obtain two sets of GO terms, one set contains the GO terms of the homologs (denoted as ), and the other set contains the GO terms of the protein itself (denoted as ). As such, the complete set of GO terms of the training set (denoted as S) is defined as follows:
| (1) |
Two feature vectors for the protein pair are formally defined as follows:
| (2) |
For each GO term , denotes the component g of the target instance and denotes the component g of the homolog instance . Those GO terms are discarded. The formula means that if the protein pair shares the same GO term g, then the corresponding component in the feature vector or is set to 2; if neither protein in the protein pair possesses the GO term g, then the value is set to 0; otherwise the value is set to 1.
Fast training and noise tolerance via l2-regularized logistic regression
In this work, we conduct homolog knowledge transfer via homolog instances to enrich the feature information and substitute the target instance when the gene/protein is not annotated. However, the homolog instances potentially introduce a certain level of evolutionary noise and increase the computational complexity via data augmentation. Hence we need to choose a computational framework that could counteract noise and train on large data with low computational cost. To our knowledge, statistical learning theory based regularization technique, e.g. Support Vector Machine (SVM) [32], could penalize the potential overfitting to noise so as to achieve a trade-off between the training error and the generalization error. Unfortunately, kernel computation on large training data will result in a large time complexity . In this work, the target instances and the homolog instances in the training data add up to instances, so large a training set demands a fast computational method. Classic logistic regression can fit a large training data set with a low computational complexity, but it also fits noise. Here we adopt the l2-regularized logistic regression method [25], implemented in the toolbox LIBLINEAR [26], to counteract the homolog noise and fit the large training data in linear time.
Given a set of instance-label pairs , l2-regularized logistic regression solves the following unconstrained optimization problem:
| (3) |
where ω denotes the weight vector, C denotes the penalty parameter or regularizer and the second term penalizes the potential noise/outlier fitting. The optimization of objective function (3) is solved via its dual form:
| (4) |
where denotes Lagrangian operator and . To help readers see the connections between the classical logistic regression and the l2-regularized logistic regression method, we first briefly review the logistic regression method as follows. Given a data point in the training data and its observation class , linear regression attempts to derive a decision function ,which is further converted to probability via the logistic function , so that if the observation class , the predicted class probability is ; otherwise, the predicted class probability is if . As a result, the likelihood of the training data is , and accordingly, the log-likelihood is . Maximizing is equivalent to minimizing the object function , which is the cross entropy between the observation class and the predicted class for all the data points. We can see that the classical logistic regression method is derived from probability theory and the weights ω are derived by minimizing the cross entropy between the observation class and the predicted class. For two-class classification, the label is often denoted by {−1, +1} [33], so that the loss function is defined as (see the second term in Formula (3)), where the opposite signs between the observation class and the predicted class yield large loss, indicating a predictive error and penalty. Besides the log loss, l2-regularized logistic regression also imposes a constraint on the l2-norm of the weight vector ω as defined by the first term of Formula (3).
For each test protein-protein pair , the decision function yields two outputs , , which is further combined into the final decision as follows:
| (5) |
where denotes the absolute value of ∆. The final label for the test protein pair is defined as follows:
| (6) |
where the threshold δ is used to filter out those weak positive predictions, denotes the undetermined predictions that are discarded. Formula (6) means that only those predictions that are above random guess with threshold δ are accepted. Here we treat as probability and denotes the predicted probability on the positive class or the negative class .
Experimental setting and model evaluation
Three experimental settings are designed to validate the effectiveness of homolog knowledge transfer via homolog instances. The first setting, namely combine-instance, combines the outputs of the target instance and the homolog instance. The second setting, namely homolog-instance, uses the homolog instance alone to assess the model performance. This setting is deliberately designed to evaluate the robustness of the computational framework against GO sparsity or unavailability. The third setting, namely target-instance, uses the target instance alone to assess the model performance, which is designed as the baseline of performance comparison. Equivalence to or excellence over the baseline performance indicates the effectiveness of homolog knowledge transfer.
The model is evaluated using five performance metrics, i.e. ROC-AUC (Receiver Operating Characteristic AUC), SE (sensitivity), SP (specificity), MCC (Matthews correlation coefficient) and F1 score. Among these metrics, SP, SE and MCC are derived from the confusion matrix M, from which several intermediate variables as defined in formula (7) are first derived. Based on these intermediate variables, SPl, SEl and MCCl for each label are calculated according to formula (8) and the overall MCC is calculated according to formula (9).
| (7) |
| (8) |
| (9) |
where records the counts that class i are classified to class j, and L denotes the number of labels. AUC is calculated based on the decision values as defined in formula (5). F1 score is defined as follows:
| (10) |
Results
Cross validation and independent test
Cross validation.
The final training data consist of 9,991 physical PPIs (positive class) and 2,586 indirect PPIs (negative class) that are extracted from several databases. It is evident that the two classes are highly imbalanced. To obtain as many predictions as possible, we set the threshold δ=0 as defined in Formula (6). The larger the threshold δ is, the more credible the obtained predictions are, and the fewer predictions we could obtain accordingly. It is at the discretion of readers to choose the threshold δ according to the application scenario. If the readers care less about the confidence level and need more information, a large δ is suggested; otherwise, a small δ is preferred. The results of 5-fold cross validation are provided in Table 1, and the corresponding ROC curves are illustrated in Figure 1.
Figure 1.

ROC curves for 5-fold cross validation.
The model is assessed under three experimental settings, namely combined-instance,, homolog-instance and target-instance. As shown in Table 1, the proposed model performs almost equivalently on these three cases, indicating that the homolog knowledge is properly transferred to the less-studied genes, and the homolog instance can be safely used as the substitute when the gene concerned is not annotated. The high MCC values, e.g. 0.8960 achieved on the combined instances, indicate that there is little predictive bias between the larger class of physical interactions and the much smaller class of indirect interactions. Encouragingly, the proposed computational framework achieves quite promising performance even on the small class of indirect interaction, e.g. SP 0.8659, SE 0.8689 and MCC 0.8352 on the combined instances. As illustrated in Figure 1, the computational framework also achieves very comparable ROC-AUC scores under the three experimental settings.
Independent test.
Table 2 summarizes the results of the independent test. From the results, we can see that the computational framework generalizes well to the unseen physical interactions, achieving 93.67% and 98.00% recognition rates on HPRD and BioGrid, respectively. Comparatively, the negative independent test set contains much less indirect interactions, since a large portion of indirect interactions have to be used as the negative class of the training data to reduce the skewedness of class distribution. The 75.61% recognition rate on the negative independent test data is somewhat low but is still encouraging, regarding the imbalanced class distribution. With the accumulation of more indirect interactions with experimental evidences, the skewedness of the positive class to the negative class could be greatly reduced so that the computational framework could generalize better to the unseen indirect interactions.
Comparison with the related work
The existing computational methods, e.g. the related work [20], randomly sample the nonadjacent genes in the same pathways as indirect interactions without exploiting the experimentally-derived indirect interactions that are curated in the PPI databases. As discussed before, the definition of indirect interaction is still vague in the existing computational and experimental work. The practice to treat any two nonadjacent genes in the same pathway as indirection interaction is open to further discussion. In our opinion, indirect interaction at least indicates some causal or regulatory relationship between two indirectly interacting genes. Nevertheless, we still compare the proposed computational framework with the related work [20] despite the different views about indirect interactions.
As shown in Table 1, the computational method hPRINT proposed in the related work [20] achieves ROC-AUC scores 0.849 on Random Forest, 0.793 on Naive Bayes and 0.777 on SVM, which are significantly lower than the ROC-AUC scores of this computational framework, i.e. 0.9758 on the combined instance, 0.9744 on the homolog instance and 0.9757 on the target instance. Here we do not compare l2-regularized logistic regression to the commonly-used base classifier SVM, Random Forest and Naive Bayes, because SVM demonstrates a large computational complexity on large training data and the other two methods (i.e. Random Forest and Naive Bayes) lack noise-control mechanisms.
Interactome-scale predicted indirect interactions and validation
We further apply the trained model to interaction-scale predictions. As shown in Table 2, 23,131 PPIs amounting up to 7.59% of the total 304,799 PPIs are identified as indirect interactions. The computational results for the entire prediction set are provided in the Supplementary File S1 and the predicted indirect PPIs are provided in the Supplementary File S2.
Indirect interaction between two genes generally indicates that there exists one or more than one cascade of physical PPIs to transmit biochemical signals from the effector gene to the receptor gene. Here we use the breath-first graph search algorithm to search for all the physical paths between the two genes that are predicted to indirectly interact. The physical paths found in HPRD [24] and BioGrid [7] are provided in the Supplementary File S3 and S4, respectively. Among the predicted 23,131 indirect interactions, 51.66% of the indirect PPIs are validated to find physical paths in HPRD [24], and 32.58% of the indirect PPIs are validated to find physical paths in BioGrid [7]. Most of the predicted indirect PPIs have more than one physical path (see Figure 2). For instance, the predicted indirect interaction between TLR2 and EGFR surprisingly finds 224 physical paths in BioGrid. Furthermore, most physical paths contain 4 to 7 genes/proteins or even more (see Figure 3). For instance, the interaction between CREB3 and PRKACA taken from Reactome and KEGG is predicted to be an indirect interaction with 136 physical paths between them, and the longest path (CREB3 - EMD - ACTB - PLD2 - ARF1 - PLIN2 - ABHD5 - PLIN1 - PRKACA) contains nine genes (see Supplementary File S3 for details). The two genes are viewed to physically interact in HPRD, potentially because the experimental techniques such as gene co-expression, co-occurrence and Y2H fail to detect these intermediate physical interactions along the long physical path.
Figure 2. Statistics of number of physical PPI cascades between two genes that are predicted to indirectly interact.

The horizontal axis denotes the physical path length and the vertical axis denotes the number of predicted indirect interactions with a specific physical path length.
Figure 3. Distribution of path length of the physical PPI cascades searched in HPRD and BioGrid for the predicted indirect interactions.

Here path length is defined as the number of nodes along the cascade of physical PPIs.
We also map the genes that are predicted to indirectly interact onto the known human signaling pathways to verify that there do exist a series of signaling genes/proteins between them for signal relay. In NetPath [34], about 35 human immune signaling pathways are manually curated. For simplicity, we merge the 11 sub-types of Interleukin (IL-1 ~ IL-11) into one single signaling pathway, thus we obtain 27 human signaling pathways in total. The pathway enrichment analysis of the physical paths between two indirectly interacting genes is provided in the Supplementary File S5 for HPRD and S6 for BioGrid, respectively. Now we take the predicted indirect interaction between YAP1 and PPP2CA for example. Interested readers are referred to the other predicted indirect interactions that are provided in the Supplementary File S3~S6 for useful cues. The interaction between YAP1 and PPP2CA is taken from Reactome [5] and KEGG [21]. YAP1 is a transcriptional regulator and the critical downstream regulatory target in the Hippo signaling pathway that plays a pivotal role in organ size control and tumour suppression by restricting proliferation and promoting apoptosis [35]. PPP2CA is the major phosphatase for microtubule-associated proteins (MAPs) and modulates the activity of phosphorylase B kinase casein kinase 2, mitogen-stimulated S6 kinase, and MAP-2 kinase [36]. Gene ontology (GO) enrichment analysis shows that both YAP1 and PPP2CA are annotated with the common GO terms: GO:0005829 (cytosol), GO:0005737 (cytoplasm), GO:0005634 (nucleus), GO:0016020 (membrane), GO:0008022 (protein C-terminus binding), GO:0005515 (protein binding) and GO:0006355 (regulation of transcription, DNA-dependent). We can see that YAP1 is still likely to interact with PPP2CA indirectly, though the two proteins are potentially subcellular co-localized (e.g. cytosol, cytoplasm, nucleus, membrane). In HPRD [24] and BioGrid [7], we have found dozens of physical paths between YAP1 and PPP2CA (see Supplementary File S3 and S4).
Verified cascades of physical PPIs in HPRD and pathways enrichment analysis.
In HPRD [24], twenty physical paths are verified to exist between YAP1 and PPP2CA (see Supplementary File S3 and Figure 4). The path lengths vary from 3 to 6. Here path length is defined as the number of protein nodes along the cascade of physical PPIs. Pathway enrichment analysis shows that all the twenty cascades of physical PPIs constitute parts of the TGFBeta (transforming growth factor beta) signaling pathway in NetPath [34] (see Supplementary File S4). In addition, some part (e.g. PPP2CA - TP53 - CREBBP) of the cascade of physical PPIs (e.g. PPP2CA - TP53 – CREBBP - NFE2 - YAP1) can be mapped onto NetPath signaling pathways (e.g. TNF, IL signaling pathways) (see Figure 5 Ⓐ and Ⓑ). GO enrichment analysis shows that a majority of proteins in these cascades of physical PPIs that belong to the TGFBeta signaling pathway are localized in the subcellular compartments of nucleus, cytoplasm, nucleoplasm and membrane); fulfil the molecular functions of protein binding, DNA binding, chromatin binding, etc.; and participate in the biological processes of transcription regulation, cell proliferation, cell cycle, cell apoptosis, organismal development, etc. (see Table 3). The cascades of physical PPIs between YAP1 and PPP2CA function to transduce biological signals and induce cross-talks between signaling pathways, e.g. PPP2CA - TP53 - CREBBP between TNF and IL signaling pathways (see Figure 5 Ⓐ and Ⓑ). These evidences potentially demonstrate that an indirect effect rather than a direct interaction exists between YAP1 and PPP2CA.
Figure 4. Cascades of physical PPIs in HPRD that are verified for the predicted indirect interaction between YAP1 and PPP2CA.

The grey real line denotes the physical interaction and the red dotted line denotes the predicted indirect interaction. The red nodes denote the genes concerned and the green nodes denote the intermediate genes.
Figure 5. Pathway enrichment analyses of the cascades of physical PPIs in HPRD that are verified for the predicted indirect interaction between YAP1 and PPP2CA.

Ⓐ denotes the TNF signaling pathway; Ⓑ denotes the IL signaling pathway. The grey real line denotes the physical interaction and the red dotted line denotes the predicted indirect interaction. The red nodes denote the genes concerned and the green nodes denote the intermediate genes.
Table 3.
GO enrichment analysis of the cascades of physical PPIs in NetPath TGFBeta signaling pathway that are verified for the predicted indirect interaction between YAP1 and PPP2CA.
| GO term ID | GO term name | Percentage(%) | |
|---|---|---|---|
| Cellular components | GO:0005634 | nucleus | 83.33 |
| GO:0005737 | cytoplasm | 64.29 | |
| GO:0005654 | nucleoplasm | 61.90 | |
| GO:0005829 | cytosol | 40.48 | |
| GO:0016020 | membrane | 33.33 | |
| GO:0005739 | mitochondrion | 21.43 | |
| Molecular functions | GO:0005515 | protein binding | 95.24 |
| GO:0003677 | DNA binding | 45.24 | |
| GO:0003700 | sequence-specific DNA binding transcription factor activity | 38.10 | |
| GO:0003682 | chromatin binding | 19.05 | |
| GO:0046872 | metal ion binding | 19.05 | |
| GO:0005524 | ATP binding | 16.67 | |
| GO:0044212 | transcription regulatory region DNA binding | 16.67 | |
| Biological processes | GO:0006355 | regulation of transcription, DNA-dependent | 54.76 |
| GO:0045944 | positive regulation of transcription from RNA polymerase II promoter | 35.71 | |
| GO:0006915 | apoptotic process | 21.43 | |
| GO:0008285 | negative regulation of cell proliferation | 19.05 | |
| GO:0006367 | transcription initiation from RNA polymerase II promoter | 16.67 | |
| GO:0007275 | multicellular organismal development | 16.67 | |
| GO:0007049 | cell cycle | 11.90 | |
| GO:0030154 | cell differentiation | 11.90 | |
| GO:0006974 | response to DNA damage stimulus | 11.90 | |
Verified cascades of physical PPIs in BioGrid and pathways enrichment analysis.
In BioGrid [7], we have found 35 physical paths between YAP1 and PPP2CA. For clarity, only a part of physical paths are illustrated in Figure 6, and the complete physical paths are referred to the Supplementary File S4. The path lengths vary from 4 to 6. Pathways enrichment analysis shows that all the 35 cascades of physical PPIs can also be completely mapped to the TGFBeta signaling pathway (see Supplementary File S6). Some part (e.g. PPP2CA – MYC - PML) of the physical PPI cascade (e.g. PPP2CA –MYC – PML - RUNX1 - YAP1) can be mapped onto the NetPath signaling pathway (e.g. TNF signaling pathway) (see Figure 7 Ⓐ and Ⓑ). The mappings to other NetPath signaling pathways, e.g. IL, Wnt, BCR, EGFR1, AR, etc., are referred to the Supplementary File S6. The cascades of physical PPIs between YAP1 and PPP2CA potentially mediate the cross-talks among signaling pathways, e.g. PPP2CA – MYC - BTRC between the TNF and Wnt signaling pathways (see Figure 7 Ⓐ and Ⓑ), PPP2CA - MYC - CREBBP between the TNF and Wnt signaling pathways (see Figure 7 Ⓐ and Ⓑ). If we merge the physical PPIs in HPRD with those in BioGrid or collect more physical PPIs, the number of physical paths between YAP1 and PPP2CA and the average path length will accordingly increase. This provides supporting evidences to the prediction of indirect interaction between YAP1 and PPP2CA.
Figure 6. Cascades of physical PPIs in BioGrid that are verified for the predicted indirect interaction between YAP1 and PPP2CA (only a part of physical paths are illustrated).

The grey real line denotes the physical interaction and the red dotted line denotes the predicted indirect interaction. The red nodes denote the genes concerned and the green nodes denote the intermediate genes.
Figure 7. Pathway enrichment analyses of the cascades of physical PPIs in BioGrid that are verified for the predicted indirect interaction between YAP1 and PPP2CA.

Ⓐ denotes the TNF signaling pathway; Ⓑ denotes the Wnt signaling pathway. The grey real line denotes the physical interaction and the red dotted line denotes the predicted indirect interaction. The red nodes denote the genes concerned and the green nodes denote the intermediate genes.
Discussion
Physical protein-protein interactions (PPI) constitute the basic components of signaling pathways. No differentiation between direct interactions and indirect interactions is prone to erroneously infer signaling pathways. The existing functional or physical PPI databases have curated a fraction of indirect interactions, but only a very small number indirect interactions are explicitly recognized. At present, the experimental techniques used to detect physical interactions include Yeast 2-Hybrid (Y2H) and Affinity Purification followed by Mass Spectrometry (AP-MS). However, the two techniques are still prone to yield a certain number of indirect interactions. Distinguishing direct interactions from indirect interactions is significant to veritable reconstruction of signaling pathways.
To date, very limited computational methods have been proposed to distinguish the two types of interactions in human functional protein-protein interaction networks. The related work generally treats the relationship between any arbitrary nonadjacent genes in the same pathway as indirect interaction, which does not hold true in many cases. In the process of signal transduction, many genes/proteins only function to relay the signals. Only when the source gene leads to some observable indirect effect on the sink gene via a series of intermediate genes, e.g. activation/inhibition, regulation, expression, transcription, reaction, modification, etc., could the relationship be viewed as indirect interaction in our opinions. So far some databases such as Reactome and KEGG have collected a small number of indirect interactions. However, the existing computational methods rarely exploited these experimental data.
In this work, we exploit the experimentally derived direct and indirect interactions to train a two-class predictive model. Each protein pair is represented with two instances, namely the target instance and the homolog instance. The target instance is represented with the GO knowledge of the proteins themselves, and the homolog instance is represented with the GO knowledge of the homologs. In such a manner, the multi-instance feature construction method facilitates homolog knowledge transfer to tackle the problem of GO sparsity. Nevertheless, two new problems are introduced into the computational modeling, one problem is that the homolog instances are prone to introduce evolutionary noise, and the other problem is that the computational complexity is substantially increased as the homolog instances augment the already-large training data. To solve the two problems, we adopt l2-regularized logistic regression to counteract the homolog noise and fit the large training data at a low computational cost. The results of both the cross validation and the independent test show that the model performance is quite encouraging and is less biased on the highly unbalanced training data.
It is worth noting that the GO feature space of the training data is high dimensional as we obtain 12,335, 5,347 and 1,755 GO terms of biological processes, molecular function and cellular compartments, respectively. Feature selection is an effective way to reduce the feature dimensionality. However, GO terms are highly sparse among genes and many genes are annotated with only a few GO terms. As such, the core feature set yielded by feature selection is not sufficient to represent most genes, and many genes would be represented with null vectors. The main reason for conducting homolog knowledge transfer is to reduce the risk of null-vector representation. Fortunately, the sample size is still larger than the feature dimensionality, which is of less risk of overfitting than the frequently encountered cases that the feature dimensionality is several times the size of training data, e.g. small examples and high feature dimensionality in the computational analysis of gene expression data. Furthermore, the regularization technique of l2-regularized logistic regression itself could penalize the potential overfitting caused by the high feature dimensionality.
The computational results at the interactome scale demonstrate that 23,131 indirect interactions are detected from the known 304,799 PPIs, and most of these indirectly interacting genes find dozens of physical paths between them via the breath-first graph search in the major physical PPI databases. In addition, pathway enrichment analysis shows that most of the physical paths can be mapped onto more than one human signaling pathway, indicating that there do exist a series of intermediate genes between the indirectly interacting genes that function to relay biochemical signals. These evidences to some extent demonstrate the credibility of the interactome-scale predictions. Importantly, these computational results could provide useful cues to following applications (1) exploring unknown physical PPIs or physical paths between two indirectly interacting genes; (2) exploring novel members of the protein complex formed around the indirectly interacting proteins; (3) amending or extending the existing signaling pathways; (4) recognizing the physical protein-protein interactions as potential druggable targets.
Supplementary Material
Acknowledgements
This work is partly supported by the funding from NIH NIMHD-RCMI grant 2G12MD007595, DOD ARO grant W911NF-15–1-0510 and the Louisiana Cancer Research Consortium (LCRC). The contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH, DOD or LCRC
Footnotes
Competing interests
The authors declare that they have no competing interests.
References
- 1.Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, et al. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci U S A 97:1143–1147 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180–183 (2002). [DOI] [PubMed] [Google Scholar]
- 3.Kim ED, Sabharwal A, Vetta AR, Blanchette M. Predicting direct protein interactions from affinity purification mass spectrometry data. Algorithms Mol Bio 5:34 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Szklarczyk D, Franceschini A, Wyder S, Forslund K, Heller D, et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res 43 (Database issue):D447–52 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, et al. The Reactome pathway Knowledgebase. Nucleic Acids Res 44 (D1):D481–7 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, et al. The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42 (Database issue):D358–63 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chatr-Aryamontri A, Breitkreutz BJ, Oughtred R, Boucher L, Heinicke S, et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res 43 (Database issue):D470–8 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wu X, Zhu L, Guo J, Zhang D, Lin K (2006) Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations. Nucleic Acids Res 34: 2137–2150 (2006) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.DeBodt S, Proost S, Vandepoele K, Rouze P, Peer Y, et al. Predicting protein-protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression. BMC Genomics 10: 288 (2009) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Miller J, Lo RS, Ben-Hur A, Desmarais C, Stagljar I, et al. Large-scale identification of yeast integral membrane protein interactions. Proc Natl Acad Sci U S A 102: 12123–12128 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lin N, Wu B, Jansen R, Gerstein M, Zhao H, et al. Information assessment on predicting protein-protein interactions. BMC Bioinformatics 5:154 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Maetschke S, Simonsen M, Davis M, Ragan MA. Gene Ontology-driven inference of protein–protein interactions using inducers. Bioinformatics 28: 69–75 (2012). [DOI] [PubMed] [Google Scholar]
- 13.Shen J, Zhang J, Luo X, Zhu W, Yu K, et al. Predicting protein–protein interactions based only on sequences information. Proc Natl Acad Sci U S A 104: 4337–41 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Qi Y, Tastan O, Carbonell JG, Klein-Seetharaman J & Weston J Semi-supervised multi-task learning for predicting interactions between HIV-1 and human proteins. Bioinformatics 26: i645–i652 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mei S Probability weighted ensemble transfer learning for predicting interactions between HIV-1 and human proteins. PLoS One 8: e79 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bandyopadhyay S, Ray S, Mukhopadhyay A & Maulik U A review of in silico approaches for analysis and prediction of HIV-1-human protein-protein interactions. Brief Bioinform 16: 830–851 (2015). [DOI] [PubMed] [Google Scholar]
- 17.Mei S & Zhu H A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks. Sci Rep 5: 8034 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Mei S & Zhang K Computational discovery of Epstein-Barr virus targeted human genes and signalling pathways. Sci Rep 6:30612 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wu G, Feng X & Stein L. A human functional protein interaction network and its application to cancer data analysis. Genome Biol 11:R53 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Elefsinioti A, Saraç ÖS, Hegele A, Plake C, Hubner NC, et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10:M111.010629 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kanehisa M. Representation and analysis of molecular networks involving diseases and drugs. Genome Inform 23:212–3 (2009). [PubMed] [Google Scholar]
- 22.Soong TT, Wrzeszczynski KO, Rost B. Physical protein-protein interactions predicted from microarrays. Bioinformatics 24:2608–14 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32 (Database issue):D449–51 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, et al. Human Protein Reference Database−-2009 update. Nucleic Acids Res 37(Database issue):D767–72 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yu F, Huang F, Lin C. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn 85:41–75 (2011). [Google Scholar]
- 26.Fan R, Chang K, Hsieh C, Wang X & Lin C LIBLINEAR: A Library for Large Linear Classification. Mach Learn Res 9:1871–1874 (2008). [Google Scholar]
- 27.Yu J, Guo M, Needham CJ, Huang Y, Cai L, et al. Simple sequence-based kernels do not predict protein-protein interactions. Bioinformatics 26:2610–4 (2010). [DOI] [PubMed] [Google Scholar]
- 28.Ashburner M et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Barrell D, Dimmer E, Huntley RP, Binns D & O’Donovan C et al. The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res 37 (Database issue), D396–403 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Boeckmann B, Bairoch A, Apweiler R, Blatter MC & Estreicher A et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31, 365–70 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Altschul SF, Madden TL, Schäffer AA, Zhang J & Zhang Z Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–402 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Chih-Chung Chang & Chih-Jen Lin. LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:1–27 (2011) [Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm]. [Google Scholar]
- 33.Kevin M Machine Learning: A Probabilistic Perspective. MIT press; (2012). [Google Scholar]
- 34.Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GS, et al. NetPath: a public resource of curated signal transduction pathways. Genome Biol 11:R3 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Zhao B, Wei X, Li W, Udan RS, Yang Q et al. Inactivation of YAP oncoprotein by the Hippo pathway is involved in cell contact inhibition and tissue growth control. Genes Dev 21:2747–2761 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Watkins GR, Wang N, Mazalouskas MD, Gomez RJ, Guthrie CR, et al. Monoubiquitination promotes calpain cleavage of the protein phosphatase 2A (PP2A) regulatory subunit alpha4, altering PP2A stability and microtubule-associated protein phosphorylation. J Biol Chem 287:24207–24215 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
