GNNGL-PPI: multi-category prediction of protein-protein interactions using graph neural networks based on global graphs and local subgraphs

Xin Zeng; Fan-Fang Meng; Meng-Liang Wen; Shu-Juan Li; Yi Li

doi:10.1186/s12864-024-10299-x

. 2024 May 9;25:406. doi: 10.1186/s12864-024-10299-x

GNNGL-PPI: multi-category prediction of protein-protein interactions using graph neural networks based on global graphs and local subgraphs

Xin Zeng ¹, Fan-Fang Meng ¹, Meng-Liang Wen ², Shu-Juan Li ³, Yi Li ^1,^✉

PMCID: PMC11080243 PMID: 38724906

Abstract

Most proteins exert their functions by interacting with other proteins, making the identification of protein-protein interactions (PPI) crucial for understanding biological activities, pathological mechanisms, and clinical therapies. Developing effective and reliable computational methods for predicting PPI can significantly reduce the time-consuming and labor-intensive associated traditional biological experiments. However, accurately identifying the specific categories of protein-protein interactions and improving the prediction accuracy of the computational methods remain dual challenges. To tackle these challenges, we proposed a novel graph neural network method called GNNGL-PPI for multi-category prediction of PPI based on global graphs and local subgraphs. GNNGL-PPI consisted of two main components: using Graph Isomorphism Network (GIN) to extract global graph features from PPI network graph, and employing GIN As Kernel (GIN-AK) to extract local subgraph features from the subgraphs of protein vertices. Additionally, considering the imbalanced distribution of samples in each category within the benchmark datasets, we introduced an Asymmetric Loss (ASL) function to further enhance the predictive performance of the method. Through evaluations on six benchmark test sets formed by three different dataset partitioning algorithms (Random, BFS, DFS), GNNGL-PPI outperformed the state-of-the-art multi-category prediction methods of PPI, as measured by the comprehensive performance evaluation metric F1-measure. Furthermore, interpretability analysis confirmed the effectiveness of GNNGL-PPI as a reliable multi-category prediction method for predicting protein-protein interactions.

Keywords: Multi-category prediction of protein-protein interactions, Graph neural network, Global graphs, Local subgraphs, Asymmetric loss function

Introduction

Protein-protein interactions (PPI) play a crucial role in various biological processes within cells. Identifying PPI is of great significance in advancing research across multiple life science fields, including medical diagnosis, drug design, and disease treatment [1]. Currently, the methods for identifying PPI can be broadly categorized into traditional biological experimental methods and computational methods. Traditional biological experimental methods primarily involve techniques such as yeast two-hybrid [2], protein chips [3], and synthetic lethal analysis [4]. However, these methods suffer from several disadvantages, including being time-consuming, labor-intensive, and requiring high financial resources [5]. To overcome these disadvantages, the computational methods for PPI prediction have developed rapidly. Nevertheless, these computational methods face dual challenges. Firstly, they need to accurately identify multiple specific categories of PPI. Secondly, they must achieve the desired predictive performance.

In recent years, the computational methods for predicting PPI have transitioned from docking-based methods to machine learning and deep learning-based methods. Docking-based methods [6] are capable of effectively predicting PPI, but they require high-quality protein 3D structures and significant computational resources. Additionally, they operate at a slower prediction speed, which cannot keep pace with the demands of processing massive data. Machine learning and deep learning methods, on the other hand, have exhibited better performance in handling large amounts of data [7–10]. These methods can utilize both protein sequence and structural data. Machine learning-based methods [9, 11, 12] rely on protein sequence or structural features, utilizing models like SVM [10, 13, 14] and random forest [15] to predict PPI. Although these methods have displayed good predictive performance, they cannot automatically extract deep-level features of PPI from the original sequences or structures of proteins. This creates bottlenecks that hamper improvements in predictive performance. However, the emergence of deep learning models, such as multi-layer neural networks, has provided better model prediction performance and pointed researchers towards breakthroughs in addressing the performance bottlenecks encountered by machine learning techniques [16].

Methods for predicting PPI using deep learning techniques leverage protein sequences, structures, and PPI networks. Deep learning techniques such as deep neural networks (DNN) [17], convolutional neural networks (CNN) [18], recurrent neural networks (RNN) [19], attention mechanisms [20], and graph neural networks (GNN) [21] are employed to extract deep-level features from proteins and PPI networks. DNN-based methods extract protein features through multi-layer neural networks to directly predict PPI [22, 23] or employ machine learning models for PPI prediction [24]. CNN and RNN-based methods focus on extracting local features and long-range dependency features from protein sequences, respectively. For instance, LSTM-PHV method [8] and other related approaches [25] leveraged the LSTM (Long Short-Term Memory) model to capture long-range dependency features in protein sequences. Methods like ADH-PPI [26] and DCSE [27] integrated CNN and RNN to extract both local and long-range features from protein sequences, which were then combined to predict PPI. Attention mechanisms have also been widely utilized to identify key sequence features in protein sequences [23, 26, 28]. While these DNN, CNN, RNN, and attention-based methods primarily focused on protein sequence features, they often overlooked the structural features of proteins and the hidden interaction features present in PPI networks. To address these limitations, GNN-based methods have emerged. However, incorporating protein structures into these methods has been challenging due to the slow exploration of protein 3D structures. With the advent of protein 3D structure prediction tools like AlphaFold [29] and ColabFold [30], obtaining monomer protein 3D structures has become easier. This has led to rapid advancements in research that utilizes GNN to extract protein structural features for predicting PPI, either independently or in combination with protein sequence features. For instance, the method proposed in reference [31] directly employed the GCN/GAT model to extract the structural features from two interacting proteins. These features were then concatenated and used to predict PPI. TAGPPI method [32] combined TextCNN and GAT to extract sequence and structural features from protein sequences and contact maps, respectively. The extracted features were fused before being passed through a fully connected (FC) layer to predict PPI. Other methods introduced interaction features by constructing a PPI network graph where proteins served as vertices and interactions served as edges. GNN was then used to extract interaction features from this graph, ultimately predicting PPI. Methods such as S-VGAE [33], HIGH-PPI [34], and Topsy-Turvy [35] utilized PPI network graphs along with protein sequences or structures as vertex features. GNN was applied for binary classification to determine whether there was an interaction between vertices in the graph. The aforementioned methods treated PPI as a binary classification task and did not identify specific interaction categories between proteins. However, methods like PIPR [36], GNN-PPI [37], SemiGNN-PPI [38], and AFTGAN [39] have been developed to predict PPI into multi-category. PIPR utilized a Siemens residual current convolutional neural network (RCNN) to extract local features and contextual information from protein sequences. GNN-PPI, SemiGNN, and AFTGAN leveraged GNN to learn deep-level features from PPI networks, enabling multi-category prediction. Although these methods extracted protein sequence features and PPI network graph features for multi-category PPI prediction, research in this area was still in its early stages. The performance of the models was somewhat below user expectations. Therefore, proposing an efficient and effective model for multi-category prediction of PPI presents significant challenges.

In this study, we proposed a novel graph neural network method, called GNNGL-PPI, for predicting multi-category of protein-protein interactions. GNNGL-PPI utilized a combination of global graph and local subgraph features. Specifically, we used Graph Isomorphism Network (GIN) to extract global graph features from PPI network graphs. Additionally, we employed GIN-AK to extract local subgraph features from subgraphs containing protein vertices in the PPI network graphs. Simultaneously, to address the issue of imbalanced samples for each category in the benchmark dataset, we introduced an Asymmetric Loss (ASL) Function. This loss function helped enhance the predictive performance of the model by assigning different weights to different categories based on their prevalence. We evaluated the model on six standard test sets created using three different dataset partitioning algorithms (Random, BFS, DFS). GNNGL-PPI consistently outperformed the state-of-the-art methods for multi-category prediction of PPI, as measured by the comprehensive evaluation metric F1-measure. Furthermore, we conducted interpretability analysis to validate the effectiveness of GNNGL-PPI. Experimental results confirmed that GNNGL-PPI was a reliable method for predicting multi-category of PPI.

Materials and methods

Datasets

In this study, we approached protein-protein interactions prediction as a multi-category task, encompassing seven PPI categories: Reaction, Binding, Post-translational modification (Ptmod), Activation, Inhibition, Catalysis, and Expression. Each protein-protein interaction pair is assigned to at least one of these categories. For example, the interaction category between protein 9606.ENSP00000005257 (RalA) and protein 9606.ENSP00000202677 (RalGAP A2) is Inhibition (PMID: 34767674), while the interaction category protein 9606.ENSP00000003100 (Cyp51) and protein 9606.ENSP00000240055 (NF-YB) is Activation (PMID:27438,727). To evaluate the performance of the model, we utilized two benchmark datasets, SHS27k and SHS148k, which were consistent with the dataset used in the GNN-PPI [37] method and exhibited sequence consistency of less than 40% in each dataset. These datasets consisted of 7624 (1690 proteins) and 44,488 (5189 proteins) protein-protein interaction pairs, respectively. The number and occupation ratio of samples for the seven PPI categories generated by these protein-protein interaction pairs were shown in Table 1. It was evident from Table 1 that the number of samples for the seven PPI categories was imbalanced. During the training and testing of the model, we also employed the Random, Breath First Search (BFS), and Depth First Search (DFS) algorithms proposed by the GNN-PPI method to partition the SHS27k and SHS148k datasets. For detailed information on the data partitioning algorithms for the Random, BFS, and DFS, please refer to the GNN-PPI method.

Table 1.

Number and occupation ratio of samples for seven categories of protein-protein interactions

PPI Categories	SHS27K		SHS148K
PPI Categories	Number of Samples	Occupation Ratio	Number of Samples	Occupation Ratio
Reaction	3164	18.22%	18,067	17.71%
Binding	4017	23.13%	23,448	22.98%
Ptmod	1303	7.50%	9336	9.15%
Activation	3297	18.98%	18,910	18.53%
Inhibition	1407	8.10%	8987	8.81%
Catalysis	3492	20.11%	19,871	19.47%
Expression	687	3.96%	3419	3.35%

Open in a new tab

PPI network graph formation and protein features

PPI network graph formation

Assuming a set of proteins P={p_1,p_2,…,p_n } with n as the number of proteins, each protein acted as a vertex in the PPI network graph. The interaction category between p_i and p_j was represented as edge e_ij (1≤i,j≤n). Different protein-protein interaction categories made up the category space D={D_1,…,D_t } (t=7) of the dataset, where t is the number of PPI categories. If there was an interaction between p_i and p_j of a certain category, the corresponding position in the adjacent matrix representing the PPI network graph was assigned a value of 1, otherwise, it was assigned a value of 0.

Protein features

We employed the pre-trained model MASSA [40] to capture high-level and more fine-grained features. This pre-trained model leveraged multi-modal protein data, including protein sequences, structures, gene ontology annotations, motifs, and region positions, to derive comprehensive protein features. Unlike previous research, which mainly focused on protein features derived from protein sequences such as Position-Specific Scoring Matrix (PSSM) [41] or Hidden Markov Models (HMM) [42] matrix, or treated protein sequences as natural language processing (NLP) [43] tasks to extract protein sequence features. However, these researches yielded relatively singular protein features and did not fully encompass other biochemical information pertaining to proteins.

Proposed model

To begin with, we input the multimodal data of proteins, such as sequences, structures, and Go annotations, into the pre-trained model MASSA and obtained the 512-dimensional protein pre-training features. Next, we embedded the pre-trained features using a layer of linear transformation. Finally, we used the embedded features as the protein vertex features in the PPI network (Fig. 1A).

The extraction of high-level features in the PPI network graph can be divided into two main parts, as shown in Fig. 1B and C: global graph features extraction and local subgraph features extraction. In the global graph features extraction part, we utilized one layer of GIN followed by two FC layers. ReLU activation function [44] was applied in the FC layers. To ensure stable model training, batch normalization function was employed in the last layer to normalize the extracted features and obtained the global graph features of the PPI network graph. Moving on to the local subgraph features extraction section: firstly, we selected a protein vertex as the center and included all vertices within a path distance of K from that the center vertex. This formed a subgraph of the protein. Secondly, we used the GIN-AK to extract the features from the subgraph. GIN-AK incorporated two distinct processes to extract global features and centroid features of the subgraph, respectively. These features were then combined to obtain the final local subgraph features. To extract the global features of the subgraph, we input both the subgraph and protein features into a single layer of GIN. The output features of GIN were then fed into a FC layer (gating unit) with a sigmoid activation function, resulting in the global features of the subgraph. For the centroid features of the subgraph, we first calculated the path distance between the vertices in the subgraph. The obtained path distance matrix was then processed using the same gating unit to obtain the centroid features. Finally, we concatenated the global features and centroid features of the subgraph, applied batch normalization, and obtained the local subgraph features. The global graph features and local subgraph features were concatenated to form the vertex features in the PPI network graph. It was important to note that ReLU activation function was used in the gating unit during the subgraph features extraction process. Furthermore, to prevent gradient vanishing and enhance model stability during training, a dropout layer was added to the gating unit.

In order to predict multi-category of PPI, we first multiplied the features of the protein itself and its interacting protein. The resulting multiplied features were then fed into a FC layer. The output of the FC layer was a 7-dimensional matrix. Finally, by applying the sigmoid function to the 7-dimensional matrix, we completed the multi-category prediction of PPI.

GIN-AK

GNN have proven to be an effective framework for topology representation learning. GIN is widely regarded as the most repressive GNN [45], but it still faces limitation in breaking through the first-order Weisfeiler-Leman isomorphism testing [46, 47]. To overcome this limitation, a subgraph operation had been introduced based on GIN. This approach updated vertex information by using the feature of the subgraph of vertex, rather than relying solely on the neighboring vertex feature information. This transformation reduced the complexity of the graph feature problem to a smaller and simpler subgraph feature problem. The resulting GIN model was called GIN-AK (GIN As Kernel). In the process of updating vertices using graph convolutional operations, each vertex aggregates information directly from neighboring vertices in a star operation. The star operation

Inline graphic forms a graph that can be defined as formular (1). The specific update method for vertices in GIN is shown in formula 2.

The limitation of graph convolutional operations lies in their inability to differentiate between graphs that possess the same vertex degree but exhibit different structures. In order to address this issue, we proposed the utilization of a subgraph encoder, denoted as Inline graphic , to substitute the operation. This replacement significantly enhanced the expressive capabilities of the graph convolutional operation. The subgraph encoder facilitated the update of vertex features by constructing subgraph features within a -hop egonet centered around the vertex . This update process was displayed in formula (3), which specifically outlined the procedure for updating vertices in GIN-AK.

In GIN-AK, a Inline graphic -hop propagation mechanism was employed to acquire the subgraph surrounding each vertex. Additionally, the path distance between each vertex and the centroid of its corresponding subgraph was computed. This information enhanced the vertex features and contributed to the overall improvement of graph convolutional expressiveness. To capture the nonlinear transformations of global features and centroid features of the subgraph, gating units were introduced. Finally, the global features and centroid features of the subgraph were obtained using formula (4) and (5), respectively.

Among them, Inline graphic represented the path distance matrix from vertex to vertex in the +1-layer, and represented element wise multiplication. Combining the global features and centroid features of the subgraph, the update process of vertex was shown in formula (6).

GIN-AK improved the expressiveness of graph convolutional operation through the use of subgraphs instead of star graphs. This enhancement allowed GIN-AK to capture underlying structural features in PPI network, leading to the improved performance in predicting different categories of protein-protein interactions.

ASL

In this study, PPI prediction was regarded as a multi-category classification task. However, the dataset (Table 1) used was imbalanced, which means that there were more samples corresponding to some categories compared to others. This can affect the performance evaluation of the model since it might focus more on the majority classes and ignore the minority classes. To address this problem, we introduced the Asymmetric Loss (ASL) [48] function.

In multi-category imbalanced datasets, using symmetric loss functions like Focal Loss [49], BCE Loss [50], or Cross Entropy [51] may not effectively learn some features from positive samples. These loss functions tend to focus more on negative samples than positive samples [48], which can be suboptimal. For example, in Focal Loss function (formula 8), when using the same Inline graphic for multi-category training, it may eliminate the gradient of sparse positive samples. Among them, is the output probability of the model, is the focusing parameter, and represent the positive and negative loss parts, respectively.

Therefore, in ASL, the focusing parameters of the loss function are decoupled into positive focusing parameter Inline graphic and negative focusing parameter . This allows for asymmetric focusing, enabling better control the effect of positive and negative samples on the loss function. In addition, considering the asymmetric focusing parameters in ASL, there are shortcomings in some cases where there are not enough negative samples. In ASL, another asymmetric mechanism of probability translation is further proposed, as shown in formula (9).

Inline graphic represents an adjustable probability margin. Through the above two adjustments, the final definition of ASL is shown in formula (10).

By employing ASL to dynamically regulate the degree of asymmetry throughout the entire training process, the selection of hyperparameters is simplified. This effectively balances the focus of the network on positive and negative samples, ultimately improving the accuracy of multi-category prediction.

Model training

We trained our model for 400 epochs using the Adam optimizer [52] with a batch size of 1024 and an initial learning rate of 0.01. To optimize the training process, we implemented a learning rate decay function, set the patience to 20, and stopped training when the model did not reduce its loss after 20 epochs. To prevent overfitting, we used a dropout rate of 0.2 in the local subgraph features extraction part and 0.5 in the global graph features extraction section. In the local subgraph features extraction part, we set Inline graphic in -hop to 1, extracted the 1-hop subgraph as the input graph of GIN-AK, and used ASL with a probability margin of 0.05. For asymmetric focusing, we set to 1 and to 0, as recommended by ASL.

Evaluation metrics

In this study, we regarded PPI prediction as a multi-category classification task. The dataset used in our study had an imbalanced distribution of samples across different categories. To effectively evaluate the model’s performance on this imbalanced dataset, we employed F1-measure as the evaluation metric. F1-measure is a widely-used evaluation metric for imbalanced datasets because it takes into account both precision and recall, providing a balanced measure of model performance [53].

Among them, TP, FP, TN, and FN represent the number of predicted true positive, false positive, true negative, and false negative samples, respectively.

Results and discussion

Multimodal pre-training features offer a better representation of protein

The predictive performance of a model is directly influenced by the quality of its input features. In this study, we aimed to select high-quality features for protein vertices in the PPI network. To achieve this, we compared the sequence features of proteins with features obtained from the latest multimodal pre-training model of proteins. Experimental results presented in Table 2 exhibited that multimodal pre-training features can more effectively represent proteins. The sequence features used in our study were derived from the GNN-PPI method, a classic approach for predicting PPI. These sequence features consisted of 13 dimensions, with the first 5 dimensions capturing the co-occurrence similarity features of amino acids, and the remaining 8 dimensions representing the similarity features of electrostatic and hydrophobic interactions between amino acids. Through a combination of one layer of CNN and linear transformation, these 13-dimensional sequence features were transformed into 512-dimensional sequence features, which were utilized in our study. For the multimodal pre-training features, we employed the latest and more refined pre-training model called MASSA. This model leverages protein data from various modalities including protein sequences, structures, gene ontology annotations, motifs, and region positions to extract comprehensive protein features at a higher level. Subsequently, we constructed a GIN model and used the protein sequence features as well as the multimodal pre-training features as vertex features in the PPI network graph to predict PPI.

Table 2.

Performance comparison between multimodal pre-training features and sequence features

Feature	SHS27k			SHS148K
Feature	Random (%)	BFS (%)	DFS (%)	Random (%)	BFS (%)	DFS (%)
Sequence features	88.210.47	63.263.70	74.313.44	91.980.20	65.504.59	82.240.73
MASSA	88.990.80	70.340.67	78.321.60	92.520.17	69.994.60	83.191.26

Open in a new tab

In the case of the SHS27K dataset, the GIN model based on multimodal pre-trained features exhibited the improved performance under three different dataset partitioning algorithms (Random, BFS, and DFS). Specifically, the F1-measure increased by approximately 0.7%, 7%, and 4%, respectively. As for the SHS148K dataset, the GIN network based on multimodal pre-trained features performed improvements of 1%, 4%, and 1% under the Random, BFS, and DFS algorithms, respectively. These experimental findings indicated that when using graph neural networks for predicting PPI, multimodal pre-trained features offered a better representation of protein compared to sequence features alone.

Local subgraph features can enhance the performance of the model

In this study, we proposed a novel operation for updating vertex information in GIN. Instead of directly aggregating information from neighboring vertices in a star operation, we updated the vertex information through the features of the vertex’s subgraph. To assess the contribution of local subgraph features to the model’s performance, we conducted ablation experiments based on global graph features, local subgraph features, and their combined features. The global graph features were extracted directly using the GIN model, while the local subgraph features were extracted using the GIN-AK. The combined features were obtained by fusing global graph features and local subgraph features. Experimental results presented in Table 3 indicated that, based solely on global graph features and local subgraph features, the latter performed approximately 5% better than the former for the BFS partitioning algorithm on both SHS27K and SHS148K datasets, while they exhibited similar performance under the other two partitioning algorithms. However, after combining global graph features and local subgraph features, the combined model showed better performance than the global graph features-based and local subgraph features-based models under the two partitioning algorithms (Random and DFS) of the SHS148K dataset and all partitioning algorithms of the SHS27K dataset. Although the F1-measure of the combined model was 1% lower than that of the local subgraph features-based model under the BFS partitioning algorithm of the SHS148K dataset, these findings still showed that incorporating local subgraph features on the basis of global graph features could enhance the model’s performance.

Table 3.

Performance comparison of global graph, subgraph, and their combined features

Feature	SHS27k			SHS148K
Feature	Random (%)	BFS (%)	DFS (%)	Random (%)	BFS (%)	DFS (%)
Subgraph	89.150.69	74.080.32	77.762.55	90.840.32	75.973.64	82.870.52
Global Graph	88.990.80	70.340.67	78.321.60	92.520.17	69.994.60	83.191.26
Combination	90.040.55	76.081.22	79.674.05	92.760.17	74.624.35	84.471.20

Open in a new tab

Selection of parameter k in k-hop

To select the most effective parameter k in k-hop, we conducted experiments on the SHS27K dataset using both k values of 1 and 2. Experiment results, as presented in Table 4, showed that a k value of 1 outperformed the alternative. Due to the significant computational cost associated with calculating subgraphs for all vertices, we decided against conducting further experiments on the SHS27K dataset where the k value was greater than or equal to 3. Additionally, given the large size of the SHS148K dataset, it was not feasible to perform subgraph calculation experiments with a k value greater than or equal to 2 on our device. As a result, we ultimately settled on a k value of 1.

Table 4.

Performance comparison of different k values based on SHS27K

k-hop	SHS27K
k-hop	Random (%)	BFS (%)	DFS (%)
1-hop	90.230.31	78.515.25	79.813.43
2-hop	90.250.25	68.295.58	76.181.19

Open in a new tab

The impact of ASL on the performance of the model

In the previous experiments, we utilized the BCE loss function, which is a type of symmetric loss function that does not address the issue of imbalanced sample sizes with different interaction categories in the dataset. To address this limitation, we introduced the ASL and conducted some experiments using it. Experimental results presented in Table 5 revealed that, under the three partitioning algorithms (Random, BFS, and DFS) on both SHS27K and SHS148K datasets, the model based on ASL achieved higher F1-measure than the model based on the BCE loss function. Specifically, in the SHS27K dataset, F1-measure increased by approximately 0.20%, 2%, and 0.15%, respectively, while in the SHS148K dataset, F1-measure increased by approximately 1%, 1%, and 2%, respectively. These results illustrated that utilizing the ASL can effectively enhance the performance of the model on imbalanced datasets.

Table 5.

Performance comparison of asymmetric (ASL) and symmetric (BCE) loss functions

Loss Function	SHS27k			SHS148K
Loss Function	Random (%)	BFS (%)	DFS (%)	Random (%)	BFS (%)	DFS (%)
BCE	90.040.55	76.081.22	79.674.05	92.760.17	74.624.35	84.471.20
ASL	90.230.31	78.515.25	79.813.43	93.340.12	75.142.63	86.071.05

Open in a new tab

Performance comparison of GNNGL-PPI with the state-of-the-art methods

In this study, GNNGL-PPI exhibited good performance across three different partitioning algorithms on the SHS27K and SHS148K datasets. To comprehensively assess its performance, we conducted a comparison with two machine learning methods, including Random Forest (RF) and Logistic Regression (LR), as well as four deep learning methods like PIPR, GNN-PPI, LDMGNN [54], and SemiGNN-PPI. Experimental results (Table 6) showed that GNNGL-PPI consistently outperformed these state-of-the-art methods in terms of F1-measure under all three partitioning algorithms on both datasets. These findings affirmed the reliability of GNNGL-PPI as a PPI predictor.

Table 6.

Performance comparison of GNNGL-PPI with the state-of-the-art methods

Method	SHS27k			SHS148K
Method	Random (%)	BFS (%)	DFS (%)	Random (%)	BFS (%)	DFS (%)
RF	78.450.08	37.671.57	35.552.22	82.100.20	38.961.94	43.263.43
LR	71.550.93	43.065.05	48.511.87	67.000.07	47.451.42	51.092.09
PIPR	83.310.75	44.484.44	57.803.24	90.052.59	61.8310.23	63.980.76
GNN-PPI	87.910.39	63.811.79	74.725.26	92.260.10	71.375.33	82.670.85
LDMGNN	89.340.44	74.563.03	78.202.69	92.380.08	73.985.51	83.790.95
SemiGNN-PPI	89.510.46	72.152.87	78.323.15	92.400.22	71.783.56	85.451.17
GNNGL-PPI	90.230.31	78.515.25	79.813.43	93.340.12	75.142.63	86.071.05

Open in a new tab

Performance comparison between GNNGL-PPI and state-of-the-art methods under the multi-classification evaluation metrics

In addition to using binary classification evaluation metrics, we further utilized multi-classification evaluation metrics such as Weighted precision, Weighted recall, Weighted f1-score, Macro precision, Macro recall, Macro f1-score, Micro precision, Micro recall, and Micro f1-score to evaluate the model’s performance. Experimental results (Table 7) showed that GNNGL-PPI exhibited superior performance compared to other state-of-the-art methods. These outcomes highlighted the effectiveness of GNNGL-PPI in predicting the multi-category of PPI.

Table 7.

Performance comparison between GNNGL-PPI and other state-of-the-art methods under the multi-classification evaluation metrics

Method	Multi-classification evaluation metrics	SHS27K			SHS148K
Method	Multi-classification evaluation metrics	Random (%)	BFS (%)	DFS (%)	Random (%)	BFS (%)	DFS (%)
GNNGL-PPI	Weighted precision	89.490.43	76.411.14	76.492.22	92.960.20	75.372.71	84.850.65
LDMGNN		89.130.55	73.142.32	76.110.08	92.890.33	72.563.32	83.351.35
GNN-PPI		88.970.17	69.759.24	70.921.94	92.870.07	70.912.55	83.080.56
GNNGL-PPI	Weighted recall	90.810.43	79.382.02	82.922.98	94.000.07	78.886.02	88.740.38
LDMGNN		88.410.22	74.063.47	81.152.79	92.040.14	65.232.76	80.881.67
GNN-PPI		88.100.32	71.689.55	73.672.40	91.630.25	59.126.67	82.601.01
GNNGL-PPI	Weighted f1-score	90.130.41	77.420.86	79.392.57	93.470.07	76.664.37	86.690.60
LDMGNN		88.730.39	72.812.39	78.161.30	92.450.20	67.322.71	81.861.04
GNN-PPI		88.500.25	68.301.91	71.871.64	92.240.10	62.534.21	82.720.77
GNNGL-PPI	Macro precision	86.360.66	70.005.68	72.632.65	89.760.47	72.832.70	78.440.77
LDMGNN		85.920.40	70.082.81	73.160.47	89.600.30	71.603.26	78.260.87
GNN-PPI		85.880.40	57.8820.34	67.371.51	89.640.15	69.183.32	77.650.57
GNNGL-PPI	Macro recall	86.520.60	72.224.15	78.092.74	90.200.06	75.555.97	83.710.31
LDMGNN		83.890.67	64.322.12	74.341.63	87.320.07	58.131.70	74.141.84
GNN-PPI		83.430.45	52.888.32	69.441.37	87.070.40	54.248.07	72.370.88
GNNGL-PPI	Macro f1-score	86.400.33	70.374.35	74.992.72	89.970.26	73.703.83	80.830.55
LDMGNN		84.800.52	65.852.10	72.800.35	88.470.16	62.113.11	75.771.47
GNN-PPI		84.550.13	52.9614.06	68.001.34	88.300.18	58.625.55	73.820.66
GNNGL-PPI	Micro precision	89.490.49	75.741.09	76.232.14	92.950.23	74.773.26	84.350.77
LDMGNN		89.240.48	73.522.64	76.100.18	93.050.32	72.283.12	82.901.39
GNN-PPI		89.110.25	71.545.70	70.442.11	92.980.09	70.031.74	83.270.57
GNNGL-PPI	Micro recall	90.820.43	79.392.03	82.932.98	94.010.07	78.886.03	88.760.36
LDMGNN		88.420.22	74.063.47	81.152.79	92.050.15	65.232.77	80.881.67
GNN-PPI		88.110.32	71.699.55	73.342.26	91.640.25	59.126.68	82.601.02
GNNGL-PPI	Micro f1-score	90.150.39	77.490.56	79.432.53	93.480.09	76.724.29	86.500.56
LDMGNN		88.820.32	73.742.48	78.121.10	92.540.20	68.522.26	81.860.94
GNN-PPI		88.600.29	70.811.72	71.831.61	92.310.09	63.863.43	82.870.70

Open in a new tab

Performance comparison of some statistical tests between GNNGL-PPI and state-of-the-art methods

In this study, we dealt with an imbalanced dataset where the number of samples across different categories varied. To accurately evaluate the model’s performance on this imbalanced dataset, we opted for the F1 metric as it provided a more accurate reflection of the model’s true performance. Consequently, we conducted some statistical tests on the F1 metric of three methods including GNNPPI, LDMGN, and GNNGL-PPI using three distinct partitioning algorithms on both SHS27K and SHS148K datasets. Experiment results, as shown in Table 8, indicated that GNNGL-PPI exhibited slightly superior performance across two statistical testing metrics compared to two other state-of-the-art methods.

Table 8.

Performance comparison of some statistical tests between GNNGL-PPI and state-of-the-art methods

Rank		Number of cases		Rank Mean	Sum of Ranks
GNNPPI-GNNGLPPI	Negative Rank	18^a		9.50	171.00
	Positive Rank	0^b		0.00	0.00
	Bind Value	0^c
	Total	18
LDMGN-GNNGLPPI	Negative Rank	17^d		9.53	162.00
	Positive Rank	1^e		9.00	9.00
	Bind Value	0^f
	Total	18
(a) GNNPPI < GNNGLPPI (b) GNNPPI > GNNGLPPI (c) GNNPPI = GNNGLPPI (d) LDMGN < GNNGLPPI (e) LDMGN > GNNGLPPI (f) LDMGN = GNNGLPPI
Statistic tests (a)	Z		Progressive significance (two tailed)			p-value
GNNPPI-GNNGLPPI	-3.724^b		0.000			0.000196
LDMGN-GNNGLPPI	-3.332^b		0.001			0.000863
(a) Wilconxon sign rank test (b) Based on positive rank

Open in a new tab

Statistical analysis of prediction accuracy of GNNGL-PPI and GNN-PPI under different interaction categories

Table 6 showed that the GNNGL-PPI method outperformed other state-of-the-art methods for comparison under three different partitioning algorithms on both the SHS27K and SHS148K datasets. To further evaluate the predictive performance of GNNGL-PPI across different interaction categories, we compared its prediction accuracy with the GNN-PPI method under the BFS partitioning algorithm using the SHS27K dataset (Fig. 2). Statistical analysis revealed that for the interaction category Reaction with 3164 samples (Table 1), after being divided by the BFS algorithm, 613 test samples were obtained. GNNGL-PPI correctly predicted 496 of the 569 samples it identified, yielding an accuracy of 80.91% (Fig. 2a). In comparison, GNN-PPI correctly predicted 434 of the 510 samples, achieving an accuracy of 70.79%. Similarly, for the Binding, Ptmod, Activation, Inhibition, Catalysis, and Expression categories, GNNGL-PPI achieved a higher accuracy (within parentheses) than that of GNN-PPI (85.19%, 71.55%), (86.02%, 71.55%), (84.35%, 81.42%), (86.67%, 85.74%), (87.47%, 74.94%), and (42.55%, 28.72%), respectively (Fig. 2b). In all interaction categories except for Action and Inhibition, GNNGL-PPI exhibited classification accuracy that was more than 10% higher than that of GNN-PPI. Notably, under the Expression category, there were only 94 test samples. Despite this, GNNGL-PPI and GNN-PPI correctly predicted 27 and 40 samples, respectively, with an accuracy improvement of 13.83%. This suggested that our proposed GNNGL-PPI method could learn features from small sample sizes and made accurate predictions, mitigating the problem of low prediction accuracy due to imbalanced dataset categories with small sample sizes.

Fig. 2 — Performance comparison between GNNGL-PPI and GNN-PPI across different interaction categories using the BFS partitioning algorithm on the SHS27K dataset. (a) In each interaction category, GNNGL-PPI outperformed GNN-PPI by predicting more correct samples (green color) as compared to the latter (red color). Particularly in the Expression category, GNNGL-PPI correctly predicted 40 positive samples, while GNN-PPI only achieved 27. (b) With the exception of the Action and Inhibition interaction categories, the prediction accuracy of both methods was similar. However, in the remaining five interaction categories, GNNGL-PPI exhibited a prediction accuracy that was more than 10% higher than that of GNN-PPI.

Interpretability analysis of the effectiveness of GNNGL-PPI

To thoroughly evaluate the effectiveness of GNNGL-PPI, we applied the widely recognized dimensionality reduction algorithm t-SNE [55] to the SHS27K dataset. This dataset was partitioned using the Random alogrithm. t-SNE proved to be an ideal choice for our analysis as it preserved the proximity of closely positioned data points even after reducing their dimensions. At the same time, it effectively maintained the separation between originally distant data points. We utilized t-SNE to gain insights into the effectiveness of GNNGL-PPI by visualizing the clustered representations resulting from dimensionality reduction at different epochs of model training. Due to the rapid convergence of GNNGL-PPI in the early stages of training, we applied t-SNE to reduce the dimensionality of features learned in the 1st, 5th, 20th, 50th, 100th, and 200th epochs. The clustering visualization obtained after one epoch of model training (Fig. 3a) revealed that data points representing different interaction categories were intertwined, with no distinct separation between them. However, as the number of training epochs increased, the clustering visualizations after the 5th, 20th, 50th, and 100th epochs (Fig. 3b to e) clearly showed that data points representing different interaction categories gradually moved away from each other, while data points representing the same interaction category started to cluster together. By the 200th epoch, clear boundaries between data points of different interaction categories were observed (Fig. 3f). It’s important to note that since this study treated PPI as a multi-category classification task, some data points from different interaction categories may still appear in the same cluster. Nevertheless, the clustering visualization results from different training epochs indicated that GNNGL-PPI was an effective method for predicting multi-category of PPI.

Fig. 3 — Interpretability analysis of the effectiveness of GNNGL-PPI. We used t-SNE algorithm to elucidate GNNGL-PPI’s effectiveness by contrasting the clustered visualizations resulting from dimensionality reduction for multiple different epochs of the model training of GNNGL-PPI. The clustering visualization results, represented as (a) to (f), corresponded to the features learned in the first, fifth, 20th, 50th, 100th, and 200th epochs of model training, respectively. Upon examining these results, a clear pattern emerged: data points representing different interaction categories gradually moved apart as the number of training epochs increased, while data points representing the same interaction category gradually clustered together. The clustering visualization results showed that GNNGL-PPI was an effective PPI multi-category predictor

Discussions

Although GNNGL-PPI has exhibited a fine performance in predicting multiple categories of protein-protein interactions and can provide explanatory analysis for its effectiveness, it also has certain shortcomings:

The training of GNNGL-PPI was conducted on two publicly available benchmark datasets, and the model’s learned deep features were limited. This limitation may result in a gap between the generalization performance of GNNGL-PPI and users’ expected thresholds.
GNNGL-PPI relied on sequence features and GO annotations of proteins, while overlooking protein structural features. However, protein structural features often play a crucial role in the performance of PPI models.
The utilization of GNNGL-PPI requires users to input protein-protein interaction networks, which undoubtedly increases the complexity for non-professionals and restricts the application and promotion of GNNGL-PPI.

In response to these shortcomings, we have carefully considered the issues and proposed potential solutions:

To address the limited deep features learned by GNNGL-PPI, we suggest crawling a more comprehensive PPI interaction dataset from databases. By constructing a more extensive PPI interaction network for training GNNGL-PPI, we can gradually enhance its generalization performance.
In order to incorporate protein structural features into GNNGL-PPI, we recommend extracting such features from protein topology graphs or three-dimensional grids using graph neural networks or 3D-CNN. This approach will enrich the protein features and improve the model’s performance.
To alleviate the challenges faced by non-professionals in utilizing GNNGL-PPI, we propose developing an online service tool. With this tool, users would only need to input protein pairs, and the tool would automatically construct a protein-protein interaction network and predict the interaction categories between proteins.

In order to further improve the performance of GNNGL-PPI, we believe that in-depth research should be conducted in the following areas in the future:

Introducing self-supervised contrastive learning into the GNNGL-PPI method enhances its feature extraction ability, enabling the method to learn more key implicit features that affect multi-category prediction of protein-protein interactions.
At present, the protein features used are still relatively single, and multiple modal information such as protein sequences and structures should be integrated to promote the model to achieve better performance based on the rich features of proteins.
Deepening the interpretive analysis of the effectiveness of protein-protein interaction prediction methods will not only promote the application of this method, but also provide support for the research of new methods.

GNNGL-PPI not only provides PPI prediction services to support users in understanding the mechanism of protein-protein interaction, but also the predicted PPI categories or PPI networks constructed from them can provide deep and rich features for drug-target interaction prediction and other related tasks, which is helpful for drug screening, repositioning, and target recognition research. We will briefly provide relevant work in Appendix A.

Conclusions

GNNGL-PPI is a novel method for predicting multi-category of PPI based on graph neural networks. This innovative approach leverages the power of GIN and GIN-AK to extract the features from both the global graph and local subgraph of the PPI network. Additionally, the use of ASL helps enhance the model’s performance on imbalanced datasets. Experimental results have showed that GNNGL-PPI surpasses the performance of existing state-of-the-art methods for PPI prediction on two standard datasets. The effectiveness of GNNGL-PPI is further supported by the t-SNE algorithm, which provides visual evidence of the model’s capability. Therefore, GNNGL-PPI is a reliable multi-category prediction tool for protein-protein interactions.

Appendix A

Here, we will briefly mention that predicting protein-protein interactions can significantly contribute to drug-target interactions, target identification, and other related research in drug discovery, ultimately advancing drug development.

TripletMultiDTI [56] utilized protein-protein interaction (PPI) and drug-drug interaction (DDI) networks as supplementary knowledge. It integrated multimodal information, including drugs, proteins, DDI networks, and PPI networks, as inputs to predict the affinity of drug-target pairs. Furthermore, a review article [57] elaborated on the crucial role of protein-protein interactions in performing essential cellular functions. These interactions have been pivotal drug targets for the past two decades and are fundamental to drug development and design. Additionally, reference [58] provided a detailed exploration of the vital role of protein-protein interactions in structural biology and drug discovery.

Author contributions

All authors contributed to the study conception and design. X.Z.: Ideas, writingoriginal manuscript, writingreview and editing, writingrevised manuscript. F.-F.M.: Creation of models, data curation, implementation of the computer code and supporting algorithms. M.-L.W.: Supervision. S.-J.L.: Data analysis. Y.L.: Development of methodology, formal analysis, modification guidance. All authors read and approved the final manuscript.

Funding

This work was supported by the National Natural Sciences Foundation of China (No. 62366002), Yunnan Fundamental Research Projects (No. 202101BA070001-227), and a grant (No. 2023KF005) from State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Yunnan University.

Data availability

The source data and code repository can be accessed at https://github.com/dldxzx/GNNGL-PPI.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Raman K. Construction and analysis of protein–protein interaction networks. Autom Exp. 2010;2:2. doi: 10.1186/1759-4499-2-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Fields S, Sternglanz R. The two-hybrid system: an assay for protein-protein interactions. Trends Genet. 1994;10:286–92. doi: 10.1016/0168-9525(90)90012-U. [DOI] [PubMed] [Google Scholar]
3.Zhu H, Bilgin M, Bangham R, et al. Global analysis of protein activities using proteome chips. Science. 2001;293:2101–5. doi: 10.1126/science.1062191. [DOI] [PubMed] [Google Scholar]
4.Tong AHY, Evangelista M, Parsons AB, et al. Systematic Genetic Analysis with ordered arrays of yeast deletion mutants. Science. 2001;294:2364–8. doi: 10.1126/science.1065810. [DOI] [PubMed] [Google Scholar]
5.Hu L, Wang X, Huang Y-A, et al. A survey on computational models for predicting protein–protein interactions. Brief Bioinform. 2021;22:bbab036. doi: 10.1093/bib/bbab036. [DOI] [PubMed] [Google Scholar]
6.Hayashi T, Matsuzaki Y, Yanagisawa K, et al. MEGADOCK-Web: an integrated database of high-throughput structure-based protein-protein interaction predictions. BMC Bioinformatics. 2018;19:62. doi: 10.1186/s12859-018-2073-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Wu J, Liu B, Zhang J, et al. DL-PPI: a method on prediction of sequenced protein–protein interaction based on deep learning. BMC Bioinformatics. 2023;24:473. doi: 10.1186/s12859-023-05594-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Tsukiyama S, Hasan MM, Fujii S, et al. LSTM-PHV: prediction of human-virus protein–protein interactions by LSTM with word2vec. Brief Bioinform. 2021;22:bbab228. doi: 10.1093/bib/bbab228. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kibar G, Vingron M. Prediction of protein–protein interactions using sequences of intrinsically disordered regions. Proteins. 2023;91:980–90. doi: 10.1002/prot.26486. [DOI] [PubMed] [Google Scholar]
10.Romero-Molina S, Ruiz‐Blanco YB, Harms M, et al. PPI‐Detect: a support vector machine model for sequence‐based prediction of protein–protein interactions. J Comput Chem. 2019;40:1233–42. doi: 10.1002/jcc.25780. [DOI] [PubMed] [Google Scholar]
11.Zhang M, Su Q, Lu Y et al. Application of machine learning approaches for protein-protein interactions prediction. MC 2017; 13. [DOI] [PubMed]
12.Sze-To A, Fung S, Lee E-SA, et al. Prediction of protein–protein Interaction via co-occurring aligned pattern clusters. Methods. 2016;110:26–34. doi: 10.1016/j.ymeth.2016.07.018. [DOI] [PubMed] [Google Scholar]
13.Chatterjee P, Basu S, Kundu M et al. PPI_SVM: prediction of protein-protein interactions using machine learning, domain-domain affinities and frequency tables. Cell Mol Biology Lett 2011; 16. [DOI] [PMC free article] [PubMed]
14.Xu D, Xu H, Zhang Y, et al. Protein-protein interactions Prediction based on Graph Energy and protein sequence information. Molecules. 2020;25:1841. doi: 10.3390/molecules25081841. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Su X-R, Hu L, You Z-H, et al. Multi-view heterogeneous molecular network representation learning for protein–protein interaction prediction. BMC Bioinformatics. 2022;23:234. doi: 10.1186/s12859-022-04766-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ahmed I, Witbooi P, Christoffels A. Prediction of human- Bacillus anthracis protein–protein interactions using multi-layer neural network. Bioinformatics. 2018;34:4159–64. doi: 10.1093/bioinformatics/bty504. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Canziani A, Paszke A, Culurciello E. An Analysis of Deep Neural Network Models for Practical Applications. 2017.
18.He K, Zhang X, Ren S et al. Deep Residual Learning for Image Recognition. 2015.
19.Zaremba W, Sutskever I, Vinyals O. Recurr Neural Netw Regularization. 2015.
20.Vaswani A, Shazeer N, Parmar N et al. Atten Is all You Need. 2017.
21.Xu K, Hu W, Leskovec J et al. HOW POWERFUL ARE GRAPH NEURAL NETWORKS? international conference on learning representations. 2019.
22.Sun T, Zhou B, Lai L, et al. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics. 2017;18:277. doi: 10.1186/s12859-017-1700-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Li X, Han P, Wang G, et al. SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics. 2022;23:474. doi: 10.1186/s12864-022-08687-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Mahapatra S, Gupta VR, Sahu SS, et al. Deep neural network and Extreme Gradient boosting based hybrid classifier for Improved Prediction of Protein-Protein Interaction. IEEE/ACM Trans Comput Biol Bioinf. 2022;19:155–65. doi: 10.1109/TCBB.2021.3061300. [DOI] [PubMed] [Google Scholar]
25.Zhou X, Song H, Li J. Residue-frustration-based prediction of protein–protein interactions using machine learning. J Phys Chem B. 2022;126:1719–27. doi: 10.1021/acs.jpcb.1c10525. [DOI] [PubMed] [Google Scholar]
26.Asim MN, Ibrahim MA, Malik MI, et al. ADH-PPI: an attention-based deep hybrid model for protein-protein interaction prediction. iScience. 2022;25:105169. doi: 10.1016/j.isci.2022.105169. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Chen W, Wang S, Song T, et al. DCSE:Double-Channel-Siamese-Ensemble model for protein protein interaction prediction. BMC Genomics. 2022;23:555. doi: 10.1186/s12864-022-08772-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Soleymani F, Paquet E, Viktor H, et al. Protein–protein interaction prediction with deep learning: a comprehensive review. Comput Struct Biotechnol J. 2022;20:5316–41. doi: 10.1016/j.csbj.2022.08.070. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kim G, Lee S, Karin EL et al. Easy and accurate protein structure prediction using ColabFold. 2023. [DOI] [PubMed]
31.Jha K, Saha S, Singh H. Prediction of protein–protein interaction using graph neural networks. Sci Rep. 2022;12:8360. doi: 10.1038/s41598-022-12201-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Song B, Luo X, Luo X, et al. Learning spatial structures of proteins improves protein–protein interaction prediction. Brief Bioinform. 2022;23:bbab558. doi: 10.1093/bib/bbab558. [DOI] [PubMed] [Google Scholar]
33.Yang F, Fan K, Song D, et al. Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC Bioinformatics. 2020;21:323. doi: 10.1186/s12859-020-03646-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Gao Z, Jiang C, Zhang J, et al. Hierarchical graph learning for protein–protein interaction. Nat Commun. 2023;14:1093. doi: 10.1038/s41467-023-36736-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Singh R, Devkota K, Sledzieski S, et al. Topsy-Turvy: integrating a global view into sequence-based PPI prediction. Bioinformatics. 2022;38:i264–72. doi: 10.1093/bioinformatics/btac258. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Chen M, Ju CJ-T, Zhou G, et al. Multifaceted protein–protein interaction prediction based on siamese residual RCNN. Bioinformatics. 2019;35:i305–14. doi: 10.1093/bioinformatics/btz328. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Lv G, Hu Z, Bi Y, et al. Learning unknown from correlations. Graph Neural Network for Inter-novel-protein Interaction Prediction; 2021.
38.Zhao Z, Qian P, Yang X, et al. SemiGNN-PPI: self-ensembling multi-graph neural network for efficient. and Generalizable Protein-Protein Interaction Prediction; 2023.
39.Kang Y, Elofsson A, Jiang Y, et al. AFTGAN: prediction of multi-type PPI based on attention free transformer and graph attention network. Bioinformatics. 2023;39:btad052. doi: 10.1093/bioinformatics/btad052. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Hu F, Hu Y, Zhang W, et al. A Multimodal Protein Representation Framework for Quantifying Transferability across Biochemical Downstream Tasks. Adv Sci. 2023;10:2301223. doi: 10.1002/advs.202301223. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Jeong JC, Lin X, Chen X-W. On position-specific Scoring Matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinf. 2011;8:308–15. doi: 10.1109/TCBB.2010.93. [DOI] [PubMed] [Google Scholar]
42.Remmert M, Biegert A, Hauser A, et al. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9:173–5. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]
43.Li H. Deep learning for natural language processing: advantages and challenges. Natl Sci Rev. 2018;5:24–6. doi: 10.1093/nsr/nwx110. [DOI] [Google Scholar]
44.Maas AL, Hannun AY, Ng AY et al. Rectifier nonlinearities improve neural network acoustic models. Proc. icml. 2013; 30:3.
45.Xu K, Hu W, Leskovec J et al. How Powerful are Graph Neural Networks? 2019.
46.Chen Z, Villar S, Chen L, et al. On the equivalence between graph isomorphism testing and function approximation with gnns. Advances in neural information processing systems 2019; 32.
47.Weisfeiler B, Leman A. The reduction of a graph to canonical form and the algebra which appears therein. nti, Series 1968; 2:12–16.
48.Ridnik T, Ben-Baruch E, Zamir N et al. Asymmetric Loss For Multi-Label Classification. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021; 82–91.
49.Lin T-Y, Goyal P, Girshick R et al. Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision. 2017; 2980–2988.
50.Jadon S. A survey of loss functions for semantic segmentation. 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 2020; 1–7.
51.Li L, Doroslovacki M, Loew MH. Approximating the gradient of cross-entropy loss function. IEEE Access. 2020;8:111626–35. doi: 10.1109/ACCESS.2020.3001531. [DOI] [Google Scholar]
52.Kingma DP, Ba J, Adam. A Method for Stochastic Optimization. 2017.
53.Zeng M, Zou B, Wei F et al. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. 2016; 225–8.
54.Zhong W, He C, Xiao C, et al. Long-distance dependency combined multi-hop graph neural networks for protein–protein interactions prediction. BMC Bioinformatics. 2022;23:521. doi: 10.1186/s12859-022-05062-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Linderman GC, Rachh M, Hoskins JG, et al. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat Methods. 2019;16:243–5. doi: 10.1038/s41592-018-0308-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Dehghan A, Razzaghi P, Abbasi K, et al. TripletMultiDTI: Multimodal representation learning in drug-target interaction prediction with triplet loss function. Expert Syst Appl. 2023;232:120754. doi: 10.1016/j.eswa.2023.120754. [DOI] [Google Scholar]
57.Lee AC-L, Harris JL, Khanna KK, et al. A Comprehensive Review on current advances in peptide Drug Development and Design. IJMS. 2019;20:2383. doi: 10.3390/ijms20102383. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Jubb H, Higueruelo AP, Winter A, et al. Structural biology and drug discovery for protein–protein interactions. Trends Pharmacol Sci. 2012;33:241–8. doi: 10.1016/j.tips.2012.03.006. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The source data and code repository can be accessed at https://github.com/dldxzx/GNNGL-PPI.

[CR1] 1.Raman K. Construction and analysis of protein–protein interaction networks. Autom Exp. 2010;2:2. doi: 10.1186/1759-4499-2-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Fields S, Sternglanz R. The two-hybrid system: an assay for protein-protein interactions. Trends Genet. 1994;10:286–92. doi: 10.1016/0168-9525(90)90012-U. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Zhu H, Bilgin M, Bangham R, et al. Global analysis of protein activities using proteome chips. Science. 2001;293:2101–5. doi: 10.1126/science.1062191. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Tong AHY, Evangelista M, Parsons AB, et al. Systematic Genetic Analysis with ordered arrays of yeast deletion mutants. Science. 2001;294:2364–8. doi: 10.1126/science.1065810. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Hu L, Wang X, Huang Y-A, et al. A survey on computational models for predicting protein–protein interactions. Brief Bioinform. 2021;22:bbab036. doi: 10.1093/bib/bbab036. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Hayashi T, Matsuzaki Y, Yanagisawa K, et al. MEGADOCK-Web: an integrated database of high-throughput structure-based protein-protein interaction predictions. BMC Bioinformatics. 2018;19:62. doi: 10.1186/s12859-018-2073-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Wu J, Liu B, Zhang J, et al. DL-PPI: a method on prediction of sequenced protein–protein interaction based on deep learning. BMC Bioinformatics. 2023;24:473. doi: 10.1186/s12859-023-05594-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Tsukiyama S, Hasan MM, Fujii S, et al. LSTM-PHV: prediction of human-virus protein–protein interactions by LSTM with word2vec. Brief Bioinform. 2021;22:bbab228. doi: 10.1093/bib/bbab228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Kibar G, Vingron M. Prediction of protein–protein interactions using sequences of intrinsically disordered regions. Proteins. 2023;91:980–90. doi: 10.1002/prot.26486. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Romero-Molina S, Ruiz‐Blanco YB, Harms M, et al. PPI‐Detect: a support vector machine model for sequence‐based prediction of protein–protein interactions. J Comput Chem. 2019;40:1233–42. doi: 10.1002/jcc.25780. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Zhang M, Su Q, Lu Y et al. Application of machine learning approaches for protein-protein interactions prediction. MC 2017; 13. [DOI] [PubMed]

[CR12] 12.Sze-To A, Fung S, Lee E-SA, et al. Prediction of protein–protein Interaction via co-occurring aligned pattern clusters. Methods. 2016;110:26–34. doi: 10.1016/j.ymeth.2016.07.018. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Chatterjee P, Basu S, Kundu M et al. PPI_SVM: prediction of protein-protein interactions using machine learning, domain-domain affinities and frequency tables. Cell Mol Biology Lett 2011; 16. [DOI] [PMC free article] [PubMed]

[CR14] 14.Xu D, Xu H, Zhang Y, et al. Protein-protein interactions Prediction based on Graph Energy and protein sequence information. Molecules. 2020;25:1841. doi: 10.3390/molecules25081841. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Su X-R, Hu L, You Z-H, et al. Multi-view heterogeneous molecular network representation learning for protein–protein interaction prediction. BMC Bioinformatics. 2022;23:234. doi: 10.1186/s12859-022-04766-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Ahmed I, Witbooi P, Christoffels A. Prediction of human- Bacillus anthracis protein–protein interactions using multi-layer neural network. Bioinformatics. 2018;34:4159–64. doi: 10.1093/bioinformatics/bty504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Canziani A, Paszke A, Culurciello E. An Analysis of Deep Neural Network Models for Practical Applications. 2017.

[CR18] 18.He K, Zhang X, Ren S et al. Deep Residual Learning for Image Recognition. 2015.

[CR19] 19.Zaremba W, Sutskever I, Vinyals O. Recurr Neural Netw Regularization. 2015.

[CR20] 20.Vaswani A, Shazeer N, Parmar N et al. Atten Is all You Need. 2017.

[CR21] 21.Xu K, Hu W, Leskovec J et al. HOW POWERFUL ARE GRAPH NEURAL NETWORKS? international conference on learning representations. 2019.

[CR22] 22.Sun T, Zhou B, Lai L, et al. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics. 2017;18:277. doi: 10.1186/s12859-017-1700-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Li X, Han P, Wang G, et al. SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics. 2022;23:474. doi: 10.1186/s12864-022-08687-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Mahapatra S, Gupta VR, Sahu SS, et al. Deep neural network and Extreme Gradient boosting based hybrid classifier for Improved Prediction of Protein-Protein Interaction. IEEE/ACM Trans Comput Biol Bioinf. 2022;19:155–65. doi: 10.1109/TCBB.2021.3061300. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Zhou X, Song H, Li J. Residue-frustration-based prediction of protein–protein interactions using machine learning. J Phys Chem B. 2022;126:1719–27. doi: 10.1021/acs.jpcb.1c10525. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Asim MN, Ibrahim MA, Malik MI, et al. ADH-PPI: an attention-based deep hybrid model for protein-protein interaction prediction. iScience. 2022;25:105169. doi: 10.1016/j.isci.2022.105169. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Chen W, Wang S, Song T, et al. DCSE:Double-Channel-Siamese-Ensemble model for protein protein interaction prediction. BMC Genomics. 2022;23:555. doi: 10.1186/s12864-022-08772-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Soleymani F, Paquet E, Viktor H, et al. Protein–protein interaction prediction with deep learning: a comprehensive review. Comput Struct Biotechnol J. 2022;20:5316–41. doi: 10.1016/j.csbj.2022.08.070. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Kim G, Lee S, Karin EL et al. Easy and accurate protein structure prediction using ColabFold. 2023. [DOI] [PubMed]

[CR31] 31.Jha K, Saha S, Singh H. Prediction of protein–protein interaction using graph neural networks. Sci Rep. 2022;12:8360. doi: 10.1038/s41598-022-12201-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Song B, Luo X, Luo X, et al. Learning spatial structures of proteins improves protein–protein interaction prediction. Brief Bioinform. 2022;23:bbab558. doi: 10.1093/bib/bbab558. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Yang F, Fan K, Song D, et al. Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC Bioinformatics. 2020;21:323. doi: 10.1186/s12859-020-03646-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Gao Z, Jiang C, Zhang J, et al. Hierarchical graph learning for protein–protein interaction. Nat Commun. 2023;14:1093. doi: 10.1038/s41467-023-36736-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Singh R, Devkota K, Sledzieski S, et al. Topsy-Turvy: integrating a global view into sequence-based PPI prediction. Bioinformatics. 2022;38:i264–72. doi: 10.1093/bioinformatics/btac258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Chen M, Ju CJ-T, Zhou G, et al. Multifaceted protein–protein interaction prediction based on siamese residual RCNN. Bioinformatics. 2019;35:i305–14. doi: 10.1093/bioinformatics/btz328. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Lv G, Hu Z, Bi Y, et al. Learning unknown from correlations. Graph Neural Network for Inter-novel-protein Interaction Prediction; 2021.

[CR38] 38.Zhao Z, Qian P, Yang X, et al. SemiGNN-PPI: self-ensembling multi-graph neural network for efficient. and Generalizable Protein-Protein Interaction Prediction; 2023.

[CR39] 39.Kang Y, Elofsson A, Jiang Y, et al. AFTGAN: prediction of multi-type PPI based on attention free transformer and graph attention network. Bioinformatics. 2023;39:btad052. doi: 10.1093/bioinformatics/btad052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Hu F, Hu Y, Zhang W, et al. A Multimodal Protein Representation Framework for Quantifying Transferability across Biochemical Downstream Tasks. Adv Sci. 2023;10:2301223. doi: 10.1002/advs.202301223. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Jeong JC, Lin X, Chen X-W. On position-specific Scoring Matrix for protein function prediction. IEEE/ACM Trans Comput Biol Bioinf. 2011;8:308–15. doi: 10.1109/TCBB.2010.93. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Remmert M, Biegert A, Hauser A, et al. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9:173–5. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Li H. Deep learning for natural language processing: advantages and challenges. Natl Sci Rev. 2018;5:24–6. doi: 10.1093/nsr/nwx110. [DOI] [Google Scholar]

[CR44] 44.Maas AL, Hannun AY, Ng AY et al. Rectifier nonlinearities improve neural network acoustic models. Proc. icml. 2013; 30:3.

[CR45] 45.Xu K, Hu W, Leskovec J et al. How Powerful are Graph Neural Networks? 2019.

[CR46] 46.Chen Z, Villar S, Chen L, et al. On the equivalence between graph isomorphism testing and function approximation with gnns. Advances in neural information processing systems 2019; 32.

[CR47] 47.Weisfeiler B, Leman A. The reduction of a graph to canonical form and the algebra which appears therein. nti, Series 1968; 2:12–16.

[CR48] 48.Ridnik T, Ben-Baruch E, Zamir N et al. Asymmetric Loss For Multi-Label Classification. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021; 82–91.

[CR49] 49.Lin T-Y, Goyal P, Girshick R et al. Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision. 2017; 2980–2988.

[CR50] 50.Jadon S. A survey of loss functions for semantic segmentation. 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) 2020; 1–7.

[CR51] 51.Li L, Doroslovacki M, Loew MH. Approximating the gradient of cross-entropy loss function. IEEE Access. 2020;8:111626–35. doi: 10.1109/ACCESS.2020.3001531. [DOI] [Google Scholar]

[CR52] 52.Kingma DP, Ba J, Adam. A Method for Stochastic Optimization. 2017.

[CR53] 53.Zeng M, Zou B, Wei F et al. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. 2016; 225–8.

[CR54] 54.Zhong W, He C, Xiao C, et al. Long-distance dependency combined multi-hop graph neural networks for protein–protein interactions prediction. BMC Bioinformatics. 2022;23:521. doi: 10.1186/s12859-022-05062-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Linderman GC, Rachh M, Hoskins JG, et al. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat Methods. 2019;16:243–5. doi: 10.1038/s41592-018-0308-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Dehghan A, Razzaghi P, Abbasi K, et al. TripletMultiDTI: Multimodal representation learning in drug-target interaction prediction with triplet loss function. Expert Syst Appl. 2023;232:120754. doi: 10.1016/j.eswa.2023.120754. [DOI] [Google Scholar]

[CR57] 57.Lee AC-L, Harris JL, Khanna KK, et al. A Comprehensive Review on current advances in peptide Drug Development and Design. IJMS. 2019;20:2383. doi: 10.3390/ijms20102383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Jubb H, Higueruelo AP, Winter A, et al. Structural biology and drug discovery for protein–protein interactions. Trends Pharmacol Sci. 2012;33:241–8. doi: 10.1016/j.tips.2012.03.006. [DOI] [PubMed] [Google Scholar]

PERMALINK

GNNGL-PPI: multi-category prediction of protein-protein interactions using graph neural networks based on global graphs and local subgraphs

Xin Zeng

Fan-Fang Meng

Meng-Liang Wen

Shu-Juan Li

Yi Li

Abstract

Introduction

Materials and methods

Datasets

Table 1.

PPI network graph formation and protein features

PPI network graph formation

Protein features

Proposed model

Fig. 1.

GIN-AK

ASL

Model training

Evaluation metrics

Results and discussion

Multimodal pre-training features offer a better representation of protein

Table 2.

Local subgraph features can enhance the performance of the model

Table 3.

Selection of parameter k in k-hop

Table 4.

The impact of ASL on the performance of the model

Table 5.

Performance comparison of GNNGL-PPI with the state-of-the-art methods

Table 6.

Performance comparison between GNNGL-PPI and state-of-the-art methods under the multi-classification evaluation metrics

Table 7.

Performance comparison of some statistical tests between GNNGL-PPI and state-of-the-art methods

Table 8.

Statistical analysis of prediction accuracy of GNNGL-PPI and GNN-PPI under different interaction categories

Fig. 2.

Interpretability analysis of the effectiveness of GNNGL-PPI

Fig. 3.

Discussions

Conclusions

Appendix A

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases