Abstract
Proteins primarily perform their functions through interactions with other proteins, making the accurate prediction of protein–protein interactions (PPIs) a fundamental problem. Experimental methods for determining PPIs are often slow and expensive, which has driven significant efforts to improve the performance of computational methods in this field. While many methods have been designed, recent thorough investigations proved that the existing methods learn exclusively from sequence similarities and node degrees. When such data leakage is avoided, performances were shown to become random. We introduce C3PI, a novel sequence-based deep learning framework designed for predicting PPIs. C3PI uses as input ProtT5 protein embeddings into a complex architecture that includes two novel components, a puzzler and an entangler, which significantly enhance the model’s performance. Through extensive comparisons with state-of-the-art methods across many datasets, C3PI consistently outperforms competing approaches, especially in key metrics such as AUPRC and AUROC. Most importantly, C3PI is the first PPI prediction method to achieve a significant improvement over random on the leakage-free gold standard dataset. C3PI is available as a web server at c3pi.csd.uwo.ca and source code from github.com/lucian-ilie/C3PI.
Keywords: protein interaction, machine learning, protein embedding, ProtT5
Introduction
Proteins play a crucial role in various cellular processes, including cell growth, gene expression, and intercellular communication [1]. In the late 1990s, analyses of protein functions predominantly focused on individual proteins [2]. However, to gain a comprehensive understanding of protein functionality, it is essential to study proteins in relation to their interacting partners. Since proteins operate collaboratively to ensure proper functionality, examining them within the context of their interactions is imperative. With the publication of the human genome [3] and advancements in proteomics, it has become increasingly important to understand not only the functions of individual proteins but also how they interact with one another within cellular environments.
Protein–protein interactions (PPIs) play numerous critical roles in our bodies, serving as a foundation for various biological processes. One significant application of understanding PPIs is in predicting the function of target proteins, which can provide insights into their roles within the cellular environment. Additionally, PPIs are crucial for determining the effectiveness of drugs, as many therapeutic agents function by modulating these interactions to achieve a desired biological response [4].
In the context of a living organism or cell, PPIs are interpreted as physical contacts between proteins, which occur through molecular docking mechanisms. These interactions enable proteins to form complexes, communicate signals, and execute a wide range of cellular functions essential for maintaining life [5]. Understanding the specifics of how proteins interact at a molecular level not only enhances our knowledge of fundamental biological processes but also aids in the development of new therapeutic strategies and drug designs.
Experimental methods for identifying protein interactions, such as immunoprecipitation utilizing protein-specific antibodies and Protein A/G affinity beads, are known for being time-consuming, labor-intensive, and expensive [6]. Co-immunoprecipitation is widely recognized as a standard technique in the field, employing antibodies and affinity beads to identify molecules that interact with proteins. Another experimental approach is the pull-down assay, which involves affinity purification with multiple wash and elution steps [7]. Additionally, techniques like surface plasmon resonance [8], bacterial two-hybrid [9], and cytology two-hybrid [10] are used to investigate protein interactions in vitro.
In contrast to experimental methods, computational approaches are rapid and cost-effective. Consequently, numerous studies aim to predict protein interactions using computational methods. The protein interaction prediction problem is typically framed as a classification challenge. Classifiers undergo two key phases: training, where the classifier is educated on available data, and testing, where the classifier is tasked with predicting the correct class for input data. Given the potential for prediction errors, multiple metrics are employed to assess the performance of classifiers.
We focus in this study on sequence-based programs for two reasons. First, the number of available protein sequences outnumbers that of structures by two orders of magnitude [11]. Second, while the advent of AlphaFold [12, 13] helped predict significantly better structures than before, it cannot reliably predict structures for proteins containing intrinsically disordered regions, which are the ones having the most interactions. Therefore, sequence-based prediction is still the best solution. Many sequence-based programs have been designed for PPI prediction, employing a variety of methods. Most popular ones include DeepFE [14], DPPI [15], R-FC and R-LSTM [16], Bio2Vec [17], PIPR [18], StackPPI [19], D-SCRIPT [20], TAGPPI [21], and Topsy-Turvy [22].
The performance of the computational methods has often been claimed to have very high accuracy (95%–99%) [23], seemingly implying that this is a solved problem. Bernett et al. [23] noticed that the overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances. They investigated this issue thoroughly and their conclusions about the PPI prediction problem is that not only it is not solved, but also it is actually wide open. They prove that the current models learn exclusively from sequence similarity and node degree (number of interacting proteins), rather than identifying more complex sequence features representing binding pockets, protein domains, or similar motifs [23]. When data leakage is avoided, performances were shown to become random. This render the methods unequipped to handle interactions of dark proteins [24], which have no known similarity with other sequences. In order to help future development of sequence-based PPI prediction methods, they built a gold standard dataset, with training, validation, and testing components, which is free from data leakage. They have tested many programs on this dataset and proved that their performance is random or very close.
In this paper, we introduce C3PI, a new deep learning program for PPI prediction. C3PI uses as input ProtT5 protein embeddings [25] into a complex architecture that includes two novel components: a puzzler and an entangler. The puzzler produces random permutations of segments of each input protein sequence, whereas the entangler performs a mid-level combination of features extracted from the two input sequences. We perform a comprehensive evaluation of its performance by comparing it with state-of-the-art tools on six species datasets used by top competitors, as well as on the gold standard dataset mentioned above. C3PI has the highest AUROC and AUPRC parameters in all but one species tests. On the gold standard dataset, C3PI is the first method to achieve significant improvement over random.
Materials and methods
Datasets
To assess the performance of C3PI in predicting PPIs, we utilized two types of datasets: the species datasets used by D-SCRIPT [20] and the gold standard dataset developed by Bernett et al. [23].
The species datasets have been sourced from the STRING database (version 11) [26] and processed to keep only high-confidence interactions, as well as filtered out by clustering proteins at a 40% similarity threshold using CD-HIT [27, 28], in order to mitigate the model’s reliance on sequence similarity alone for memorizing interactions. Negative PPIs have been produced by randomly pairing proteins from the non-redundant set [15].
Our human PPI dataset follows different guidelines. While D-SCRIPT maintains a 10:1 ratio for positive and negative interactions, we opted for a 1:1 ratio in the training set and 10:1 in the validation set. This was done to achieve better training while maintaining universal applicability to many species. The fairly large total number of pairs allowed us to employ about 95% of the data for training, and the remaining for validation, trying to maximize the amount of training data while preserving a meaningful validation set.
The gold standard dataset is taken from Bernett et al. [23]. It was purposefully built to be free from data leakage and minimized pairwise sequence similarities. We refer to [23] for details. All datasets are shown in Table 1.
Table 1.
Datasets
| Species datasets | Pairs | Proteins | Lengths | Avg. len. |
|---|---|---|---|---|
| E.coli | 22 000 | 4437 | 50–799 | 286.42 |
| S.cerevisiae | 55 000 | 5664 | 50–800 | 341.53 |
| D.melanogaster | 55 000 | 19 213 | 50–800 | 387.97 |
| C.elegans | 55 000 | 25 429 | 50–800 | 351.70 |
| M.musculus | 55 000 | 37 497 | 50–800 | 392.85 |
| H.sapiens (train) | 95 864 | 14 272 | 50–800 | 382.79 |
| H.sapiens (validate) | 4219 | 1536 | 51–800 | 381.38 |
| H.sapiens (test) | 52 725 | 15 525 | 50–800 | 382.65 |
| Gold standard dataset (H.sapiens) | ||||
| Train (INTRA1) | 163 192 | 4286 | 54–14 507 | 752.71 |
| Validate (INTRA0) | 59 260 | 3711 | 56–34 350 | 590.48 |
| Test (INTRA2) | 52 048 | 3022 | 51–5183 | 504.73 |
Architecture
The proposed PPI prediction model is organized as a two branch, multi-scale feature extractor followed by a fusion and classification head. Each branch processes one protein sequence ProtT5 embedding independently, extracting features at six distinct length scales. The outputs from both branches are then fused, aggregated into a single representation, and passed through a final multilayer classifier to calculate probability of interactions.
Puzzler
Protein embeddings do not directly encode 3D proximity of nonadjacent residues. It is often the case that residues that are far away in the sequence become spatially close in the 3D structure. Therefore, the main idea of the puzzler component is to expose the model to alternative residue ordering. Each protein sequence is divided into pieces which are then shuffled, to produce a permuted sequence. These permuted sequences are then fed as input to the model. Precisely, each protein sequence is first normalized to a length of 800, as customary in the area, by either padding or truncating, and then split into 15 pieces of size 53 each; this uses the first 795 residues, while the last 5 are ignored.
The ProtT5 embedding vectors are computed for each residue in the original, unpermuted, sequence. The permuted sequences use these ProtT5 vectors. The 15 pieces are randomly shuffled by the puzzler to produce 8 permuted sequences for each input protein sequence; see Fig. 1. Each input protein is permuted according to 8 fixed permutations, the first of which is the identity, which means the original sequence is always used. During training, for each input protein pair, the puzzler generates eight pairs by combining the corresponding permutations. Each of these is treated as a separate input pair with the same property as the original one. Each pair is passed through the model individually, allowing it to learn from varied representations of the same underlying data. At inference, eight separate permuted pairs are again generated using the same permutations and each instance is independently passed through the trained model to produce a prediction. The final output is obtained by averaging the eight predictions.
Figure 1.
The puzzler component: the original sequence produces eight permuted versions, the first of which is the original sequence
We have considered three candidates for the embedding method: Ankh [29], ESM-2 [30], and ProtT5 [25]. Comparison on the validation set indicated the ProtT5 performed the best and thus was chosen for the model.
Different sets of random permutations have been tested but no significant difference in predicting power over the training dataset has been detected. While the idea deserves further analysis, it appears that as long as parts of the protein sequences are moved around, the prediction power improves significantly. The Ablation section indicates the power of the puzzler component of the architecture.
Also, limited testing has been performed for the size of the permuted pieces. For example, splitting each sequence of length 800 into the more natural 16 blocks of length 50 produces a model that is slightly less effective. This, like other aspects of the architecture, could potentially benefit from further tuning.
Multi-scale convolutional dense block
At the core of each branch lies a set of six convolutional dense modules, designed to capture sequence patterns ranging from global to highly local. Each module applies a 1D convolution with a specific kernel size and stride, followed by batch normalization and a nonlinear activation. The resulting feature maps are then flattened and projected via a fully connected layer into a fixed dimensional embedding. A dropout layer follows each dense projection to reduce overfitting.
The six (kernel, stride, output-dim) configurations are chosen as: (795, 1, 16), (400, 200, 32), (200, 100, 64), (100, 50, 128), (50, 25, 256), and (20, 10, 512). These settings ensure that the first module effectively sees the entire input sequence (capturing global context), while the last module focuses on short subsequences with a stride of 10 (capturing very local patterns). Flattening and projecting into a relatively low-dimensional embedding at each scale encourages the network to summarize each receptive field into a compact feature vector. Figure 2 shows the overall structure of a convolutional dense block.
Figure 2.
Multi-scale convolutional dense block: each configuration captures protein sequence features at a specific resolution, ranging from global to highly local patterns.
Entangler
To model PPIs, two parallel branches with identical structures independently process the embeddings of the two input proteins. For each protein sequence,
, each branch produces six embeddings
with
. By processing each protein through the same hierarchy of kernel sizes and strides, each branch learns to extract complementary features at matching scales for the two partners. Figure 3 shows the overall structure of a branch block.
Figure 3.
Architecture of a single CNN branch: each branch processes one protein and extracts six scale-specific embeddings using the multi-scale convolutional dense block.
Rather than fusing both protein embeddings only at the very end, the entangler performs a mid level combination at each scale. Specifically, for scale
, the two feature vectors
are concatenated into a
-dimensional vector. This concatenated vector is passed through a small fully connected layer, mapping
, with a nonlinear activation and dropout. The intuition behind this design is that it allows the network to learn interaction between two protein features that are specific to a given length scale and keep each fused vector at the same dimensionality,
, as its original scale, simplifying downstream aggregation. Figure 4 shows the overall structure of an entangler block.
Figure 4.
Scale-wise entangler block: at each scale, embeddings from both protein branches are concatenated and passed through a shared transformation to model interactions.
Aggregation and final classification
After generating the six fused vectors, these are concatenated into a single high-dimensional representation of total size
.
This 1008-dimensional vector is then passed through a three-layer multilayer perceptron: (i) a linear projection
, followed by ReLU and dropout (rate 0.3), (ii) a linear projection
, followed by ReLU and dropout (rate 0.3), and (iii) a linear projection
, followed by a sigmoid activation. The final scalar output represents the predicted probability that the two input proteins interact. Figure 5 shows the overall structure of a classifier block.
Figure 5.
Final classifier block: the six fused embeddings are aggregated and passed through a multilayer perceptron to predict PPI probability.
Complete architecture
The entire architecture is presented in Fig. 6, which shows precisely how all the previously discussed parts are assembled together. The eight permuted sequences (the first of which is the original) are produced by the puzzler (Fig. 1), which are pairwise fed into the entangler (Fig. 4), whose result is assigned prediction by the classifier block (Fig. 5), the final prediction being the average of the eight predictions.
Figure 6.
The overall architecture of the model: the entire architecture of the model is summarized, indicating how the puzzler, the entangler and the other blocks fit together when applied to a pair of protein sequences.
Competing methods
When evaluating methods to compare with, one method stood out, Topsy-Turvy [22], which is in agreement with the study of Bernett et al. [23]. Therefore, we have spent a considerable amount of effort on this method. We ran both D-SCRIPT, the previous method from the same group, and Topsy-Turvy however, while the D-SCRIPT results match those in their paper, the Topsy-Turvy ones do not. We have communicated with the authors, who helped greatly by providing several trained models, but unfortunately none could match the results in the paper. Therefore, we present both the reported results and the best of the models sent to us. We ran also the R-FC and R-LSTM models for all species datasets. The performance of the pretrained models was very poor, the sensitivity being close to zero. To address this, we retrained the models using our training dataset. After retraining, the performance improved significantly. In addition, we include the results from another competitive method, PIPR. All these methods were included also in the study of Bernett et al. [23].
Evaluation scheme
In order to assess the performance of the tested models, we have considered many parameters: sensitivity (recall), specificity, precision, accuracy, F1-score, Matthews Correlation Coefficient (MCC) and the area under the receiver operating characteristic curve (AUROC) and precision recall curve (AUPRC). The AUPRC is the best indicator of performance for skewed data [31, 32], which makes it most relevant for imbalanced datasets. For balanced datasets, both AUROC and AUPRC reflect well the overall performance of the models. We compared the area under curves using Delong’s test for AUROC and bootstrap and permutation tests for AUPRC.
Results
Species data
We compare in this section the performance of C3PI against D-SCRIPT, PIPR, PIPR+D-SCRIPT, R-FC, R-LSTM, and Topsy-Turvy on the species datasets from Table 1. The results shown in Table 2, with ROC and PR curves, are shown for the method we run in Fig. 7.
Table 2.
Performance comparisons of methods and area under curve statistics across six different species: six parameters are used for comparison: sensitivity, specificity, precision, accuracy, F1-score, Matthew’s correlation coefficient, and areas under the ROC and PR curves;
-values are given for Delong’s test for AUROC and bootstrap and permutation tests for AUPRC for pairwise comparison between C3PI and the method in that row;
-values
are considered not significant and shown in red.
| Dataset | Model | Sens. | Spec. | Prec. | Acc. | F1 | MCC | AUROC | AUPRC | DeLong | Bootstrap | Permutation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| E.coli | D-SCRIPT
|
0.520 | 0.791 | 0.627 | 0.863 | 0.571 | |||||||
| D-SCRIPT | 0.382 | 0.987 | 0.788 | 0.921 | 0.515 | 0.515 | 0.860 | 0.574 | 8.03E-228 | 1.20E-12 | 0.00E+00 | ||
PIPR
|
0.131 | 0.629 | 0.217 | 0.675 | 0.308 | ||||||||
PIPR+D-SCRIPT
|
0.394 | 0.793 | 0.526 | 0.863 | 0.588 | ||||||||
| R-FC | 0.377 | 0.879 | 0.237 | 0.833 | 0.291 | 0.209 | 0.677 | 0.233 | 0.00E+00 | 2.78E-290 | 0.00E+00 | ||
| R-LSTM | 0.241 | 0.936 | 0.272 | 0.872 | 0.256 | 0.187 | 0.692 | 0.227 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
Topsy-Turvy
|
0.805 | 0.556 | |||||||||||
| Topsy-Turvy | 0.464 | 0.930 | 0.451 | 0.879 | 0.457 | 0.389 | 0.643 | 0.405 | 0.00E+00 | 1.42E-94 | 0.00E+00 | ||
| C3PI | 0.675 | 0.910 | 0.429 | 0.889 | 0.524 | 0.475 | 0.873 | 0.633 | |||||
| S.cerevisiae | D-SCRIPT
|
0.223 | 0.706 | 0.339 | 0.789 | 0.405 | |||||||
| D-SCRIPT | 0.223 | 0.991 | 0.706 | 0.921 | 0.339 | 0.368 | 0.789 | 0.405 | 0.00E+00 | 3.72E-64 | 0.00E+00 | ||
PIPR
|
0.085 | 0.398 | 0.140 | 0.718 | 0.230 | ||||||||
PIPR+D-SCRIPT
|
0.225 | 0.708 | 0.341 | 0.789 | 0.417 | ||||||||
| R-FC | 0.347 | 0.850 | 0.188 | 0.805 | 0.244 | 0.152 | 0.652 | 0.176 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
| R-LSTM | 0.150 | 0.942 | 0.204 | 0.870 | 0.173 | 0.105 | 0.668 | 0.163 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
Topsy-Turvy
|
0.850 | 0.534 | |||||||||||
| Topsy-Turvy | 0.544 | 0.942 | 0.484 | 0.906 | 0.512 | 0.461 | 0.762 | 0.431 | 0.00E+00 | 3.25E-36 | 0.00E+00 | ||
| C3PI | 0.690 | 0.893 | 0.391 | 0.874 | 0.499 | 0.456 | 0.886 | 0.536 | |||||
| D.melanogaster | D-SCRIPT
|
0.359 | 0.798 | 0.495 | 0.824 | 0.552 | |||||||
| D-SCRIPT | 0.359 | 0.991 | 0.798 | 0.933 | 0.495 | 0.508 | 0.824 | 0.552 | 0.00E+00 | 4.21E-53 | 0.00E+00 | ||
PIPR
|
0.121 | 0.521 | 0.196 | 0.728 | 0.278 | ||||||||
PIPR+D-SCRIPT
|
0.361 | 0.798 | 0.497 | 0.824 | 0.562 | ||||||||
| R-FC | 0.365 | 0.881 | 0.235 | 0.834 | 0.286 | 0.203 | 0.691 | 0.219 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
| R-LSTM | 0.206 | 0.931 | 0.231 | 0.865 | 0.218 | 0.145 | 0.660 | 0.188 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
Topsy-Turvy
|
0.921 | 0.713 | |||||||||||
| Topsy-Turvy | 0.741 | 0.955 | 0.621 | 0.935 | 0.675 | 0.643 | 0.896 | 0.649 | 0.00E+00 | 1.43E-04 | 0.00E+00 | ||
| C3PI | 0.823 | 0.892 | 0.432 | 0.886 | 0.567 | 0.543 | 0.931 | 0.683 | |||||
| C.elegans | D-SCRIPT
|
0.306 | 0.840 | 0.449 | 0.813 | 0.548 | |||||||
| D-SCRIPT | 0.306 | 0.994 | 0.840 | 0.932 | 0.448 | 0.482 | 0.813 | 0.548 | 0.00E+00 | 9.91E-132 | 0.00E+00 | ||
PIPR
|
0.142 | 0.673 | 0.235 | 0.757 | 0.346 | ||||||||
PIPR+D-SCRIPT
|
0.308 | 0.841 | 0.451 | 0.814 | 0.559 | ||||||||
| R-FC | 0.352 | 0.878 | 0.223 | 0.830 | 0.273 | 0.188 | 0.682 | 0.218 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
| R-LSTM | 0.217 | 0.944 | 0.278 | 0.878 | 0.244 | 0.180 | 0.702 | 0.214 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
Topsy-Turvy
|
0.906 | 0.700 | |||||||||||
| Topsy-Turvy | 0.645 | 0.970 | 0.683 | 0.941 | 0.663 | 0.631 | 0.851 | 0.616 | 0.00E+00 | 2.49E-53 | 0.00E+00 | ||
| C3PI | 0.718 | 0.957 | 0.625 | 0.935 | 0.668 | 0.634 | 0.943 | 0.740 | |||||
| M.musculus | D-SCRIPT
|
0.346 | 0.818 | 0.486 | 0.833 | 0.580 | |||||||
| D-SCRIPT | 0.346 | 0.992 | 0.818 | 0.934 | 0.486 | 0.506 | 0.833 | 0.580 | 0.00E+00 | 1.49E-06 | 0.00E+00 | ||
PIPR
|
0.331 | 0.734 | 0.456 | 0.839 | 0.526 | ||||||||
PIPR+D-SCRIPT
|
0.355 | 0.820 | 0.495 | 0.838 | 0.609 | ||||||||
| R-FC | 0.586 | 0.870 | 0.311 | 0.845 | 0.407 | 0.349 | 0.807 | 0.375 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
| R-LSTM | 0.229 | 0.961 | 0.367 | 0.894 | 0.282 | 0.236 | 0.742 | 0.272 | 0.00E+00 | 0.00E+00 | 0.00E+00 | ||
Topsy-Turvy
|
0.934 | 0.735 | |||||||||||
| Topsy-Turvy | 0.749 | 0.948 | 0.592 | 0.930 | 0.661 | 0.628 | 0.885 | 0.636 | 0.00E+00 |
|
|
||
| C3PI | 0.800 | 0.883 | 0.406 | 0.875 | 0.538 | 0.512 | 0.923 | 0.622 | |||||
| H.sapiens | D-SCRIPT
|
0.386 | 0.833 | 0.527 | 0.854 | 0.617 | |||||||
| D-SCRIPT | 0.386 | 0.992 | 0.833 | 0.937 | 0.527 | 0.541 | 0.854 | 0.617 | 0.00E+00 | 8.24E-266 | 0.00E+00 | ||
PIPR
|
0.701 | 0.838 | 0.763 | 0.960 | 0.835 | ||||||||
PIPR+D-SCRIPT
|
0.400 | 0.949 | 0.562 | 0.962 | 0.844 | ||||||||
| R-FC | 0.893 | 0.904 | 0.482 | 0.903 | 0.626 | 0.612 | 0.959 | 0.792 | 3.50E-150 | 4.30E-100 | 0.00E+00 | ||
| R-LSTM | 0.469 | 0.979 | 0.691 | 0.933 | 0.559 | 0.535 | 0.900 | 0.607 | 7.78E-300 | 0.00E+00 | 0.00E+00 | ||
| Topsy-Turvy | 0.782 | 0.958 | 0.651 | 0.942 | 0.711 | 0.682 | 0.895 | 0.703 | |||||
| C3PI | 0.914 | 0.934 | 0.581 | 0.932 | 0.710 | 0.696 | 0.977 | 0.878 | 0.00E+00 | 6.75E-136 | 0.00E+00 |
The results for the methods marked by superscript “a” were taken from their papers; the rest were computed by us. The best results are shown in boldface (excluding superscript “a”).
Figure 7.
ROC and PR curves are shown for the five models that have been run by us from Table 2: D-SCRIPT, R-FC, R-LSTM, Topsy-Turvy, and C3PI for the species datasets.
For the threshold-dependent parameters, the winners are distributed among the methods: sensitivity is won by C3PI in all tests, specificity and precision are won by D-SCRIPT in all tests, accuracy is distributed between D-SCRIPT and Topsy-Turvy, F1-score and MCC are distributed between C3PI and Topsy-Turvy. The most important, threshold-independent, parameters, that measure the entire behaviour of a method on a given dataset are AUROC and AUPRC and are all won by C3PI with one exception, the AUPRC for M.musculus. The second best is Topsy-Turvy with several exceptions: D-SCRIPT is second best for both areas for E.coli and AUROC for S.cerevisiae, R-FC is second best for both area for H.sapiens. R-FC exhibits very strong performance for H.sapiens but very poor generalization, placing very low for all the other species.
The average improvement of C3PI over the top competitor, Topsy-Turvy, is 13.35% for AUROC and 26.11% for AUPRC. Given that these are skewed datasets, AUPRC is the most important parameter. In the single case where Topsy-Turvy performs better, it is by 2.09% and statistically not significant, as shown below.
The
-values assessing the statistical significance of the difference between AUROC and AUPRC of C3PI and that of each competing method we could run—D-SCRIPT, R-FC, R-LSTM, and Topsy-Turvy—are presented in the last three columns of Table 2. For AUROC we used Delong’s test and for AUPRC we used bootstrap and permutation tests.
-values
are considered significant; those above the threshold are shown in red in the table. All differences are significant except the only case where Topsy-Turvy outperforms C3PI, the AUPRC for M.musculus.
Gold standard dataset
The most relevant comparison is on the gold standard dataset of Bernett et al. [23]. In Table 3, we compare C3PI against all methods they studied (except baseline methods): D-SCRIPT, DeepFE, PIPR, R-FC, R-LSTM, and Topsy-Turvy.
Table 3.
Comparison on the gold standard dataset
| Model | Sens. | Spec. | Prec. | Acc. | F1 | MCC | AUROC | AUPRC |
|---|---|---|---|---|---|---|---|---|
D-SCRIPT
|
0.194 | 0.813 | 0.509 | 0.503 | 0.280 | 0.009 | 0.499 | 0.510 |
DeepFE
|
0.466 | 0.570 | 0.520 | 0.518 | 0.492 | 0.037 | 0.521 | 0.513 |
PIPR
|
0.464 | 0.580 | 0.525 | 0.522 | 0.492 | 0.044 | 0.532 | 0.528 |
R-FC
|
0.405 | 0.644 | 0.533 | 0.524 | 0.460 | 0.050 | 0.524 | 0.514 |
R-LSTM
|
0.607 | 0.423 | 0.514 | 0.515 | 0.556 | 0.030 | 0.515 | 0.509 |
Topsy-Turvy
|
0.261 | 0.857 | 0.646 | 0.559 | 0.372 | 0.146 | 0.587 | 0.590 |
| C3PI | 0.705 | 0.587 | 0.630 | 0.646 | 0.665 | 0.293 | 0.703 | 0.695 |
The results for methods marked by superscript “a” are from Bernett et al. [23]. The best results are shown in boldface.
C3PI is by far the winner, with Topsy-Turvy clearly the main competitor. The main conclusion of the study of Bernett et al. was that the previous methods are essentially random, with the slight exception of Topsy-Turvy. The composite measures shown visually in Fig. 8 indicate this very clearly: accuracy, AUROC, and AUPRC for all methods stay at 50%, then increase slightly for Topsy-Turvy, then increase again, significantly, for C3PI. The trend is most clearly visible for MCC. C3PI’s improvement over Topsy Turvy is 15.54% for accuracy, 79.06% for F1-score, 100.15% for MCC, 19.66% for AUROC, and 17.86% for AUPRC. The MCC improvement over Topsy-Turvy is already very high, but the improvement over the third best MCC is an extraordinary 482.5%. These results underscore the significance of this dataset, through which Bernett et al. provided a new foundation and direction for subsequent work in the field.
Figure 8.
Bar chart visualization of composite parameters: the plots show, from left to right, the accuracy, F1-score, MCC, AUROC, and AUPRC from the gold standard dataset comparison (Table 3).
Ablation study
Architecture
In order to evaluate to contribution of our puzzler and entangler architecture components, we performed an ablation study where we trained and evaluated the model while removing one or both components on the gold standard dataset. While removing the puzzler component, each protein sequence is padded or truncated to a fixed length of 800 amino acids and passed directly to the model. No permuted sequences are generated. The rest of the architecture remains unchanged, including the dual-branch convolutional encoder, scale-wise entangler, and final classification MLP. For the removal of the entangler component, the layers that fuse protein features at each scale are removed and the six scale-specific embeddings from each protein are concatenated to form a 2016-dimensional feature vector. The classifier is modified accordingly to include the following components:
– Linear layer (2016
64) + ReLU + Dropout– Linear layer (64
8) + ReLU + Dropout-
– Linear layer (8
1) + SigmoidRemoving both the puzzler and entangler involves performing all the above modifications.
The results of the tests on the gold standard dataset are shown in Table 4. The removal of puzzler or entangler brings a significant drop in performance, more pronounce for the puzzler than for the entangler. Without either one of them, the performance is still above that of Topsy-Turvy. The removal of both components brings the model to essentially random performance, comparable with the other models in Table 3.
Table 4.
Ablation study: the full model is compared on the gold standard dataset with the model without the entangler component, puzzler component, and both
| Model | Sens. | Spec. | Prec. | Acc. | F1 | MCC | AUROC | AUPRC |
|---|---|---|---|---|---|---|---|---|
| C3PI | 0.705 | 0.587 | 0.630 | 0.646 | 0.665 | 0.293 | 0.703 | 0.695 |
| w/o ent. | 0.648 | 0.561 | 0.596 | 0.604 | 0.621 | 0.209 | 0.646 | 0.634 |
| w/o puz. | 0.569 | 0.618 | 0.598 | 0.593 | 0.583 | 0.187 | 0.569 | 0.634 |
| w/o both | 0.570 | 0.505 | 0.535 | 0.537 | 0.552 | 0.075 | 0.549 | 0.529 |
Embedding methods
Table 5 gives the comparison of three top embedding methods, Ankh, ESM-2, and ProtT5, on the gold standard dataset. While ProtT5 is clearly the best overall, Ankh and ESM-2 exhibit good performance. They trade off F1-score and MCC but Ankh has better overall performance than ESM-2 in terms of the area under each curve.
Table 5.
Embedding comparison on the gold standard dataset
| Embed. | Sens. | Spec. | Prec. | Acc. | F1 | MCC | AUROC | AUPRC |
|---|---|---|---|---|---|---|---|---|
| Ankh | 0.557 | 0.714 | 0.661 | 0.636 | 0.605 | 0.275 | 0.700 | 0.689 |
| ESM-2 | 0.747 | 0.478 | 0.589 | 0.613 | 0.659 | 0.234 | 0.666 | 0.656 |
| ProtT5 | 0.705 | 0.587 | 0.630 | 0.646 | 0.665 | 0.293 | 0.703 | 0.695 |
The best results are shown in boldface.
Piece size and number
Table 6 gives the comparison on the gold standard dataset of three possibilities for the size of pieces and the corresponding piece number, chosen such that the product is close to the protein length of 800. The combination 15
53, in spite of the fact that it is using only 795 residues out of the 800 available, performs considerably better than the neighbouring possibilities, 16
50 and 14
57.
Table 6.
Comparison of piece sizes on the gold standard dataset: for each case, the number of pieces and the length of one piece are given
| Pieces | Sens. | Spec. | Prec. | Acc. | F1 | MCC | AUROC | AUPRC |
|---|---|---|---|---|---|---|---|---|
16 50 |
0.578 | 0.613 | 0.599 | 0.595 | 0.588 | 0.191 | 0.640 | 0.630 |
15 53 |
0.705 | 0.587 | 0.630 | 0.646 | 0.665 | 0.293 | 0.703 | 0.695 |
14 57 |
0.734 | 0.487 | 0.589 | 0.611 | 0.653 | 0.228 | 0.653 | 0.640 |
Application: NOTCH signalling pathway
The NOTCH signalling pathway is one of the best studied protein interaction pathways [33, 34] and continues to attract interest [35]. The NOTCH pathway is critical in many biomolecular processes, and defects in the pathway are known to be involved in several types of cancers and in several neurological diseases. Compared to other signally pathways, the Notch pathway is comparatively unusual in that it involves a one-on-one interaction between the notch receptor in the plasma membrane and a single ligand.
The interaction data used here have been taken from Rivas and Fontanillo [5] who used the NOTCH network as a canonical example of networks in biology. This study shows the network, as derived from studies in humans, and includes a total of 50 binary interactions including 33 that involve one of the four NOTCH proteins in humans.
The predictions by D-SCRIPT, Topsy-Turvy, and C3PI are used to plot the networks in Fig. 9. A summary is given in Table 7. Complete predictions are given in the Supplementary Table S1. D-SCRIPT performs very poorly on this example. C3PI performs clearly better than Topsy-Turvy, as seen from the summary in Table 7 and from Fig. 9. The average score for C3PI predictions is 0.644, whereas Topsy-Turvy has 0.490. In spite of significant improvement, plenty of room for improvement remains.
Figure 9.
The NOTCH network: the predictions for the NOTCH network are shown as computed by (a) D-SCRIPT, (b) Topsy-Turvy, (c) C3PI; a three-colour scheme is used for the edges: white if the prediction is close to 50%, shades of blue if above >50%, and shades of red if below 50%; the width of each edge indicates the number of experiments supporting that interaction.
Table 7.
NOTCH network prediction summary
| Prediction threshold | D-SCRIPT | Topsy-Turvy | C3PI |
|---|---|---|---|
| Above 90% | 0 | 0 | 14 |
| Above 75% | 1 | 5 | 21 |
| Above 50% | 1 | 30 | 37 |
| Below 5% | 47 | 4 | 3 |
Conclusion
We have introduced C3PI, a novel PPI prediction model. C3PI uses as input ProtT5 protein embeddings into a complex architecture that includes two novel components, a puzzler and an entangler, which significantly enhance the model’s performance. C3PI clearly outperforms the state-of-the-art methods, being the first PPI prediction method to achieve a significant improvement over random on the leakage-free gold standard dataset of Bernett et al. [23].
While the improvements over the state-of-the-art programs are considerable, the tests presented on the golden standard dataset and the NOTCH network indicate that there is still need for improvement in this area.
Aside from improvements in accuracy prediction, investigations should be performed into understanding the generalizing capability of C3PI to other interaction types, as well as its ability to capture biologically meaningful features. Interpretability analysis using XAI methods could be used to assess the latter; however, our recent study on interpretability of protein sequences regarding the production of embeddings or prediction of interaction sites [36] indicates that this is a difficult problem.
Key Points
Proteins play a crucial role in most cellular process, performing their functions by interacting with other proteins. Predicting protein–protein interactions (PPI) is a fundamental problem in biology.
Experimental methods for PPI prediction are slow, expensive, have incomplete coverage, and produce many false positives. Therefore, much effort has been put into developing computational methods.
Recent thorough investigations revealed that existing computational methods learn exclusively from sequence similarities and node degrees. When such data leakage is avoided, performances become random.
We introduce C3PI, a new sequence-based method for PPI prediction, that uses ProtT5 protein embeddings and a novel deep learning architecture.
C3PI consistently outperforms existing methods, being the first PPI prediction method to achieve good performance on the leakage-free gold standard dataset.
Supplementary Material
Acknowledgements
All our computations were performed on Digital Research Alliance of Canada servers.
Contributor Information
SeyedMohsen Hosseini, Department of Computer Science, University of Western Ontario, London, N6A 5B7 Ontario, Canada.
G Brian Golding, Department of Biology, McMaster University, Hamilton, L8S 4K1 Ontario, Canada.
Lucian Ilie, Department of Computer Science, University of Western Ontario, London, N6A 5B7 Ontario, Canada.
Author contributions
S.H. designed the architectures, computed the datasets, wrote the software, performed all tests, including installing and running competing methods, and wrote an initial draft of the manuscript. G.B.G. designed and analyzed the NOTCH application and wrote that section. L.I. proposed the problem, designed the project and methodology, suggested the use of embeddings, analyzed the results, supervised the work, and wrote the final version of the manuscript.
Competing interests
None declared.
Funding
This work was supported by NSERC Discovery (RGPIN 2021-03978 to L.I., RGPIN-2020-05733 to G.B.G.).
Data availability
C3PI is freely available as a web server at c3pi.csd.uwo.ca and as source code from github.com/lucian-ilie/C3PI.
References
- 1. Alberts B, Bray D, Lewis J. et al. Molecular Biology of the Cell, Vol. 3. New York: Garland, 1994. [Google Scholar]
- 2. Golemis EA, Golemis E, Adams PD. Protein-Protein Interactions: A Molecular Cloning Manual. New York: CSHL Press, 2005. [Google Scholar]
- 3. Craig Venter J, Adams MD, Myers EW. et al. The sequence of the human genome. science 2001;291:1304–51. 10.1126/science.1058040 [DOI] [PubMed] [Google Scholar]
- 4. Srinivasa Rao V, Srinivas K, Sujini GN. et al. Protein-protein interaction detection: methods and analysis. Int J Proteomics 2014;2014:1–12. 10.1155/2014/147648 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Rivas JDL, Fontanillo C. Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 2010;6:e1000807. 10.1371/journal.pcbi.1000807 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Lin J-S, Lai E-M. Protein–protein interactions: co-immunoprecipitation. In: Bacterial Protein Secretion Systems. New York: Springer, 2017, 211–9. 10.1007/978-1-4939-7033-9_17 [DOI] [Google Scholar]
- 7. Louche A, Salcedo SP, Bigot S. Protein–protein interactions: pull-down assays. In: Bacterial Protein Secretion Systems. New York: Springer, 2017, 247–255. 10.1007/978-1-4939-7033-9_20. [DOI] [PubMed] [Google Scholar]
- 8. Douzi B. Protein–protein interactions: surface plasmon resonance. In: Journet L, Cascales E, (eds), Bacterial Protein Secretion Systems. New York: Springer, 2017, 257–75. 10.1007/978-1-4939-7033-9_21. [DOI] [PubMed] [Google Scholar]
- 9. Karimova G, Gauliard E, Davi M. et al. Protein–protein interaction: bacterial two-hybrid. In: Bacterial Protein Secretion Systems. Springer, 2017, 159–76. 10.1007/978-1-4939-7033-9_13. [DOI] [PubMed] [Google Scholar]
- 10. Atmakuri K. Protein–protein interactions: cytology two-hybrid. In: Journet L, Cascales E, (eds), Bacterial Protein Secretion Systems. New York: Springer, 2017, 189–97. 10.1007/978-1-4939-7033-9_15. [DOI] [PubMed] [Google Scholar]
- 11. Qiu J, Bernhofer M, Heinzinger M. et al. ProNA2020 predicts protein–DNA, protein–RNA, and protein–protein binding proteins and residues from sequence. J Mol Biol 2020;432:2428–43. 10.1016/j.jmb.2020.02.026 [DOI] [PubMed] [Google Scholar]
- 12. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Abramson J, Adler J, Dunger J. et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature 2024;630:493–500. 10.1038/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Sun T, Zhou B, Lai L. et al. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform 2017;18:1–8. 10.1186/s12859-017-1700-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Hashemifar S, Neyshabur B, Khan AA. et al. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 2018;34:i802–10. 10.1093/bioinformatics/bty573 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Richoux F, Servantie C, Borès C. et al. Comparing two deep learning sequence-based models for protein-protein interaction prediction, arXiv preprint arXiv:1901.06268, 2019.
- 17. Wang Y, You Z-H, Yang S. et al. A high efficient biological language model for predicting protein–protein interactions. Cells 2019;8:122. 10.3390/cells8020122 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Chen M, Ju CJ-T, Zhou G. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 2019;35:i305–14. 10.1093/bioinformatics/btz328 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Chen C, Zhang Q, Bin Y. et al. Improving protein-protein interactions prediction accuracy using xgboost feature selection and stacked ensemble classifier. Comput Biol Med 2020;123:103899. 10.1016/j.compbiomed.2020.103899 [DOI] [PubMed] [Google Scholar]
- 20. Sledzieski S, Singh R, Cowen L. et al. D-script translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst 2021;12:969–982.e6. 10.1016/j.cels.2021.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Song B, Luo X, Luo X. et al. Learning spatial structures of proteins improves protein–protein interaction prediction. Brief Bioinform 2022;23:bbab558. 10.1093/bib/bbab558 [DOI] [PubMed] [Google Scholar]
- 22. Singh R, Devkota K, Sledzieski S. et al. Topsy-turvy: integrating a global view into sequence-based PPI prediction. Bioinformatics 2022;38:i264–72. 10.1093/bioinformatics/btac258 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Bernett J, Blumenthal DB, List M. Cracking the black box of deep sequence-based protein–protein interaction prediction. Brief Bioinform 2024;25:bbae076. 10.1093/bib/bbae076 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Binder JL, Berendzen J, Stevens AO. et al. Alphafold illuminates half of the dark human proteins. Curr Opin Struct Biol 2022;74:102372. 10.1016/j.sbi.2022.102372 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Elnaggar A, Heinzinger M, Dallago C. et al. Prottrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE Transactions on Pattern Analysis and Machine Intelligence 2021;44:7112–7127. [DOI] [PubMed] [Google Scholar]
- 26. Szklarczyk D, Gable AL, Nastou KC. et al. The string database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 2021;49:D605–12. 10.1093/nar/gkaa1074 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Li W, Godzik A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. 10.1093/bioinformatics/btl158 [DOI] [PubMed] [Google Scholar]
- 28. Limin F, Niu B, Zhu Z. et al. Cd-hit: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Elnaggar A, Essam H, Salah-Eldin W. et al. Ankh: optimized protein language model unlocks general-purpose modelling, arXiv preprint arXiv:2301.06568. 2023.
- 30. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
- 31. Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, pp. 233–40, 2006.
- 32. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015;10:e0118432. 10.1371/journal.pone.0118432 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Artavanis-Tsakonas S, Rand MD, Lake RJ. Notch signaling: cell fate control and signal integration in development. Science 1999;284:770–6. 10.1126/science.284.5415.770 [DOI] [PubMed] [Google Scholar]
- 34. Kovall RA, Gebelein B, Sprinzak D. et al. The canonical notch signaling pathway: structural and biochemical insights into shape, sugar, and force. Dev Cell 2017;41:228–41. 10.1016/j.devcel.2017.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Bian W, Jiang H, Yao L. et al. A spatially defined human Notch receptor interaction network reveals Notch intracellular storage and Ataxin-2-mediated fast recycling. Cell Rep 2023;42:112819. 10.1016/j.celrep.2023.112819 [DOI] [PubMed] [Google Scholar]
- 36. Fazel Z, de Souza CPE, Brian Golding G. et al. Explainability of protein deep learning models. Int J Mol Sci 2025;26:5255. 10.3390/ijms26115255 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
C3PI is freely available as a web server at c3pi.csd.uwo.ca and as source code from github.com/lucian-ilie/C3PI.









































