PEPPI: Whole-proteome Protein-protein Interaction Prediction through Structure and Sequence Similarity, Functional Association, and Machine Learning

Eric W Bell; Jacob H Schwartz; Peter L Freddolino; Yang Zhang

doi:10.1016/j.jmb.2022.167530

. 2022 Mar 5;434(11):167530. doi: 10.1016/j.jmb.2022.167530

PEPPI: Whole-proteome Protein-protein Interaction Prediction through Structure and Sequence Similarity, Functional Association, and Machine Learning

Eric W Bell ¹, Jacob H Schwartz ¹, Peter L Freddolino ^1,^2,^⁎, Yang Zhang ^1,^2,^⁎

PMCID: PMC8897833 NIHMSID: NIHMS1815720 PMID: 35662463

Graphical abstract

Keywords: direct interaction prediction, interology, dimer threading, interactome, SARS-CoV-2

Abstract

Proteome-wide identification of protein-protein interactions is a formidable task which has yet to be sufficiently addressed by experimental methodologies. Many computational methods have been developed to predict proteome-wide interaction networks, but few leverage both the sensitivity of structural information and the wide availability of sequence data. We present PEPPI, a pipeline which integrates structural similarity, sequence similarity, functional association data, and machine learning-based classification through a naïve Bayesian classifier model to accurately predict protein-protein interactions at a proteomic scale. Through benchmarking against a set of 798 ground truth interactions and an equal number of non-interactions, we have found that PEPPI attains 4.5% higher AUROC than the best of other state-of-the-art methods. As a proteomic-scale application, PEPPI was applied to model the interactions which occur between SARS-CoV-2 and human host cells during coronavirus infection, where 403 high-confidence interactions were identified with predictions covering 73% of a gold standard dataset from PSICQUIC and demonstrating significant complementarity with the most recent high-throughput experiments. PEPPI is available both as a webserver and in a standalone version and should be a powerful and generally applicable tool for computational screening of protein-protein interactions.

Introduction

The biological function of many proteins is conferred through their interactions with other proteins. Therefore, to fully understand the function of each protein in an organism, one must first attain a comprehensive network of the protein-protein interactions (PPIs) that occur within the cell. The discovery of critical interactions within this interaction network, or “interactome”, can lead to drug development¹ or protein engineering2, 3 targeting these interactions. However, many of these experiments do not guarantee that the interactions detected are, in fact, direct physical contacts between the proteins; some of the earliest databases for PPI prediction involve features that assert only a functional association between proteins.⁴ While these databases can be useful for prediction of physical interactions, as all physical interactions are functionally associated, the converse is not true; many biological applications, such as drug target discovery, require knowledge of which proteins come into physical contact. The methods for elucidating these direct physical interactions are at present either prohibitively costly for whole-proteome analysis (such as structure solving or crosslinking mass spectrometry) or are too susceptible to errors (such as yeast-two hybrid).⁵ As an alternative, computational methods can be used to model proteome-wide interactions, as well as refine existing interaction datasets.

One of the most straightforward methods of computational interaction prediction is to determine whether the query protein pair is similar to an already known interaction. Many programs directly leverage sequence similarity for this purpose because the sequence comparison operation is quick and sequence data is plentiful.6, 7 However, since structure is more evolutionarily conserved than sequence, structural similarity is much more effective at detecting distantly similar PPIs; methods which leverage this structural information8, 9, 10, 11 grow more powerful as modern structural biology methods such as cryo-EM facilitate the solving of complicated protein complex structures and as computational approaches offer improved accuracy in predicting the folds of individual proteins.¹² In addition, structures provide a clear ground truth as to whether two proteins interact physically; if a solved structure of the interaction exists, the proteins are likely to interact in vivo. Therefore, an effective similarity-based program should consider both structural and sequence similarity.

Another common method for PPI prediction is the application of machine learning-based classifiers. In the earlier days of machine learning, the major novelty of machine learning-based PPI predictors was in how they extracted features from the input amino acid sequences in order to create a fixed-length vector that could be utilized in standard machine learning algorithms, such as the conjoint triad method¹³ or autocorrelation.¹⁴ As more modern deep learning techniques became available, improved PPI predictions were achieved with features solely from evolutionary profiles¹⁵ or even based on sequence alone.16, 17 Despite these initial studies into deep learning, to our knowledge these methods have yet to demonstrate success in a species agnostic context; more commonly, they are benchmarked by either combining a few species-specific datasets¹⁶ or by training on one species and testing on another.15, 17

Here, we present a Pipeline for the Extraction of Predicted Protein-protein Interactions (PEPPI), which offers high-accuracy PPI predictions through a consensus of sequence and structural similarity, functional association, and neural network classification. While the source code for PEPPI can be found at https://github.com/ewbell94/PEPPI, an online webserver implementation of this pipeline can be found at https://zhanggroup.org/PEPPI/, which allows users to create PPI predictions from sequences alone. We additionally present an application of PEPPI to make predictions of the inter-species interactome between human host cells and SARS-CoV-2. Through the following benchmarks and examples, we demonstrate that PEPPI is a useful tool for predicting both pairwise and systems-level PPIs.

Results

Pipeline overview and module cross-validation

PEPPI is a protein-protein interaction prediction pipeline which takes in a pair of query sequences and quantifies their likelihood of interaction as a natural log-transformed likelihood ratio (log(LR)) through a consensus of five independent prediction modules (Figure 1 ). This consensus is determined by a naïve Bayesian classifier model trained on a set of 800 high-confidence interactions from IntAct¹⁸ and 800 curated non-interactions from the Negatome 2.0 database¹⁹ (see Supplementary Methods).

An illustration of the PEPPI pipeline. This pipeline functions by analyzing a pair of input sequences via a series of independent modules, including structure similarity, sequence similarity, neural network classification, and functional association data. These modules are combined using a naïve Bayesian consensus classifier, which provides the final interaction score as a log-likelihood ratio.

Figure 2(a) presents the 10-fold cross validation performance of each individual module and the full pipeline on this training set. The best-performing individual modules are SPRING and SEQ, which implement structure and sequence-based similarity approaches, respectively. The SPRING module uses the dimeric threading program SPRING²⁰ to identify dimer structure templates out of a database of interacting proteins extracted from the PDB, while SEQ uses BLAST²¹ sequence searching to identify similar interactions in a database of direct interactions identified by high-throughput experiment (HTE) data. These homology-based modules will perform well for any cases which have homologous similarity to existing interactions, which is the case for many true interactions. The next best-performing module is the neural network-based CT module, which transforms the input amino acid sequences into a fixed-length vector according to the conjoint triad method¹³ and classifies the resulting vector through a neural network model. This module helps PEPPI retrieve true positive predictions in case there is only loose homologous similarity to existing interactions. The STRING module, a module which extracts various query functional association features from the STRING database,²² performs relatively poorly on its own. This is expected because of its focus on functional association data instead of physical interaction data and because of its inability to provide data if the interaction is not located in STRING. SPRINGNEG has a nearly identical methodology to SPRING but it performs least well because it searches through a database of non-interacting protein structures. Therefore, a hit in this database will only lower the interaction score because by design it only provides information to filter out functionally associated non-interactions. Overall, the combination of all modules clearly outperforms any single module, demonstrating that the modules are complimentary in classification.

PEPPI benchmark results. (a) 10-fold cross validation AUROC reveals that the full PEPPI pipeline outperforms its component modules: the neural network classifier (CT), the functional association data (STRING), the sequence similarity method (SEQ), the structure similarity method (SPRING), and the non-interaction similarity method (SPRINGNEG). (b) An ROC curve of the performance of PEPPI against PRISM, a structure similarity-based method, SPRINT, a sequence similarity-based method, PIPR, a deep learning-based method, and D-SCRIPT, a structure-aware deep learning-based method on a balanced testing set. The dotted line represents the performance of random classification. (c) Precision-recall curve of the performance of PEPPI against several other comparable programs on an unbalanced testing set. The dotted line represents the performance of random classification. (d) A superposition of an example dimer model (PDB 3CI0; chain J in red, chain K in blue) on its dimer template structure (PDB 5VTM; chain W in yellow, chain X in cyan).

PEPPI benchmark and performance

In order to quantify the classification performance of PEPPI against existing methods, we benchmarked PEPPI alongside PRISM,⁸ a structure-based similarity predictor, SPRINT,⁶ a sequence-based similarity predictor, PIPR,¹⁶ a deep learning predictor which utilizes a combination of recurrent and convolutional layers in its architecture, and D-SCRIPT,¹⁷ a “structure aware” deep learning predictor improving upon PIPR. These benchmarks were performed using a randomly selected test set of 798 interacting structure pairs and an equal number of structure pairs involving chains from the same protein complex but known to not come into physical contact (and thus do not form a physical interaction). All structures of this test set were taken from the PDB and were confirmed to have < 50% sequence identity to the PEPPI training protein pairs. It should be noted that the protein pairs in our dataset are classified through their sequence alone, and therefore, even though many of the corresponding PDB structures contain more than one interaction, each pair of chains can be considered independently.

The results of the PPI predictions are summarized presented in Figure 2(b) and Table S1, where it is shown that PEPPI significantly outperforms all competing methods in terms of area under ROC (AUROC), average precision (similar to area under precision-recall curve, AUPRC), and all but SPRINT in maximum achievable Matthews correlation coefficient (MCC). PRISM had the lowest performance in this benchmark, which is likely due to its outdated interface structure library, which misses many structures which have been solved since its release to the public. Interestingly, the highest performance from a competing program was seen from SPRINT, a sequence motif-based similarity classifier, and not from either D-SCRIPT or PIPR, the more sophisticated deep-learning pipelines. The deep learning architectures of PIPR and D-SCRIPT were originally trained and benchmarked based on performance of species-specific interaction datasets; when these methods were applied to the species agnostic dataset used in our benchmarks, the deep learning methods’ ability to accurately classify interactions decreased. As a result, the competing method which draws its conclusions from explicit similarity (SPRINT) outperforms the methods which try to sub-optimally learn the interaction problem itself (PIPR & D-SCRIPT). However, we still found that D-SCRIPT outperforms PIPR, as is consistent with D-SCRIPT’s benchmarks.¹⁷

One particular case of interest in our benchmark dataset is the interaction between chain A and chain B of PDB code 1F3M (corresponding to the N- and C-terminal domains of the human kinase PAK1). While this is a true interaction (and PEPPI correctly classifies it as such), no hit was found by PEPPI in either the sequence or structure databases after homologous template removal, leading to poor scores for those pipelines (SPRING: 8.937, STRING: not found, SEQ: 0.129, SPRINGNEG: 4.618). However, this interaction was still classified as positive with a log(LR) of 0.056 due to a high interaction probability from the CT module (0.999), thus demonstrating the utility of CT for rescuing interactions that do not attain significant similarity. On the side of non-interaction classification, the potential interaction of chain G and chain I of 3CJH (corresponding to two Tim13 chains of the yeast Tim8-Tim13 complex) poses an interesting case, as these chains are present in the same complex but do not directly interact with each other. PEPPI was able to classify this as non-interacting (log(LR) = −0.939) where competing programs could not, despite attaining a high SPRING score (34.208), a high CT confidence (1.0), and loose SEQ homology (0.217). The reason for the correct classification is a high score from SPRINGNEG (36.108) which pushes down the total interaction likelihood, thus demonstrating a case where SPRINGNEG’s false positive identification ability rescues the pipeline from misclassifying the interaction.

While the previous balanced dataset is convenient for benchmarking, it is not fully reflective of the context in which an interactome prediction algorithm is applied because true interactions are much sparser relative to the total set of pairwise combinations of query proteins in almost all contexts. Therefore, we randomly sampled 100 interacting pairs from the previous test set and paired the 200 chains from these interactions in an all-by-all fashion (excluding self pairings), resulting in an unbalanced test set of 100 true interacting pairs and 19,890 decoy pairs. These decoys are not confirmed non-interactions, but due to the low likelihood of finding interactions by random chance, benchmarking using this dataset is still valuable. Due to the high number of protein pairs, PRISM was excluded from this benchmark because of its slow speed and poor performance on the preceding benchmark. The results of this unbalanced benchmark are presented in Figure 2(c) and Table S2. The outcome is similar to the previous benchmark, with PEPPI outperforming all other programs, followed most closely by the sequence-based algorithm SPRINT. In this benchmark, however, the superiority of PEPPI is much clearer, as SPRINT is on average more susceptible to false positive detection for comparable recalls. This resulted in a statistically significant difference in max MCC performance between PEPPI and SPRINT, which was not the case on the balanced benchmark set. Also made clearer is the relatively poor performance of PIPR with respect to false positive errors; so many pairs are classified with the highest confidence score that the maximum achievable precision is 0.186. D-SCRIPT again clearly outperforms PIPR but fails to perform as well as the similarity-based methods. One particular case of interest in this benchmark is the true interaction between chain J and chain K of PDB code 3CI0 (part of the type 2 secretion system of enterotoxigenic E. coli), an interaction which was detected only by PEPPI (Figure 2(d)). This interaction was able to be detected solely through structural similarity (SPRING: 51.45), with all other modules failing to detect the interaction (STRING: not found, SPRINGNEG: 4.09, SEQ: 0.279, CT: 0.026), leading to a log(LR) of 1.433.

SARS-CoV-2 and human interactome modeling

The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has disrupted the lives of almost every person to some degree, and as of November 2021, over 5 million people have lost their lives to the disease worldwide.²³ SARS-CoV-2 has thus become an essential entity to understand, as our expedient comprehension of this virus translates to the development of therapeutic medicines, such as antiviral drugs and vaccines, for current and future coronavirus infections. One fundamental step towards understanding the function of the virus is to model the virus-host interactome, with which we can begin to identify the purpose of each viral protein through our functional understanding of the human proteins with which each viral protein interacts. To this end, we have predicted the set of interactions which occur between SARS-CoV-2 and human proteins using PEPPI.

Our SARS-CoV-2/Human interactome model consists of 403 interactions whose likelihood ratios were determined to be greater than 1, i.e., interactions which are more likely to be interacting than not. As shown in Figure 3 (a)-(b), the SARS-CoV-2 protein which has the highest number of predicted interactions is the Spike protein (86 interactions), followed by the 2′-O-methyltransferase nsp16 (46 interactions), and the RNA polymerase nsp12 (41 interactions). The highest confidence interaction of this network was the Spike/ACE2 interaction (Figure 3(c)), which is expected given the extensive study of this interaction due to its essential role in viral entry.²⁴ PEPPI also correctly predicted Spike to interact with two other host proteins important to viral entry: Furin, a protease which cleaves Spike during entry of SARS-CoV-2 but is not involved in SARS-CoV-1 entry,²⁵ and TMPRSS2, a cell-surface protease involved in viral entry of both SARS-CoV-2 and SARS-CoV-1.²⁴ The PEPPI results also demonstrated the power of structure similarity-based PPI prediction through the prediction of the PARP15-nsp3 interaction (Figure 3(d)); this interaction was predicted with high confidence (log(LR) = 1.435), mainly due to the SPRING module’s high confidence score (35.6). The PARP proteins are known to interact with the nsp3 macrodomain in other coronaviruses,²⁶ so detection of this interaction in our dataset stands as an important validation.

A summary of SARS-CoV-2/human interactome prediction. (a) A network overview of the full interactome of human/SARS-CoV-2 protein pairs, with SARS-CoV-2 proteins colored red and human proteins colored blue. (b) A bar chart of the number of predicted interactions involving each SARS-CoV-2 protein. Proteins which were not predicted to have any interactions were excluded. (c) A superposition of a dimer model of the top-ranked SARS-CoV-2 Spike (in red) and Human ACE2 (in blue) interaction on its dimer template structure (PDB 6ACG; chain D in cyan, chain C in yellow). (d) A superposition of a dimer model of the SARS-CoV-2 nsp3 macrodomain (in red) and human PARP15 macrodomain (in blue) on its dimer template structure (PDB 2W2G; chain A in cyan, chain B in yellow). (e) A superposition of a dimer model of a domain of the SARS-CoV-2 nsp3 (in red) and human NEDD8 (in blue) on its dimer template structure (PDB 5WFI; chain C in cyan, chain A in yellow). (f) A superposition of a dimer model of a domain of the SARS-CoV-2 nsp3 (in red) and human UBD (in blue) on its dimer template structure (PDB 6BI8; chain C in cyan, chain A in yellow).

To evaluate PEPPI’s overall performance at recapitulating known biology, we constructed a gold standard dataset of known interactions between human and SARS coronavirus proteins from PSICQUIC²⁷ for comparison. This dataset consisted of 128 interactions, 94 (73%) of which were predicted by PEPPI. In addition, we compared the overlap between our predicted dataset and a recently published high-throughput experimental dataset.²⁸ PEPPI’s predictions only shared one interaction with this dataset (an interaction between MARK3 and ORF9b), but the experimental dataset presents only functional associations due to their use of AP-MS, which is known to pull down entire interacting complexes instead of only the “prey” protein of interest. Compared to PEPPI, the experimental dataset also misses crucial interactions, such as the interactions involving the Spike protein with ACE2, Furin, and TMPRSS2. In fact, only 3 of the 128 (2%) gold standard interactions we isolated from PSICQUIC are present in this dataset. Thus, even in the presence of a high throughput experimental dataset, PEPPI provides a demonstrably useful complement and reveals many direct physical interactions which would otherwise be missed.

Lastly, PEPPI made the potentially significant predictions that nsp3 interacts with the post-translational modifiers NEDD8 and UBD (FAT10). While it is well-documented that the papain-like protease (PLPro) of SARS-CoV-2 nsp3 both deISGylates and deubiquitinates viral proteins to avoid host detection and thus evade immune response,²⁹ there has been less study of the role of the related, ubiquitin-like, post-transcriptional modifiers NEDD8 and UBD in SARS-CoV-2 disease. NEDD8 tags proteins for degradation, has been implicated in the innate immune response to viruses,³⁰ and is a target by some viruses for modulation of host immune response.³¹ UBD has been shown to be a ubiquitin-independent and cytokine-inducible modifier which targets proteins for proteasomal degradation³² and has additionally been shown to have roles in viral infection defense.³³ Furthermore, the top structural templates PEPPI found for nsp3/NEDD8 (Figure 3(e)) and nsp3/UBD (Figure 3(f)) were a PLPro in complex with free ubiquitin and a PLPro in complex with the ubiquitin-like and innate-immune-modulating protein ISG15, respectively. We therefore hypothesize that SARS-CoV-2 modulates host innate immune response through interaction of nsp3 with NEDD8 and UBD in a similar manner that nsp3 interacts with ubiquitin.

Discussion & conclusion

We have presented a novel PPI prediction pipeline which demonstrates superior performance relative to other approaches. In addition to performance, this method presents a few unique advantages. Firstly, because the structure-based analysis makes use of threading rather than structural alignment, it is much faster than pipelines which need to explicitly model the input chains, while retaining the flexibility of not requiring an input structure. Second, because structure-based analysis is a component of the pipeline, PEPPI can produce rough structural models of the interactions, which can help deepen biological insights such as interface residue determination and can guide follow-up experiments. Finally, because PEPPI is a consensus model, it is not solely dependent on any one methodology to make its predictions; even if all modules classify an interaction with low confidence, if these classifications agree, the final prediction will have reasonable confidence (as we have shown in a previous example above). In addition, the consensus classifier is constructed such that if any modules are intentionally excluded or fail to produce a score, a prediction can still be made from the remaining modules.

A few shortcomings and assumptions of the pipeline should also be discussed. First, the interaction predictions made in this pipeline are based largely on similarity to known PPIs, and these modules will only detect interactions with similarity to solved structures or to interactions detected in high-throughput screens. Therefore, the method’s performance will depend on the coverage of our knowledge of the existing interaction space, which is currently far from fully comprehensive. However, this knowledge will expand as more interactions are discovered, so the power of the similarity-based method will improve over time. Second, because PEPPI is similarity-dependent, if an interaction is predicted between proteins of two given families, all other proteins in those families will likely also be predicted to interact. In this case, interactions involving proteins of the same families can be sorted by LR; the highest rated interaction is the most likely to be true. Finally, this pipeline predicts the capability for proteins to interact regardless of biological context. As a result, it is possible for some of the interactions predicted here to not exist within the context of the cell due to factors such as incompatible subcellular localization or insufficient expression of the proteins of interest in vivo. While this additional biological insight can be useful in pruning the interaction space in a proteome-wide interactome modeling study, it is not explicitly considered in PEPPI’s interaction predictions. Therefore, it is worth validating the interactions predicted with this program with more focused small-scale biochemical studies, such as crosslinking mass spectrometry experiments. Despite these shortcomings, the whole proteome interaction networks modeled by PEPPI can help biologists retrieve existing biology and derive novel biology for their system of interest, as we did for the SARS-CoV-2/Human interaction system. Through the understanding of PPI networks on the whole-proteome scale that PEPPI provides, future studies will be able to better understand the systems-level complexity that underpins biological phenomena as well as target individual edges of the network for therapeutic benefit.

Methods

Pipeline overview

The PEPPI pipeline performs predictions through a set of independent modules, each of which score the interaction likelihood in their own way. These modules include a conjoint triad trained neural network, a STRING database lookup module, and two “interology” based modules: a threading-based module using a modified version of SPRING and a sequence-based module using BLAST. Scores from each of these modules are transformed into a ratio of likelihood based on pre-trained score probability distributions, and the final likelihood ratio is calculated as the product of likelihood ratios from each independent module (i.e., the sum of log-likelihood ratios). A full description of the pipeline methodology can be found in the Supplementary Material.

SARS-CoV-2 virus and human host protein sequence collection

The SARS-CoV-2 proteome was collected from the UniProtKB pre-release. Replicase polyprotein 1ab was split into nsp1-16 (excluding nsp11) according to its “chain” regions as described in the “Protein Processing” subsection of its UniProtKB entry; nsp11 was extracted from replicase polyprotein 1a in a similar fashion. As a result, the SARS-CoV-2 sequence set consisted of 31 protein sequences in total. The human proteome, consisting of 20,600 proteins, was also collected from the Uniprot database.³⁴ All 20600*31 = 638,600 putative interactions were analyzed with PEPPI; any pairs resulting in a log(LR) greater than 0 were classified as interacting. The “gold standard” dataset is comprised interactions listed in the PSICQUIC database²⁷ annotated as “direct interaction” as of April 2021 between the proteins of either SARS-CoV-1 or SARS-CoV-2 and human proteins, a total of 128 interactions.

CRediT authorship contribution statement

Eric W. Bell: Conceptualization, Methodology, Software, Validation, Formal analysis, Data curation, Writing – original draft, Writing – review & editing. Jacob H. Schwartz: Investigation, Writing – original draft, Writing – review & editing. Peter L. Freddolino: Conceptualization, Writing – review & editing, Supervision, Project administration, Funding acquisition. Yang Zhang: Conceptualization, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Acknowledgments

Acknowledgements

We would like to thank Dr. Gilbert S. Omenn for helpful discussion. This work is supported in part by the National Institute of General Medical Sciences (GM136422, S10OD026825), the National Institute of Allergy and Infectious Diseases (AI134678), and the National Science Foundation (IIS1901191, DBI2030790, MTM2025426). The work in this manuscript was performed in part using the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation (ACI1548562).

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Edited by Rita Casadio

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jmb.2022.167530.

Appendix A. Supplementary data

The following are the Supplementary data to this article:

Supplementary Data 1

mmc1.pdf^{(277KB, pdf)}

Supplementary Data 2

mmc2.pdf^{(277KB, pdf)}

Data availability

The relevant data have been shared as supplementary material, and source code is available with the web server itself.

References

1.D.E. Scott, A.R. Bayly, C. Abell, J. Skidmore, Small molecules, big targets: Drug discovery faces the protein-protein interaction challenge, Nature Publishing Group, 2016. p. 533-50. [DOI] [PubMed]
2.Bryan C.M., Rocklin G.J., Bick M.J., Ford A., Majri-Morrison S., Kroll A.V., et al. Computational design of a synthetic PD-1 agonist. Proc. Natl. Acad. Sci. 2021;118 doi: 10.1073/pnas.2102164118. e2102164118-e. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Huang X., Pearce R., Zhang Y. De novo design of protein peptides to block association of the SARS-CoV-2 spike protein with human ACE2. Aging. 2020;12:11263–11276. doi: 10.18632/aging.103416. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Snel B., Lehmann G., Bork P., Huynen M.A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000;28:3442–3444. doi: 10.1093/nar/28.18.3442. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Fields S. High-throughput two-hybrid analysis: The promise and the peril. FEBS J. 2005;272:5391–5399. doi: 10.1111/j.1742-4658.2005.04973.x. [DOI] [PubMed] [Google Scholar]
6.Li Y., Ilie L. SPRINT: Ultrafast protein-protein interaction prediction of the entire human interactome. BMC Bioinf. 2017;18:1–11. doi: 10.1186/s12859-017-1871-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Pitre S., Dehne F., Chan A., Cheetham J., Duong A., Emili A., et al. PIPE: A protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinf. 2006;7:1–15. doi: 10.1186/1471-2105-7-365. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Baspinar A., Cukuroglu E., Nussinov R., Keskin O., Gursoy A. PRISM: A web server and repository for prediction of protein-protein interactions and modeling their 3D complexes. Nucleic Acids Res. 2014;42:W285–W289. doi: 10.1093/nar/gku397. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.R. Hosur, J. Peng, A. Vinayagam, U. Stelzl, J. Xu, N. Perrimon, et al., A computational framework for boosting confidence in high-throughput protein-protein interaction datasets, 2012, pp. 76-. [DOI] [PMC free article] [PubMed]
10.Garzón J.I., Deng L., Murray D., Shapira S., Petrey D., Honig B. A computational interactome and functional annotation for the human proteome. eLife. 2016;5:1–27. doi: 10.7554/eLife.18715. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gong W., Guerler A., Zhang C., Warner E., Li C., Zhang Y. Integrating Multimeric Threading With High-throughput Experiments for Structural Interactome of Escherichia coli. J. Mol. Biol. 2021;433 doi: 10.1016/j.jmb.2021.166944. [DOI] [PubMed] [Google Scholar]
12.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Shen J., Zhang J., Luo X., Zhu W., Yu K., Chen K., et al. Predicting protein-protein interactions based only on sequences information. PNAS. 2007;104:4337–4341. doi: 10.1073/pnas.0607879104. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Guo Y., Yu L., Wen Z., Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 2008;36:3025–3030. doi: 10.1093/nar/gkn159. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Hashemifar S., Neyshabur B., Khan A.A., Xu J. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics. 2018;34:i802–i810. doi: 10.1093/bioinformatics/bty573. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.M. Chen, C.J.T. Ju, G. Zhou, X. Chen, T. Zhang, K.W. Chang, et al., Multifaceted protein-protein interaction prediction based on Siamese residual RCNN, 14 ed. p. i305-i14. [DOI] [PMC free article] [PubMed]
17.Sledzieski S., Singh R., Cowen L., Berger B. D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst. 2021;12 doi: 10.1016/j.cels.2021.08.010. 969-82.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Orchard S., Ammari M., Aranda B., Breuza L., Briganti L., Broackes-Carter F., et al. The MIntAct project - IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42:D358–D363. doi: 10.1093/nar/gkt1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Blohm P., Frishman G., Smialowski P., Goebels F., Wachinger B., Ruepp A., et al. Negatome 2.0: A database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res. 2014;42:D396–D400. doi: 10.1093/nar/gkt1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Guerler A., Govindarajoo B., Zhang Y. Mapping monomeric threading to protein-protein structure prediction. J. Chem. Inf. Model. 2013;53:717–725. doi: 10.1021/ci300579r. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
22.Szklarczyk D., Gable A.L., Lyon D., Junge A., Wyder S., Huerta-Cepas J., et al. STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47:D607–D613. doi: 10.1093/nar/gky1131. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Organization WH, WHO Coronavirus (COVID-19) Dashboard, WHO Coronavirus (COVID-19) Dashboard With Vaccination Data, 2021, pp. 1-5.
24.Hoffmann M., Kleine-Weber H., Schroeder S., Krüger N., Herrler T., Erichsen S., et al. SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor. Cell. 2020;181 doi: 10.1016/j.cell.2020.02.052. 271-80.e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Johnson B.A., Xie X., Bailey A.L., Kalveram B., Lokugamage K.G., Muruato A., et al. Loss of furin cleavage site attenuates SARS-CoV-2 pathogenesis. Nature. 2021;591:293–299. doi: 10.1038/s41586-021-03237-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Grunewald M.E., Chen Y., Kuny C., Maejima T., Lease R., Ferraris D., et al. The coronavirus macrodomain is required to prevent PARP-mediated inhibition of virus replication and enhancement of IFN expression. PLoS Pathogens. 2019;15 doi: 10.1371/journal.ppat.1007756. e1007756-e. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.B. Aranda, H. Blankenburg, S. Kerrien, F.S.L. Brinkman, A. Ceol, E. Chautard, et al., PSICQUIC and PSISCORE: Accessing and scoring molecular interactions, Nature Publishing Group, 2011, pp. 528–529. [DOI] [PMC free article] [PubMed]
28.Gordon D.E., Jang G.M., Bouhaddou M., Xu J., Obernier K., White K.M., et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020;583:459–468. doi: 10.1038/s41586-020-2286-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Shin D., Mukherjee R., Grewe D., Bojkova D., Baek K., Bhattacharya A., et al. Papain-like protease regulates SARS-CoV-2 viral spread and innate immunity. Nature. 2020;587:657–662. doi: 10.1038/s41586-020-2601-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.K. Han, J. Zhang, Roles of neddylation against viral infections, Nature Publishing Group, 2018, pp. 292–294. [DOI] [PMC free article] [PubMed]
31.Kumar R., Mehta D., Mishra N., Nayak D., Sunil S. Role of host-mediated post-translational modifications (PTMS) in RNA virus pathogenesis. Multidisciplinary Digital Publishing Institute. 2021:1–26. doi: 10.3390/ijms22010323. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Hipp M.S., Kalveram B., Raasi S., Groettrup M., Schmidtke G. FAT10, a Ubiquitin-Independent Signal for Proteasomal Degradation. Mol. Cell. Biol. 2005;25:3483–3491. doi: 10.1128/MCB.25.9.3483-3491.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Basler M., Buerger S., Groettrup M. The ubiquitin-like modifier FAT10 in antigen processing and antimicrobial defense. Pergamon. 2015:129–132. doi: 10.1016/j.molimm.2015.04.012. [DOI] [PubMed] [Google Scholar]
34.The UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research. 2021;49(D1):D480–D489. doi: 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data 1

mmc1.pdf^{(277KB, pdf)}

Supplementary Data 2

mmc2.pdf^{(277KB, pdf)}

Data Availability Statement

The relevant data have been shared as supplementary material, and source code is available with the web server itself.

[b0005] 1.D.E. Scott, A.R. Bayly, C. Abell, J. Skidmore, Small molecules, big targets: Drug discovery faces the protein-protein interaction challenge, Nature Publishing Group, 2016. p. 533-50. [DOI] [PubMed]

[b0010] 2.Bryan C.M., Rocklin G.J., Bick M.J., Ford A., Majri-Morrison S., Kroll A.V., et al. Computational design of a synthetic PD-1 agonist. Proc. Natl. Acad. Sci. 2021;118 doi: 10.1073/pnas.2102164118. e2102164118-e. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0015] 3.Huang X., Pearce R., Zhang Y. De novo design of protein peptides to block association of the SARS-CoV-2 spike protein with human ACE2. Aging. 2020;12:11263–11276. doi: 10.18632/aging.103416. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0020] 4.Snel B., Lehmann G., Bork P., Huynen M.A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000;28:3442–3444. doi: 10.1093/nar/28.18.3442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0025] 5.Fields S. High-throughput two-hybrid analysis: The promise and the peril. FEBS J. 2005;272:5391–5399. doi: 10.1111/j.1742-4658.2005.04973.x. [DOI] [PubMed] [Google Scholar]

[b0030] 6.Li Y., Ilie L. SPRINT: Ultrafast protein-protein interaction prediction of the entire human interactome. BMC Bioinf. 2017;18:1–11. doi: 10.1186/s12859-017-1871-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0035] 7.Pitre S., Dehne F., Chan A., Cheetham J., Duong A., Emili A., et al. PIPE: A protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. BMC Bioinf. 2006;7:1–15. doi: 10.1186/1471-2105-7-365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0040] 8.Baspinar A., Cukuroglu E., Nussinov R., Keskin O., Gursoy A. PRISM: A web server and repository for prediction of protein-protein interactions and modeling their 3D complexes. Nucleic Acids Res. 2014;42:W285–W289. doi: 10.1093/nar/gku397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0045] 9.R. Hosur, J. Peng, A. Vinayagam, U. Stelzl, J. Xu, N. Perrimon, et al., A computational framework for boosting confidence in high-throughput protein-protein interaction datasets, 2012, pp. 76-. [DOI] [PMC free article] [PubMed]

[b0050] 10.Garzón J.I., Deng L., Murray D., Shapira S., Petrey D., Honig B. A computational interactome and functional annotation for the human proteome. eLife. 2016;5:1–27. doi: 10.7554/eLife.18715. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0055] 11.Gong W., Guerler A., Zhang C., Warner E., Li C., Zhang Y. Integrating Multimeric Threading With High-throughput Experiments for Structural Interactome of Escherichia coli. J. Mol. Biol. 2021;433 doi: 10.1016/j.jmb.2021.166944. [DOI] [PubMed] [Google Scholar]

[b0060] 12.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0065] 13.Shen J., Zhang J., Luo X., Zhu W., Yu K., Chen K., et al. Predicting protein-protein interactions based only on sequences information. PNAS. 2007;104:4337–4341. doi: 10.1073/pnas.0607879104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0070] 14.Guo Y., Yu L., Wen Z., Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res. 2008;36:3025–3030. doi: 10.1093/nar/gkn159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0075] 15.Hashemifar S., Neyshabur B., Khan A.A., Xu J. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics. 2018;34:i802–i810. doi: 10.1093/bioinformatics/bty573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0080] 16.M. Chen, C.J.T. Ju, G. Zhou, X. Chen, T. Zhang, K.W. Chang, et al., Multifaceted protein-protein interaction prediction based on Siamese residual RCNN, 14 ed. p. i305-i14. [DOI] [PMC free article] [PubMed]

[b0085] 17.Sledzieski S., Singh R., Cowen L., Berger B. D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst. 2021;12 doi: 10.1016/j.cels.2021.08.010. 969-82.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0090] 18.Orchard S., Ammari M., Aranda B., Breuza L., Briganti L., Broackes-Carter F., et al. The MIntAct project - IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42:D358–D363. doi: 10.1093/nar/gkt1115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0095] 19.Blohm P., Frishman G., Smialowski P., Goebels F., Wachinger B., Ruepp A., et al. Negatome 2.0: A database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res. 2014;42:D396–D400. doi: 10.1093/nar/gkt1079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0100] 20.Guerler A., Govindarajoo B., Zhang Y. Mapping monomeric threading to protein-protein structure prediction. J. Chem. Inf. Model. 2013;53:717–725. doi: 10.1021/ci300579r. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0105] 21.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[b0110] 22.Szklarczyk D., Gable A.L., Lyon D., Junge A., Wyder S., Huerta-Cepas J., et al. STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47:D607–D613. doi: 10.1093/nar/gky1131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0115] 23.Organization WH, WHO Coronavirus (COVID-19) Dashboard, WHO Coronavirus (COVID-19) Dashboard With Vaccination Data, 2021, pp. 1-5.

[b0120] 24.Hoffmann M., Kleine-Weber H., Schroeder S., Krüger N., Herrler T., Erichsen S., et al. SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor. Cell. 2020;181 doi: 10.1016/j.cell.2020.02.052. 271-80.e8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0125] 25.Johnson B.A., Xie X., Bailey A.L., Kalveram B., Lokugamage K.G., Muruato A., et al. Loss of furin cleavage site attenuates SARS-CoV-2 pathogenesis. Nature. 2021;591:293–299. doi: 10.1038/s41586-021-03237-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0130] 26.Grunewald M.E., Chen Y., Kuny C., Maejima T., Lease R., Ferraris D., et al. The coronavirus macrodomain is required to prevent PARP-mediated inhibition of virus replication and enhancement of IFN expression. PLoS Pathogens. 2019;15 doi: 10.1371/journal.ppat.1007756. e1007756-e. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0135] 27.B. Aranda, H. Blankenburg, S. Kerrien, F.S.L. Brinkman, A. Ceol, E. Chautard, et al., PSICQUIC and PSISCORE: Accessing and scoring molecular interactions, Nature Publishing Group, 2011, pp. 528–529. [DOI] [PMC free article] [PubMed]

[b0140] 28.Gordon D.E., Jang G.M., Bouhaddou M., Xu J., Obernier K., White K.M., et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020;583:459–468. doi: 10.1038/s41586-020-2286-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0145] 29.Shin D., Mukherjee R., Grewe D., Bojkova D., Baek K., Bhattacharya A., et al. Papain-like protease regulates SARS-CoV-2 viral spread and innate immunity. Nature. 2020;587:657–662. doi: 10.1038/s41586-020-2601-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0150] 30.K. Han, J. Zhang, Roles of neddylation against viral infections, Nature Publishing Group, 2018, pp. 292–294. [DOI] [PMC free article] [PubMed]

[b0155] 31.Kumar R., Mehta D., Mishra N., Nayak D., Sunil S. Role of host-mediated post-translational modifications (PTMS) in RNA virus pathogenesis. Multidisciplinary Digital Publishing Institute. 2021:1–26. doi: 10.3390/ijms22010323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0160] 32.Hipp M.S., Kalveram B., Raasi S., Groettrup M., Schmidtke G. FAT10, a Ubiquitin-Independent Signal for Proteasomal Degradation. Mol. Cell. Biol. 2005;25:3483–3491. doi: 10.1128/MCB.25.9.3483-3491.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0165] 33.Basler M., Buerger S., Groettrup M. The ubiquitin-like modifier FAT10 in antigen processing and antimicrobial defense. Pergamon. 2015:129–132. doi: 10.1016/j.molimm.2015.04.012. [DOI] [PubMed] [Google Scholar]

[bib166] 34.The UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research. 2021;49(D1):D480–D489. doi: 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

PEPPI: Whole-proteome Protein-protein Interaction Prediction through Structure and Sequence Similarity, Functional Association, and Machine Learning

Eric W Bell

Jacob H Schwartz

Peter L Freddolino

Yang Zhang

Graphical abstract

Abstract

Introduction

Results

Pipeline overview and module cross-validation

Figure 1.

Figure 2.

PEPPI benchmark and performance

SARS-CoV-2 and human interactome modeling

Figure 3.

Discussion & conclusion

Methods

Pipeline overview

SARS-CoV-2 virus and human host protein sequence collection

CRediT authorship contribution statement

Acknowledgments

Acknowledgements

Declaration of Competing Interest

Footnotes

Appendix A. Supplementary data

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

PEPPI: Whole-proteome Protein-protein Interaction Prediction through Structure and Sequence Similarity, Functional Association, and Machine Learning

Eric W Bell

Jacob H Schwartz

Peter L Freddolino

Yang Zhang

Graphical abstract

Abstract

Introduction

Results

Pipeline overview and module cross-validation

Figure 1.

Figure 2.

PEPPI benchmark and performance

SARS-CoV-2 and human interactome modeling

Figure 3.

Discussion & conclusion

Methods

Pipeline overview

SARS-CoV-2 virus and human host protein sequence collection

CRediT authorship contribution statement

Acknowledgments

Acknowledgements

Declaration of Competing Interest

Footnotes

Appendix A. Supplementary data

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases