A graph neural network-based approach for predicting SARS-CoV-2–human protein interactions from multiview data

Sumanta Ray; Syed Alberuni; Alexander Schönhuth

doi:10.1371/journal.pone.0332794

. 2025 Sep 25;20(9):e0332794. doi: 10.1371/journal.pone.0332794

A graph neural network-based approach for predicting SARS-CoV-2–human protein interactions from multiview data

Sumanta Ray ^1,^2,^*,^#, Syed Alberuni ^3,^#, Alexander Schönhuth ^2,^#

Editor: Chandrabose Selvaraj⁴

PMCID: PMC12463271 PMID: 40997149

Abstract

The COVID-19 pandemic has demanded urgent and accelerated action toward developing effective therapeutic strategies. Drug repurposing models (in silico) are in high demand and require accurate and reliable molecular interaction data. While experimentally verified viral–host interaction data (SARS-CoV-2–human interactions published on April 30, 2020) provide an invaluable resource, these datasets include only a limited number of high-confidence interactions. Here, we extend these resources using a deep learning–based multiview graph neural network approach, coupled with optimal transport–based integration.

Our comprehensive validation strategy confirms 472 high-confidence predicted interactions between 280 host proteins and 27 SARS-CoV-2 proteins. The proposed model demonstrates robust predictive performance, achieving ROC-AUC scores of 85.9% (PPI network), 83.5% (GO similarity network), and 83.1% (sequence similarity network), with corresponding average precision scores of 86.4%, 82.8%, and 82.3% on independent test sets. Comparative evaluation shows that our multiview approach consistently outperforms conventional single-view and baseline graph learning methods.

The model combines features derived from protein sequences, gene ontology terms, and physical interaction information to improve interaction prediction. Furthermore, we systematically map the predicted host factors to FDA-approved drugs and identify several candidates, including lenalidomide and pirfenidone, which have established or emerging roles in COVID-19 therapy. Overall, our framework provides comprehensive and accurate predictions of SARS-CoV-2–host protein interactions and represents a valuable resource for drug repurposing efforts.

Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in December 2019 in China and has since spread across the globe [1]. The COVID-19 (Coronavirus Disease-2019) pandemic has already affected more than 300 million people, and the numbers are still increasing. SARS-CoV-2, a highly virulent and contagious novel coronavirus strain, is responsible for causing acute respiratory coronavirus disease (COVID-19) [1]. Given the urgency of the situation, researchers have been searching for new therapeutic strategies and effective drugs over the past few months. In the pursuit of new remedies, one way is to find a proper set of viral targets and an interaction map between viral proteins and host proteins that will help to study the possible drug targets for inhibiting the infection in the host cell.

In general, viral infection involves numerous protein-protein interactions (PPIs) between the virus and host proteins. These interactions range from the initial binding of the viral envelope protein to the host membrane receptor to the hijacking of the host cell’s machinery for replication by viral proteins. Therefore, identifying potential target host proteins for the virus is crucial for understanding the internal mechanisms of viral infections and designing antiviral drugs.

Host-directed therapies (HDT), which mainly target human proteins that are important carriers for the virus (host factors) to enter and control human cells, are now an important supplementary strategy [2] in de novo drug discovery. Due to the unavailability of a proper set of host factors for use in HDT, it is difficult to produce/repurpose effective drugs for COVID-19. In-silico approaches for drug repurposing urgently require accurate viral-host interaction data to connect with drug-host interactions for discovering new drugs or small molecules.

The basis of any drug repurposing screen is viral-host molecular interaction data. The physical interaction sets are constructed using experimental and computational methods by considering data from viral and host proteins from different perspectives, such as domain and sequence information. For SARS-CoV-2–human protein interaction data, traditional experimental predictions often miss crucial connections between viral and host proteins. To address these limitations, advanced computational methods, particularly cutting-edge machine learning strategies such as Graph Convolutional Networks (GCNs), are increasingly being applied in this domain. GCNs are particularly well-suited for predicting interactions that are not captured through experimental approaches alone, including indirect connections that may significantly enhance our understanding of the human-SARS-CoV-2 interactome. These indirect connections are vital as they can reveal broader networks of interactions that are essential for the development of effective therapeutic strategies against COVID-19 [3].

Recent computational approaches for predicting SARS-CoV-2–human protein interactions have employed a wide range of methodologies, including virtual screening, molecular docking, sequence-based predictors, and advanced network analysis. For example, molecular docking-based pipelines have been used to identify potential inhibitors of SARS-CoV-2 proteins by screening FDA-approved antiviral drugs, providing important leads for drug repurposing [4]. Similarly, recent work by Choudhury et al. utilized deep learning and graph-based methods to systematically predict viral-host interactions and their therapeutic implications [5, 6]. A few research groups have employed network algorithms for discovering human PPIs with SARS-CoV-2-host interactors to treat COVID-19. There are two major contributions: Gordon et al. conducted seminal work on generating a protein interaction map between the SARS-CoV-2 and human proteins using affinity-purification mass spectrometry (AP-MS) [8]. In independent work, Dick et al. identified high-confidence interactions between human proteins and SARS-CoV-2 proteins using sequence-based PPI predictors (PIPE4 & SPRINT) [7]. Despite the numerous articles published since then, these two studies remain reliable sources for interactions. These two publicly available interaction sets have been utilized so far in drug repurposing works such as in [9–11]. The interaction sets consist of 512 high-confidence physical interactions between 29 SARS-CoV-2 proteins and 132 human proteins (targets). Apart from the human target proteins, other host proteins within the human interactome may help viral proteins manipulate human cellular machinery indirectly.

Identifying these host factors using some advanced AI models may further explore the interaction map, thus providing an additional advantage for the drug repurposing models. Some attempts have been made so far, such as in [12–17], where, in most cases, predictions are guided by a single source of information. In [14], the first attempt was made to incorporate high-level information during the prediction process using the AI model Node2Vec, followed by a rank aggregation technique.

However, relying solely on single-view models has significant limitations. For instance, approaches based exclusively on sequence similarity might overlook biologically relevant interactions that are not detectable at the sequence level but can be identified through functional or network-based similarities. As a specific example, two host proteins with low sequence similarity may still participate in the same biological pathway or cellular process, thus becoming relevant interaction partners for viral proteins. Single-view methods focusing on sequence alone would typically miss such indirect yet biologically meaningful relationships. Our multiview integration strategy explicitly addresses this gap by combining sequence similarity, functional similarity derived from Gene Ontology terms, and experimentally validated protein–protein interaction networks, providing a more comprehensive and biologically meaningful prediction of host factors involved in SARS-CoV-2 infection.

The main obstacles in constructing a computationally curated dataset for interactions involving SARS-CoV-2 are as follows: 1) The scarcity of experimentally confirmed, strong positive instances of interactions between SARS-CoV-2 and host proteins. 2) The expenses associated with understanding the SARS-CoV-2 strain and the sequence similarities/dissimilarities among various coronavirus families, rendering most existing methods irrelevant. 3) The absence of suitable data-driven computational approaches (up to this point) capable of integrating diverse interaction data. Owing to these difficulties, in-silico-based approaches generally focus on unveiling the similarity between the target and non-target protein within the human interactome. Their aim is to search for proteins in the whole interactome that are similar (in some sense) to target proteins and thus follow the same (likely) interaction pattern with viral proteins. The similarity between proteins may arise in several ways; here, we use functional similarity, which leverages the Gene Ontology terms of biological processes, and sequence similarity between amino acids, which signifies structural and functional importance.

As mentioned above, in order to utilize resources to the largest extent, the optimal way is to use sufficiently advanced high-dimensional (deep) and statistical techniques. In this work, to the best of our knowledge for the first time, we combine all the arguments raised above. As a brief summary of our contributions:

– We combined resources from the human interactome with a small set of experimentally verified positive samples of SARS-CoV-2 and host protein interactions, leveraging large-scale human-human PPI data in the prediction task.

– Although most viral-host interaction predictions are driven by experimentally verified physical interactions, incorporating additional data has recently gained considerable momentum [18, 19]. Our approach is the first to explicitly learn and integrate additional information (node features) from multiple views of protein networks.

– This approach is the first to predict host factors for SARS-CoV-2 using a graph convolutional network [20] as an advanced AI model. The model can successfully fuse three different features of protein nodes into an aggregated similarity matrix that finally drives the prediction task.

– We demonstrate a novel application of a statistical distance measure called ‘Wasserstein metric’ or optimal transport distance to assess similarity or dissimilarity between protein pairs, which are represented as two discrete sets of points in a multidimensional space.

– The hierarchical clusters obtained from the distance matrix contain human proteins, including SARS-CoV-2 targets and other non-target proteins that are similar in three different views of the networks we consider. The Human proteins, other than SARS-CoV-2 targets, sharing the same clusters, may be considered as important host factors for SARS-CoV-2.

Results

Workflow

Fig 1 describes the workflow of our analysis pipeline and outlines the basic ideas of our work. We describe all the important steps in the following paragraphs of this subsection. Throughout the text, we use the term ‘CoV-host’ to represent human proteins that have experimentally verified interactions with SARS-CoV-2 and ‘non-CoV-host’ to refer to human proteins without such interactions. The detailed algorithm is provided in the method section.

A. Raising a multi-view interaction networks of host proteins.

See A in Fig 1. We compiled three networks, representing three separate views of the target/non-target host proteins. First, we used the established and refined publicly available human PPI resources, namely the human interactome. Next, we constructed two additional networks based on functional similarity and protein sequence similarity among the target/non-target nodes. Gene ontology-based semantic similarity between two protein nodes is used to calculate the functional similarity score in one network, while the similarity between the amino acid sequences was utilized to build the weighted links in another network (see methods). All three networks have two types of nodes (described in panel-A):

SARS-CoV-2-associated host proteins or target (CoV-host).
human proteins except SARS-CoV-2-associated host proteins or non-target.

Thus, we have collected three different views of CoV-host and non-target nodes within the human interactome. It is necessary to combine multiple views to an integrated network that can combine the three different representations/views in a mutual context.

B. GCN-based graph embedding of the networks.

See A in Fig 1. First, we employ a network embedding strategy (here: Graph Convolution Network [20]), which extracts node features from the three networks separately. In-depth, the GCN possesses the advantage of harnessing the capabilities of convolutional neural networks to encode relationships between samples. It effectively combines the graph structure, typically represented as an adjacency matrix, with the information embedded within each node to enhance the neural network’s capabilities. To apply GCN in each view/network, we convert the GO semantic similarity matrix and sequence similarity matrix to graphs representing the relationship between proteins. Next, we encode the entire graph (adjacency matrix) into a fixed-size, low-dimensional latent space. Thus GCN encoder preserves the properties of all the nodes relative to their encompasses in the network. This process yields three feature matrices (F_i), with rows corresponding to nodes and columns representing the inferred network features.

C. Representing a protein in a three-dimensional unit cube.

See B in Fig 1. After GCN-based embedding is applied to the three networks, each protein is represented by a d-dimensional vector for each of the networks. This yields a matrix $F \in R^{d \times 3}$ with one row for each of the networks, where each row is of length d, reflecting the size of the embedding vector. Now we transpose F to obtain $F^{T} \in R^{d \times 3}$ , where, finally, each of the rows corresponds to a 3-dimensional point, such that, overall each protein is represented by d3-dimensional points.

D. Computing protein-protein similarity using the Wasserstein distance.

See C in Fig 1. Each protein is represented as a multivariate distribution encoded in a 3-dimensional unit cube. Each dimension represents a normalized univariate distribution corresponding to one of the three views. To capture the relationship between two proteins, we employ the Wasserstein distance, which is derived from optimal transport theory. This metric calculates the Wasserstein distance between two discrete sets of points within a unit cube, representing the two proteins. It quantifies the minimum cost required to transform the discrete distribution of points from one set into the distribution of the other set. The concept of Wasserstein distance between two distributions in a given metric space, denoted as M, can be likened to the minimum “cost” associated with reshaping or transporting one collection of items into another, commonly referred to as the ‘earth mover’s distance.’ This global optimization process takes into account both the “local” expenses associated with reshaping individual elements across the collections and the “global” cost of achieving the overall transformation. [21].

E. Clustering proteins.

Hierarchical clustering is performed on the distance matrix to group similar proteins into clusters (see C in Fig 1). Each cluster contains CoV-host as well as non-target host proteins.

F. Predicting probable targets of SARS-CoV-2.

We obtained 10 clusters containing proteins (CoV-host and non-target) similar to each other. The similarity between proteins is derived from three different biological resources: functional similarity, physical interaction information, and sequence similarity. Consequently, non-target proteins that share a cluster with CoV-host proteins may be considered probable targets and host factors of SARS-CoV-2.

Comparative evaluation with baseline methods

To demonstrate the effectiveness of our proposed multiview fusion approach using Wasserstein distance, we compared it against several baseline methods. Specifically, we conducted experiments using: (1) single-view Graph Convolutional Networks (GCNs) individually applied on each of the three networks (PPI, GO, and Sequence similarity), (2) simple fusion methods (average and concatenation) of embeddings obtained from single-view GCNs, and (3) standard graph embedding approaches such as DeepWalk followed by a simple concatenation fusion.

The results (see Table 1) show that our proposed multiview fusion framework consistently outperforms these baseline approaches, achieving higher ROC-AUC and Average Precision (AP) scores. For instance, our method achieves ROC-AUC and AP scores of 0.91 and 0.89, respectively, compared to the next-best baseline (GCN embeddings with concatenation fusion), which achieves ROC-AUC and AP scores of 0.87 and 0.85. These improvements clearly indicate the effectiveness and superiority of our Wasserstein-based fusion strategy over simpler embedding and integration techniques.

Table 1. Quantitative comparison of our proposed method against baseline embedding and fusion methods.

Method	ROC-AUC (Val)	AP (Val)	ROC-AUC (Test)	AP (Test)
GCN (PPI network only)	0.87	0.87	0.86	0.86
GCN (GO network only)	0.85	0.83	0.83	0.82
GCN (Sequence network only)	0.83	0.86	0.83	0.82
DeepWalk (PPI network) + Concatenation fusion	0.82	0.81	0.80	0.79
GCN embeddings + Average fusion	0.86	0.85	0.84	0.83
GCN embeddings + Concatenation fusion	0.88	0.86	0.87	0.85
Proposed method (GCN + Wasserstein fusion)	0.92	0.90	0.91	0.89

Open in a new tab

Training the GCN model

To train the GCN model on our dataset, we initially performed a random split of the graph data, dividing it into a ratio of 8:1:1 for the training, validation, and test sets. It’s important to note that the test edges are excluded from the training set, but all nodes within the graph remain included in the training data. Subsequently, we proceed to train the model by utilizing the training edges and evaluating its performance in the context of reconstructing the previously removed test edges. The training process entails running 50 epochs with the Adam optimizer, employing a learning rate of 0.001, and applying a dropout rate of 0.1. We use the Rectified Linear Unit (ReLU) as the activation function. Finally, we extract low-dimensional embeddings from the output of the encoder in the trained model. Table 2 comprehensively summarizes the model’s quantitative performance on all three network types (PPI network, GO similarity network, and sequence similarity network), reporting both ROC-AUC and average precision (AP) scores for validation and test splits. The Average Precision score takes into account both precision and recall, key metrics of a classification model. Precision is the ratio of true positive predictions to the total predicted positives, while recall represents the ratio of true positive predictions to the total actual positives. The AP score calculates the area under the precision-recall curve, which plots precision on the y-axis and recall on the x-axis. It emphasizes high precision at low recall levels, making it suitable for imbalanced datasets. It ranges from 0 to 1, with higher values indicating better performance. The ROC score, also ranging from 0 to 1 with higher values indicating better performance, measures the trade-off between the true positive rate (sensitivity or recall) and the false positive rate (1-specificity). As shown in Table 2, the proposed model achieves robust predictive performance across all network views, with ROC-AUC values exceeding 83% and AP scores consistently above 82% on the independent test sets. These results provide strong evidence for the reliability and generalizability of our approach.

Table 2. Performance of GCN in three networks: The first two columns of the table show the total number of nodes and the number of edges in the three networks.

The rest of the columns show the ROC and average precision scores for the validation and test edges.

Network	#edges	#nodes	Validation ROC	Validation AP	Test ROC	Test AP
PPI-network	102882	11314	87.32	87.08	85.87	86.39
GO similarity network	18685975	10961	84.79	83.21	83.46	82.81
Sequence similarity network	17685975	10691	83.38	86.48	83.1	82.30

Open in a new tab

Hierarchical clustering of the Wasserstein distance matrix

Upon computing Wasserstein distance for each pair of proteins, hierarchical clustering of the proteins is performed using Ward’s minimum variance method [22]. The cutreeDynamic function (with cutheight = ‘hybrid’ and minClusterSize = 200) of dynamicTreeCut R package is utilized to cut the dendrogram at a specific label. The function cuts the dendrogram by analyzing the shape of its branches. This results in 10 clusters (silhouette score =0.7), each composed of a combination of CoV-host and non-CoV-host proteins. These clusters exhibit similarity based on their dependencies on gene ontology terms, amino acid sequences, and protein-protein interaction (PPI) connections. Therefore, since non-CoV host proteins are grouped within the same clusters as CoV-host proteins, they share certain characteristics that indicate their potential to be targeted by specific SARS-CoV-2 proteins.

Fig 2, panel-A shows a heatmap of the Wasserstein distance matrix, with 10 identified clusters. Table 3 shows a short summary of the identified clusters.

Table 3. Details of the 10 clusters, including the total number of proteins, the number of CoV-host proteins, the number of non-CoV-host proteins, and the predicted interactions obtained from each cluster.

Sl. No.	#proteins	#CoV-host	#non-CoV-host	#predicted interactions
cluster-1	2075	109	1966	218
cluster-2	1396	34	1361	68
cluster-3	1381	44	1337	88
cluster-4	1226	106	1120	212
cluster-5	915	41	874	82
cluster-6	751	72	688	144
cluster-7	683	20	663	40
cluster-8	330	15	315	30
cluster-9	211	12	199	24
cluster-10	205	13	192	26

Open in a new tab

Notably, we choose hirarchical clustering for its interpretability and established application in bioinformatics, especially for clearly delineating biologically meaningful groups in protein-protein interaction (PPI) datasets. Ward’s linkage was specifically employed to minimize intra-cluster variance, aiding biological interpretation and visualization. However, we acknowledge that protein clusters can biologically overlap, and hierarchical clustering does not directly capture this overlap. Additionally, hierarchical clustering has relatively high computational complexity (O(n3)), which could pose limitations for larger datasets. Therefore, future work should consider exploring overlapping clustering methods, such as fuzzy clustering or network community detection (e.g., Louvain, Infomap), to capture biological relationships and reduce potential clustering errors, particularly when analyzing larger interactome datasets.

Model interpretability through barycenter

Given that the features used in our framework are extracted from a Graph Convolutional Network (GCN), traditional interpretability methods like SHAP (Shapley Additive exPlanations) [23] values may not provide significant insights due to the complex and interdependent nature of these features. Instead, we have utilized the concept of barycenter in Wasserstein space to summarize and interpret the results produced by our model.

Here the Wasserstein distance is used to create the feature space, representing features in a 3D space that integrates sequence similarity, Gene Ontology (GO) similarity, and PPI network features. The overall pattern of each identified cluster is summarized by computing its barycenter in Wasserstein space. This barycenter efficiently describes the underlying dependencies between the different measures used to create the Wasserstein distance matrix. It can be shown in Fig 3 that in most of the cases, the measures have perfect dependence structure (see cluster-7, cluster-8 and cluster-9) within the cluster. This provides a clear and meaningful overview of the clustering and prediction results.

Fig 3 — Clusters 7, 8, and 9 demonstrate near-perfect dependence, highlighting the cohesion of these measures.

Predicted interactions

Fig 4 shows a network plot (Sankey diagram) illustrating the direct and indirect links between SARS-CoV-2 protein, CoV-host and predicted human proteins with a Wasserstein distance threshold of 0.07. This reveals 73 predicted interactions between SARS-CoV-2 and non-CoV-host proteins. A small value of Wasserstein distance stands for similar proteins, so the threshold is kept low to produce a small set of interacting proteins. S1 Table shows a total of 472 interactions between SARS-CoV-2 proteins and human proteins by setting the threshold value of 0.5. The first column in Fig 4 represents SARS-CoV-2 proteins whereas the second and third columns represent CoV-host and predicted non-CoV-hosts proteins. For example, the main protease ‘M’ has existing interactions with ‘MNDA’, ‘ATP6VIA’ and ‘PMPCB’, which are predicted to be associated with the non-CoV-host proteins ‘SPOP’, ‘CIR’, and ‘SLC22A5’, respectively. This suggests that SARS-CoV-2 protein ‘M’ also has a good chance of interacting with those non-CoV-host proteins. For visualization of the Wasserstein distance between CoV-host and the predicted non-CoV-host for a particular SARS-CoV-2 protein, we create heatmaps of selected sub-matrices. The rows of the sub-matrices represent the CoV-host proteins interacting with a particular SARS-CoV-2 protein, and the columns represent the non-CoV-host proteins inferred by those CoV-hosts from the predicted list. Fig 2, panel-C shows those heatmaps (heatmaps for all SARS-CoV-2 proteins can be found in S1 Fig). For example, the first heatmap of panel C shows the Wasserstein distance between four CoV-host and eight non-CoV-host proteins for the SARS-CoV-2 protein, SPIKE. The eight non-CoV-host proteins which have smaller distances with at least one of the CoV-hosts of SPIKE are ‘RGS9’, ‘CALCRL’, ‘XDH’, ‘DAGLB’, ‘SPA17’, ‘SERPING1’, ‘MYL2’ and ‘NDUFAF4’. Among them, proteins ‘XDH’, and ‘DAGLB’ have a smaller distance with Angiotensin-converting enzyme ACE2 which is already known as the binding site for SARS-CoV-2 [24]. It has been demonstrated that Xanthine dehydrogenase (known as XDH gene), defects of which cause xanthinuria, may cause adult respiratory stress syndrome, and may potentiate influenza infection through an oxygen metabolite-dependent mechanism [25], and thus a potential candidate for the target of SPIKE protein. From the heatmap it has also been noticed that ‘SERPING1’ has a low distance with CoV-host ACE and GOLGA7, suggesting an indirect interaction with SPIKE. Interestingly, [26] provides computational evidence of an interaction between SERPING1 and a SARS-CoV-2 viral protein.

Biological relevance and therapeutic potential of predicted host factors

We assess the validity of our protein predictions by comparing them to the host factors of various other viruses. Specifically, we investigate the host factors associated with six different human-pathogenic RNA viruses, including Dengue, HIV-1, HCV, Ebola, Zika, and H1N1. We found evidence in the published literature that the predicted proteins on our list have interacted with various viruses. Table 4 shows the number of predicted proteins that overlap with the experimentally verified interaction sets of the six viruses. Furthermore, our research has identified instances in which multiple predicted proteins engage in interactions with more than one of these viruses. For example, protein ‘CD209’ interacts with Dengue, HIV-1 and HCV viruses. Some other predicted proteins such as “DHCR24”, “HPS6”, “PML”, “NFKB1”, and “UBC” are found in the interacted list of HIV-1, Ebola and H1N1 viruses. Fig 4 Panel-B shows a network plot of all the predicted proteins that have existing interactions with at least three viruses. For example, it can be seen from the figure that CD209 proteins which are predicted to interact with SARS-CoV-2 Nucleoprotein ‘N’ also have interactions with 5 viruses, Dengue, HIV-1, HCV, Ebola, and Zika.

Table 4. Predicted human proteins overlapped with other proteins targeted by other viruses.

# Sl.No.	Virus	Database	# Human Proteins in Database	#Overlapping Proteins
1	Dengue	DenvInt [27]	480	12
2	HIV1	HIV-1 Human Interaction Database [28]	4667	131
3	HCV	HCVpro database [29]	467	16
4	Ebola	Zhou et.al. [30]	3605	123
5	Zika	Zikaabase [31]	20	1
6	H1N1	Saphira et.al. [32]	617	17

Open in a new tab

To further evaluate the biological relevance of our predictions, we performed Gene Ontology (GO) enrichment analysis on the top 100 predicted human proteins (see S2 Table for details). Notably, the significantly enriched terms include processes such as viral entry into host cell (GO:0046718, $p = 3.1 \times 10^{- 6}$ ), regulation of immune response (GO:0050776, $p = 1.2 \times 10^{- 5}$ ), protein localization to membrane (GO:0072657, $p = 4.6 \times 10^{- 4}$ ), and apoptotic signaling pathway (GO:0097190, $p = 2.7 \times 10^{- 3}$ ). These results reinforce the functional plausibility of our predictions and demonstrate that the prioritized proteins are highly involved in host pathways relevant to SARS-CoV-2 infection.

Predicted host factors promotes Host Directed Therapy (HDT) option against SARS-CoV-2

When determining repurposable drugs to fight against any virus, one has to keep in mind that targeting a single virus protein is not a permanent solution because of the resistance-induced mutations of viral proteins. Therefore, the process which targets important human proteins that are carriers for the virus in human host cells offers an important supplementary strategy, which is called host-directed therapies (HDT) [2]. As this strategy does not target the viral proteins directly, it is less prone to developing resistance because human proteins are less affected by mutations. For determining the HDT, the main challenge is to identify proteins that are crucial for the maintenance and perseverance of the disease-causing virus in human cells. When these proteins are targeted, the replication machinery of the virus in the host cells collapses. For all these reasons, repurposable drugs and the proteins they target for HDT have great potential in COVID-19 therapeutics. Moreover, it offers hope for rapid implementation due to fewer side effects. The predicted proteins may act as host factors and can be targeted for HDT strategy.

To find the association of predicted proteins with different drugs/small-molecule, we used the drug repurposing hub of the CMAP database [33]. This resource comprises a curated and annotated compilation of FDA-approved drugs, clinical trial medications, and pre-clinical tool compounds, providing comprehensive details and information resources. Particularly, we adopted a three-step strategy utilizing established drug-target databases:

Identification of Predicted Host Proteins: The predicted host factors from our model were first identified based on their significant Wasserstein similarity to known SARS-CoV-2 interacting host proteins, suggesting their potential roles as targets for Host Directed Therapy (HDT).
Drug-Target Mapping using Connectivity Map (CMAP) Database: Next, the identified host proteins were systematically mapped to known drug-target interactions using the Connectivity Map (CMAP) drug repurposing hub database [33]. CMAP includes curated annotations of FDA-approved drugs, clinical-trial-phase medications, and pre-clinical tool compounds, enabling precise and reliable mapping of proteins to therapeutically relevant small molecules.
Filtering and Validation of Drug Associations: To ensure biological and therapeutic relevance, we specifically focused on predicted host proteins with existing evidence of interaction with at least two other viruses. Additionally, we filtered the associations to drugs approved or launched specifically for infectious diseases (including viral infections), pulmonary diseases, and cancers. Finally, these inferred drug-target associations were cross-validated with recent COVID-19 literature to confirm their emerging or established therapeutic roles in COVID-19 treatment scenarios (e.g., pirfenidone, pomalidomide, lenalidomide, dasatinib, nilotinib, imatinib) [34–37].

We have found 50 predicted human proteins that interact with at least two other viruses and have connections with different drugs/small molecules. The Table 5 shows 14 such proteins and their association with all launched/FDA-approved drugs which are connected with different infectious diseases (including viral infection), pulmonary disease and cancers. It can be noticed from the Table 5 that, some proteins such as ABCB1, and GRIN2D are connected with drugs erythromycin, roxithromycin, and gabapentin which are mainly used for an influenza-A virus, different pulmonary and respiratory tract infection disease. Some other proteins such as TNF and POLE connected with different drugs that are used as anticancer agents and used to treat pulmonary diseases and hematologic malignancy. Among these drugs, some are also gaining attention in the treatment of COVID-19. For example predicted host factor/protein ‘TNF’ is associated with six drugs ‘clenbuterol’, ‘epinephrine’, ‘pirfenidone’, ‘pranlukast’, ‘lenalidomide’ and ‘pomalidomide’, among these almost all are found to be promising candidates for repurposable drugs against COVID19 infection in recent literature. For example, in [34] pomalidomide and lenalidomide are described as promising repurposable drugs to use against COVID-19. In [35] ‘pirfenidone’ is described as a potential treatment against COVID-19. In [36] ‘epinephrine’ is demonstrated as an intervention to minimize the severity of COVID-19. ‘pranlukast’ which is generally used to treat influenza, metapneumovirus or coronavirus is also demonstrated to be used against COVID-19. The kinase inhibitors dasatinib, nilotinib and imatinib which are associated with the predicted proteins CSFIR and STAT5B, are described as potential candidates for COVID-19 treatment [37].

Table 5. Table shows associations of FDA-approved drugs with the predicted host factors.

predicted protein	drug	clinical phase	uses	disease area
ABCB1	erythromycin-estolate	Launched	bacterial 50S ribosomal subunit inhibitor	infectious disease
ABCB1	erythromycin-ethylsuccinate	Launched	cytochrome P450 inhibitor\|protein synthesis inhibitor	infectious disease
ABCB1	roxithromycin	Launched	bacterial 50S ribosomal subunit inhibitor	infectious disease
MTR	cyanocobalamin	Launched	methylmalonyl CoA mutase stimulant\|vitamin B	hematology\|infectious disease
GRIN2D	amantadine	Launched	glutamate receptor antagonist	infectious disease\|neurology
GRIN2D	gabapentin	Launched	calcium channel blocker	infectious disease\|neurology
TNF	chloroquine	Launched	antimalarial agent	infectious disease
FKBP1A	sirolimus	Launched	mTOR inhibitor	transplant\|pulmonary
NFKB1	pranlukast	Launched	leukotriene receptor antagonist	pulmonary
TNF	clenbuterol	Launched	adrenergic receptor agonist	pulmonary
TNF	epinephrine	Launched	adrenergic receptor agonist\|carbonic anhydrase activator\|neurotransmitter	cardiology\|allergy\|pulmonary
TNF	pirfenidone	Launched	TGF beta receptor inhibitor	pulmonary
TNF	pranlukast	Launched	leukotriene receptor antagonist	pulmonary
TNF	lenalidomide	Launched	anticancer agent	hematologic malignancy
POLE	cladribine	Launched	adenosine deaminase inhibitor\|ribonucleotide reductase inhibitor	hematologic malignancy
POLE	cytarabine	Launched	ribonucleotide reductase inhibitor	hematologic malignancy
POLE	fludarabine-phosphate	Launched	ribonucleotide reductase inhibitor	hematologic malignancy
KIT	dasatinib	Launched	Bcr-Abl kinase inhibitor\|ephrin inhibitor\|KIT inhibitor\|PDGFR tyrosine	hematologic malignancy
KIT	nilotinib	Launched	Abl kinase inhibitor\|Bcr-Abl kinase inhibitor	hematologic malignancy
PSMB2	carfilzomib	Launched	proteasome inhibitor	hematologic malignancy
PSMB2	carfilzomib	Launched	proteasome inhibitor	hematologic malignancy
CSF1R	imatinib	Launched	Bcr-Abl kinase inhibitor\|KIT inhibitor\|PDGFR tyrosine	hematologic malignancy\|oncology
CSF1R	imatinib	Launched	Bcr-Abl kinase inhibitor\|KIT inhibitor	hematologic malignancy\|oncology
TNF	pomalidomide	Launched	angiogenesis inhibitor\|tumor necrosis factor production inhibitor	hematologic malignancy
STAT5B	dasatinib	Launched	Bcr-Abl kinase inhibitor\|ephrin inhibitor\|KIT inhibitor\|PDGFR tyrosine	hematologic malignancy
ROS1	PF-06463922	Launched	ALK tyrosine kinase receptor inhibitor	oncology
ROS1	PF-06463922	Launched	ALK tyrosine kinase receptor inhibitor	oncology
MAP1A	estramustine	Launched	DNA alkylating agent	oncology
KIT	regorafenib	Launched	FGFR inhibitor\|KIT inhibitor\|PDGFR tyrosine	oncology
NTRK3	entrectinib	Launched	ALK tyrosine kinase receptor inhibitor\|proto-oncogene tyrosine protein kinase inhibitor	oncology

Open in a new tab

In summary, it is evident from the table that 1) the predicted proteins are already used as HDT for different viral diseases like herpes, influenza-A, and different pulmonary and respiratory track infectious diseases 2) most of the associated drugs are described as promising candidates for repurposing against COVID-19. Therefore, these proteins may be treated as potential candidates of HDT for use against SARS-CoV-2 infection.

Materials and methods

Overview of dataset

In this study two categories of interaction datasets are exploited: Human protein interactome, and SARS-CoV-2-host protein interaction data.

SARS-CoV-2-host interaction data.

We obtained SARS-CoV-2-host interaction data from two recent studies by Gordon et al. and Dick et al. [7, 8]. The predicted set of Gordon et al. [8] consists of 332 high-confidence interactions while Dick et al. [7] identified 261 high-confidence interactions. These studies are completely independent, with Gordon et al. utilizing affinity-purification mass spectrometry (AP-MS), and Dick et al. employing sequence-based PPI predictors (PIPE4 and SPRINT).

The human protein Interactome.

We have compiled a comprehensive list of human PPIs from two datasets: (1) CCSB human Interactome database, consisting of 7,000 proteins, and 13944 high-quality binary interactions [38–40]; (2) The Human Protein Reference Database [41], consisting of 8920 proteins and 53184 PPIs.

The summary of all the datasets is provided in Table 6.

Table 6. Datasets used in this study.

Sl.No.	Dataset Category	Dataset	#Edges	#Nodes
1	Human PPI	CCSB [45]	13944	4303
1	Human PPI	HPRD [41]	39240	9617
2	SARS_CoV2 Host_PPI	Gordon et al. [8]	332	#SARS-CoV2: 27	#Host: 332
2	SARS_CoV2 Host_PPI	Dick et al. [7]	261	#SARS-CoV2: 6	#Host: 202

Open in a new tab

In preparing the dataset for analysis, several preprocessing steps were undertaken to ensure data quality and consistency. The data were normalized to ensure that all features were on a comparable scale, which is essential for the effective performance of machine learning models such as Graph Convolutional Networks (GCNs). Missing values were processed by imputing the mean for continuous variables and the most frequent value for categorical variables, thus maintaining the dataset’s integrity. Additionally, duplicate entries were removed to prevent redundancy and potential biases in the analysis. The node degree distribution follows a power-law distribution, as expected in scale-free networks. Notably, as this is an interaction network, there are no edge weights associated with the connections between nodes.

Moreover, only high-confidence, experimentally validated interactions were retained from both CCSB and HPRD interactome databases (Table 6). Specifically, interactions supported by multiple independent experimental validations were considered high-confidence, whereas interactions derived from single experiments or insufficient validation evidence were filtered out. Ambiguous entries were also systematically excluded. This rigorous selection criterion ensures the reliability and biological relevance of interactions, significantly enhancing downstream model performance and interpretability. The adopted confidence thresholding strategy aligns with established recommendations from prior studies [42–44].

Extracting node features using GCN

We utilized a graph convolutional network (GCN) [20] to learn low-dimensional embeddings of nodes from the three networks separately. In the context of a graph $G = (V, E)$ , the objective is to develop a function that operates on signals or features associated with the nodes of G. This function takes two inputs: i) an optional feature matrix $X \in N \times D$ , where x_i describes the features of each node i, N is the number of nodes, and D is the number of input features; and ii) a representation of the graph’s structure, typically represented as an adjacency matrix A. The function generates node-level outputs $Z \in N \times F$ , where F signifies the output dimension for each node feature. The graph-level outputs are modeled by utilizing indexing operations, similar to the pooling operations used in standard convolutional neural networks [46]. Generally, each layer of the neural network can be characterized as a non-linear function: $H^{(l + 1)} = f (H^{(l)}, A))$ , where H⁽⁰⁾ = X and H^(L) = Z, L representing the number of layers, f(.,.) is a non linear activation function like ReLU. Following the definition of layer-wise propagation rule proposed in [20], the function can be written as $f (H^{(l)}, A) = σ ({\hat{D}}^{- 1 / 2} \hat{A} {\hat{D}}^{1 / 2} H^{(l)} W^{(l)})$ , where $\hat{A} = A + I$ , I represents identity matrix, $\hat{D}$ is the diagonal node degree matrix of $\hat{A}$ , $\hat{D_{i i}} = \sum_{j} A_{i j}$ , W represents trainable weight matrix of the neural network. In a straightforward manner, the graph convolution operator computes a node’s updated feature by taking a weighted average of its own attributes and those of its neighboring nodes. This operation ensures that two nodes with identical neighbor structures and node characteristics receive identical embeddings. Our adoption of the GCN architecture closely follows the approach outlined in [20], involving a three-layer GCN architecture with randomly initialized weights. For each of the three network views (PPI network, GO similarity network, and sequence similarity network), we implemented a three-layer GCN architecture as empirically determined by grid search and validation performance. Each GCN was trained for 50 epochs using the Adam optimizer (learning rate: 0.001), with a dropout rate of 0.1 and ReLU activation function in all hidden layers. The loss function employed was binary cross-entropy. Hyperparameter choices, including the number of layers, learning rate, and dropout rate—were optimized based on predictive performance on an 8:1:1 training, validation, and test split. For supervised training, negative samples (non-interacting protein pairs) were generated by random sampling of protein pairs that were not present in the verified interaction set. Care was taken to ensure that these randomly sampled negatives did not overlap with any known positive (interacting) pairs. This random negative sampling approach is widely adopted in the PPI prediction literature and aims to create a balanced dataset for model training and evaluation. The number of negative samples was matched to the number of positive samples in both training and test splits to mitigate class imbalance and ensure reliable model performance evaluation. Additional ablation experiments (varying the number of GCN layers and dropout rate) confirmed that the three-layer GCN achieved the best balance between predictive accuracy and model generalization, with deeper models yielding diminishing returns or overfitting. These hyperparameters were kept consistent across all network views before integration of node embeddings via the Wasserstein distance. For the three networks (PPI network, GO similarity network, and sequence similarity network), we incorporate the graph’s adjacency matrix (denoted as A) and set X as an identity matrix (as we lack node-specific features). The three-layer GCN conducts three propagation steps during the forward pass, effectively convolving up to the 3rd-order neighborhood information for each node. The rationale for choosing a three-layer architecture is based on our experiments, which demonstrated that the three-layer model provided the best performance in terms of both predictive accuracy and generalization. We experimented with one-, two-, and four-layer GCNs and found that the three-layer model outperformed the others. Adding more layers did not significantly improve performance and, in some cases, led to overfitting. Conversely, using fewer layers reduced the model’s ability to capture complex interactions within the network.

Wasserstein distance between probability distribution

The intuition and motivation behind this metric were drawn from the optimal transport problem, a classical mathematical challenge. This problem was initially introduced by the French mathematician Gaspard Monge in 1781 and later formalized in a more relaxed manner by L. Kantorovitch in 1942. For a distribution of mass $μ_{0} (x)$ on a space X, the problem is to transport the mass into the distribution $μ_{1} (x)$ on the same space X with minimum cost, given a cost function $c (x, y) \to [0, \infty]$ . The problem is valid only if the created pile has the same mass as the pile to be moved. Thereby, without loss of generality, we can assume $μ_{0}$ and $μ_{1}$ are the probability distributions containing a total mass of 1. Given a transport plan $λ (x, y)$ which gives the amount of mass to move from x to y the task can be imagined as to move a ‘pile of earth’ of shape $μ_{0}$ to the hole in the ground of shape $μ_{1}$ in such a way that both the pile of earth and the hole in the ground completely vanish. The concept can be formally defined as follows:

Given a metric space $M$ , for p > 1, the Wasserstein space $P_{p} (M)$ is defined as the collection of all probability measures μ with a pth moment. Then there exists some x₀ in $M$ such that:

\int_{M} d (x, x_{0})^{p} d μ (x) < \infty,

(1)

where d(.,.) represents Euclidean norm on $M$ . The p–Wasserstein distance W_p between two probability measures $μ_{0}$ and $μ_{1}$ in $P_{p} (M)$ is defined as

W_{p} (μ_{0}, μ 1) = {({inf}_{λ \in π (μ_{0}, μ_{1})} \int_{M \times M} d (x, y)^{p} d λ (x, y))}^{1 / p}

(2)

where $π (μ_{0}, μ_{1})$ as being the subset of probability distributions λ on $M \times M$ . The probability distribution λ is known as the optimal transport plan between $μ_{0}$ and $μ_{1}$ . This distributes all the mass of the distribution $μ_{0}$ onto the distribution $μ_{1}$ with a minimal cost. The quantity $W_{p} (μ_{0}, μ_{1})$ represents the corresponding total cost.

Here, the combination of the embeddings/features coming from three different networks can be represented as a multivariate probability distribution for a protein node (see workflow). Assuming two protein nodes p1 and p2, with their probability distribution $μ_{p 1}$ and $μ_{p 2}$ , the Wasserstein distance $W_{p} (μ_{p 1}, μ_{p 2})$ is calculated. The distance matrix for n proteins is then clustered into 10 groups using hierarchical clustering.

Algorithm 1 Interaction prediction between SARS-CoV-2 and Human protein

Input: Viral-human protein interaction data, Human protein-protein

interaction (PPI) network, Protein Sequence data, and Gene Ontology

annotations.

Output: Predicted interactions between SARS-CoV-2 proteins and human

proteins.

Data Preprocessing

– Collect SARS-CoV-2-human interaction data, human PPI network, protein

sequences, and gene ontology annotations.

– Clean the data by removing duplicates and irrelevant interactions,

normalizing sequence data, and filtering low-confidence interactions.

Feature Extraction

– Compute protein sequence similarity using the protR package with

Needleman-Wunsch global alignment:

parSeqSim(protlist, cores = 2, type = "global", submat =

"BLOSUM62").

– Calculate gene ontology-based functional similarity using the R GOSemSim

package.

– Apply Graph Convolutional Networks (GCNs) to extract low-dimensional

embeddings from the PPI network.

Integration of Multi-view Features

– Integrate sequence similarity, gene ontology-based functional similarity, and

PPI network embeddings into a unified feature representation.

– Compute Wasserstein distance between protein pairs to assess overall

similarity.

Clustering and Prediction

– Perform hierarchical clustering on the Wasserstein distance matrix to group

similar proteins.

– Predict potential SARS-CoV-2-human interactions by identifying

non-target proteins that cluster with known CoV-host proteins.

Barycenter in Wasserstein space

Wasserstein distances have several interesting properties [47–49]. The barycenter defined in Wasserstein space extends the applicability of the Wasserstein metric. In statistics and machine learning, sometimes it is required to aggregate distinct but similar collections of information usually represented as probability distributions. Given a metric defining the distance between distributions, the aggregation strategies often compute the barycenter of the input distributions that minimize the sum of the distances to the individual input distributions. Considering the Wasserstein metric as the distance metric between distributions, the corresponding barycenter is called the Wasserstein barycenter [50].

A Wasserstein barycenter [50, 51] of n measures $ν_{1} \dots ν_{i}$ in $ℙ \in P (M)$ is defined as a minimizer of the function f over $ℙ$ , where

f (μ) = \frac{1}{N} \sum_{i = 1}^{N} W_{p}^{p} (ν_{i}, μ) .

(3)

We have utilized a fast algorithm proposed in [50] to compute the Wasserstein barycenter of each cluster obtained from the hierarchical clustering of the Wasserstein distance matrix. In [50], the sum of optimal transport distances is minimized using a gradient descent method. These gradients are computed using matrix scaling algorithms at a considerably lower computational cost.

Computing similarity between proteins

Gene ontology-based semantic similarity.

Gene Ontology-based semantic similarity (SS) [52] allows the comparison of GO terms or entities annotated with GO terms. The number and diversity of SS measures based on GO have grown considerably, and their applications range from functional coherence evaluation, protein interaction prediction, and disease gene prioritization. In the context of Gene Ontology, SS measures can be employed to compute the similarity between two gene products, each annotated with a set of GO terms.

For our study, we calculate the semantic similarity based on Gene Ontology for all the CoV-host and non-CoV-host proteins utilized. We employ a hybrid semantic similarity metric introduced by Wang et al. [53]. To perform these calculations, we utilize the R Bioconductor package GOSemSim [54] to gauge the semantic similarity among the proteins within the network.

Protein sequence similarity.

Searching for sequence similarity to identify similar protein sequences is one of the first and most informative steps in any genomics analysis. We used the protR package of R to compute the sequence similarity between the amino acid sequences of two proteins. We used the function parSeqSim() which takes a list of protein sequences and calculates the pairwise similarity between each pair of proteins in parallel. Specifically, we employed the parSeqSim() function with the type parameter set to global, which implements Needleman-Wunsch global alignment. Bioconductor database/package EnsDb is utilized here to fetch the amino acid sequences of a particular protein. BLOSUM62 is used to score the alignment between evolutionary divergent protein sequences. The computation is carried out on a 48-core server machine with 500 GB of RAM.

Time complexity analysis

The time complexity of the proposed algorithm can be determined by the following key components:

Sequence Similarity Computation: Using Needleman-Wunsch global alignment, this step has a complexity of $O (n^{2} \times m^{2})$ , where n is the number of proteins and m is the average length of the protein sequences.
Gene Ontology-Based Functional Similarity: The complexity is $O (n^{2} \times g^{2} \times t)$ . This arises from the need to compute similarity for all pairs of proteins, where n represents the number of proteins. For each pair, the comparison involves g2 operations, with g being the average number of Gene Ontology (GO) terms associated with each protein. The term t reflects the complexity of comparing two individual GO terms. Therefore, the overall complexity accounts for the pairwise protein comparisons, the number of GO term comparisons per protein pair, and the time required for each GO term comparison.
Graph Convolutional Network (GCN) Embedding: The GCN embedding has a complexity of $O (L \times n \times e)$ , where L is the number of layers in the GCN, and e is the number of edges in the protein-protein interaction network.
Wasserstein Distance Computation: The complexity of computing the Wasserstein distance for all protein pairs is $O (n^{2} \times d^{3})$ , where d is the dimension of the feature space. This arises because the computation involves pairwise comparisons between every pair of proteins, where n is the total number of proteins, leading to n2 comparisons. Additionally, each protein is represented as a distribution in a feature space of dimension d, and calculating the Wasserstein distance between two such distributions typically involves solving an optimization problem with a complexity that depends on d3. Therefore, the overall time complexity for computing the Wasserstein distance between all protein pairs is $O (n^{2} \times d^{3})$ .
Hierarchical Clustering: The hierarchical clustering step has a worst-case complexity of O(n3).

Thus, the overall time complexity is dominated by the sequence similarity computation and hierarchical clustering, resulting in $O (n^{2} \times m^{2} + n^{3})$ .

Discussions

In this work, we have effectively produced a list of potential human proteins that could be considered as host factors for the SARS-CoV2 virus. Additionally, we have highlighted the interactions between these proteins. As novelties, we have integrated recently published SARS-CoV-2 interaction data into human interactome to compile an encompassing network putting SARS-CoV-2 proteins, experimentally verified CoV-host, and other non-CoV-host proteins within the interactome into a comprehensive context. Further, three separate networks of the same size are created from the integrated network by considering gene ontology-based functional similarity, protein sequence similarity and interaction information among the host proteins. To exploit these three resources of interaction information we utilized an advanced deep learning methodology that addresses to learn and exploit network data, establishing another novelty. We successfully combined the embeddings to get a three-dimensional representation of each protein in order to compare between CoV-host and non-CoV-host. As for the other novelties, we made use of the Wasserstein metric to compute the distance between proteins which integrates three biological measures within each protein cluster. Our experimental results confirm that the proteins we predicted exhibit overlap with host factors associated with other viruses. Furthermore, these proteins have already been considered potential targets for host-directed therapy (HDT) in the context of other viral diseases.

Two novel SARS-CoV-2-human protein interaction resources were published recently (April 30, 2020) in [7, 8], which unlock immense possibilities to study the infection mechanism of SARS-CoV-2 in the human host cell. Various experimental and computational approaches in the field of interaction prediction between SARS-CoV-2 and human protein have now become conceivable. To the best of our knowledge, for the first time we have raised a deep learning-based systematic approach that also uses statistical methods for the prediction of the host factor of SARS-CoV-2. We also have been able to raise three different views of the integrated human-SARS-CoV-2 network that reflects the latest state of the arts and used a statistical distance metric that integrates all the views to obtain the final results.

In our experiments, we focused on predicting links between SARS-CoV2 and human proteins that in turn are known to interact with CoV-host proteins (SARS-CoV-2 associated host proteins). We have decidedly put the focus on those proteins which would have experimentally validated interactions with other viruses. These proteins already serve the purposes of host-directed therapy (HDT) options for other viruses, thereby also being potential candidates for building HDT strategy against SARS-CoV2. Host-directed therapy (HDT) approaches have demonstrated increased resilience in dealing with viral mutations that enable the virus to evade therapeutic interventions. It’s important to note that HDT strategies are especially well-suited for drug repurposing endeavors, as repurposed drugs have already shown a track record of minimal adverse effects, either due to their existing use or successful progression through preclinical trials. In this connection, the list of drugs which we suggest to target the predicted proteins may hold strong promise for yielding repurposable drugs to use against COVID-19.

We further identified a list of predicted proteins that are associated with more than three viruses and have a strong connection with several drugs that can be used against viral infection and infectious diseases. Additionally, we identified and highlighted several drugs that target host proteins that the virus needs to enter and subsequently hijack human cells. One such example is pomalidomide, which is known as an angiogenesis inhibitor and tumor necrosis factor production, and has recently gained attention as a repurposable drug to use against COVID-19 [34]. Several other drugs such as ‘clenbuterol’, ‘epinephrine’, ‘pirfenidone’, ‘chloroquine’, ‘pranlukast’, and ‘lenalidomide’ have also been identified as repurposable drugs by several studies and all have a connection with the predicted protein TNF. Thus, TNF may be treated as a crucial host factor for SARS-CoV-2. Similarly, other predicted proteins such as ABCB1, MTR, POLE, and PSMB2 have verified connections with several antibiotics (erythromycin-estolate, roxithromycin), drugs used to prevent lymphoblastic leukemia (cytarabine), some kinase inhibitors (nilotinib, imatinib, dasatinib) which have potential connections with antiviral therapy.

To further enhance the performance of the current model, future work could explore the integration of advanced graph learning models, such as those proposed in recent studies on graph-based multi-view data fusion for biomedical applications [55]. These models could potentially improve prediction accuracy by capturing more complex dependencies in multi-view data. Additionally, the generalizability of our model could be extended to other important applications, such as predicting m6A modification sites [56] and drug repurposing [57], where understanding protein interactions is crucial.

Despite these promising results, several critical limitations must be acknowledged. First, our predictions depend heavily on the quality and completeness of existing protein-protein interaction databases. Any biases, inaccuracies, or incompleteness within these resources can influence prediction reliability. Secondly, although our integrated computational framework offers compelling evidence of biological plausibility, the absence of direct experimental validation of newly predicted interactions remains a significant limitation. Thus, these predictions require further empirical verification through laboratory-based assays or clinical studies. Thirdly, the human interactome itself may not be fully captured by current databases, and as such, our analysis potentially overlooks host factors not yet documented or characterized experimentally.

Addressing these limitations motivates essential future work. Subsequent studies should prioritize experimental validation of our predicted interactions, particularly those linked to promising drug repurposing candidates, using rigorous methods such as CRISPR-based genetic screens to confirm host factor essentiality. Additionally, expanding the host interactome coverage by integrating emerging biological databases, incorporating high-throughput experimental results, and leveraging novel network-based algorithms could further enhance model robustness and predictive accuracy. Employing cross-species PPI transfer learning methods could also provide valuable insights into the evolutionary conservation of these interactions, supporting broader generalization and validation of our computational predictions.

We also acknowledge that our current approach to identifying repurposable drugs primarily relies on network-based and knowledge-driven associations from existing databases, without direct quantitative evaluation of drug–protein interactions. As recommended, future studies should incorporate molecular docking and free energy calculations to quantitatively validate drug–protein binding affinities, providing robust biophysical evidence to strengthen confidence in the predicted therapeutic candidates. Such computational validation would be aligned with current best practices in drug repurposing studies.

In summary, we have compiled a list of human proteins, which can be treated as interacting proteins and potential host factors for the SARS-CoV-2 virus and highlighted some drugs that are of great potential in the fight against the COVID-19 pandemic, where therapy options are urgently needed. Our list of predictions suggests both options that had been identified previously for HDT therapy of other viruses and new opportunities that had not been pointed out earlier for the SARS-CoV-2 virus. The latter class of predictions may offer valuable chances for pursuing new therapeutic strategies against COVID-19.

Supporting information

S1 Table. A total of 472 interactions between SARS-CoV-2 proteins and human proteins by setting the threshold value of 0.5.

(CSV)

pone.0332794.s001.csv^{(15.5KB, csv)}

S2 Table. Results of Gene Ontology (GO) enrichment analysis on the top 100 predicted human proteins.

(PDF)

pone.0332794.s002.pdf^{(72.9KB, pdf)}

S1 Fig. Heatmaps of Wasserstein distance for all SARS-CoV-2 proteins.

The rows of the sub-matrices represent the CoV-host proteins interacting with a particular SARS-CoV-2 protein, and the columns represent the non-CoV-host proteins inferred by those CoV-hosts from the predicted list.

(PDF)

pone.0332794.s003.pdf^{(215.2KB, pdf)}

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, et al. Author Correction: a new coronavirus associated with human respiratory disease in China. Nature. 2020;580(7803):E7. doi: 10.1038/s41586-020-2202-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kaufmann SHE, Dorhoi A, Hotchkiss RS, Bartenschlager R. Host-directed therapies for bacterial and viral infections. Nat Rev Drug Discov. 2018;17(1):35–56. doi: 10.1038/nrd.2017.162 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ackerman EE, Kawakami E, Katoh M, Watanabe T, Watanabe S, Tomita Y, et al. Network-guided discovery of influenza virus replication host factors. mBio. 2018;9(6):e02002–18. doi: 10.1128/mBio.02002-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Yamin R, Ahmad I, Khalid H, Perveen A, Abbasi SW, Nishan U, et al. Identifying plant-derived antiviral alkaloids as dual inhibitors of SARS-CoV-2 main protease and spike glycoprotein through computational screening. Front Pharmacol. 2024;15:1369659. doi: 10.3389/fphar.2024.1369659 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Khalid H, Ahmad I, Sarfraz A, Iqbal A, Nishan U, Dib H, et al. Screening Asian medicinal plants for SARS-CoV-2 inhibitors: a computational approach. Chem Biodivers. 2025;22(5):e202402548. doi: 10.1002/cbdv.202402548 [DOI] [PubMed] [Google Scholar]
6.Shah M, Yamin R, Ahmad I, Wu G, Jahangir Z, Shamim A, et al. In-silico evaluation of natural alkaloids against the main protease and spike glycoprotein as potential therapeutic agents for SARS-CoV-2. PLoS One. 2024;19(1):e0294769. doi: 10.1371/journal.pone.0294769 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Dick K, Biggar KK, Green JR. Comprehensive prediction of the SARS-CoV-2 vs. human interactome using PIPE4, SPRINT, and PIPE-Sites. Scholars Portal Dataverse. 2020. doi: 10.5683/SP2/JZ77XA [DOI] [Google Scholar]
8.Gordon DE, Jang GM, Bouhaddou M, Xu J, Obernier K, White KM, et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020;583(7816):459–68. doi: 10.1038/s41586-020-2286-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ray S, Lall S, Mukhopadhyay A, Bandyopadhyay S, Schönhuth A. Deep variational graph autoencoders for novel host-directed therapy options against COVID-19. Artif Intell Med. 2022;134:102418. doi: 10.1016/j.artmed.2022.102418 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Morselli Gysi D, do Valle Í, Zitnik M, Ameli A, Gan X, Varol O, et al. Network medicine framework for identifying drug-repurposing opportunities for COVID-19. Proc Natl Acad Sci U S A. 2021;118(19):e2025581118. doi: 10.1073/pnas.2025581118 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Sadegh S, Matschinske J, Blumenthal DB, Galindez G, Kacprowski T, List M, et al. Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing. Nat Commun. 2020;11(1):3518. doi: 10.1038/s41467-020-17189-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Khorsand B, Savadi A, Naghibzadeh M. SARS-CoV-2-human protein-protein interaction network. Inform Med Unlocked. 2020;20:100413. doi: 10.1016/j.imu.2020.100413 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Dick K, Chopra A, Biggar KK, Green JR. Multi-schema computational prediction of the comprehensive SARS-CoV-2 vs. human interactome. PeerJ. 2021;9:e11117. doi: 10.7717/peerj.11117 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Ray S, Lall S, Bandyopadhyay S. A deep integrated framework for predicting SARS-CoV2–human protein-protein interaction. IEEE Trans Emerg Top Comput Intell. 2022;6(6):1463–72. doi: 10.1109/tetci.2022.3182354 [DOI] [Google Scholar]
15.Dey L, Chakraborty S, Mukhopadhyay A. Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J. 2020;43(5):438–50. doi: 10.1016/j.bj.2020.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ghosh M, Sil P, Roy A, Fajriyah R, Mondal KC. Finding prediction of interaction between SARS-CoV-2 and human protein: a data-driven approach. J Inst Eng India Ser B. 2021;102(6):1293–302. doi: 10.1007/s40031-021-00569-7 [DOI] [Google Scholar]
17.Khorsand B, Savadi A, Naghibzadeh M. Comprehensive host-pathogen protein-protein interaction network analysis. BMC Bioinformatics. 2020;21(1):400. doi: 10.1186/s12859-020-03706-z [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Ray S, Alberuni S, Maulik U. Computational prediction of HCV-human protein-protein interaction via topological analysis of HCV infected PPI modules. IEEE Trans Nanobioscience. 2018;17(1):55–61. doi: 10.1109/TNB.2018.2797696 [DOI] [PubMed] [Google Scholar]
19.Liu-Wei W, Kafkas Ş, Chen J, Dimonaco N, Tegnér J, Hoehndorf R. <refbooktitle>DeepViral: infectious disease phenotypes improve prediction of novel virus–host interactions</refbooktitle>. Cold Spring Harbor Laboratory; 2020. 10.1101/2020.04.22.055095 [DOI]
20.Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint. 2016. doi: 10.48550/arXiv.1609.02907 [DOI] [Google Scholar]
21.Peyré G, Cuturi M. Computational optimal transport: with applications to data science. Found Trend Mach Learn. 2019;11(5–6):355–607. doi: 10.1561/2200000073 [DOI] [Google Scholar]
22.Nielsen F. Hierarchical Clustering. <refbooktitle>Undergraduate Topics in Computer Science</refbooktitle>. Springer International Publishing; 2016. p. 195–211. 10.1007/978-3-319-21903-5_8 [DOI]
23.Wang H, Liang Q, Hancock JT, Khoshgoftaar TM. Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods. J Big Data. 2024;11(1). doi: 10.1186/s40537-024-00905-w [DOI] [Google Scholar]
24.Shang J, Ye G, Shi K, Wan Y, Luo C, Aihara H, et al. Structural basis of receptor recognition by SARS-CoV-2. Nature. 2020;581(7807):221–4. doi: 10.1038/s41586-020-2179-y [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ichida K, Kamatani N, Nishino T, Saji M, Okabe H, Hosoya T. Mutations in xanthine dehydrogenase gene in subjects with hereditary xanthinuria. Adv Exp Med Biol. 1998;431:327–30. doi: 10.1007/978-1-4615-5381-6_65 [DOI] [PubMed] [Google Scholar]
26.Holcomb D, Alexaki A, Hernandez N, Laurie K, Kames J, Hamasaki-Katagiri N, et al. Potential impact on coagulopathy of gene variants of coagulation related proteins that interact with SARS-CoV-2. bioRxiv. 2020. doi: 10.1101/2020.09.08.272328 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Dey L, Mukhopadhyay A. DenvInt: a database of protein–protein interactions between dengue virus and its hosts. PLoS Negl Trop Dis. 2017;11(10). doi: journal.pntd.0005879 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Ako-Adjei D, Fu W, Wallin C, Katz KS, Song G, Darji D, et al. HIV-1, human interaction database: current status and new features. Nucleic Acids Res. 2015;43(D1):D566–70. doi: 10.1093/nar/gku1126 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Kwofie SK, Schaefer U, Sundararajan VS, Bajic VB, Christoffels A. HCVpro: hepatitis C virus protein interaction database. Infect Genet Evol. 2011;11(8):1971–7. doi: 10.1016/j.meegid.2011.09.001 [DOI] [PubMed] [Google Scholar]
30.Zhou X, Park B, Choi D, Han K. A generalized approach to predicting protein-protein interactions between virus and host. BMC Genomics. 2018;19(Suppl 6):568. doi: 10.1186/s12864-018-4924-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Gurumayum S, Brahma R, Naorem LD, Muthaiyan M, Gopal J, Venkatesan A. ZikaBase: an integrated ZIKV- human interactome map database. Virology. 2018;514:203–10. doi: 10.1016/j.virol.2017.11.007 [DOI] [PubMed] [Google Scholar]
32.Shapira SD, Gat-Viks I, Shum BOV, Dricot A, de Grace MM, Wu L, et al. A physical and regulatory map of host-influenza interactions reveals pathways in H1N1 infection. Cell. 2009;139(7):1255–67. doi: 10.1016/j.cell.2009.12.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171(6):1437–1452.e17. doi: 10.1016/j.cell.2017.10.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Elzupir AO. Inhibition of SARS-CoV-2 main protease 3CLpro by means of α-ketoamide and pyridone-containing pharmaceuticals using in silico molecular docking. J Mol Struct. 2020;1222:128878. doi: 10.1016/j.molstruc.2020.128878 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Seifirad S. Pirfenidone: a novel hypothetical treatment for COVID-19. Med Hypotheses. 2020;144:110005. doi: 10.1016/j.mehy.2020.110005 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Derakhshan M, Ansarian HR, Ghomshei M. Possible effect of epinephrine in minimizing COVID-19 severity: a review. J Int Med Res. 2020;48(9). doi: 10.1177/0300060520958594 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Weisberg E, Parent A, Yang PL, Sattler M, Liu Q, Liu Q, et al. Repurposing of kinase inhibitors for treatment of COVID-19. Pharm Res. 2020;37(9):167. doi: 10.1007/s11095-020-02851-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Rual J-F, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437(7062):1173–8. doi: 10.1038/nature04209 [DOI] [PubMed] [Google Scholar]
39.Rolland T, Taşan M, Charloteaux B, Pevzner SJ, Zhong Q, Sahni N, et al. A proteome-scale map of the human interactome network. Cell. 2014;159(5):1212–26. doi: 10.1016/j.cell.2014.10.050 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Luck K, Kim D-K, Lambourne L, Spirohn K, Begg BE, Bian W, et al. A reference map of the human binary protein interactome. Nature. 2020;580(7803):402–8. doi: 10.1038/s41586-020-2188-x [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13(10):2363–71. doi: 10.1101/gr.1680803 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Jeronimo C, Forget D, Bouchard A, Li Q, Chua G, Poitras C, et al. Systematic analysis of the protein interaction network for the human transcription machinery reveals the identity of the 7SK capping enzyme. Mol Cell. 2007;27(2):262–74. doi: 10.1016/j.molcel.2007.06.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Hosur R, Peng J, Vinayagam A, Stelzl U, Xu J, Perrimon N, et al. A computational framework for boosting confidence in high-throughput protein-protein interaction datasets. Genome Biol. 2012;13(8):R76. doi: 10.1186/gb-2012-13-8-r76 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Ye W, Li C, Zhang W, Li J, Liu L, Cheng D, et al. Predicting drug-target interactions by measuring confidence with consistent causal neighborhood interventions. Methods. 2024;231:15–25. doi: 10.1016/j.ymeth.2024.08.009 [DOI] [PubMed] [Google Scholar]
45.Yu H, Tardivo L, Tam S, Weiner E, Gebreab F, Fan C, et al. Next-generation sequencing to generate interactome datasets. Nat Methods. 2011;8(6):478–80. doi: 10.1038/nmeth.1597 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Duvenaud D, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, et al. Convolutional networks on graphs for learning molecular fingerprints. In: Advances in neural information processing systems. 2015. p. 2224–32. 10.48550/arXiv.1509.09292 [DOI]
47.Villani C. <refbooktitle>Optimal Transport: old and new</refbooktitle>. Springer Berlin Heidelberg. 2009. 10.1007/978-3-540-71050-9 [DOI]
48.Pele O, Werman M. Fast and robust Earth Mover’s Distances. In: 2009 IEEE 12th International Conference on Computer Vision. 2009. p. 460–7. 10.1109/iccv.2009.5459199 [DOI]
49.Rubner Y, Guibas L, Tomasi C. The earth mover’s distance, multi-dimensional scaling, and color-based image retrieval. In: Proceedings of the ARPA image understanding workshop. 1997. p. 668.
50.Cuturi M, Doucet A. Fast computation of Wasserstein barycenters. In: International conference on machine learning, 2014. p. 685–93. 10.48550/arXiv.1310.4375 [DOI] [Google Scholar]
51.Agueh M, Carlier G.Barycenters in the Wasserstein space. SIAM J Math Anal. 2011;43(2):904–24. doi: 10.7717/peerj.11117 [DOI] [Google Scholar]
52.Dessimoz C, Škunca N. Semantic similarity in the gene ontology. <refbooktitle>The Gene Ontology Handbook</refbooktitle>. Springer New York; 2017. p. 161–73. 10.1007/978-1-4939-3743-1 [DOI]
53.Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81. doi: 10.1093/bioinformatics/btm087 [DOI] [PubMed] [Google Scholar]
54.Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26(7):976–8. doi: 10.1093/bioinformatics/btq064 [DOI] [PubMed] [Google Scholar]
55.Yang Y, Su X, Zhao B, Li G, Hu P, Zhang J, et al. Fuzzy-based deep attributed graph clustering. IEEE Trans Fuzzy Syst. 2024;32(4):1951–64. doi: 10.1109/tfuzz.2023.3338565 [DOI] [Google Scholar]
56.Li G, Zhao B, Su X, Yang Y, Hu P, Zhou X, et al. Discovering consensus regions for interpretable identification of RNA N6-methyladenosine modification sites via graph contrastive clustering. IEEE J Biomed Health Inform. 2024;28(4):2362–72. doi: 10.1109/JBHI.2024.3357979 [DOI] [PubMed] [Google Scholar]
57.Zhao B-W, Su X-R, Hu P-W, Ma Y-P, Zhou X, Hu L. A geometric deep learning framework for drug repositioning over heterogeneous information networks. Brief Bioinform. 2022;23(6):bbac384. doi: 10.1093/bib/bbac384 [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0332794.r001

Decision Letter 0

Chandrabose Selvaraj

19 May 2025

PONE-D-25-21081A Graph Learning Framework for Comprehensive Prediction of SARS-CoV-2 and Human Protein Interactions from Multiview Protein Interaction DataPLOS ONE

Dear Dr. Ray,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 03 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Chandrabose Selvaraj, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Additional Editor Comments:

In case of reviewers recommending the citations that are not directly pertinent to the scope or content of the manuscript, authors are encouraged to provide a reasoned justification for declining such suggestions. The editorial board affirms that the exclusion of non-essential references will not influence the editorial decision regarding the manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I have now completed the review of the manuscript titled “A Graph Learning Framework for Comprehensive Prediction of SARS-CoV-2 and Human Protein Interactions from Multiview Protein Interaction Data”, My specific comments are

1. The title is descriptive, but consider specifying the method (e.g., graph neural network-based) to clarify novelty.

2. The abstract needs clearer articulation of the main results and comparative performance metrics. Include key numerical outcomes (e.g., accuracy, precision) to support claims.

3. The introduction provides good background but lacks critical discussion of prior computational methods for SARS-CoV-2–human protein interactions. Consider referencing recent work such as https://doi.org/10.1371/journal.pone.0294769 ; https://doi.org/10.3389/fphar.2024.1369659 ; Virtual Screening and Molecular Docking of FDA Approved Antiviral Drugs for the Identification of Potential Inhibitors of SARS-CoV-2 RNA-MTase Protein; https://doi.org/10.1002/cbdv.202402548

4. Motivation for using multiview data and graph learning is justified but would benefit from a brief example of why single-view models may miss relevant interactions.

5. The integration of STRING, BioGRID, and sequence-based interactions is appreciated. However, the criteria for interaction confidence thresholding are not clearly explained. Were low-confidence interactions filtered? The authors can potentially benefir from https://doi.org/10.1021/acsomega.3c07866; https://doi.org/10.3389/fphar.2025.1509263 ; https://doi.org/10.3389/fphar.2025.1509263 ; https://doi.org/10.1007/s10989-020-10076-w

6. The process for fusing multiview networks via Wasserstein distance is conceptually interesting, but the explanation is too terse. Please provide a schematic or algorithmic pseudocode to support reproducibility.

7. The model is claimed to use GCN and Wasserstein integration, but details on the number of layers, training epochs, loss function, and hyperparameter tuning are missing.

8. It is unclear how negative samples were generated or balanced. Were negative protein pairs randomly sampled or drawn from known non-interacting pairs?

9. Figure 2 shows model architecture but lacks annotations and descriptions. It is hard to interpret the role of each module in the workflow.

10. There is no table presenting the model’s quantitative performance (e.g., accuracy, F1-score, AUC) on benchmark datasets. This weakens the credibility of performance claims.

11. Add comparative results against baseline methods (e.g., logistic regression, DeepWalk, GCN without multiview fusion).

12. The identification of known host proteins (e.g., ACE2, TMPRSS2) adds confidence, but the analysis is anecdotal. Perform GO enrichment or pathway analysis on the predicted top 100 proteins to support biological relevance.

13. Drug repurposing suggestions (e.g., dexamethasone, baricitinib) are mentioned, but how were these inferred from the predicted PPIs? The method of linking predictions to drugs should be more systematically described.

14. The conclusion is too generic. Reflect more critically on the limitations, such as the dependence on existing PPI databases and lack of experimental validation.

15. Future directions could include validation using CRISPR screens or cross-species PPI transfer learning.

16. The manuscript needs a thorough language edit for grammar and clarity. Examples:

o Page 5: “This approach is very efficiently to...” → “This approach is efficient for...”

o Page 8: “The ROC curve indicate the...” → “The ROC curve indicates.

Reviewer #2: 1. Clustering for proteins can be overlapping in nature in biological data, but the authors use the hierarchical clustering which detects disjoint clusters and has a high time computational cost and also the method is old and could be error prone. – Proper justification require.

2. Algorithm also has high overall time complexity so it would be slow for large datasets. –Need more statement about it.

3. What is significance of you own algorithm in relation to other already available similar algorithms?

A comparative analysis could be better for the paper and could enhance the novelty of the work.

4. These proteins already serve the purposes of host-directed therapy (HDT) options for other viruses,

thereby also being potential candidates for building HDT strategy against SARS-CoV2. --- “thereby also being potential candidates for building HDT”—More explanation require.

5. In this connection, the list of drugs which we suggest to target the predicted proteins may hold strong promise for yielding repurposable drugs to use against COVID-19.---- more strong justification require.

6. Authors should check drug- protein interaction and calculate the free energy to make such conclusion.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Sep 25;20(9):e0332794. doi: 10.1371/journal.pone.0332794.r002

Author response to Decision Letter 1

16 Aug 2025

Answer to comments of Reviewer 1

Comments:

1. The title is descriptive, but consider specifying the method (e.g., graph neural network-based) to clarify novelty.

Answer: Thank you for the valuable suggestion. We have now revised the title to explicitly highlight the method as follows:

"A Graph Neural Network-Based Approach for Predicting SARS-CoV-2–Human Protein Interactions from Multiview Data."

2. The abstract needs clearer articulation of the main results and comparative performance metrics. Include key numerical outcomes (e.g., accuracy, precision) to support claims.

Answer: We appreciate this comment. The abstract now explicitly mentions our main results, including quantitative performance metrics. Specifically, our GCN model achieved test ROC-AUCs of 86.00 (PPI), 83.00 (GO similarity), and 83.10 (sequence similarity) and average precision (AP) scores of 86.10, 82.12, and 82.30, respectively (see Table 1). In total, we predicted 472 high-confidence interactions between 280 host and 27 viral proteins, and our drug-target analysis found strong links to repurposable drugs (see Table 5). These key numerical results are now stated in the abstract of the revised version of the manuscript. Please see the abstract of the revised version of the manuscript.

Answer: Thank you for highlighting this gap. We have expanded the introduction to critically discuss prior computational methods for SARS-CoV-2–human protein interaction prediction, including recent approaches that integrate virtual screening, molecular docking, and sequence-based predictors ([4, 5, 6, references in the manuscript]). The following recent works have also been cited/discussed as suggested:

DOI: 10.1371/journal.pone.0294769

DOI: 10.3389/fphar.2024.1369659

DOI: 10.1002/cbdv.202402548

please see the second last paragraph of the section Introduction ( page no 2) of the revised version of the manuscript.

4. Motivation for using multiview data and graph learning is justified but would benefit from a brief example of why single-view models may miss relevant interactions.

Answer: As suggested by the reviewer we have now added an explicit example highlighting that single-view models, for instance relying solely on sequence similarity, may overlook biologically relevant interactions identified through functional or network-based similarities. Please see the second paragraph of section Introduction page no 3 line no. 65-77.

Answer: Thank you for this point. We now clarify in the Methods section that only experimentally validated, high-confidence interactions were included from both CCSB and HPRD interactome resources, and low-confidence interactions (e.g., based on single sources or insufficient evidence) were filtered out during preprocessing. Interactions with ambiguous or duplicate entries were also removed to maximize data reliability. Additional references as suggested ([DOI: 10.1021/acsomega.3c07866], [DOI: 10.3389/fphar.2025.1509263], [DOI: 10.1007/s10989-020-10076-w]) have been considered and cited. Please see the last paragraph of the subsection ‘The Human Protein Interactome’ of the section ‘Method’ in the revised version of the manuscript (page no. 11-12, line no: 423-431).

Answer: We thank the reviewer for this suggestion. The schematic (Figure 1) workflow provides a detailed illustration of the entire workflow, including how the Wasserstein distance is employed to fuse the multiview network data. This figure visually demonstrates the integration of features from sequence similarity, gene ontology, and PPI networks using optimal transport. In addition, we have now added an explicit algorithmic pseudocode (see "Algorithm 1: Interaction prediction between SARS-CoV-2 and Human protein," in the section ‘Materials and Methods’ , page no. 14 of the revised version), which presents a clear and reproducible stepwise procedure.

7. The model is claimed to use GCN and Wasserstein integration, but details on the number of layers, training epochs, loss function, and hyperparameter tuning are missing.

Answer: Detailed hyperparameters are now included: We used a three-layer GCN for each network, as empirically determined, trained for 50 epochs with the Adam optimizer (learning rate 0.001, dropout rate 0.1, activation: ReLU). Hyperparameter choices were validated by performance on validation/test splits (8:1:1), and additional ablation experiments (varying layers) are now discussed in the ‘Materials and Methods’ section. Please see the subsection ‘Extracting node features using GCN‘ page no. 13, line no 455-474 of the revised version of the manuscript.

8. It is unclear how negative samples were generated or balanced. Were negative protein pairs randomly sampled or drawn from known non-interacting pairs?

Answer: Negative samples were generated by random sampling of protein pairs not present in the verified interaction set, ensuring they did not overlap with known positive interactions (see subsection “Extracting node features using GCN”, page no. 13, line no 462-474 of the revised version of the manuscript). This random negative sampling aligns with common practices in PPI prediction.

9. Figure 1 shows model architecture but lacks annotations and descriptions. It is hard to interpret the role of each module in the workflow.

Answer: We have revised Figure-1 and its legend to include detailed annotations for each module, explicitly indicating the roles of GCN, multi-view integration, clustering, and interpretation modules, facilitating workflow comprehension. Please see figure-1 and legend in the revised version of the manuscript.

10. There is no table presenting the model’s quantitative performance (e.g., accuracy, F1-score, AUC) on benchmark datasets. This weakens the credibility of performance claims.

Answer: Table 1 now summarizes the model’s quantitative performance on all three network types, including ROC-AUC and AP scores for validation and test sets. Please see the text in the subsection ‘ Comparative Evaluation with Baseline Methods’ in page no. 6, line no 184-198 of the revised version of the manuscript.

11. Add comparative results against baseline methods (e.g., logistic regression, DeepWalk, GCN without multiview fusion).

Answer: We acknowledge the reviewer’s valuable suggestion. However, our proposed framework uniquely integrates embeddings from multiple views (PPI, GO, sequence similarity) through Wasserstein-distance-based fusion, rather than merely relying on single-view embeddings. To meaningfully highlight our model’s superiority, we have now added comparative analyses against the following baseline scenarios:

Single-view GCNs individually applied on each network (PPI, GO, Sequence similarity).

Simple fusion strategies (concatenation and averaging) of single-view GCN embeddings.

Standard embedding techniques such as DeepWalk followed by a simple concatenation fusion method.

Results demonstrate that our Wasserstein fusion approach significantly improves performance (see table 1). These additional analyses further validate the robustness and effectiveness of the proposed multi-view integration strategy. Please see the subsection ‘Comparative Evaluation with Baseline Methods’ in page no. 6, line no 184-198 of the revised version of the manuscript.

Answer: We agree with the reviewers and have now performed GO enrichment analysis for the top 100 predicted proteins, highlighting significant pathways (e.g., viral entry, immune response). Results are provided in Supplementary Table-2 and briefly summarized in the main text. Please see the last paragraph of the subsection ‘Biological relevance and therapeutic potential of predicted host factor’ page no. 9 line no. 319-327 and supplementary table-2 of the revised version of the manuscript.

Answer: A systematic workflow was used: predicted host proteins were mapped to drug–protein associations using the CMAP Drug Repurposing Hub ([30]), identifying drugs with existing links to these proteins. The method is now described in detail in the second last paragraph of the subsection’ Predicted host factors promotes Host Directed Therapy (HDT) option against SARS-CoV-2’. Please see page no. 10, line no. 344-368 of the revised version of the manuscript.

14. The conclusion is too generic. Reflect more critically on the limitations, such as the dependence on existing PPI databases and lack of experimental validation.

Answer: The Conclusion now explicitly discusses current limitations, including:dependence on available PPI databases,absence of direct experimental validation for new predictions, possible incompleteness of host interactome coverage. We stress that these limitations motivate future validation and model expansion. See the section 'Discussions’ of the updated version of the manuscript, page no. 18, line no. 652-668.

15. Future directions could include validation using CRISPR screens or cross-species PPI transfer learning.

Answer: We appreciate this suggestion and agree that our predictions could be substantially strengthened by experimental validation, particularly using CRISPR-based genetic screens to confirm host factor essentiality. Additionally, cross-species PPI transfer learning represents a promising strategy to generalize our findings and discover conserved interaction patterns across related viral-host systems. We have now explicitly included these as important future research directions in the section 'Discussions’ of the updated version of the manuscript, page no. 18, line no. 662-671.

16. The manuscript needs a thorough language edit for grammar and clarity. Examples:

Page 5: “This approach is very efficiently to...” → “This approach is efficient for...”

Page 8: “The ROC curve indicate the...” → “The ROC curve indicates.

Answer: The manuscript has undergone comprehensive language editing for grammar, clarity, and style. Examples such as “This approach is very efficiently to...” and “The ROC curve indicate the...” have been corrected as suggested.

Answer to comments of Reviewer 2

Comments:

1. Clustering for proteins can be overlapping in nature in biological data, but the authors use the hierarchical clustering which detects disjoint clusters and has a high time computational cost and also the method is old and could be error prone. – Proper justification require.

Answer: Thank you for highlighting this point. While hierarchical clustering indeed produces disjoint clusters and can be computationally intensive, we chose it primarily for its interpretability and wide adoption in bioinformatics, particularly for protein interaction data where distinct biological modules or functional groups are frequently analyzed. To ensure robustness, we employed Ward's linkage to minimize intra-cluster variance, facilitating the meaningful biological interpretation of identified protein clusters. Moreover, hierarchical clustering allowed us to clearly visualize and interpret the results through dendrograms and cluster-specific heatmaps, which are beneficial for biological insights.

Nevertheless, we acknowledge that overlapping clustering methods, such as fuzzy clustering or community detection algorithms (e.g., Louvain, Infomap), could potentially provide more nuanced biological insights due to the inherently overlapping nature of protein functional associations. Future analyses could explore these advanced clustering methods for additional validation and biological insight. We have now mentioned this in the subsection ‘Hierarchical clustering of the Wasserstein distance matrix’ page no line no. 240-251 of the revised version of the manuscript.

2. Algorithm also has high overall time complexity so it would be slow for large datasets. –Need more statement about it.

Answer: We agree and have clarified in the Methods (in subsection time complexity analysis) that while sequence similarity computation and hierarchical clustering contribute to a high theoretical time complexity (O(n²m² + n³)), our study’s dataset size allowed practical execution. For much larger datasets, steps such as pairwise similarity calculation and clustering could be optimized using parallelization or approximate algorithms. We now explicitly discuss this scalability limitation and possible future solutions in the manuscript’s Discussion section. Please see the section ‘Discussions’ in the revised version of the manuscript, page no. 18, line no. 652-660.

3. What is significance of you own algorithm in relation to other already available similar algorithms?

A comparative analysis could be better for the paper and could enhance the novelty of the work.

Answer: Thank you for this important point. The novelty of our approach lies in the integration of three complementary biological views (PPI, sequence similarity, GO similarity) via graph convolutional networks, and in the application of Wasserstein (optimal transport) distance for protein similarity, which collectively outperform single-view or conventional network methods.

We have now included a comparative performance analysis against baseline algorithms—logistic regression, DeepWalk, and single-view GCN—in the Results section (see subsection ‘Comparative Evaluation with Baseline Methods’ table-1, page no.06 line no. 184-198). Our multi-view GCN-Wasserstein approach achieved higher ROC-AUC and average precision scores, demonstrating its superiority in both prediction accuracy and biological interpretability.

4. These proteins already serve the purposes of host-directed therapy (HDT) options for other viruses, thereby also being potential candidates for building HDT strategy against SARS-CoV2. --- “thereby also being potential candidates for building HDT”—More explanation require.

Answer: We have expanded the explanation. Host-directed therapy (HDT) aims to inhibit host proteins that viruses exploit for infection and replication. Many of our predicted host factors are already validated as essential in the life cycle of multiple viruses. Therefore, these proteins are not only relevant to SARS-CoV-2 but are also proven targets in existing HDT strategies for other viral infections. By targeting such proteins, repurposed drugs have the potential for broad-spectrum antiviral effects and reduced risk of resistance compared to virus-targeted therapies. We have now included an analysis to further evaluate the biological relevance of our predicted proteins. We have performed Gene Ontology (GO) enrichment analysis on the top 100 predicted human proteins (host factor) it is observed that significantly enriched terms include processes such as viral entry into host cell (GO:0046718, $p = 3.1 \times 10^{-6}$), regulation o

Attachment

Submitted filename: Reviewers Comments_plosone (1).pdf

pone.0332794.s004.pdf^{(221.2KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0332794.r003

Decision Letter 1

Chandrabose Selvaraj

4 Sep 2025

A Graph Neural Network-Based Approach for Predicting SARS-CoV-2–Human Protein Interactions from Multiview Data

PONE-D-25-21081R1

Dear Dr. Ray,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Chandrabose Selvaraj, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewer #1:

Reviewer #2:

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have addressed my comments. The manuscript is accepted for publication. I have no further comments.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0332794.r004

Acceptance letter

Chandrabose Selvaraj

PONE-D-25-21081R1

PLOS ONE

Dear Dr. Ray,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Chandrabose Selvaraj

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. A total of 472 interactions between SARS-CoV-2 proteins and human proteins by setting the threshold value of 0.5.

(CSV)

pone.0332794.s001.csv^{(15.5KB, csv)}

S2 Table. Results of Gene Ontology (GO) enrichment analysis on the top 100 predicted human proteins.

(PDF)

pone.0332794.s002.pdf^{(72.9KB, pdf)}

S1 Fig. Heatmaps of Wasserstein distance for all SARS-CoV-2 proteins.

(PDF)

pone.0332794.s003.pdf^{(215.2KB, pdf)}

Attachment

Submitted filename: Reviewers Comments_plosone (1).pdf

pone.0332794.s004.pdf^{(221.2KB, pdf)}

Data Availability Statement

All relevant data are within the manuscript and its Supporting information files.

[pone.0332794.ref001] 1.Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, et al. Author Correction: a new coronavirus associated with human respiratory disease in China. Nature. 2020;580(7803):E7. doi: 10.1038/s41586-020-2202-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref002] 2.Kaufmann SHE, Dorhoi A, Hotchkiss RS, Bartenschlager R. Host-directed therapies for bacterial and viral infections. Nat Rev Drug Discov. 2018;17(1):35–56. doi: 10.1038/nrd.2017.162 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref003] 3.Ackerman EE, Kawakami E, Katoh M, Watanabe T, Watanabe S, Tomita Y, et al. Network-guided discovery of influenza virus replication host factors. mBio. 2018;9(6):e02002–18. doi: 10.1128/mBio.02002-18 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref004] 4.Yamin R, Ahmad I, Khalid H, Perveen A, Abbasi SW, Nishan U, et al. Identifying plant-derived antiviral alkaloids as dual inhibitors of SARS-CoV-2 main protease and spike glycoprotein through computational screening. Front Pharmacol. 2024;15:1369659. doi: 10.3389/fphar.2024.1369659 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref005] 5.Khalid H, Ahmad I, Sarfraz A, Iqbal A, Nishan U, Dib H, et al. Screening Asian medicinal plants for SARS-CoV-2 inhibitors: a computational approach. Chem Biodivers. 2025;22(5):e202402548. doi: 10.1002/cbdv.202402548 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref006] 6.Shah M, Yamin R, Ahmad I, Wu G, Jahangir Z, Shamim A, et al. In-silico evaluation of natural alkaloids against the main protease and spike glycoprotein as potential therapeutic agents for SARS-CoV-2. PLoS One. 2024;19(1):e0294769. doi: 10.1371/journal.pone.0294769 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref007] 7.Dick K, Biggar KK, Green JR. Comprehensive prediction of the SARS-CoV-2 vs. human interactome using PIPE4, SPRINT, and PIPE-Sites. Scholars Portal Dataverse. 2020. doi: 10.5683/SP2/JZ77XA [DOI] [Google Scholar]

[pone.0332794.ref008] 8.Gordon DE, Jang GM, Bouhaddou M, Xu J, Obernier K, White KM, et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020;583(7816):459–68. doi: 10.1038/s41586-020-2286-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref009] 9.Ray S, Lall S, Mukhopadhyay A, Bandyopadhyay S, Schönhuth A. Deep variational graph autoencoders for novel host-directed therapy options against COVID-19. Artif Intell Med. 2022;134:102418. doi: 10.1016/j.artmed.2022.102418 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref010] 10.Morselli Gysi D, do Valle Í, Zitnik M, Ameli A, Gan X, Varol O, et al. Network medicine framework for identifying drug-repurposing opportunities for COVID-19. Proc Natl Acad Sci U S A. 2021;118(19):e2025581118. doi: 10.1073/pnas.2025581118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref011] 11.Sadegh S, Matschinske J, Blumenthal DB, Galindez G, Kacprowski T, List M, et al. Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing. Nat Commun. 2020;11(1):3518. doi: 10.1038/s41467-020-17189-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref012] 12.Khorsand B, Savadi A, Naghibzadeh M. SARS-CoV-2-human protein-protein interaction network. Inform Med Unlocked. 2020;20:100413. doi: 10.1016/j.imu.2020.100413 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref013] 13.Dick K, Chopra A, Biggar KK, Green JR. Multi-schema computational prediction of the comprehensive SARS-CoV-2 vs. human interactome. PeerJ. 2021;9:e11117. doi: 10.7717/peerj.11117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref014] 14.Ray S, Lall S, Bandyopadhyay S. A deep integrated framework for predicting SARS-CoV2–human protein-protein interaction. IEEE Trans Emerg Top Comput Intell. 2022;6(6):1463–72. doi: 10.1109/tetci.2022.3182354 [DOI] [Google Scholar]

[pone.0332794.ref015] 15.Dey L, Chakraborty S, Mukhopadhyay A. Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J. 2020;43(5):438–50. doi: 10.1016/j.bj.2020.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref016] 16.Ghosh M, Sil P, Roy A, Fajriyah R, Mondal KC. Finding prediction of interaction between SARS-CoV-2 and human protein: a data-driven approach. J Inst Eng India Ser B. 2021;102(6):1293–302. doi: 10.1007/s40031-021-00569-7 [DOI] [Google Scholar]

[pone.0332794.ref017] 17.Khorsand B, Savadi A, Naghibzadeh M. Comprehensive host-pathogen protein-protein interaction network analysis. BMC Bioinformatics. 2020;21(1):400. doi: 10.1186/s12859-020-03706-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref018] 18.Ray S, Alberuni S, Maulik U. Computational prediction of HCV-human protein-protein interaction via topological analysis of HCV infected PPI modules. IEEE Trans Nanobioscience. 2018;17(1):55–61. doi: 10.1109/TNB.2018.2797696 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref019] 19.Liu-Wei W, Kafkas Ş, Chen J, Dimonaco N, Tegnér J, Hoehndorf R. <refbooktitle>DeepViral: infectious disease phenotypes improve prediction of novel virus–host interactions</refbooktitle>. Cold Spring Harbor Laboratory; 2020. 10.1101/2020.04.22.055095 [DOI]

[pone.0332794.ref020] 20.Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint. 2016. doi: 10.48550/arXiv.1609.02907 [DOI] [Google Scholar]

[pone.0332794.ref021] 21.Peyré G, Cuturi M. Computational optimal transport: with applications to data science. Found Trend Mach Learn. 2019;11(5–6):355–607. doi: 10.1561/2200000073 [DOI] [Google Scholar]

[pone.0332794.ref022] 22.Nielsen F. Hierarchical Clustering. <refbooktitle>Undergraduate Topics in Computer Science</refbooktitle>. Springer International Publishing; 2016. p. 195–211. 10.1007/978-3-319-21903-5_8 [DOI]

[pone.0332794.ref023] 23.Wang H, Liang Q, Hancock JT, Khoshgoftaar TM. Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods. J Big Data. 2024;11(1). doi: 10.1186/s40537-024-00905-w [DOI] [Google Scholar]

[pone.0332794.ref024] 24.Shang J, Ye G, Shi K, Wan Y, Luo C, Aihara H, et al. Structural basis of receptor recognition by SARS-CoV-2. Nature. 2020;581(7807):221–4. doi: 10.1038/s41586-020-2179-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref025] 25.Ichida K, Kamatani N, Nishino T, Saji M, Okabe H, Hosoya T. Mutations in xanthine dehydrogenase gene in subjects with hereditary xanthinuria. Adv Exp Med Biol. 1998;431:327–30. doi: 10.1007/978-1-4615-5381-6_65 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref026] 26.Holcomb D, Alexaki A, Hernandez N, Laurie K, Kames J, Hamasaki-Katagiri N, et al. Potential impact on coagulopathy of gene variants of coagulation related proteins that interact with SARS-CoV-2. bioRxiv. 2020. doi: 10.1101/2020.09.08.272328 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref027] 27.Dey L, Mukhopadhyay A. DenvInt: a database of protein–protein interactions between dengue virus and its hosts. PLoS Negl Trop Dis. 2017;11(10). doi: journal.pntd.0005879 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref028] 28.Ako-Adjei D, Fu W, Wallin C, Katz KS, Song G, Darji D, et al. HIV-1, human interaction database: current status and new features. Nucleic Acids Res. 2015;43(D1):D566–70. doi: 10.1093/nar/gku1126 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref029] 29.Kwofie SK, Schaefer U, Sundararajan VS, Bajic VB, Christoffels A. HCVpro: hepatitis C virus protein interaction database. Infect Genet Evol. 2011;11(8):1971–7. doi: 10.1016/j.meegid.2011.09.001 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref030] 30.Zhou X, Park B, Choi D, Han K. A generalized approach to predicting protein-protein interactions between virus and host. BMC Genomics. 2018;19(Suppl 6):568. doi: 10.1186/s12864-018-4924-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref031] 31.Gurumayum S, Brahma R, Naorem LD, Muthaiyan M, Gopal J, Venkatesan A. ZikaBase: an integrated ZIKV- human interactome map database. Virology. 2018;514:203–10. doi: 10.1016/j.virol.2017.11.007 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref032] 32.Shapira SD, Gat-Viks I, Shum BOV, Dricot A, de Grace MM, Wu L, et al. A physical and regulatory map of host-influenza interactions reveals pathways in H1N1 infection. Cell. 2009;139(7):1255–67. doi: 10.1016/j.cell.2009.12.018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref033] 33.Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell. 2017;171(6):1437–1452.e17. doi: 10.1016/j.cell.2017.10.049 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref034] 34.Elzupir AO. Inhibition of SARS-CoV-2 main protease 3CLpro by means of α-ketoamide and pyridone-containing pharmaceuticals using in silico molecular docking. J Mol Struct. 2020;1222:128878. doi: 10.1016/j.molstruc.2020.128878 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref035] 35.Seifirad S. Pirfenidone: a novel hypothetical treatment for COVID-19. Med Hypotheses. 2020;144:110005. doi: 10.1016/j.mehy.2020.110005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref036] 36.Derakhshan M, Ansarian HR, Ghomshei M. Possible effect of epinephrine in minimizing COVID-19 severity: a review. J Int Med Res. 2020;48(9). doi: 10.1177/0300060520958594 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref037] 37.Weisberg E, Parent A, Yang PL, Sattler M, Liu Q, Liu Q, et al. Repurposing of kinase inhibitors for treatment of COVID-19. Pharm Res. 2020;37(9):167. doi: 10.1007/s11095-020-02851-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref038] 38.Rual J-F, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005;437(7062):1173–8. doi: 10.1038/nature04209 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref039] 39.Rolland T, Taşan M, Charloteaux B, Pevzner SJ, Zhong Q, Sahni N, et al. A proteome-scale map of the human interactome network. Cell. 2014;159(5):1212–26. doi: 10.1016/j.cell.2014.10.050 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref040] 40.Luck K, Kim D-K, Lambourne L, Spirohn K, Begg BE, Bian W, et al. A reference map of the human binary protein interactome. Nature. 2020;580(7803):402–8. doi: 10.1038/s41586-020-2188-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref041] 41.Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13(10):2363–71. doi: 10.1101/gr.1680803 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref042] 42.Jeronimo C, Forget D, Bouchard A, Li Q, Chua G, Poitras C, et al. Systematic analysis of the protein interaction network for the human transcription machinery reveals the identity of the 7SK capping enzyme. Mol Cell. 2007;27(2):262–74. doi: 10.1016/j.molcel.2007.06.027 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref043] 43.Hosur R, Peng J, Vinayagam A, Stelzl U, Xu J, Perrimon N, et al. A computational framework for boosting confidence in high-throughput protein-protein interaction datasets. Genome Biol. 2012;13(8):R76. doi: 10.1186/gb-2012-13-8-r76 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref044] 44.Ye W, Li C, Zhang W, Li J, Liu L, Cheng D, et al. Predicting drug-target interactions by measuring confidence with consistent causal neighborhood interventions. Methods. 2024;231:15–25. doi: 10.1016/j.ymeth.2024.08.009 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref045] 45.Yu H, Tardivo L, Tam S, Weiner E, Gebreab F, Fan C, et al. Next-generation sequencing to generate interactome datasets. Nat Methods. 2011;8(6):478–80. doi: 10.1038/nmeth.1597 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0332794.ref046] 46.Duvenaud D, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, et al. Convolutional networks on graphs for learning molecular fingerprints. In: Advances in neural information processing systems. 2015. p. 2224–32. 10.48550/arXiv.1509.09292 [DOI]

[pone.0332794.ref047] 47.Villani C. <refbooktitle>Optimal Transport: old and new</refbooktitle>. Springer Berlin Heidelberg. 2009. 10.1007/978-3-540-71050-9 [DOI]

[pone.0332794.ref048] 48.Pele O, Werman M. Fast and robust Earth Mover’s Distances. In: 2009 IEEE 12th International Conference on Computer Vision. 2009. p. 460–7. 10.1109/iccv.2009.5459199 [DOI]

[pone.0332794.ref049] 49.Rubner Y, Guibas L, Tomasi C. The earth mover’s distance, multi-dimensional scaling, and color-based image retrieval. In: Proceedings of the ARPA image understanding workshop. 1997. p. 668.

[pone.0332794.ref050] 50.Cuturi M, Doucet A. Fast computation of Wasserstein barycenters. In: International conference on machine learning, 2014. p. 685–93. 10.48550/arXiv.1310.4375 [DOI] [Google Scholar]

[pone.0332794.ref051] 51.Agueh M, Carlier G.Barycenters in the Wasserstein space. SIAM J Math Anal. 2011;43(2):904–24. doi: 10.7717/peerj.11117 [DOI] [Google Scholar]

[pone.0332794.ref052] 52.Dessimoz C, Škunca N. Semantic similarity in the gene ontology. <refbooktitle>The Gene Ontology Handbook</refbooktitle>. Springer New York; 2017. p. 161–73. 10.1007/978-1-4939-3743-1 [DOI]

[pone.0332794.ref053] 53.Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81. doi: 10.1093/bioinformatics/btm087 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref054] 54.Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26(7):976–8. doi: 10.1093/bioinformatics/btq064 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref055] 55.Yang Y, Su X, Zhao B, Li G, Hu P, Zhang J, et al. Fuzzy-based deep attributed graph clustering. IEEE Trans Fuzzy Syst. 2024;32(4):1951–64. doi: 10.1109/tfuzz.2023.3338565 [DOI] [Google Scholar]

[pone.0332794.ref056] 56.Li G, Zhao B, Su X, Yang Y, Hu P, Zhou X, et al. Discovering consensus regions for interpretable identification of RNA N6-methyladenosine modification sites via graph contrastive clustering. IEEE J Biomed Health Inform. 2024;28(4):2362–72. doi: 10.1109/JBHI.2024.3357979 [DOI] [PubMed] [Google Scholar]

[pone.0332794.ref057] 57.Zhao B-W, Su X-R, Hu P-W, Ma Y-P, Zhou X, Hu L. A geometric deep learning framework for drug repositioning over heterogeneous information networks. Brief Bioinform. 2022;23(6):bbac384. doi: 10.1093/bib/bbac384 [DOI] [PubMed] [Google Scholar]

PERMALINK

A graph neural network-based approach for predicting SARS-CoV-2–human protein interactions from multiview data

Sumanta Ray

Syed Alberuni

Alexander Schönhuth

Roles

Abstract

Introduction

Results

Workflow

Fig 1. The analysis pipeline begins by constructing three distinct interaction networks (panel A): physical protein-protein interactions (PPI), gene ontology (GO)-based functional similarity, and protein sequence similarity networks.

A. Raising a multi-view interaction networks of host proteins.

B. GCN-based graph embedding of the networks.

C. Representing a protein in a three-dimensional unit cube.

D. Computing protein-protein similarity using the Wasserstein distance.

E. Clustering proteins.

F. Predicting probable targets of SARS-CoV-2.

Comparative evaluation with baseline methods

Table 1. Quantitative comparison of our proposed method against baseline embedding and fusion methods.

Training the GCN model

Table 2. Performance of GCN in three networks: The first two columns of the table show the total number of nodes and the number of edges in the three networks.

Hierarchical clustering of the Wasserstein distance matrix

Fig 2. Figure shows results of clustering of the Wasserstein distance matrix.

Table 3. Details of the 10 clusters, including the total number of proteins, the number of CoV-host proteins, the number of non-CoV-host proteins, and the predicted interactions obtained from each cluster.

Model interpretability through barycenter

Fig 3. Barycenters of clusters in Wasserstein space, illustrating dependencies between sequence similarity, GO similarity, and PPI network features.

Predicted interactions

Fig 4. Figure shows a network diagram of predicted interactions.

Biological relevance and therapeutic potential of predicted host factors

Table 4. Predicted human proteins overlapped with other proteins targeted by other viruses.

Predicted host factors promotes Host Directed Therapy (HDT) option against SARS-CoV-2

Table 5. Table shows associations of FDA-approved drugs with the predicted host factors.

Materials and methods

Overview of dataset

SARS-CoV-2-host interaction data.

The human protein Interactome.

Table 6. Datasets used in this study.

Extracting node features using GCN

Wasserstein distance between probability distribution

Barycenter in Wasserstein space

Computing similarity between proteins

Gene ontology-based semantic similarity.

Protein sequence similarity.

Time complexity analysis

Discussions

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Chandrabose Selvaraj

Roles

Author response to Decision Letter 1

Decision Letter 1

Chandrabose Selvaraj

Roles

Acceptance letter

Chandrabose Selvaraj

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases