Abstract
Background
Accurate prediction of compound-protein interaction (CPI) plays a crucial role in drug discovery. Existing data-driven methods aim to learn from the chemical structures of compounds and proteins yet ignore the conceptual knowledge that is the interrelationships among the fundamental elements in the biomedical knowledge graph (KG). Knowledge graphs provide a comprehensive view of entities and relationships beyond individual compounds and proteins. They encompass a wealth of information like pathways, diseases, and biological processes, offering a richer context for CPI prediction. This contextual information can be used to identify indirect interactions, infer potential relationships, and improve prediction accuracy. In real-world applications, the prevalence of knowledge-missing compounds and proteins is a critical barrier for injecting knowledge into data-driven models.
Results
Here, we propose BEACON, a data and knowledge dual-driven framework that bridges chemical structure and conceptual knowledge for CPI prediction. The proposed BEACON learns the consistent representations by maximizing the mutual information between chemical structure and conceptual knowledge and predicts the missing representations by minimizing their conditional entropy. BEACON achieves state-of-the-art performance on multiple datasets compared to competing methods, notably with 5.1% and 6.6% performance gain on the BIOSNAP and DrugBank datasets, respectively. Moreover, BEACON is the only approach capable of effectively predicting knowledge representations for knowledge-lacking compounds and proteins.
Conclusions
Overall, our work provides a general approach for directly injecting conceptual knowledge to enhance the performance of CPI prediction.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12915-024-02049-y.
Keywords: Compound-protein interactions, Knowledge graphs, Incomplete multi-modal representation learning, Self-supervised learning
Background
Accurate compound-protein interaction (CPI) identification is a prerequisite for screening compounds that bind to target proteins from large-scale chemical libraries [1, 2], playing a pivotal role in boosting drug discovery and development [3]. While in vitro experiments offer a reliable means of identifying CPI, the resources required for determining each potential compound-protein pair are prohibitive. To circumvent costly and laborious experiments, numerous computational methods have been developed to expedite the drug discovery process in the past decades [4, 5].
With the gradual maturation of artificial intelligence, machine learning approaches promise to be transformative for CPI prediction, leading to remarkable advances in augmenting and accelerating current computational pipelines [6, 7]. In machine learning approaches, the chemical structures of compounds and proteins are converted into features that are utilized to predict CPI. Despite a large number of successful applications, traditional machine learning methods still fall short in the ability to handle complex data and suffer from the reliance on manually engineered features. To relieve this bottleneck, researchers have turned to deep learning models, which have sparked the emergence of various data-driven approaches for large-scale CPI prediction [8, 9]. Leveraging the power of deep learning, these models can automatically extract features from vast amounts of data and show more competitive performance due to their ability to capture the nonlinear relationships of CPI [10].
Still, existing works in the field of CPI prediction primarily focus on learning from the chemical structures (i.e., sequences, graphs, or 3D structures) of compounds and proteins, rarely considering the valuable conceptual knowledge encoded by the biomedical knowledge graph (KG). A knowledge graph is a special type of heterogeneous graph with entities and relationships as nodes and edges, respectively [11, 12]. Knowledge graphs provide a comprehensive view of entities and relationships beyond individual compounds and proteins. They encompass a wealth of information like pathways, diseases, and biological processes, providing a better understanding of a molecule’s properties and functions. Since human professionals hold the view that a molecule’s structure determines its properties and functions, injecting the knowledge of molecular properties and functions into models, in turn, facilitates the learning of structure representations, thereby enhancing CPI prediction.
Furthermore, the prevalence of knowledge-missing compounds and proteins presents a formidable hurdle when injecting conceptual knowledge into models in real-world scenarios. Conventional deep learning frameworks typically require complete data, encompassing both chemical structure and conceptual knowledge, for each sample. However, it is common for compounds/proteins, particularly those that are newly developed, to lack the necessary conceptual knowledge, resulting in incomplete data. Apart from incompleteness, the inconsistency among chemical structure and conceptual knowledge due to the large modality gaps cannot be overlooked. The features derived from chemical structure and conceptual knowledge are very distinct from each other, exhibiting different distributions. During model learning, chemical structure conveys internal information about indicators of molecular properties and interactions, such as functional groups and their positions. Different from chemical structure, conceptual knowledge represents external information about the relationships between biomedical entities (e.g., compounds, genes and diseases). Thus, the consistent representations across chemical structure and conceptual knowledge need to be learned. Simply concatenating the structure-specific and knowledge-specific representations and modeling them in a conventional way may yield suboptimal performance. As a consequence of inconsistency and incompleteness, it is non-trivial to bridge the chemical structure and conceptual knowledge in a unified framework.
In this paper, we propose a novel data and knowledge dual-driven framework named BEACON (Bridging chEmicAl struCture and cOnceptual kNowledge), which establishes connections between chemical structure and conceptual knowledge. Compared to purely data-driven approaches, BEACON bridges the gap between structure information and knowledge information, providing valuable and knowledgeable complements from the knowledge graph. Specifically, to solve the inconsistency issue among chemical structure and conceptual knowledge and the incompleteness issue of multi-modal data, we design dual contrastive learning to learn the consistent representations by maximizing the mutual information between chemical structure and conceptual knowledge and dual predictive learning to predict the missing knowledge representations by minimizing their conditional entropy. In addition, modality-specific reconstruction is used to preserve structure-specific and knowledge-specific information. After the generation of the missing knowledge representations, we merge the representations of compounds and proteins, forming the unified representations that are then inputted into the classifier for CPI prediction.
Our contributions can be summarized as follows:
We introduce the problem of missing conceptual knowledge, which has been overlooked in previous works. To the best of our knowledge, this is the first attempt to inject conceptual knowledge into data-driven models for CPI prediction and predict knowledge representations of knowledge-missing compounds and proteins.
We propose a data and knowledge dual-driven framework that solves the inconsistency issue among chemical structure and conceptual knowledge and the incompleteness issue of multi-modal data, bridging the gap between chemical structure and conceptual knowledge.
We empirically validate the benefits of injecting conceptual knowledge obtained from the knowledge graph for CPI prediction. Extensive experiments conducted on four datasets show that BEACON significantly outperforms state-of-the-art baseline methods.
Results
Overview of BEACON
Our proposed BEACON framework takes incomplete samples as input to learn representations of compounds and proteins for the prediction of CPIs. As depicted in Fig. 1a, structure representations can be directly obtained for all compounds and proteins, whereas knowledge representations are only available for a subset of them. To establish a connection between chemical structure and conceptual knowledge, as illustrated in Fig. 1b, we design dual contrastive learning and dual predictive learning, tailored to ensure information consistency and predictability, respectively. In addition, we employ modality-specific reconstruction to preserve the maximum amount of information from both chemical structure and conceptual knowledge, while avoiding trivial solutions.
Fig. 1.
Overview of BEACON. a Three modality-specific encoders are adopted to extract the latent representations of compound structure features (blue), knowledge graph embeddings (green) and protein structure features (yellow). The knowledge representations are available only for partial compounds/proteins (pink circle) that have already existed in the knowledge graph (KG). Consequently, the knowledge representations corresponding to the out-of-KG compounds/proteins (blue circle) are missing. b Self-supervised learning loss. (1) Modality-specific reconstruction is utilized to preserve structure-specific and knowledge-specific information, avoiding trivial solutions. (2) Dual contrastive learning is used to guarantee consistency by maximizing the mutual information between chemical structure and conceptual knowledge. It maximizes the similarity of chemical structure and conceptual knowledge representations for the same compounds/proteins and minimizes the similarity of chemical structure and conceptual knowledge representations for different compounds/proteins. (3) Dual predictive learning, consisting of forward and inverse predictive learning, is employed to ensure the predictability of knowledge information by minimizing the conditional entropy of chemical structure and conceptual knowledge. Forward predictive learning aims at reconstructing knowledge representations from structure representations through an autoencoder. In contrast, inverse predictive learning generates structure representations from their corresponding knowledge representations using another autoencoder. c The missing knowledge representations are predicted from structure representations using the forward predictive learning networks. Once the missing knowledge representations are obtained, the structure-specific and knowledge-specific representations of compounds/proteins are concatenated as common representations, which are then fed into the multi-layer perceptron for compound-protein interaction prediction
During the training/inference stage, the model is fed with the entire training/testing set, which includes the incomplete samples associated with compounds/proteins that lack conceptual knowledge. The features of chemical structure and conceptual knowledge from all samples are projected into the latent space using their respective autoencoders. However, for training the predictive learning networks and employing the dual contrastive learning, we only utilize the complete instances, i.e., compounds/proteins that are present in both chemical structure and conceptual knowledge. The consistency between chemical structure and conceptual knowledge is guaranteed by maximizing the mutual information between them through dual contrastive learning, and the predictability of knowledge information is ensured by minimizing the conditional entropy of chemical structure and conceptual knowledge through dual predictive learning. As shown in Fig. 1c, the missing knowledge representations are predicted from the corresponding structure representations leveraging the predictive learning networks. Upon acquiring the missing knowledge representations, the common representations, which are the combination of the structure-specific and knowledge-specific representations, are then fed into the classifier to predict the probabilities of CPIs.
Benchmark evaluation of BEACON
Table 1 shows the overall performance of our BEACON model on the four datasets in comparison to the baselines. The performance results of the baseline methods were obtained by running the methods on the same dataset splits. It should be noted that we only compared data-driven methods. This is because network-based methods and KG-based methods are not comparable to our method on the datasets we used, as these methods cannot handle compounds and proteins that do not exist in the KG/network during training and inference. Our method, which can handle knowledge-missing compounds and proteins, is specifically proposed to address these limitations. The missing knowledge representations of these compounds and proteins can be predicted by BEACON’s forward predictive learning networks using their structural representations.
Table 1.
Comparison results of the proposed method and baselines on the Human, C.elegans, BIOSNAP and DrugBank datasets
| Dataset | Method | AUC | AUPR | Precision | Recall | ACC |
|---|---|---|---|---|---|---|
| Human | LR | |||||
| DeepDTI | ||||||
| DeepConv-DTI | ||||||
| MolTrans | ||||||
| TransformerCPI | ||||||
| DrugBAN | ||||||
| BEACON | ||||||
| C.elegans | LR | |||||
| DeepDTI | ||||||
| DeepConv-DTI | ||||||
| MolTrans | ||||||
| TransformerCPI | ||||||
| DrugBAN | ||||||
| BEACON | ||||||
| BIOSNAP | LR | |||||
| DeepDTI | ||||||
| DeepConv-DTI | ||||||
| MolTrans | ||||||
| TransformerCPI | ||||||
| DrugBAN | ||||||
| BEACON | ||||||
| DrugBank | LR | |||||
| DeepDTI | ||||||
| DeepConv-DTI | ||||||
| MolTrans | ||||||
| TransformerCPI | ||||||
| DrugBAN | ||||||
| BEACON |
The boldface denotes the highest score, and italics indicate the second highest score
From the comparison results, it is evident that our BEACON model outperforms all baselines across four datasets on all evaluation metrics. Concretely, BEACON achieves an AUC of 0.989 and 0.996 on the Human and C. elegans datasets, respectively. These results clearly demonstrate the success of BEACON in making accurate CPI predictions on these two datasets. Meanwhile, BEACON also exhibits significant improvements on both BIOSNAP and DrugBank datasets. More precisely, on the BIOSNAP dataset, our method improves performance by at least 5.1% in AUC, 4.7% in AUPR, 6.7% in precision, 3.4% in recall, and 5.7% in ACC compared to the baseline models. Similarly, on the DrugBank dataset, BEACON outperforms the baseline models substantially with improvements of 6.6%, 6%, 4.5%, 6.2% and 6.8% in AUC, AUPR, precision, recall, and ACC, respectively. These results indicate the superiority of our BEACON model, which benefits from effectively leveraging conceptual knowledge to complement chemical structure for CPI prediction.
Additionally, we observe that LR displays the least effectiveness among the evaluated models, illustrating the limited capability of traditional machine learning methods in predicting CPI. Moreover, it is worth noticing that the DeepConv-DTI model yields comparable performance with the Transformer-based models on most datasets. This suggests that molecular fingerprints and convolutional neural networks (CNNs) [13] remain competitive for encoding compounds and proteins, respectively.
Performance evaluation in the compound cold start setting
To further imitate compound screening in a realistic scenario, we evaluate the performance of BEACON in the compound cold start setting, where compounds in the testing set are not observed in the training set. Following the experimental setting of ref [14], we randomly select 20% of the compounds from the BIOSNAP dataset and utilize the associated samples as the testing set. It is more challenging for the model to make accurate predictions in the cold start setting. The experiments in this scenario allow us to verify the model’s generalization ability to handle unseen compounds.
Following DrugVQA [5] and TransformerCPI [15], which only consider compound or protein cold start settings, we evaluate the model’s performance in the compound cold start setting to serve as a representative evaluation of its capabilities across various potential cold start settings. Experiments for other settings can be conducted similarly to the compound cold start setting.
The results presented in Table 2 demonstrate that, as expected in the cold start setting, the performance of all models experiences a drop compared to the results reported in Table 1. However, despite the challenges posed by the cold start scenario, BEACON continues to achieve state-of-the-art performance on all evaluation metrics, surpassing the baseline models. Significantly, BEACON outperforms the best baseline model, DrugBAN, with increases of 4.4%, 3.7%, 5%, 1.1%, and 3.2% for AUC, AUPR, precision, recall, and ACC, respectively. These results highlight the effectiveness of BEACON, even in the challenging cold start scenario, emphasizing its superiority over existing methods.
Table 2.
Comparison results of the proposed method and baselines in the compound cold start setting
| Method | AUC | AUPR | Precision | Recall | ACC |
|---|---|---|---|---|---|
| LR | |||||
| DeepDTI | |||||
| DeepConv-DTI | |||||
| MolTrans | |||||
| TransformerCPI | |||||
| DrugBAN | |||||
| BEACON |
The boldface denotes the highest score, and italics indicate the second highest score
Cross-domain performance comparison
Since similar sequences are highly likely to have similar functions and structures, in-domain classification under random splitting is an easier task and holds less practical importance. In addition to the compound cold start setting, we further study cross-domain CPI prediction, where the training and test data have different distributions. For a fair comparison with other state-of-the-art methods, we used the cross-domain dataset provided by DrugBAN and conducted five independent runs with different random seeds. The performance results of other methods were directly taken from DrugBAN.
The dataset is constructed using a clustering-based pair-splitting strategy on the BIOSNAP dataset. This strategy clusters drug compounds and target proteins separately for cross-domain performance evaluation. Specifically, it employs single-linkage clustering, a bottom-up hierarchical clustering method, to ensure that the distances between samples in different clusters are always greater than a predefined distance, i.e., the minimum distance threshold . This property prevents clusters from being too close, which helps to generate the cross-domain scenario. The binarized ECFP4 feature is used to represent drug compounds, while the integral PSC feature represents target proteins. To accurately measure pairwise distances, the Jaccard distance and cosine distance are used for ECFP4 and PSC, respectively. A value of is chosen for both drug and protein clustering to avoid overly large clusters and ensure the separation of dissimilar samples. After clustering, 60% of the drug clusters and 60% of the protein clusters are randomly selected, and all drug-target pairs between the selected drugs and proteins are considered source domain data. The remaining drug-target pairs form the target domain data. Under this clustering-based pair-splitting strategy, the source and target domains are non-overlapping and have different distributions. Cross-domain evaluation is more challenging than in-domain random splitting but provides a better measure of a model’s generalization ability in real-world drug discovery.
From Fig. 2, we observed that BEACON’s advantages became even more pronounced in this more challenging setting, with the performance gap between BEACON and other baselines widening further. This is likely due to the fact that other methods rely solely on the structural information of compounds and proteins, making accurate predictions particularly difficult when the structural similarity between the training and test sets is low. As a result, their performance drops significantly compared to random splits, even with the incorporation of the domain adaptation module (). In contrast, BEACON’s ability to integrate conceptual knowledge from the knowledge graph allows it to better capture the complex relationships between compounds and proteins, enabling it to maintain strong performance even when the structural similarity between the training and test sets is reduced.
Fig. 2.
Cross-domain performance comparison on the BIOSNAFP dataset
Ablation analysis of BEACON
To inspect whether our well-designed self-supervised learning losses could solve the inconsistency issue among chemical structure and conceptual knowledge, as well as the incompleteness issue of multi-modal data, we generate t-SNE [16] plots to visualize the compound representations. As shown in Fig. 3a and b, when using only the modality-specific reconstruction loss, the representations of structure and knowledge are clearly separated. As a comparison, when employing all self-supervised learning losses, the representations of structure and knowledge align in t-SNE manifolds. This indicates that our joint self-supervised learning losses could minimize distribution shifts among structure and knowledge, solving the inconsistency issue. In Fig. 3c and d, we visualize the representations of the complete knowledge and the estimated missing knowledge. The latter is predicted by our predictive learning networks. It is observed that there is a substantial overlap between the representations of complete and missing knowledge when employing three self-supervised learning losses. This observation demonstrates that our method successfully addresses the incompleteness issue of multi-modal data. In contrast, the representations from the model solely utilizing the modality-specific reconstruction loss are separated into two distinct parts. This finding indicates the ineffectiveness of predicting missing knowledge without the inclusion of dual contrastive learning loss and dual predictive learning loss.
Fig. 3.
t-SNE visualizations of representations. a Structure and knowledge representations from the model that solely utilizes modality-specific reconstruction loss. b Structure and knowledge representations from the model that uses three self-supervised learning losses. c Complete and missing knowledge representations when utilizing modality-specific reconstruction loss. d Complete and missing knowledge representations when employing three self-supervised learning losses
Despite the efficacy of our proposed BEACON in addressing data incompleteness, the most straightforward approach to tackle this issue is by removing all incomplete samples. To evaluate the importance of utilizing incomplete samples, we perform model training exclusively using complete samples, denoted as the “w/o incomplete samples” setting. The results presented in Table 3 demonstrate notable performance enhancements achieved by our method compared to using solely complete samples on all evaluation metrics. Noteworthy, the AUC, which is one of the most critical evaluation metrics, shows a remarkable improvement, increasing from 0.882 to 0.962. This further underscores the necessity of incorporating incomplete samples.
Table 3.
Ablation results of the proposed method on the DrugBank dataset
| Setting | AUC | AUPR | Precision | Recall | ACC |
|---|---|---|---|---|---|
| w/o incomplete samples | |||||
| w/o conceptual knowledge | |||||
| w/o compound knowledge | |||||
| w/o protein knowledge | |||||
| Full dataset |
To further investigate the influence of injecting conceptual knowledge obtained from the knowledge graph, we conduct experiments using the following three settings: (1) w/o conceptual knowledge, where only the structural information of compounds and proteins is utilized; (2) w/o compound knowledge, where the structural information of compounds and proteins and the conceptual knowledge of proteins are utilized; (3) w/o protein knowledge, where the structural information of compounds and proteins and the conceptual knowledge of compounds are utilized. As depicted in Table 3, noticeable performance degradation is observed across all settings, confirming that the inclusion of conceptual knowledge from the knowledge graph enhances the performance of CPI prediction.
Parameter sensitivity
In this section, we evaluate the impact of the three hyperparameters of our loss function. The results were obtained by varying one parameter while keeping the others fixed at their optimal values on the DrugBank validation set. As shown in Fig. 4, our method demonstrates relative robustness to the choice of (dual predictive learning loss), likely because this loss is easier to optimize during training, as the contrastive loss pulls and together to align the embeddings, facilitating convergence. Additionally, an appropriate selection of (dual contrastive learning loss) and (modality-specific reconstruction loss) can significantly boost performance.
Fig. 4.
Parameter sensitivity study on DrugBank
Case study
To illustrate BEACON’s capability in novel interaction prediction and to further elaborate on how conceptual knowledge enhances CPI prediction, we conducted a case study. As shown in Table 4, BEACON trained on the DrugBank dataset predicts the interaction probability of Ezogabine with seven GABA receptors: GABRA1, GABRA2, GABRA3, GABRB1, GABRB3, GABRG2, and GABRG3 to be 0.999, 0.999, 0.996, 0.998, 0.998, 0.999, and 0.999, respectively. These interactions are novel and not included in the DrugBank dataset.
Table 4.
Novel predictions on GABA receptors
| Gene name | UniProt ID | Drug name | DrugBank ID | Score |
|---|---|---|---|---|
| GABRA1 | P14867 | Ezogabine | DB04953 | 0.999 |
| GABRA2 | P47869 | Ezogabine | DB04953 | 0.999 |
| GABRA3 | P34903 | Ezogabine | DB04953 | 0.996 |
| GABRB1 | P18505 | Ezogabine | DB04953 | 0.998 |
| GABRB3 | P28472 | Ezogabine | DB04953 | 0.998 |
| GABRG2 | P18507 | Ezogabine | DB04953 | 0.999 |
| GABRG3 | Q99928 | Ezogabine | DB04953 | 0.999 |
We extract a path where the head entity is Ezogabine and the tail entity is GABRG2 in DRKG. From Fig. 5, it can be seen that Ezogabine (Compound::DB04953) has a relation (GNBR::T::Compound:Disease) with epilepsy (Disease::MESH:D004827), and epilepsy (Disease::MESH:D004827) has a relation (Hetionet::DaG::Disease:Gene) with GABRG2 (Gene::2566).
Fig. 5.
Extracted path for the novel prediction
This path is congruent with the mechanism of action of Ezogabine [17]. Ezogabine is a neuronal potassium channel opener being developed as a first-in-class antiepileptic drug and is currently being studied in phase 3 trials as an adjunctive treatment for partial-onset seizures in adult patients with refractory epilepsy. Ezogabine affects GABA neurotransmission in the GABA-A receptor, a key inhibitory receptor in the central nervous system implicated in epilepsy. Malfunctioning of the GABA-A receptor leads to hyperexcitability in the brain, causing seizures, making this receptor an important target for antiepileptic therapeutics.
As shown in Fig. 6, BEACON uses TransE to encode a wealth of information, such as pathways, diseases, and biological processes, into knowledge graph embeddings by ensuring that each (head, relation, tail) satisfies the training objective of knowledge graph embeddings. These embeddings are low-dimensional representations of entities and relations, where similar biological entities and relations have similar embeddings, offering conceptual knowledge for CPI prediction. This knowledge helps to identify indirect interactions based on known interactions of similar compounds or proteins, infer potential relationships based on shared pathways or diseases, and thus improve prediction accuracy.
Fig. 6.
Encoding conceptual knowledge
Grouping performance of compound representations
In general, compounds that interact with the same protein typically bind at similar locations on the protein. Hence, it is reasonable to assume that compounds interacting with the same protein may possess similar characteristics [18]. Based on this assumption, we conduct experiments following the experimental setting of ref [18], where we evaluate the grouping performance of 285 compounds selected from the DrugBank database that interact with 5 different proteins. Figure 7 shows the visualization of compound representations obtained from our proposed BEACON model and the baseline model.
Fig. 7.
Visualized representations of 285 compounds interacting with 5 proteins from the DrugBank dataset. a TransformerCPI model. b DrugBAN model. c BEACON model
In Fig. 7a, the baseline TransformerCPI model only shows centralized clusters for compounds interacting with the protein P04150, while the representations of other compounds do not form obvious clusters. The DrugBAN model enhances compound embeddings, resulting in more well-defined clusters compared to the TransformerCPI model. For instance, the compounds in the blue group are widely dispersed in the TransformerCPI model embeddings, whereas they are more centralized into a cluster in the DrugBAN model embeddings. In contrast, the compound representations learned by the BEACON model exhibit distinguishable clusters that can be mapped to the corresponding proteins they interact with. This observation indicates that the baseline models fail to capture the protein-specific information. These experimental results unequivocally establish that the compound representations learned by our BEACON model possess the ability to be grouped effectively according to their interacting proteins. Such protein-specific representations learned by the BEACON model contribute to the improved performance of CPI prediction.
Discussion
It should be noted that BEACON is a KG encoder agnostic framework that can be used with any KG encoder capable of obtaining entity embeddings. In our paper, BEACON utilizes TransE [19] to encode the static KG (i.e., DRKG). When handling dynamic biomedical KGs, we can periodically retrain TransE using the latest KG to incorporate new knowledge. Furthermore, TransE can be replaced with models like LKGE [20], which is designed for dynamic KGs and does not require learning embeddings from scratch. Retraining of the KG encoder does not result in the retraining of BEACON. BEACON is capable of handling both static and dynamic KGs.
Regarding exploring specific parts of the biomedical KG that are particularly influential in improving CPI prediction, according to the statistics of DRKG, the large-scale biomedical KG employed in BEACON, (Protein, Protein) pairs, and (Compound, Compound) pairs have the highest and second-highest counts, respectively. Therefore, in DRKG, the (Protein, Protein) pairs and (Compound, Compound) pairs might significantly impact improving CPI prediction. Furthermore, relatively small-scale KGs, such as those used in DTINet [21] and deepDTnet [22], only include the following relationships: (Compound, Compound), (Compound, Disease), (Compound, Side-effect), (Protein, Protein), and (Protein, Disease). These relationships may play a significant role in improving CPI prediction compared to other relationships in DRKG. We will explore this question in our future work.
Conclusions
In this study, we present BEACON, a novel data and knowledge dual-driven framework that bridges the gap between chemical structure and conceptual knowledge to enhance CPI prediction. To handle the ubiquitous knowledge-missing compounds and proteins, we design multiple self-supervised learning losses. BEACON solves the issues of inconsistency and incompleteness while avoiding trivial solutions and predicting CPI accurately through four joint learning objectives. The dual contrastive learning loss is designed to learn the consistent representations by maximizing the mutual information between structure and knowledge, solving the inconsistency issue. The dual predictive learning loss is employed to predict the missing knowledge representations by minimizing the conditional entropy of structure and knowledge, solving the incompleteness issue. The modality-specific reconstruction loss is used to preserve modality-specific information, avoiding trivial solutions where representations of structure and knowledge converge to the same constant. The CPI prediction loss is utilized to train the model to capture the nonlinear relationships of CPIs.
Through extensive experiments conducted on four publicly available datasets, we empirically demonstrate the superiority of BEACON over state-of-the-art methods. Furthermore, BEACON can serve as a versatile multi-modal framework that can be easily extended to other tasks and domains, leveraging other forms of incomplete information, such as 3D information of compounds and proteins. For other tasks in drug discovery, BEACON can be directly applied to drug-drug interaction prediction and protein-protein interaction prediction tasks. In other domains, BEACON can be easily extended to multi-modal learning of images and text reports for medical image analysis, addressing the challenges posed by missing text reports.
Methods
Problem formulation
We formulate CPI prediction as a binary classification task. Specifically, compounds/proteins are represented by the feature vectors extracted from two distinct data modalities. We denote the feature vectors extracted from the chemical structure and conceptual knowledge modalities of compounds as and , respectively. We assume that , consisting of N samples with dimension , is complete on the grounds that each compound has its associated chemical structure. Conversely, , consisting of samples with dimension , is incomplete due to the absence of corresponding entities in the knowledge graph for certain compounds. Here, represents the set of complete multi-modal compound samples that are completely paired in two modalities, while denotes the set of incomplete multi-modal compound samples that exist solely in one modality (i.e., chemical structure). Similarly, the feature vectors extracted from the structure and knowledge modalities of proteins are denoted by and , respectively. Utilizing , , and as inputs, we predict the interactions between compound-protein pairs.
Feature extraction
Compound features
We utilize MACCS [23] and ECFP [24], two types of widely used molecular fingerprints, as our compound features. These fingerprints are generated using the open-source cheminformatics package RDKit [25] based on the SMILES strings of compounds. The MACCS fingerprint is a 167-bit representation of a molecule. It uses 166 bits to encode the presence or absence of specific molecular substructures. The 0th bit is reserved and not used for encoding. Therefore, it can only encode predefined substructures. As for the ECFP fingerprint, its idea is derived from the Morgan algorithm, which encodes information about the local environment of each atom in the molecule. By choosing different maximum diameters of atom neighborhoods and different output lengths, various ECFPs can be generated by RDKit. In our model, we set the maximum diameter and output length of ECFP to 4 and 256, respectively. To form the compound features, we concatenate these two fingerprints, resulting in 423-dimensional feature vectors.
Protein features
We employ the Transformer [26] architecture to derive the protein features from amino acid sequences. The self-attention mechanism in Transformer enables modeling of the contextual information from the entire protein sequence. By capturing the interactions between all characters in the sequence, residue-residue interactions can be directly represented by the Transformer architecture. Specifically, we use ESM [27], a pre-trained deep Transformer [28], as our protein encoder. By inputting the amino acid sequences, we obtain protein features with a dimensionality of 1280.
Knowledge graph embeddings
For compounds/proteins with corresponding entities, we utilize TransE [19] to generate their embeddings. TransE is a knowledge graph embedding method that aims to represent the entities and relations of the knowledge graph in a low-dimensional vector space while preserving their semantic meaning and higher-order proximity. In TransE, a fact (h, r, t) in the knowledge graph G is assumed to satisfy , where h is the head entity, t is the tail entity, and r is the relationship. The score function in TransE is computed as
| 1 |
If (h, r, t) holds, the score is low. Otherwise, the score is high.
Self-supervised learning
Modality-specific reconstruction
Usually, the features of different modalities exhibit distinct statistical properties, making it challenging to establish a direct connection between them. To confront this hurdle, we resort to the utilization of autoencoders [29] to encode multi-modal data into their respective latent subspaces, allowing for the modeling of correlation between different modalities. Autoencoders are known for their ability to capture salient representations of data, capable of reducing the dimensionality of features and mitigating the impact of noise. Consequently, autoencoders have been widely applied in various self-supervised learning domains [30, 31].
In our approach, multiple autoencoders are utilized to map the original features of each modality into low-dimensional latent representations. Each autoencoder consists of an encoder and a decoder. By ensuring the consistency between inputs and outputs, the latent representations can effectively preserve the information of each modality. Let , , and represent compound structural features, protein structural features and KG embeddings, respectively. The input features are fed into the corresponding autoencoders, and the reconstruction loss is defined as
| 2 |
where , , and represent the modality-specific decoder. , and are obtained by
| 3 |
where , and are the modality-specific encoder. By optimizing the modality-specific reconstruction loss , the model is encouraged to learn meaningful representations that capture the distinct characteristics of each modality. This helps prevent the learned representations from collapsing to the same constant and ensures that the representations preserve the information of the corresponding modalities. So far, there is no direct connection between different modalities, allowing for the learning of representations for both complete and incomplete modalities without inferring missing data.
Dual contrastive learning
Each modality contributes a specific aspect of the data. However, the presence of modality gaps may impede performance due to low similarity between latent representations of different modalities. To narrow down the modality gaps between different modalities, we design dual contrastive learning, which maximizes the consistency between chemical structure and conceptual knowledge representations, to align the distribution of representations from different modalities.
As one of the most effective self-supervised learning paradigms, contrastive learning has made tremendous developments in representation learning [32, 33]. The core concept of contrastive learning is to find a latent space where the similarity between positive pairs is maximized and the similarity between negative pairs is minimized. In this study, we employ the InfoNCE loss [34] as the learning objective, which maximizes the lower bound of mutual information between the learned structure representations and entity representations of compounds, i.e., , with N representing the number of negative pairs.
To implement dual contrastive learning, we first extract the structure and entity representations and for each compound. Subsequently, we proceed with the selection of positive and negative pairs. Specifically, the positive pairs consist of structure-entity representation pairs pertaining to the same compound, while the remaining pairs within the batch are treated as negative. The positive pairs will be aligned, whereas the negative pairs will be contrasted. The dual contrastive learning loss on the compound side can be expressed as below
| 4 |
| 5 |
| 6 |
where and represent the chemical structure and conceptual knowledge representations, respectively, forming the positive pair. and are randomly sampled chemical structure and conceptual knowledge representations for the anchor pair , treated as negative samples. denotes the inner product. The dual contrastive learning loss on the protein side is identical. Therefore, the final dual contrastive learning loss is defined as the sum of the individual losses on the compound and protein sides
| 7 |
Dual predictive learning
To overcome the issue of incompleteness in multi-modal representation learning, we propose dual predictive learning to hallucinate the missing modalities. We assume that each modality can be viewed as a mapping of the other modalities, as they share common semantic information. For example, represents the mapping of compound structure representations to entity representations . In a latent space of a deep neural network, the compound representations of one modality can be used to predict the compound representations of another modality by minimizing the conditional entropy , where , or , .
To be specific, our proposed dual predictive learning consists of two parts: forward predictive learning and inverse predictive learning. Forward predictive learning encourages the learned structure representations to reconstruct the entity representations using the deterministic mapping . It guides to extract information related to the entity representations . In contrast, inverse predictive learning emphasizes the learned entity representations to reconstruct the structure representations using the deterministic mapping . It guides to extract the information associated with the structure representations .
For bi-modal data, the dual predictive learning loss for compound representations is computed as follows
| 8 |
It is important to note that relying solely on the dual predictive learning loss may lead to trivial solutions where and converge to the same constant. To avoid this situation, the modality-specific reconstruction loss should be added to the optimization process. In a similar manner, we can obtain the dual predictive learning loss for protein representations. The final dual predictive learning loss is the sum of the compound loss and protein loss
| 9 |
During the training and inference phases, we input the entire dataset, including the incomplete samples, into the network to obtain the representations of all modalities. For samples associated with compounds that are missing in knowledge modality, we predict the missing representations from the available representations through the predictive learning networks
| 10 |
where represents the representations of . The common representations are obtained by concatenating all modality-specific representations, resulting in for compounds that exist in all modalities and for compounds that only exist in structure modality. The common representations of proteins can be formed in an analogous way: and .
Compound-protein interaction prediction
For each sample of CPI, we first concatenate the structure and entity representations of the compound/protein into the common representation as follows
| 11 |
where and represent the structure and entity representations of the i-th compound, respectively, while and represent the structure and entity representations of the j-th protein, respectively. || denotes the concatenation operation.
Next, we apply the multi-layer perceptron (MLP) to make the final prediction based on the concatenated interaction representation. The prediction is done as
| 12 |
where denotes the sigmoid function and represents the probability of CPI.
For the training objective, the binary cross-entropy loss for CPI prediction is formulated as
| 13 |
where (0 or 1) represents the actual label of the i-th CPI sample, and denotes the predicted probability of interaction for the i-th compound-protein pair.
Training
BEACON employs four joint learning objectives to form the overall loss function, which is formulated as follows
| 14 |
where , , , and are CPI prediction loss, dual contrastive learning loss, dual predictive learning loss, and modality-specific reconstruction loss, respectively. , , and are the trade-off hyperparameters that control the importance of , , and , respectively. For the selection of , , and , values are searched within the range {0.001, 0.01, 0.1, 1, 10, 100}, leveraging the validation results of DrugBank. In our experiments, we have set these hyperparameters to fixed values of 0.1, 0.01 and 0.01, respectively.
The CPI loss is the primary objective of the model, directly measuring performance on the core task. Introducing a weight could undermine the optimization of this core objective. The other loss components are auxiliary objectives that help improve generalization and incorporate additional constraints. The weights that control the influence of these auxiliary losses ensure they do not overshadow the primary task.
Datasets
To demonstrate the effectiveness of our proposed method, we conduct comparative experiments on four benchmark datasets: Human [35], C.elegans [35], BIOSNAP [36], and DrugBank [37].
Human [35] comprises 3369 positive samples. To establish a balanced dataset, we randomly select 3369 negative samples from a pool of 384,916 highly credible negative samples. As a result, the Human dataset contains a total of 6738 compound-protein pairs involving 4248 compounds and 2018 proteins.
C.elegans [35] includes 4000 positive samples. Similar to the Human dataset, we randomly select 4000 negative samples from a collection of 88,261 highly credible negative samples. Consequently, the C.elegans dataset consists of 8000 samples, encompassing 2997 compounds and 1970 proteins.
BIOSNAP [36] is composed of 4510 compounds and 2181 proteins. It contains 13,741 positive CPI pairs. To construct a balanced dataset, we follow the common practice [38, 39] of sampling from unlabeled pairs to obtain negative samples.
DrugBank [37] consists of 6655 compounds and 4294 proteins. For the selection of negative samples, we randomly sample from the unlabeled compound-protein pairs. The number of negative samples is equated to the number of positive samples, resulting in a set of 17,511 negative samples.
We randomly divide the CPI samples of the Human, C.elegans, and DrugBank datasets into training/validation/testing with a ratio of 8/1/1. Regarding the BIOSNAP dataset, we follow the partitioning strategy described in ref [14]. Accordingly, the dataset is divided into training (70%), validation (10%), and testing (20%) sets.
In addition to the CPI samples, we incorporate the large-scale knowledge graph DRKG [40] into our methodology. DRKG contains 97,238 vertices belonging to 13 entity-types and 5,874,261 edges attributed to 107 relation-types, providing conceptual knowledge. Compounds and proteins are mapped into DRKG entities based on their respective identifiers. Notably, we exclude the CPI samples present in DRKG to prevent data leakage and maintain the integrity of our experimental setting. Additional file 1 provides the statistics of incomplete samples in datasets.
Baselines
We compare our proposed model BEACON with the following competitive baselines.
LR employs a logistic regression model for CPI prediction. It uses extended-connectivity fingerprints of diameter 4 (ECFP4) as compound features and PSC52 as protein features.
DeepDTI [41] applies deep belief networks [42] to process features. It concatenates ECFP2, ECFP4, and ECFP6 as compound features and employs PSC as protein features.
DeepConv-DTI [43] utilizes 1D-CNNs with a global max pooling layer to extract protein features. The ECFPs of compounds are processed using fully connected layers.
TransformerCPI [15] modifies the encoder and final linear layers of the Transformer model. It encodes protein sequences using gated convolutional networks and extracts compound features using graph neural networks. The interaction vectors are generated through the decoder of the Transformer.
MolTrans [14] first extracts substructure representations of the compound/protein sequence using the FCS algorithm. These representations are then fed into the compound/protein encoder to obtain augmented contextual representations. Afterwards, an interaction map is generated based on the substructure representations, and a 2D-CNN model is applied to extract higher-order interactions. Finally, a decoder is used to output the CPI probability.
DrugBAN [44] is a deep bilinear attention network framework designed to learn pairwise local interactions between compounds and proteins. It takes compound molecular graphs and target protein sequences as inputs. Currently, it stands as the most competitive model in the field.
Implementation details
The proposed model comprises three primary modules: modality-specific autoencoders, predictive learning networks and a classifier. The architecture of the first two modules is based on fully connected networks with layer normalization [45] and ReLU activation [46] applied after each layer.
In the modality-specific autoencoders, the dimensionality of encoders is set to K-256-D, where K and D represent the dimensionality of the original data and the dimensionality of the latent space, respectively. It should be noted that the decoder architecture is symmetrical to the encoder architecture. The predictive learning networks have a fixed dimensionality of D-128-256-128-256-128-D. In our model, D is set to 128. For the classifier, we set the dimensionality to 512-256-256-1.
The BEACON model is implemented using Pytorch [47]. To optimize the model, Adam [48] is adopted with an initial learning rate of 0.0001 for all datasets. The maximum number of training epochs is fixed to 200. We run the experiments on one NVIDIA GeForce RTX 2080 Ti GPU.
Evaluation metrics
Following DrugVQA [5] and DrugBAN [44], we choose five evaluation metrics to assess the performance: AUC (area under the curve), AUPR (area under the precision-recall curve), precision, recall, and ACC (accuracy). AUC represents the probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example. Precision measures the accuracy of positive predictions. Recall measures the model’s ability to correctly identify all relevant instances in a dataset. AUPR focuses on the trade-off between precision and recall. ACC measures overall correctness. These metrics are widely used in binary classification tasks. AUC and AUPR provide an overall evaluation, while precision and recall focus on specific aspects of performance, and ACC provides a global measure of accuracy. By using these five metrics, the effectiveness of BEACON can be adequately captured. The model with the highest AUC performance on the validation set is selected for evaluation on the testing set. We report the mean and standard deviation of three independent runs with three different random seeds.
Supplementary Information
Additional file 1: Tables S1-S4. Table S1 - The statistics of incomplete samples in the Human dataset. Table S2 - The statistics of incomplete samples in the C.elegans dataset. Table S3 - The statistics of incomplete samples in the BIOSNAP dataset. Table S4 - The statistics of incomplete samples in the DrugBank dataset.
Additional file 2. Data values for the figures.
Acknowledgements
We thank Pengyong Li and Peng Zhou for their helpful discussion and feedback.
Abbreviations
- CPI
Compound-protein interaction
- KG
Knowledge graph
- BEACON
Bridging chEmicAl struCture and cOnceptual kNowledge
- CNN
Convolutional neural network
- AUC
Area under the curve
- AUPR
Area under the precision-recall curve
- ACC
Accuracy
- MLP
Multi-layer perceptron
- ECFP4
Extended-connectivity fingerprints of diameter 4
Authors’ contributions
X.Z. and Y.L. guided the project. W.T. conceived the original ideas, wrote the manuscript, and designed and performed the experiments. X.L. was involved in the discussion and polished the manuscript. L.Z. polished the manuscript. T.M. conceived the original ideas and was involved in the discussion. N.C. performed part of the experiments. J.J. refined the figures. S.Y. contributed to the manuscript’s revision process. All authors read and approved the final manuscript.
Funding
This work was partly supported by the National Nature Science Foundation of China (62372159, U22A2037, 62122025, 61972138, 62272151, 62102140, 62202413, 62425204, 62450002, 62432011, 62172002, 62202413), Hunan Provincial Natural Science Foundation of China (2022JJ20016), the Science and Technology Innovation Program of Hunan Province (2022RC1099, 2022RC1100), and Excellent Youth Funding of Hunan Provincial Education Department, China (No. 23B0129).
Data availability
All data generated or analyzed during this study are included in this published article, its supplementary information files and publicly available repositories. The data and source code used in this project are freely available at both the GitHub repository (https://github.com/wentao228/BEACON) and Zenodo (https://doi.org/10.5281/zenodo.13913963) [49]. The data values for the figures are provided in Additional file 2.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Yuansheng Liu, Email: yuanshengliu@hnu.edu.cn.
Sisi Yuan, Email: syuan4@charlotte.edu.
References
- 1.Ashburn TT, Thor KB. Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov. 2004;3(8):673–83. [DOI] [PubMed] [Google Scholar]
- 2.Tsubaki M, Tomii K, Sese J. Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics. 2019;35(2):309–18. [DOI] [PubMed] [Google Scholar]
- 3.Zhao Q, Yang M, Cheng Z, Li Y, Wang J. Biomedical data and deep learning computational models for predicting compound-protein relations. IEEE/ACM Trans Comput Biol Bioinform. 2021;19(4):2092–110. [DOI] [PubMed] [Google Scholar]
- 4.Li S, Wan F, Shu H, Jiang T, Zhao D, Zeng J. MONN: a multi-objective neural network for predicting compound-protein interactions and affinities. Cell Syst. 2020;10(4):308–22. [Google Scholar]
- 5.Zheng S, Li Y, Chen S, Xu J, Yang Y. Predicting drug-protein interaction using quasi-visual question answering system. Nat Mach Intell. 2020;2(2):134–40. [Google Scholar]
- 6.Rube HT, Rastogi C, Feng S, Kribelbauer JF, Li A, Becerra B, et al. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. Nat Biotechnol. 2022;40(10):1520–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kc GB, Bocci G, Verma S, Hassan MM, Holmes J, Yang JJ, et al. A machine learning platform to estimate anti-SARS-CoV-2 activities. Nat Mach Intell. 2021;3(6):527–35. [Google Scholar]
- 8.Quan Z, Guo Y, Lin X, Wang ZJ, Zeng X. Graphcpi: Graph neural representation learning for compound-protein interaction. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). San Diego: IEEE; 2019. p. 717–22.
- 9.Zeng X, Xiang H, Yu L, Wang J, Li K, Nussinov R, et al. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nat Mach Intell. 2022;4(11):1004–16. [Google Scholar]
- 10.Liu Y, Zhou Z, Cao X, Cao D, Zeng X. Effective drug-target affinity prediction via generative active learning. Inf Sci. 2024;679:121135.
- 11.Bonner S, Barrett IP, Ye C, Swiers R, Engkvist O, Bender A, et al. A review of biomedical datasets relating to drug discovery: a knowledge graph perspective. Brief Bioinform. 2022;23(6):bbac404. [DOI] [PubMed] [Google Scholar]
- 12.Ma T, Tao W, Li M, Zhang J, Pan X, Lin J, et al. KGExplainer: towards exploring connected subgraph explanations for knowledge graph completion. arXiv preprint arXiv:240403893. 2024.
- 13.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. New York: Association for Computing Machinery; 2012.
- 14.Huang K, Xiao C, Glass LM, Sun J. MolTrans: molecular interaction transformer for drug-target interaction prediction. Bioinformatics. 2021;37(6):830–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chen L, Tan X, Wang D, Zhong F, Liu X, Yang T, et al. TransformerCPI: improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics. 2020;36(16):4406–14. [DOI] [PubMed] [Google Scholar]
- 16.Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11):2579–605.
- 17.Gunthorpe MJ, Large CH, Sankar R. The mechanism of action of retigabine (ezogabine), a first-in-class K+ channel opener for the treatment of epilepsy. Epilepsia. 2012;53(3):412–24. [DOI] [PubMed] [Google Scholar]
- 18.Pei Q, Wu L, Zhu J, Xia Y, Xie S, Qin T, et al. Breaking the barriers of data scarcity in drug-target affinity prediction. Brief Bioinform. 2023;24(6):bbad386. [DOI] [PubMed] [Google Scholar]
- 19.Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O. Translating embeddings for modeling multi-relational data. In: Advances in neural information processing systems. Red Hook: Curran Associates Inc.; 2013.
- 20.Cui Y, Wang Y, Sun Z, Liu W, Jiang Y, Han K, et al. Lifelong embedding learning and transfer for growing knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press; 2023;37:4217–24.
- 21.Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat Commun. 2017;8(1):573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci. 2020;11(7):1775–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci. 2002;42(6):1273–80. [DOI] [PubMed] [Google Scholar]
- 24.Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–54. [DOI] [PubMed] [Google Scholar]
- 25.Landrum G. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum. 2013;8(31.10):5281. [Google Scholar]
- 26.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems. Red Hook: Curran Associates Inc.; 2017.
- 27.Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci. 2021;118(15):e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long and Short Papers). Minneapolis: Association for Computational Linguistics; 2019;1:4171–4186.
- 29.Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7. [DOI] [PubMed] [Google Scholar]
- 30.Lin Y, Gou Y, Liu Z, Li B, Lv J, Peng X. Completer: incomplete multi-view clustering via contrastive prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Nashville: IEEE; 2021. p. 11174–83.
- 31.He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans: IEEE; 2022. pp. 16000–9.
- 32.Tao W, Liu Y, Lin X, Song B, Zeng X. Prediction of multi-relational drug-gene interaction via Dynamic hyperGraph Contrastive Learning. Brief Bioinform. 2023;24(6):bbad371. [DOI] [PubMed] [Google Scholar]
- 33.Ma T, Chen Y, Tao W, Zheng D, Lin X, Pang CI, et al. Learning to denoise biomedical knowledge graph for robust molecular interaction prediction. IEEE Trans Knowl Data Eng. 2024;1–13.
- 34.Oord Avd, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:180703748. 2018.
- 35.Liu H, Sun J, Guan J, Zheng J, Zhou S. Improving compound-protein interaction prediction by building up highly credible negative samples. Bioinformatics. 2015;31(12):i221–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zitnik M, Sosic R, Leskovec J. BioSNAP Datasets: Stanford Biomedical Network Dataset Collection. 2018;5(1). https://snap.stanford.edu/biodata. Accessed 20 Aug 2022.
- 37.Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46(D1):D1074–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018;34(13):i457–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhang M, Chen Y. Link prediction based on graph neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18). Red Hook: Curran Associates Inc.; 2018:5171–81.
- 40.Ioannidis V, Song X, Manchanda S, Li M, Pan X, Zheng D, et al. DRKG-Drug Repurposing Knowledge Graph for COVID-19. 2020. https://github.com/gnn4dr/DRKG. Accessed 20 Aug 2022.
- 41.Wen M, Zhang Z, Niu S, Sha H, Yang R, Yun Y, et al. Deep-learning-based drug-target interaction prediction. J Proteome Res. 2017;16(4):1401–9. [DOI] [PubMed] [Google Scholar]
- 42.Hinton GE. Deep belief networks. Scholarpedia. 2009;4(5):5947. [Google Scholar]
- 43.Lee I, Keum J, Nam H. DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol. 2019;15(6):e1007129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Bai P, Miljković F, John B, Lu H. Interpretable bilinear attention network with domain adaptation improves drug-target prediction. Nat Mach Intell. 2023;5(2):126–36. [Google Scholar]
- 45.Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint arXiv:160706450. 2016.
- 46.Nair V, Hinton GE. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10). Madison: Omnipress; 2010. pp. 807–14.
- 47.Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32.
- 48.Kingma DP, Adam BJ. A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR). 2015.
- 49.Tao W, Lin X, Liu Y, Zeng L, Ma T, Cheng N, et al. Bridging chemical structure and conceptual knowledge enables accurate prediction of compound-protein interaction. Zenodo. 2024. 10.5281/zenodo.13913963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Tables S1-S4. Table S1 - The statistics of incomplete samples in the Human dataset. Table S2 - The statistics of incomplete samples in the C.elegans dataset. Table S3 - The statistics of incomplete samples in the BIOSNAP dataset. Table S4 - The statistics of incomplete samples in the DrugBank dataset.
Additional file 2. Data values for the figures.
Data Availability Statement
All data generated or analyzed during this study are included in this published article, its supplementary information files and publicly available repositories. The data and source code used in this project are freely available at both the GitHub repository (https://github.com/wentao228/BEACON) and Zenodo (https://doi.org/10.5281/zenodo.13913963) [49]. The data values for the figures are provided in Additional file 2.







