Abstract
Motivation
Predicting protein–ligand binding sites is crucial in studying protein interactions with applications in biotechnology and drug discovery. Two distinct paradigms have emerged for this purpose: sequence-based methods, which leverage protein sequence information, and structure-based methods, which rely on the three-dimensional (3D) structure of the protein. Here, we analyze a hybrid approach that combines the strengths of both paradigms by integrating two recent deep learning architectures: protein language models (pLMs) from the sequence-based paradigm and Graph Neural Networks (GNNs) from the structure-based paradigm. Specifically, we construct a residue-level Graph Attention Network (GAT) model based on the protein’s 3D structure that uses pre-trained pLM embeddings as node features. This integration enables us to study the interplay between the sequential information encoded in the protein sequence and the spatial relationships within the protein structure on the model performance.
Results
By exploiting a benchmark dataset over a range of ligands and ligand types, we have shown that using the structure information consistently enhances the predictive power of the baselines in absolute terms. Nevertheless, as more complex pLMs are used to represent node features, the relative impact of the structure information represented by the GNN architecture diminishes. The above observations suggest that although the use of the experimental protein structure almost always improves the accuracy of the prediction of the binding site, complex pLMs still contain structural information that leads to good predictive performance even without the use of 3D structure.
Availability and implementation
The datasets generated and/or analyzed during the current study, as well as pretrained models, are available in the following Zenodo link https://zenodo.org/records/15184302. The source code that was used to generate the results of the current study is available in the following GitHub repository https://github.com/hamzagamouh/pt-lm-gnn as well as in the following Zenodo link https://zenodo.org/records/15192327.
1 Introduction
Proteins are fundamental biomolecules that play a critical role in the functioning of all living organisms. They are involved in various biological processes such as signal transduction or cell regulation and interact with other macromolecules and small molecules to perform their functions. The interaction is mediated through binding sites on the protein surface, as well as through buried sites accessible via entry channels (Pravda et al. 2014). These binding sites contain residues crucial for the recognition and binding of the ligand molecule. Thus, the study of protein–ligand binding sites and binding residues is essential to understand the fundamental mechanisms of biological processes with a profound impact on applications such as drug discovery (Ferreira et al. 2015, Konc and Janežič 2022) and biotechnology (Kim et al. 2017).
With the rapid advances in computational techniques in the last two decades, various methods have been developed for detecting protein–ligand binding sites. The methods use diverse algorithms and exploit different types of information from protein sequences and 3D structure, broadly categorizing the approaches into sequence-based and structure-based methods (Roche et al. 2015, Zhao et al. 2020).
Before describing the existing methods, we should emphasize that the problem of predicting protein–ligand interactions can be approached in two main ways: binding residue prediction, where sequence-based methods are mainly used, and binding site prediction, where structure-based methods are the most appropriate. Binding residue prediction involves labeling individual residues of the protein depending on whether they belong to a binding site. In contrast, binding site prediction aims at detecting surface regions capable of accommodating ligands that can potentially bind to the protein.
Sequence-based methods operate on amino acid sequences and are characterized by their ability to identify binding residues solely from protein sequence data. Although sequence-based methods can only predict individual binding residues and not full binding sites, they can still be relevant in many applications, such as variant effect prediction, as the mutation of a binding residue increases the probability of a detrimental impact of such a mutation by hampering the protein’s ability to bind ligands (Kim et al. 2017).
Traditional sequence-based tools, such as ConSurf (Ashkenazy et al. 2010) and S-Site (Yang et al. 2013), are template-based methods that use proteins with known binding sites as templates together with the evolutionary conservation information to predict binding residues from highly conserved regions of the protein.
In contrast, more recent methods rely on machine learning algorithms to make predictions. With the exponential increase in the size of biological databases (Tiwary 2022), there has been an explosion of machine learning methods to solve all kinds of tasks in bioinformatics (Serra et al. 2018). In the context of sequence-based methods for protein–ligand binding site prediction, different machine learning-based methods utilize different types of information about a protein sequence and its amino acids.
Several methods use Support Vector Machines (SVM) and Random Forest (RF) as their main classification algorithms and various input features. TargetS (Yu et al. 2013) constructs features using evolutionary information from Position Specific Scoring Matrix (PSSM), predicted secondary structure, and ligand-specific binding propensities of residues. ATPint (Chauhan et al. 2009) utilizes evolutionary information, hydrophobicity, and other predicted features such as average accessible surface area. NsitePred (Chen et al. 2012) computes features from the predicted secondary structure and uses additional information such as the predicted relative solvent accessibility (RSA) and dihedral angles, as well as PSSM features and residue conservation scores. LigandDSES (Chen et al. 2016) and LigandRFs (Chen et al. 2014) use amino acid physico-chemical properties provided by the AAIndex database (Kawashima and Kanehisa 2000).
Deep learning methods have attracted enormous attention of bioinformaticians in recent years (Li et al. 2019) due to their potential of automatic learning of complex representations from vast amounts of available data and due to their recent success in other fields, such as Natural Language Processing (NLP) (Khurana et al. 2023) and Computer Vision (CV) (Chai et al. 2021). Deep learning has also been used for binding residue detection in methods such as DeepBind (Alipanahi et al. 2015) and DeepCSeqSite (Cui et al. 2019). These approaches use Convolutional Neural Networks (CNNs) on protein sequences to predict binding residues. DeepBind uses residue types as input features, while DeepCSeqSite relies on various types of information, such as position-specific scoring matrix (PSSM), secondary structure (SS), dihedral angle (DA), and conservation scores (CS).
Recently, language models (LMs) have emerged as a viable option to represent protein sequences. Large LMs have become the standard method in NLP (Min et al. 2024) due to their remarkable performance in a wide range of language-related tasks. An example of a very successful LM is the famous ChatGPT, based on the GPT-3 architecture (Brown et al. 2020), which can generate human-like responses in conversation. In bioinformatics, LMs have also been applied to address various challenges related to protein analysis (Unsal et al. 2022, Lin et al. 2022b, Zheng et al. 2023).
A LM is a deep learning model architecture that is trained to learn complex representations of text input, also called embeddings, from an extensive corpus of text. LMs are built upon two basic successful ideas in NLP: masked language modeling and transformer architecture. Masked language modeling (Devlin et al. 2018) is a self-supervised learning strategy based on masking parts of the text and training the model to predict the missing parts. This strategy benefits from vast amounts of available unannotated data and forces the model to learn general embeddings that can be fine-tuned on downstream tasks where the data is scarce. The transformer architecture (Vaswani et al. 2017) relies on the famous attention mechanism that helps the model attend only to relevant parts of the input by learning the attention weights of different parts of a text input.
Treating protein amino acids as words and sequences as sentences of a natural language opens a way to apply language modeling techniques to proteomics. Recently, several protein language models (pLMs) (Ferruz and Höcker 2022) were constructed by training Transformer architectures on large protein sequence datasets. The learned embeddings of protein sequences were then successfully applied to the prediction of various protein characteristics, such as protein structure (Rao et al. 2020, Høie et al. 2022) or protein-protein interactions (Wang et al. 2019, Jha et al. 2023). In our recent work, we explored the potential of pLMs to predict protein–ligand binding residues, showing superior performance over several state-of-the-art methods on multiple datasets (Hoksza and Gamouh 2022). In a broader view, the binding residue prediction problem can be viewed as a type of more general task of protein residue annotation, such as post-translational modification prediction, where, indeed, pLMs have also been successfully applied (Pokharel et al. 2022, Pratyush et al. 2023).
On the other hand, structure-based methods for protein–ligand binding site prediction utilize features derived from the protein 3D structure. Different structure-based methods vary in the way of represent the 3D protein structure and in the algorithm used for making the predictions.
FINDSITE (Brylinski and Skolnick 2008) is a 3D template-based method that uses a threading algorithm based on binding-site similarity to groups of template structures. 3DLigandSite (Wass et al. 2010) and FunFOLD (Roche et al. 2011) are also template-based methods that combine sequence and structure similarity to extract homologous proteins from PDB, from which ligands are extracted, superimposed, and clustered to determine the binding site associated with each cluster. Various other methods apply geometrical measurements over the 3D structure to detect cavities or hollows on the protein’s surface. SURFNET (Laskowski 1995) is a method that positions spheres within the space between two protein atoms. LIGSITE (Hendlich et al. 1997) detects pockets with a series of simple operations on a cubic grid. FPocket (Le Guilloux et al. 2009) is based on Voronoi tessellation and alpha spheres. CurPocket (Liu et al. 2020) defines the binding sites by identifying clusters of concave regions from the curvature distribution of the protein surface. Methods such as Q-SiteFinder (Laurie and Jackson 2005), FTSite (Ngan et al. 2012), and SiteComp (Lin et al. 2012) are energy-based methods. Such methods place probes on the protein surface and subsequently locate cavities by estimating the energy potentials between the probes and the cavities. In addition to template-based, geometry-based, and energy-based methods, machine learning methods rely on 3D structural features, sometimes combined with other features, to train various machine learning algorithms. For instance, P2Rank (Krivák and Hoksza 2018) labels solvent-accessible surface points of the protein by using the Random Forest algorithm on a set of handcrafted physicochemical and structural features. The ligandable points are then clustered to obtain the binding pockets. Recently, deep-learning methods have been introduced for structure-based binding residue/site prediction as well. Often, the methods represent the protein structure as a 3D grid of voxels and use a 3D Convolutional Neural Network (CNN) (O’Shea and Nash 2015) as their primary model architecture to learn the binding sites. These methods differ mainly in the input features and model hyperparameters. DeepSite (Jiménez et al. 2017), PUResNet (Kandel et al. 2021), and DeepSurf (Mylonas et al. 2021) employ atomic chemical properties, DeepDrug3D (Pu et al. 2019) is based on interaction energies of ligand atoms with protein residues, while Deeppocket (Aggarwal et al. 2022) uses atom types. More recent methods, such as SiteRadar (Evteev et al. 2023), GraphPLBR (Wang et al. 2023), EquiPocket (Zhang et al. 2023), GraphBind (Xia et al. 2021), GraphSite (Yuan et al. 2022), and SKITTLES (Evteev et al. 2025), use different variations of the Graph Neural Network (GNN) architecture and have demonstrated state-of-the-art performance.
GNN is a class of neural networks designed to operate on graphs and other structured data (Veličković 2023). GNNs are based on the idea of representing the input data as a graph and propagating node information between the graph nodes. Each node is associated with a feature vector containing the node features. These features are iteratively updated by aggregating information from neighboring nodes using a series of message-passing steps. This property of GNNs enables the model to capture the graph’s local structure and learn more structure-based and context-aware embeddings. Methods based on GNNs may also benefit from large libraries of predicted protein structures by methods like AlphaFold (Jumper et al. 2021, Varadi et al. 2022). The primary output of a GNN is node feature vectors, which can be used for various node-level and graph-level downstream tasks. In recent years, GNNs have been applied extensively in bioinformatics and have shown state-of-the-art results across multiple tasks (Zhang et al. 2021).
In the following sections, we analyze the interplay of protein sequence and structure information by building a machine learning model that exploits two recent state-of-the-art deep learning architectures, a Graph Neural Network augmented with protein-language model embeddings. Particularly, we want to address the following research questions: Can we improve the prediction performance by fusing both approaches? How much does the structure information from GNNs contribute to the predictive power of the solely sequence-based pLMs?
2 Materials and methods
The high-level view of our approach, sketched in Fig. 1, is as follows. The first input of the pipeline is the protein sequence of single-letter amino acid codes. The sequence is processed by a pLM (Embedder), which computes embeddings of each amino acid in the sequence, i.e., residue-level embeddings. The second input is the corresponding protein 3D structure, described as a set of atom 3D coordinates. The structure is converted to a graph by the protein graph constructor (described in Subsection 2.1). In the protein graph, nodes correspond to residues labeled by the residue-level embeddings and edges to residues close in the 3D space. The protein graph is then processed by a GNN that predicts binding probabilities for each residue. Using a threshold, the predicted probabilities are converted to binding residue labels (binding vs. non-binding).
Figure 1.
General architecture of our models.
Furthermore, we measured the effect of the structure information that comes from the GNN models by comparing them to a baseline model, which is a sequence-based model that lacks graph structure information. The sequence baseline takes the residue-level embeddings as input and feeds it to a multi-layer perceptron, which predicts the binding residue probability.
As mentioned, we use the Graph Neural Network (GNN) as our primary model architecture. Different GNN architectures vary in how they aggregate information from other nodes to transform the feature vectors. In our approach, we compare two well-known GNN architectures—Graph Convolutional Network (GCN) (Kipf and Welling 2016) and Graph Attention Network (GAT) (Veličković et al. 2017).
The GCN uses convolutional operations to learn feature representations of nodes in a graph. The principle of GCNs is based on the idea of adapting convolutional neural networks (CNNs) (O’Shea and Nash 2015) to the graph domain by replacing the regular grid-like structure of image data with an irregular graph structure. By analogy, GCNs define a convolution operation on graphs, which involves aggregating information from the node’s neighbors and updating the node’s feature representation accordingly. The graph convolution works by learning a trainable weight matrix shared across all nodes, enabling the GCN to learn a set of filters specific to the graph structure.
The GAT follows the trend of the attention mechanism of the NLP Transformer architectures (Vaswani et al. 2017). The model attends differently to different parts of a given node neighborhood by assigning importance scores to each neighbor based on their relevance to the current node. The attention mechanism enables the GAT to focus on the most relevant nodes in the graph while ignoring noise and irrelevant information. Figure 2 shows the architectural differences between GCNs and GATs.
Figure 2.
Comparison of GCN and GAT architectures.
2.1 Protein graph construction
To use the GNN architecture, the protein needs to be represented as a graph with node features. In general, the strength of electrostatic interactions is inversely proportional to the squared distances between atoms. Therefore, it is physically plausible to enable information sharing between parts of the protein close to each other. Therefore, to construct the protein graph, we started with the 3D structure of the protein, and we constructed a proximity graph on the residue level. Nodes correspond to residues of the protein, and edges represent the closeness relationship of the residues to each other. Two residues are connected if the distance between their alpha-carbon atoms is less than the threshold distance. In this work, we explored the following thresholds: 4, 6, 8, and 10 Å.
2.2 Protein language model embeddings
pLMs process sequences of amino acid letters and return two kinds of embeddings: an embedding for the whole protein sequence and an embedding for each sequence letter, i.e., residue-level embeddings. The latter embeddings can be directly used as node features of the protein graph. In this work, we used the following pLMs: two pLMs that are part of the ProtTrans project (Elnaggar et al. 2022), ProtBERT-BFD that was pre-trained on BFD (Steinegger et al. 2019), and ProtT5-XL-UniRef50 (Prot-T5), which was pre-trained on BFD and fine-tuned on UniRef50 (Suzek et al. 2015). Both embeddings were computed using the bio-embeddings Python library (https://docs.bioembeddings.com/v0.2.3/). Moreover, we used SeqVec (Heinzinger et al. 2019) embeddings, obtained also using the bio-embeddings library, as well as ESM-2 embeddings (Lin et al. 2022a) obtained using the model file esm2_t36_3B_UR50D from the ESM GitHub repository (https://github.com/facebookresearch/esm) [esm]. Both SeqVec and ESM-2 pLMs were pretrained on the UniRef50 dataset. For all the above pLMs, the encoder part of the model was used to compute the embeddings, which were extracted from the last layer of the encoder. This represents the standard strategy used for evaluating the pre-trained embeddings on downstream tasks in the original papers (Heinzinger et al. 2019, Elnaggar et al. 2022, Lin et al. 2022a). Further information about embeddings, such as the number of parameters and embedding dimension, can be found in the Supplementary Table S1.
2.3 AA index
pLMs are context-aware, resulting in different feature vectors for the same amino acid in different sequential contexts. To test the effect of information propagation through the protein graph (see Subsection 3.6), we also generated context-independent feature vectors, i.e., vectors whose values are not dependent on the neighborhood, serving as good baseline node features for our GNN models. For that purpose, we used the AAIndex database (Kawashima and Kanehisa 2000), a large collection of physicochemical and biochemical properties of amino acids. Using the AAIndex database, we constructed node features by collecting all returned properties of the respective amino acid into one vector. We used the Python AAIndex library (https://github.com/amckenna41/aaindex) to extract AAIndex features. The AAIndex features were normalized over all amino acids, resulting in 566-dimensional feature vectors.
2.4 Datasets
As our main dataset, we used a benchmark designed by Yu et al. (2013) (we would call it the Yu benchmark from now on), involving 12 different ligands to build and test our models. Second, to validate that our methodology is on par with recent GNN-based approaches, we evaluated it on another dataset for protein-DNA and protein-RNA binding sites from the works of GraphBind (Xia et al. 2021) and GraphSite (Yuan et al. 2022), details of which are given in Supplementary Table S8. Lastly, to evaluate the generalizability of our findings, we created a new dataset based on the PDBBind dataset (Wang et al. 2004), with further details provided in Subsection 3.7.
The benchmarking dataset designed by Yu et al. (2013) contains training and independent test sets of protein sequences and their corresponding actual binding residues for 12 different ligands, which include: 5 nucleotides (AMP, ADP, ATP, GTP, GDP), DNA, HEME, and 5 ions (Ca2+, Mg2+, Mn2+, Fe3+, Zn2+). For the remainder of the text and in all tables, we refer to the ions using their atom names (CA, MG, MN, FE, ZN).
As the benchmark was used to test several sequence-based methods, such as Yu et al. (2013) and Hoksza and Gamouh (2022), and given that our method has a structural component, we needed to collect the corresponding 3D structures of the protein sequences. To achieve this, we downloaded the entire BioLip dataset (Yang et al. 2013), which was used to construct the benchmark, and we extracted the tertiary structures of the sequences by matching their PDB IDs and chain IDs. For sequences whose corresponding structures were not found in BioLip, we used the latest version of PDB (Berman et al. 2000) to extract the structures.
The PDB files were first parsed by the Biopython library (https://biopython.org/) in order to obtain the sequences and the atomic coordinates. Some of the sequences obtained from the Biopython parser underwent minor manual corrections to match them with the sequences from the benchmark dataset. In total, the letters of some modified residues were changed for 12 sequences, one residue was skipped for 13 sequences, and 2 sequences were skipped due to a high mismatch between the sequence retrieved from the benchmark and the sequence retrieved after processing the PDB file. Finally, each residue from a sequence was associated with a 3D coordinate. The obtained coordinates were used to construct the protein graphs as described in Subsection Protein graph construction using the Python Deep Graph Library (DGL https://www.dgl.ai/).
We also need to note that due to technical problems with the ProtT5 embeddings, we could not obtain embeddings for all of the proteins. In total, we could not obtain the protein graphs for 31 sequences. The sequences for which we could not generate the embeddings consisted of training sequences only, so this issue did not affect the reported results, as those were based on the test sets. Table 1 illustrates statistics of the benchmark datasets as well as the number of protein graphs obtained after the preprocessing phase.
Table 1.
Yu benchmark summary.
| Training sets |
Independent test sets |
||||||
|---|---|---|---|---|---|---|---|
| Ligand | Sequences | Missing protein graphs | Binding residues | Non-Binding residues | Sequences | Binding residues | Non-Binding residues |
| ATP | 221 | 0 | 3021 | 72 334 | 50 | 647 | 16 639 |
| ADP | 296 | 0 | 3833 | 98 740 | 47 | 686 | 20 327 |
| AMP | 145 | 0 | 1603 | 44 401 | 33 | 392 | 10 355 |
| GDP | 82 | 0 | 1101 | 26 244 | 14 | 194 | 4180 |
| GTP | 54 | 1 | 745 | 21 205 | 7 | 89 | 1868 |
| CA | 965 | 4 | 4914 | 287 801 | 165 | 785 | 53 779 |
| ZN | 1168 | 16 | 4705 | 315 235 | 176 | 744 | 47 851 |
| MG | 1138 | 7 | 3860 | 350 716 | 217 | 852 | 72 002 |
| MN | 335 | 1 | 1496 | 112 312 | 58 | 237 | 17 484 |
| FE | 173 | 1 | 818 | 50 453 | 26 | 120 | 9092 |
| DNA | 335 | 0 | 6461 | 71 320 | 52 | 973 | 16 225 |
| HEME | 206 | 1 | 4380 | 49 768 | 27 | 580 | 8630 |
2.5 Model hyperparameters
For building our models, we used the implementation of GCN and GAT provided by the Python library DGL-LifeSci (https://lifesci.dgl.ai/), and we’ve trained and evaluated the models using the Pytorch Python library (https://pytorch.org/). Our GCN architecture consisted of graph convolutional layers of size 512 with ReLU activation, a dropout rate of 0.5 (Srivastava et al. 2014), residual connections (He et al. 2016), and batch normalization (Ioffe and Szegedy 2015). At the same time, our GAT architecture consisted of graph attention layers of size 512, ReLU activations, a dropout rate of 0.5, 4 attention heads, and residual connections. We used a dense layer with two softmax units on top of the GCN and GAT models to compute the node-level outputs. We also utilized a weighted version of the binary cross-entropy loss due to the high class imbalance of the datasets, the AdamW optimizer (Loshchilov and Hutter 2017) as the optimization algorithm with learning_rate = 3e-4 and weight_decay = 1e-5, and we trained all the models for 2000 epochs with a batch_size = 32. Since the process of training and evaluating the GNN models on the pLM embeddings is time-consuming, the hyperparameters of the GNN models were chosen after manual tuning on a random validation split from the training set. The range of values tried in the manual tuning is described in Supplementary Tables S6 and S7.
Regarding the sequence baseline models, we compared three model classes: Multi-Layer Perceptron (MLP), Random Forest (RF), and SVM. The models were built using the embeddings from the ProtT5 language model. To select the sequence baseline architecture that will be used in the remaining experiments, we have performed fivefold Cross-Validation (CV) on the ADP ligand training set using different hyperparameters of the model classes. The results of the fivefold CV can be found in Supplementary Table S4. The SVM and RF models were implemented using the Scikit-learn Python library (Pedregosa et al. 2011). Moreover, the MLP classifiers were trained using the Pytorch Python library for 2000 epochs with a batch size of 32, and the reported validation scores of the MLPs represent the best validation scores obtained during the 2000 epochs of training. To account for class imbalance in the sequence baselines, we used weighted binary cross-entropy as the loss function for the MLPs, and we assigned the class_weight parameter to ’balanced’ in the Scikit-learn implementation of the RF and SVM. Based on the fivefold CV results, we have chosen the sequence baseline model in all remaining experiments to be a single-layer MLP with 512 units and with a dropout rate of 0.1, as it has the best mean CV score.
3 Results and discussion
To evaluate the residue-level predictions of our models, we used standard binary classification metrics. Specifically, we have chosen to show our results with respect to the Matthews Correlation Coefficient (MCC) due to the significant class imbalance present in the datasets, as it has been shown that the MCC metric is one of the most suitable metrics in such cases (Chicco and Jurman 2020).
Our recent work (Hoksza and Gamouh 2022) shows that more complex LMs often yield better performance. Therefore, we used the ProtT5 embeddings in most of our experiments as one of the most complex pLMs.
We used a random split of the processed benchmark training sets to obtain training and validation sets. The training/validation split ratio was designed for the validation sets to have the same size as the independent test sets. The validation sets were used to define the early stopping epoch while training the models. The training was stopped at the epoch with the best validation MCC. In the subsequent sections, we report the results of the independent test sets.
3.1 Effect of the number of graph convolutional layers
The effect of information propagation through the protein graph can best be seen by varying the number of convolutional layers. One round of graph convolution collects information from the neighborhood of a given node. Thus, as the number of graph convolutions increases, a given node will have access to more distant neighbors since the one-hop neighbors will already contain information about farther neighbors in their hidden features computed from previous rounds of graph convolution. Therefore, increasing the number of convolutional layers enables information propagation between distant parts of the graph. To test the effect of the number of convolutional layers on the prediction performance, we used graphs constructed using a 6 Å cutoff distance, ProtT5 embeddings, and we varied the number of graph convolutional layers in our standard GCN architecture; specifically, we tried 1, 2, 4, and 6 layers. Furthermore, we report the mean and standard deviation of the validation MCC score for fivefold CV splits. The results are shown in Fig. 3, which was created using the Supplementary Table S1. The reported validation scores represent the best validation score obtained while training the models for 2000 epochs.
Figure 3.
Effect of the number of graph convolutional layers. The bars represent the mean of the validation MCC scores for 5-fold CV splits. The error bars represent the standard deviation of the validation MCC scores. The colors correspond to the number of graph convolutional layers of 512 units.
We can observe that for about half of the ligand datasets, the models constructed using different numbers of convolutional layers have very similar performance. Moreover, for most of the remaining ligand datasets, adding more graph convolutional layers decreases the performance. This suggests that there is little positive effect of adding more graph convolutional layers.
Based on the above observations, we decided to use a single-layered GNN architecture and an arbitrary random split with the same random seed in the remaining experiments. Another reason for choosing a single layer in the following experiments is to avoid the common oversmoothing problem in GNNs (Rusch et al. 2023), where deep GNNs result in nearly indistinguishable node features in the last layers of the network, which may result in poor performance in downstream tasks.
3.2 Effect of graph cutoff distances
Next, we tested the effect of the graph cutoff distance. The cutoff distance influences the graph’s connectivity, as a higher cutoff distance results in more connections and thus leads to a more densely connected graph. In such a graph, a given node has more neighbors, and therefore more nodes are taken into account in information propagation to determine the state of the given node. A typical cutoff seen in other works is 6 Å computed based on the distance of alpha carbons (Fout et al. 2017). This work tested the following cutoff distances: 4 Å, 6 Å, 8 Å, and 10 Å. Moreover, we constructed an ensemble model using models trained on graphs built using the above cutoff distances. This model combines the predicted binary classes from each cutoff distance and outputs the most often observed class. An ensemble model that uses multiple cutoff distances removes the bias of choosing a predefined cutoff distance. Therefore, it has the potential to improve the generalization capability of the GNN.
Table 2 compares the different cutoff distances and the ensemble model. We can observe that although the graph cutoff distance significantly affects the performance of the GCN model, there is no observable consistent trend by varying the cutoff distance. Moreover, Supplementary Table S3 shows the effect of cutoff distances across multiple classification metrics, namely MCC, together with Precision and Recall.
Table 2.
Effect of graph cutoff distance and the graph attention mechanism.a
| GCN |
GAT |
Sequence Baseline | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ligand | 4 Å | 6 Å | 8 Å | 10 Å | Ensemble | 4 Å | 6 Å | 8 Å | 10 Å | Ensemble | Baseline |
| ADP | 0.569 | 0.564 | 0.581 | 0.557 | 0.584 | 0.571 | 0.578 | 0.597 | 0.582 | 0.583 | 0.553 |
| AMP | 0.450 | 0.412 | 0.424 | 0.419 | 0.445 | 0.449 | 0.463 | 0.489 | 0.475 | 0.482 | 0.416 |
| ATP | 0.546 | 0.537 | 0.538 | 0.557 | 0.569 | 0.566 | 0.575 | 0.572 | 0.587 | 0.583 | 0.501 |
| CA | 0.396 | 0.382 | 0.403 | 0.420 | 0.421 | 0.383 | 0.408 | 0.408 | 0.411 | 0.426 | 0.513 |
| DNA | 0.473 | 0.476 | 0.470 | 0.459 | 0.490 | 0.460 | 0.483 | 0.510 | 0.488 | 0.499 | 0.371 |
| FE | 0.618 | 0.645 | 0.614 | 0.645 | 0.645 | 0.704 | 0.668 | 0.692 | 0.719 | 0.703 | 0.651 |
| GDP | 0.665 | 0.668 | 0.737 | 0.693 | 0.710 | 0.696 | 0.695 | 0.746 | 0.705 | 0.744 | 0.651 |
| GTP | 0.537 | 0.514 | 0.575 | 0.564 | 0.556 | 0.666 | 0.669 | 0.670 | 0.573 | 0.695 | 0.524 |
| HEME | 0.689 | 0.672 | 0.736 | 0.675 | 0.691 | 0.675 | 0.674 | 0.743 | 0.682 | 0.685 | 0.720 |
| MG | 0.343 | 0.344 | 0.351 | 0.362 | 0.365 | 0.325 | 0.347 | 0.364 | 0.349 | 0.364 | 0.332 |
| MN | 0.617 | 0.606 | 0.594 | 0.590 | 0.634 | 0.602 | 0.642 | 0.607 | 0.642 | 0.638 | 0.585 |
| ZN | 0.660 | 0.681 | 0.673 | 0.693 | 0.699 | 0.670 | 0.672 | 0.685 | 0.690 | 0.699 | 0.671 |
| Average | 0.547 | 0.542 | 0.558 | 0.553 | 0.567 | 0.564 | 0.573 | 0.590 | 0.575 | 0.592 | 0.541 |
Values represent MCC scores for the Yu benchmark test set. The bold text highlights the value is the highest value in the respective row.
3.3 Effect of graph attention mechanism
For GAT, we tested the effect of the graph attention mechanism initially designed as a regularization strategy for the GNN models. The attention may contribute to a better generalization performance as the model attends only to relevant parts of the neighborhood of a node. To test the added value of the graph attention mechanism, we compared our shallow GCN model with a shallow version of GAT, where we used our standard GAT architecture with a single graph attention layer. Table 2 compares the GCN and GAT models for the different cutoff distances.
We see that, unlike in the case of GCN, for most datasets, there is a consistent improvement in the performance of the GAT model with increasing cutoff distance. This observation can be explained by the capacity of the attention mechanism to reduce noise in larger neighborhoods. For graphs obtained using a high cutoff distance, each node has a bigger neighborhood and collects information from more (distant) neighbors. Without using the attention mechanism, the model does not have the capacity to filter out irrelevant information. The graph attention mechanism fixes this issue by adjusting the neighbor weights to attend only to neighbors relevant for making the prediction.
Moreover, we observe that for all ligands and both for GAT and GCN, the ensemble models have better average performance across ligand datasets than all cutoff distances, and this performance is very similar to the average performance of the model with cutoff 8. These observations may suggest that the model with cutoff 8 can be considered as a lightweight proxy for the ensemble model in terms of the number of parameters and the required preprocessing steps. We will call those models GCN8 and GAT8 in the rest of the work. Table 2 shows that GAT8 has significantly higher performance than the GCN8 for the GTP ligand, while it is slightly more performant for most other ligands. Furthermore, the GAT8 significantly outperforms the sequence baseline for most ligands. In the following experiments, we, therefore, consider our best-performing model architecture to be the GAT8. Supplementary Table S2 also includes a comparison of GAT and GCN using more classification metrics.
3.4 What is the attention attentive to?
In the previous experiments (Table 2), we showed that attention helps to improve the accuracy of predictions in comparison with GCN; we were further wondering which amino acids were helpful and therefore investigated a number of binding sites of Zn ion, GTP, and HEME as three variable representatives of the studied ligands. We were specifically investigating cases where a ligand-binding residue was not predicted by GCN but was correctly predicted with GAT. To do that, we used 10 Å protein graphs; for every binding residue, we extracted the attention value for each neighbor. As our model uses four attention heads, the attention values were averaged across the heads. Then, individually for each binding residue, we colored the binding site with a relative contribution of attention of the binding residue neighbors. We observed that in many cases, the residues with the highest attention were the other ligand-binding residues (and sequence neighbors of the studied residue). In many cases, the binding residues were often physically close to the ligand, but we also observed cases where the residues with the highest attention were on the other part of the binding site and away from the studied residue (see Figs 4–6). The figures represent the visualization of the attention. The binding residue and its neighbors are represented as sticks. The binding residue is colored yellow, with neighbors going from green (highest attention) to red (lowest attention). The low-attention neighbors are partially transparent. The ligand is colored gray.
Figure 4.

Structure of Zn binding site on zinc finger antiviral protein (3u9g)—Cys 73 was correctly predicted with the help of attention—the biggest contribution came from Cys 78 (green) and His 86 and Leu 85 (both brown)—Cys 78 and His 86 are directly interacting Zn 226.
Figure 5.

Structure of GTP binding site of dethiobiotin synthetase (3qxj)—Thr 15 was correctly predicted with the help of attention—the biggest contribution came from Gly 12 (green) and Asp 53 (brown). Gly 12 is directly interacting with GTP.
Figure 6.

Structure of HEME binding site of fungal catalase-peroxidase 2 MagKatG2 (3ut2)—His 314 was correctly predicted with the help of attention—the biggest contribution came from Lys 318 (green), Trp 365, and Gly 317 (brown). All three residues are directly interacting with heme.
We should emphasize that the goal of this exercise was to offer a visual way of inspecting the attention, but a more quantitative approach should be taken to draw a conclusive statement regarding the attention. This is further supported by the fact that we also encountered instances where it was not clear how the residues with high attention could contribute to the accurate prediction of the studied residue.
3.5 Comparison with existing methods
To put the proposed approach in the context of existing research, we compared our GAT8 model with the Prot-T5 embeddings, which consistently demonstrated higher performance in the previous experiments, to other approaches that were trained and tested using the Yu benchmark dataset. Namely, TargetS (Yu et al. 2013), EC-RUS (Ding et al. 2017), and SXGBsite (Zhao et al. 2019), which are based on different hand-crafted, but context-dependent features as described in Section Introduction. For each of the three methods, we show the results of the best-performing versions of those methods as presented in the respective papers. Table 3 compares the methods using the area under the ROC curve (ROC-AUC) and MCC. Our GAT8 model with ProtT5 embeddings outperforms all of the methods on the MCC metric for all ligand datasets, and on the ROC-AUC metric for most datasets. However, it should be emphasized that the presented methods are sequence-based, using only predicted structural features (such as predicted secondary structure). On the other hand, the presented approach does not incorporate 3D structure directly, as the protein graph only approximates the 3D information.
Table 3.
Comparison with existing methods—Yu benchmark.a
| AUC |
MCC |
|||||||
|---|---|---|---|---|---|---|---|---|
| Ligand | TargetS | EC-RUS | SXGBsite | ProtT5 GAT8 | TargetS | EC-RUS | SXGBsite | ProtT5 GAT8 |
| ADP | 0.896 | 0.872 | 0.907 | 0.945 | 0.507 | 0.511 | 0.521 | 0.597 |
| AMP | 0.83 | 0.815 | 0.851 | 0.892 | 0.359 | 0.393 | 0.366 | 0.489 |
| ATP | 0.898 | 0.871 | 0.886 | 0.936 | 0.502 | 0.506 | 0.448 | 0.572 |
| CA | 0.767 | 0.77 | 0.757 | 0.882 | 0.243 | 0.225 | 0.167 | 0.408 |
| DNA | 0.836 | 0.814 | 0.827 | 0.932 | 0.377 | 0.319 | 0.27 | 0.510 |
| FE | 0.945 | 0.936 | 0.913 | 0.986 | 0.479 | 0.49 | 0.454 | 0.692 |
| GDP | 0.896 | 0.872 | 0.93 | 0.963 | 0.55 | 0.579 | 0.678 | 0.746 |
| GTP | 0.855 | 0.861 | 0.883 | 0.932 | 0.617 | 0.641 | 0.572 | 0.670 |
| HEME | 0.907 | 0.935 | 0.9 | 0.976 | 0.598 | 0.64 | 0.555 | 0.743 |
| MG | 0.706 | 0.78 | 0.819 | 0.782 | 0.294 | 0.317 | 0.326 | 0.364 |
| MN | 0.888 | 0.891 | 0.888 | 0.920 | 0.449 | 0.31 | 0.329 | 0.607 |
| ZN | 0.936 | 0.958 | 0.892 | 0.962 | 0.527 | 0.437 | 0.363 | 0.685 |
Values represent scores for the Yu benchmark test set. The bold text highlights the value is the highest value in the respective row.
Finally, we also validate that our approach is comparable with recently published methods that predict nucleic acid binding using GNNs. Specifically, Supplementary Table S9 compares our approach with GraphBind (Xia et al. 2021) and GraphSite (Yuan et al. 2022), which used variations of GNNs, in addition to GeoBind (Li and Liu 2023) and EquiPNAS (Roche et al. 2024) that used combinations of GNNs and pLMs.For each of the presented methods, we report the best-performing version, which relied on the experimental protein structure to construct the protein graph. We compare the methods using the area under the ROC curve (AUC), the area under the Precision-Recall curve (AUPR), and MCC. We report the scores directly from the original published results of the methods whenever the score was available. While EquiPNAS and GeoBind showed the best performance on the DNA/RNA benchmarks, our GAT8 model with ProtT5 embeddings shows similar performance to GraphBind and GraphSite, especially on the DNA benchmark.
In conclusion, while our GAT8 model performs on par with some of the existing models, it is, in some cases, slightly outperformed by some of the more recent and more involved architectures. Several factors contribute to this. Firstly, many models used in the DNA/RNA benchmark are specifically fine-tuned for this ligand type, while our GAT8 model is a general-purpose architecture, so that it could explore the effect of incorporating structure for various ligand types. Additionally, newer architectures like EquiPNAS incorporate advanced GNN components, such as Equivariance, which have been shown to enhance GNN performance across various tasks (Han et al. 2025). Moreover, the selection of features for node and edge embeddings can significantly impact model performance.
3.6 How much does the GNN architecture contribute to the performance?
In the previous sections, we observed that the GNN architecture improves the performance of the ProtT5 pLM. This observation prompted us to the necessity of quantifying how much the structural information processed by the GNN architecture contributes to the predictive performance of sequence-based pLMs. To this end, we designed two experiments to analyze the interplay of sequence information represented by node embeddings and structural information embedded in the graph connectivity.
3.6.1 Effect of varying the pLM
The first experiment involved comparing the GAT8 architecture with the sequence baseline model for several node embeddings using the Yu benchmark. Specifically, we compared one embedding with context-independent features and four embeddings from four different pLMs. The first embedding uses the context-independent AAIndex physico-chemical properties of amino acids, where a residue is represented by the same feature vector independently of its sequential context. The four remaining models use different context-aware pLMs of varying complexity. SeqVec embeddings and ProtBERT, which are relatively less complex, as well as ProtT5 and ESM-2 embeddings, which are relatively more complex. The embedding complexity can be measured by two main indicators: the number of parameters of the pLM, and the dimensionality of the embedding (see Supplementary Table S5). The model complexity increases when one or both indicators increase. We measured the effect of the structure information by calculating the absolute () and relative improvements () of the GAT8 models over their respective sequence baselines. We define the improvement in scores as follows:
We compared the embeddings using a fivefold CV approach, reporting the highest validation MCC scores achieved during training, along with their absolute and relative improvements. The results are presented in Table 4, showing the means and standard deviations across the fivefolds.
Table 4.
Effect of different embeddings.a
| Embedding | Ligand | ADP | AMP | ATP | CA | DNA | FE | GDP | GTP | HEME | MG | MN | ZN | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAIndex | Sequence | 0.068 0.012 | 0.056 0.009 | 0.075 0.004 | 0.124 0.009 | 0.159 0.016 | 0.198 0.015 | 0.116 0.016 | 0.088 0.021 | 0.147 0.03 | 0.086 0.003 | 0.172 0.011 | 0.247 0.01 | 0.12 |
| GAT8 | 0.152 0.015 | 0.081 0.016 | 0.138 0.009 | 0.152 0.008 | 0.218 0.018 | 0.281 0.029 | 0.25 0.039 | 0.158 0.045 | 0.228 0.035 | 0.107 0.005 | 0.206 0.018 | 0.317 0.012 | 0.19 | |
| 0.084 0.01 | 0.025 0.012 | 0.063 0.008 | 0.027 0.008 | 0.058 0.007 | 0.084 0.023 | 0.133 0.025 | 0.07 0.033 | 0.081 0.022 | 0.022 0.004 | 0.034 0.017 | 0.07 0.003 | 0.06 | ||
| 1.266 0.336 | 0.451 0.211 | 0.851 0.141 | 0.224 0.071 | 0.367 0.05 | 0.423 0.111 | 1.144 0.134 | 0.815 0.382 | 0.572 0.225 | 0.252 0.05 | 0.198 0.105 | 0.284 0.012 | 0.57 | ||
| P-Value | 0.0011 | 0.0087 | 0.0002 | 0.0021 | 0.0001 | 0.001 | 0.0 | 0.0088 | 0.0048 | 0.0003 | 0.0135 | 0.0 | ||
| SeqVec | Sequence | 0.577 0.051 | 0.313 0.04 | 0.461 0.031 | 0.356 0.026 | 0.302 0.041 | 0.483 0.061 | 0.606 0.091 | 0.461 0.084 | 0.524 0.052 | 0.371 0.034 | 0.434 0.034 | 0.565 0.017 | 0.45 |
| GAT8 | 0.639 0.04 | 0.41 0.054 | 0.543 0.03 | 0.399 0.035 | 0.371 0.041 | 0.617 0.074 | 0.697 0.072 | 0.595 0.07 | 0.586 0.044 | 0.415 0.035 | 0.539 0.028 | 0.625 0.014 | 0.53 | |
| 0.063 0.016 | 0.096 0.029 | 0.082 0.031 | 0.043 0.017 | 0.07 0.013 | 0.134 0.021 | 0.091 0.025 | 0.134 0.021 | 0.063 0.01 | 0.044 0.009 | 0.106 0.024 | 0.06 0.012 | 0.08 | ||
| 0.111 0.034 | 0.309 0.091 | 0.18 0.076 | 0.119 0.047 | 0.235 0.063 | 0.28 0.037 | 0.158 0.07 | 0.303 0.09 | 0.122 0.031 | 0.119 0.027 | 0.247 0.066 | 0.107 0.024 | 0.19 | ||
| P-Value | 0.0019 | 0.0016 | 0.0062 | 0.0048 | 0.0011 | 0.0001 | 0.0072 | 0.0017 | 0.001 | 0.0006 | 0.0011 | 0.0006 | ||
| ProtBERT | Sequence | 0.468 0.045 | 0.265 0.033 | 0.384 0.017 | 0.373 0.028 | 0.385 0.024 | 0.613 0.027 | 0.545 0.07 | 0.376 0.042 | 0.522 0.05 | 0.359 0.025 | 0.497 0.044 | 0.584 0.01 | 0.44 |
| GAT8 | 0.562 0.041 | 0.313 0.04 | 0.469 0.024 | 0.443 0.037 | 0.435 0.029 | 0.66 0.061 | 0.646 0.078 | 0.498 0.078 | 0.599 0.058 | 0.417 0.013 | 0.565 0.044 | 0.647 0.011 | 0.52 | |
| 0.093 0.009 | 0.048 0.016 | 0.084 0.032 | 0.07 0.013 | 0.049 0.011 | 0.046 0.078 | 0.101 0.021 | 0.122 0.062 | 0.077 0.01 | 0.058 0.015 | 0.068 0.015 | 0.063 0.008 | 0.07 | ||
| 0.201 0.034 | 0.18 0.058 | 0.222 0.09 | 0.188 0.028 | 0.128 0.027 | 0.079 0.124 | 0.186 0.042 | 0.328 0.16 | 0.148 0.015 | 0.164 0.053 | 0.138 0.036 | 0.108 0.014 | 0.17 | ||
| P-Value | 0.0002 | 0.0023 | 0.0053 | 0.0001 | 0.0005 | 0.2278 | 0.0006 | 0.0102 | 0.0 | 0.0023 | 0.001 | 0.0001 | ||
| ProtT5 | Sequence | 0.636 0.035 | 0.423 0.065 | 0.543 0.019 | 0.441 0.026 | 0.494 0.036 | 0.699 0.027 | 0.646 0.056 | 0.503 0.067 | 0.629 0.041 | 0.444 0.027 | 0.589 0.052 | 0.673 0.016 | 0.55 |
| GAT8 | 0.68 0.029 | 0.473 0.069 | 0.603 0.025 | 0.486 0.02 | 0.526 0.026 | 0.743 0.064 | 0.715 0.072 | 0.609 0.056 | 0.665 0.043 | 0.47 0.028 | 0.626 0.041 | 0.695 0.02 | 0.60 | |
| 0.043 0.009 | 0.05 0.046 | 0.059 0.015 | 0.045 0.015 | 0.032 0.014 | 0.044 0.081 | 0.069 0.035 | 0.107 0.033 | 0.036 0.012 | 0.026 0.008 | 0.036 0.012 | 0.022 0.008 | 0.04 | ||
| 0.069 0.017 | 0.127 0.127 | 0.109 0.027 | 0.103 0.037 | 0.066 0.034 | 0.065 0.114 | 0.107 0.052 | 0.218 0.084 | 0.057 0.019 | 0.059 0.018 | 0.064 0.027 | 0.032 0.012 | 0.08 | ||
| P-Value | 0.0008 | 0.0902 | 0.0009 | 0.0034 | 0.0118 | 0.2674 | 0.0101 | 0.0043 | 0.0025 | 0.0018 | 0.0062 | 0.0044 | ||
| ESM-2 | Sequence | 0.648 0.028 | 0.457 0.057 | 0.576 0.035 | 0.431 0.025 | 0.506 0.027 | 0.717 0.026 | 0.708 0.057 | 0.601 0.074 | 0.655 0.042 | 0.418 0.025 | 0.606 0.043 | 0.648 0.014 | 0.58 |
| GAT8 | 0.693 0.032 | 0.515 0.066 | 0.616 0.034 | 0.474 0.023 | 0.523 0.023 | 0.736 0.018 | 0.732 0.06 | 0.633 0.064 | 0.675 0.036 | 0.459 0.018 | 0.636 0.04 | 0.687 0.018 | 0.61 | |
| 0.045 0.006 | 0.058 0.017 | 0.04 0.006 | 0.043 0.013 | 0.018 0.013 | 0.019 0.009 | 0.023 0.01 | 0.032 0.027 | 0.02 0.011 | 0.04 0.012 | 0.03 0.013 | 0.039 0.009 | 0.03 | ||
| 0.069 0.008 | 0.126 0.037 | 0.07 0.012 | 0.101 0.033 | 0.035 0.028 | 0.026 0.013 | 0.033 0.014 | 0.056 0.051 | 0.031 0.019 | 0.097 0.032 | 0.05 0.023 | 0.06 0.014 | 0.06 | ||
| P-Value | 0.0 | 0.0015 | 0.0002 | 0.0024 | 0.0468 | 0.0117 | 0.0064 | 0.0676 | 0.0208 | 0.0024 | 0.0078 | 0.0007 |
Values represent the mean and standard deviation of the highest validation MCC scores from a 5-fold CV setting for the Yu benchmark training set. The P-values correspond to the result of the t-test performed on the relative improvement values from the CV folds.
Table 4 indicates that the absolute improvement in the validation MCC scores of the GAT8 model over the sequence baseline is positive on average across all ligand datasets. Moreover, although different embeddings have varying degrees of absolute improvement depending on the ligand dataset, they have similar values on average. Nevertheless, in the case of the less complex AAIndex, SeqVec, and ProtBERT embeddings, we can observe the highest relative improvements in the MCC score of the GAT8 model over the sequence baseline for most ligands, while the more complex ESM-2 and ProtT5 embeddings show smaller relative improvements. These observations show that while the protein structural information almost always improves the sequence baseline regardless of the chosen embedding, the relative effect is more pronounced for simple embeddings and decreases with the complexity of the language models.
To consolidate the relative improvement observations, we have performed statistical significance tests for all embeddings and across all ligand datasets. Table 4 presents the results of t-tests for the mean relative improvement scores of the GAT8 model over the sequence baseline in the fivefold CV setting. The null hypothesis of the t-tests is that there is no relative improvement, and the significance threshold is chosen to be 0.01. Table 4 shows that for all embeddings, most relative improvement values are statistically significant (.)
3.6.2 Effect of the original protein structure
To quantify how much improvement is caused by the concrete graph topology as opposed to random propagation of information, we devised the following experiment. We compared the GAT8 model with graphs constructed using the experimental PDB structure called “original” with a “random” version of the same model, where the original graph was replaced by a random graph. Specifically, we randomly assigned edges between residues and explicitly removed every edge in the original graph. The “random” model provides a solid baseline against which to measure the effect of the experimental structure information in the GNN architecture and its relationship with pLMs. In Table 5, we report the absolute and relative improvements in test MCC scores of the GAT8 model with original graphs over their respective random graph baselines. We observe from the absolute improvement scores that for all embeddings, the original structure almost always contributes positively to the performance. Nevertheless, this effect tends to decrease on average both in terms of absolute and relative improvement, especially for more complex pLMs.
Table 5.
Effect of original structure.a
| Embedding | Ligand | ADP | AMP | ATP | CA | DNA | FE | GDP | GTP | HEME | MG | MN | ZN | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAIndex | Original | 0.142 | 0.101 | 0.106 | 0.122 | 0.168 | 0.241 | 0.215 | 0.192 | 0.214 | 0.083 | 0.204 | 0.331 | 0.177 |
| Random | 0.052 | 0.054 | 0.064 | 0.115 | 0.122 | 0.193 | 0.083 | 0.090 | 0.135 | 0.066 | 0.165 | 0.256 | 0.116 | |
| 0.090 | 0.047 | 0.042 | 0.007 | 0.046 | 0.048 | 0.132 | 0.102 | 0.079 | 0.017 | 0.039 | 0.075 | 0.060 | ||
| 1.734 | 0.870 | 0.646 | 0.065 | 0.377 | 0.249 | 1.579 | 1.125 | 0.585 | 0.261 | 0.239 | 0.295 | 0.669 | ||
| SeqVec | Original | 0.571 | 0.365 | 0.512 | 0.310 | 0.322 | 0.587 | 0.640 | 0.613 | 0.602 | 0.298 | 0.480 | 0.598 | 0.492 |
| Random | 0.528 | 0.352 | 0.500 | 0.291 | 0.298 | 0.576 | 0.611 | 0.613 | 0.566 | 0.293 | 0.479 | 0.572 | 0.473 | |
| 0.043 | 0.013 | 0.012 | 0.019 | 0.024 | 0.011 | 0.029 | 0.000 | 0.036 | 0.005 | 0.001 | 0.026 | 0.018 | ||
| 0.082 | 0.036 | 0.024 | 0.064 | 0.082 | 0.018 | 0.048 | 0.000 | 0.064 | 0.017 | 0.002 | 0.045 | 0.040 | ||
| ProtBERT | Original | 0.504 | 0.326 | 0.445 | 0.378 | 0.381 | 0.635 | 0.580 | 0.552 | 0.564 | 0.324 | 0.514 | 0.618 | 0.485 |
| Random | 0.468 | 0.305 | 0.423 | 0.337 | 0.357 | 0.601 | 0.589 | 0.535 | 0.527 | 0.297 | 0.485 | 0.614 | 0.462 | |
| 0.036 | 0.021 | 0.022 | 0.041 | 0.024 | 0.034 | −0.009 | 0.017 | 0.037 | 0.027 | 0.029 | 0.004 | 0.024 | ||
| 0.078 | 0.068 | 0.051 | 0.121 | 0.067 | 0.057 | −0.016 | 0.031 | 0.070 | 0.090 | 0.060 | 0.007 | 0.057 | ||
| ProtT5 | Original | 0.597 | 0.489 | 0.572 | 0.408 | 0.510 | 0.692 | 0.746 | 0.670 | 0.743 | 0.364 | 0.607 | 0.649 | 0.587 |
| Random | 0.583 | 0.471 | 0.560 | 0.381 | 0.494 | 0.688 | 0.706 | 0.635 | 0.729 | 0.351 | 0.589 | 0.689 | 0.573 | |
| 0.014 | 0.018 | 0.012 | 0.027 | 0.016 | 0.004 | 0.040 | 0.035 | 0.014 | 0.013 | 0.018 | −0.040 | 0.014 | ||
| 0.024 | 0.038 | 0.021 | 0.071 | 0.031 | 0.005 | 0.057 | 0.055 | 0.020 | 0.037 | 0.030 | −0.059 | 0.028 | ||
| ESM-2 | Original | 0.616 | 0.493 | 0.597 | 0.401 | 0.647 | 0.643 | 0.750 | 0.671 | 0.755 | 0.350 | 0.597 | 0.683 | 0.600 |
| Random | 0.617 | 0.507 | 0.603 | 0.416 | 0.475 | 0.674 | 0.755 | 0.705 | 0.750 | 0.342 | 0.574 | 0.680 | 0.591 | |
| −0.001 | −0.014 | −0.006 | −0.015 | 0.172 | −0.031 | −0.005 | −0.034 | 0.005 | 0.008 | 0.023 | 0.003 | 0.009 | ||
| −0.002 | −0.027 | −0.010 | −0.035 | 0.363 | −0.046 | −0.007 | −0.048 | 0.006 | 0.023 | 0.040 | 0.005 | 0.022 |
Values represent MCC scores for the Yu benchmark test set.
The results of both experiments suggest that, due to the fact that more complex embeddings significantly improve the performance of the sequence and the random graph baselines, a significant part of the structure information necessary for predicting protein–ligand binding sites is already encoded in the protein language models. This may be explained by the fact that, as complex protein language models were built using masked language modeling, a large number of parameters and huge training sets, important relationships between residues that correlate with structural features may already be captured in the embeddings and can thus be used for binding site predictions.
3.7 Evaluation on the PDBBind dataset
In previous experiments (see Table 4), we found that the incorporation of protein structure information using GNN significantly improved the performance of baseline pLM sequences. However, an open question remains: do these results hold true in broader contexts, especially those involving a more diverse range of ligands?
Although the Yu benchmark was constructed from BioLip (Yang et al. 2013), which includes a variety of biologically plausible ligands, it predominantly contains chemically and structurally similar ligands and ligand types. For example, ligands like ADP, ATP, AMP, GDP, and GTP are highly similar, as are ion ligands such as Zn and Mg. This similarity could introduce bias in protein–ligand binding site models, potentially affecting the generalizability of our findings. To address this issue, we studied the impact of structural information on a newly constructed dataset from the PDBBind dataset (Wang et al. 2004, Su et al. 2019), which includes chemically diverse ligands. The data set was carefully divided into training, validation, and test sets, where we controlled for protein sequence similarity and ligand similarity using the Tanimoto similarity of Extended-Connectivity fingerprints (ECFP4) (Rogers and Hahn 2010). This ensures a fair evaluation of the performance of the model on a wider range of ligands and avoids potential data leakage risks.
PDBBind is a dataset comprising experimentally determined protein–ligand binding affinities alongside the corresponding structures of protein–ligand complexes derived from the PDB. The binding affinities are curated from the literature by reviewing the references associated with these complex structures. Here, we used a refined set of 5,316 protein–ligand complexes from the 2020 version of PDBBind (http://www.pdbbind.org.cn/download/pdbbind_2020_intro.pdf). The refined set was obtained by curating 157,974 experimentally determined PDB structures, following the methodology described in the work of Su et al. (2019), and the version used in our work was downloaded from PDBBind-Plus (https://www.pdbbind-plus.org.cn/download).
We constructed our dataset using the following procedure.
PDBBindv2020 already contains annotations of protein–ligand binding pockets that were determined by considering all residues that are within 10 Å distance of the ligand as binding residues.
Since our GAT8 model accepts only a single chain as input and cannot capture information from other chains when the binding pocket spans multiple chains, we focused exclusively on structures where the binding pocket spans a single chain. Additionally, we removed chains that do not contain binding residues from the dataset, resulting in a total of 3465 labeled chains.
To address the data leakage issue associated with protein sequence and ligand similarity, we designed training, validation, and test set splits using popular sequence and ligand clustering algorithms. We used the MMseqs tool (Hauser et al. 2016) (easy-cluster command) with default parameters to cluster protein sequences and defined a cutoff value of 40% for sequence similarity. We used the Rdkit (https://www.rdkit.org/docs/index.html) to compute ECFP4 Fingerprint Tanimoto similarities (radius=2 and nBits=1024) between all ligands in the dataset and used hierarchical clustering with a cutoff value of 40 % to define the ligand clusters.
We defined the training, validation, and test split by randomly assigning the sequence clusters obtained in step 3 to each set, which guaranteed that no pair of sequences from different splits had more than 40% similarity.
We removed all sequences from the validation and test sets whose ligands share a cluster with any ligand in the training set. The same process was also applied between the test set and the validation set. This step removed any bias of ligand similarity by ensuring that no pair of ligands from different splits had more than 40% chemical similarity.
The above procedure resulted in 2516 total protein–ligand single-chain complexes. We used the same procedure to build 5 different train/validation/test splits for our experiment, as an equivalent strategy to five fold CV used in Subsection 3.6.2 that is more sensitive to data leakage. More details about the constructed dataset are presented in the Supplementary Table S10, and the training, validation, and test splits used can be found in (https://zenodo.org/records/15184302).
We assessed the impact of incorporating structural information by comparing the performance of the GAT8 model with the sequence baseline on the new dataset. Both models were trained and evaluated following the same procedure and hyperparameters outlined in Subsection How much does the GNN architecture contribute to the performance?. To assess the effect of pLM complexity on the contribution of structural information, the models were constructed using both the complex ESM-2 embeddings and the simpler SeqVec embeddings.
Supplementary Table S11 presents the models’ performance measured by the best validation MCC score reached during training of each split. The table also presents t-test results for statistical significance of the values.
The results indicate that the GAT8 model outperforms the sequence baseline in all settings for both pLMs. Furthermore, while a statistically significant relative improvement of 0.112 0.02 () was observed in the MCC score for the simple SeqVec embeddings, a lower relative improvement of 0.003 0.014 was observed for the complex ESM-2. Moreover, the improvement was not statistically significant for the latter case (). These observations align with the findings in the Subsection Effect of the original protein structure.
Therefore, the experiment indicates that the incorporation of structural information through the GNN positively impacts protein–ligand binding site prediction with lower relative improvements in performance for more complex pLMs, even in cases where there is a high level of control over sequence and ligand similarity in the training and testing splits.
4 Conclusion
In this work, we integrated sequence-based and structure-based paradigms for predicting protein–ligand binding sites by designing a GNN model augmented with protein language model embeddings. While the model’s performance varies with the cutoff distance used to construct the protein graph, the introduction of the graph attention mechanism significantly enhances predictive performance for densely connected graphs. Our findings indicate that although the structural information processed by the GNN architecture generally contributes positively to the model’s performance, this effect is more pronounced with simple node features and diminishes with the use of more complex language models. Overall, our research demonstrates the potential utility of combining sequence-based and structure-based approaches—specifically, using a GNN model enhanced with protein language model embeddings—to improve protein–ligand binding site prediction. This is particularly promising given the increasing availability of predicted 3D models. Although slight inaccuracies in atom positions within these predicted structures might pose challenges for tasks like molecular docking, they should not significantly impact the protein–ligand residue prediction task. This is because the graph topology, which serves as the input to the GNN, is merely an approximation of the protein’s three-dimensional structure and remains relatively unaffected by minor perturbations in atom positions. Consequently, we believe that integrating protein sequence information from language models with 3D structure data is a promising approach for predicting protein–ligand binding residues.
Supplementary Material
Contributor Information
Hamza Gamouh, Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic.
Marian Novotný, Faculty of Science, Charles University, 128 00 Prague, Czech Republic.
David Hoksza, Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic.
Author contributions
Hamza Gamouh (Conceptualization [equal], Data curation [lead], Formal analysis [equal], Investigation [lead], Methodology [equal], Resources [lead], Software [lead], Writing—original draft [lead], Writing—review & editing [lead]), Marian Novotny (Conceptualization [supporting], Investigation [supporting], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), and David Hoksza (Conceptualization [equal], Funding acquisition [lead], Methodology [equal], Project administration [lead], Supervision [lead], Validation [lead], Writing—original draft [supporting], Writing—review & editing [supporting])
Supplementary data
Supplementary data is available at Bioinformatics online.
Conflict of interest: The authors declare that they have no competing interests.
Funding
This work was supported by the Czech Science Foundation [23-07349S]. Computational resources were provided by the e-INFRA CZ project (ID: 90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic. Part of this work was carried out with the support of ELIXIR CZ Research Infrastructure (ID LM2023055, MEYS CR).
Data availability
The datasets generated and/or analyzed during the current study, as well as pretrained models, are available in the following Zenodo link https://zenodo.org/records/15184302. The source code that was used to generate the results of the current study is available in the following GitHub repository https://github.com/hamzagamouh/pt-lm-gnn as well as in the following Zenodo link https://zenodo.org/records/15192327.
References
- Aggarwal R, Gupta A, Chelur V et al. Deeppocket: ligand binding site detection and segmentation using 3d convolutional neural networks. J Chem Inf Model 2022;62:5069–79. [DOI] [PubMed] [Google Scholar]
- Alipanahi B, Delong A, Weirauch MT et al. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nat Biotechnol 2015;33:831–8. [DOI] [PubMed] [Google Scholar]
- Ashkenazy H, Erez E, Martz E et al. Consurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic Acids Res 2010;38:W529–W533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berman HM, Westbrook J, Feng Z et al. The protein data bank. Nucleic Acids Res 2000;28:235–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown T, Mann B, Ryder N et al. Language models are few-shot learners. Adv Neural Info Process Syst 2020;33:1877–901. [Google Scholar]
- Brylinski M, Skolnick J. A threading-based method (findsite) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA 2008;105:129–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chai J, Zeng H, Li A et al. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl 2021;6:100134. [Google Scholar]
- Chauhan JS, Mishra NK, Raghava GP. Identification of atp binding residues of a protein from its primary sequence. BMC Bioinformatics 2009;10:434–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen K, Mizianty MJ, Kurgan L. Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 2012;28:331–41. [DOI] [PubMed] [Google Scholar]
- Chen P, Huang JZ, Gao X. Ligandrfs: random Forest ensemble to identify ligand-binding residues from sequence information alone. BMC Bioinfo 2014;15:S4–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen P, Hu S, Zhang J et al. A sequence-based dynamic ensemble learning system for protein ligand-binding site prediction. IEEE/ACM Trans Comput Biol Bioinform 2016;13:901–12. [DOI] [PubMed] [Google Scholar]
- Chicco D, Jurman G. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics 2020;21:6. 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui Y, Dong Q, Hong D et al. Predicting protein-ligand binding residues with deep convolutional neural networks. BMC Bioinfo 2019;20:93–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devlin J, Chang M-W, Lee K et al. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv, 2018, preprint: not peer reviewed.
- Ding Y, Tang J, Guo F. Identification of protein–ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model 2017;57:3149–61. 10.1021/acs.jcim.7b00307 [DOI] [PubMed] [Google Scholar]
- Elnaggar A, Heinzinger M, Dallago C et al. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. [DOI] [PubMed] [Google Scholar]
- Evteev S, Ereshchenko A, Adjugim D et al. Skittles: gnn-assisted pseudo-ligands generation and its application for binding sites classification and affinity prediction. Prot Struct Funct Bioinfo 2025;93:1269–80. [DOI] [PubMed] [Google Scholar]
- Evteev SA, Ereshchenko AV, Ivanenkov YA. Siteradar: utilizing graph machine learning for precise mapping of protein–ligand-binding sites. J Chem Inf Model 2023;63:1124–32. [DOI] [PubMed] [Google Scholar]
- Ferreira LG, Dos Santos RN, Oliva G et al. Molecular docking and structure-based drug design strategies. Molecules 2015;20:13384–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferruz N, Höcker B. Controllable protein design with language models. Nat Mach Intell 2022;4:521–32. [Google Scholar]
- Fout A, Byrd J, Shariat B et al. Protein interface prediction using graph convolutional networks. Adv Neural Info Process Syst 2017;30. [Google Scholar]
- Hauser M, Steinegger M, Söding J. Mmseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics 2016;32:1323–30. [DOI] [PubMed] [Google Scholar]
- He K, Zhang X, Ren S et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 770–778.
- Heinzinger M, Elnaggar A, Wang Y et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019;20:723–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hendlich M, Rippmann F, Barnickel G. Ligsite: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 1997;15:359–89. [DOI] [PubMed] [Google Scholar]
- Høie MH, Kiehl EN, Petersen B et al. Netsurfp-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res 2022;50:W510–W515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoksza D, Gamouh H. Exploration of protein sequence embeddings for protein-ligand binding site detection. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2022, 3356–3361. IEEE.
- Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 2015, 448–56. PMLR.
- Jha K, Karmakar S, Saha S. Graph-bert and language model-based framework for protein–protein interaction identification. Sci Rep 2023;13:5663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han J, Cen J, Wu L et al. A survey of geometric graph neural networks: data structures, models and applications. Front Comp Sci 2025;19. 10.1007/s11704-025-41426-w [DOI] [Google Scholar]
- Jiménez J, Doerr S, Martínez-Rosell G et al. Deepsite: protein-binding site predictor using 3d-convolutional neural networks. Bioinformatics 2017;33:3036–42. [DOI] [PubMed] [Google Scholar]
- Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with alphafold. Nature 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kandel J, Tayara H, Chong KT. Puresnet: prediction of protein-ligand binding sites using deep residual neural network. J Cheminform 2021;13:65–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kawashima S, Kanehisa M. Aaindex: amino acid index database. Nucleic Acids Res 2000;28:374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khurana D, Koli A, Khatter K et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 2023;82:3713–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim P, Zhao J, Lu P et al. Mutlbsgenedb: mutated ligand binding site gene database. Nucleic Acids Res 2017;45:D256–D263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv, 2016, preprint: not peer reviewed.
- Konc J, Janežič D. Protein binding sites for drug design. Biophys Rev 2022;14:1413–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krivák R, Hoksza D. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 2018;10:39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laskowski RA. Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 1995;13:323–30. [DOI] [PubMed] [Google Scholar]
- Laurie AT, Jackson RM. Q-sitefinder: an energy-based method for the prediction of protein–ligand binding sites. Bioinformatics 2005;21:1908–16. [DOI] [PubMed] [Google Scholar]
- Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 2009;10:168–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li P, Liu Z-P. Geobind: segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res 2023;51:e60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y, Huang C, Ding L et al. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods 2019;166:4–21. [DOI] [PubMed] [Google Scholar]
- Lin Y, Yoo S, Sanchez R. Sitecomp: a server for ligand binding site analysis in protein structures. Bioinformatics 2012;28:1172–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Z, Akin H, Rao R et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. a; 500902, preprint: not peer reviewed.
- Lin Z, Akin H, Rao R et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022. b, preprint: not peer reviewed.
- Liu Y, Grimm M, Dai W-T et al. Cb-dock: a web server for cavity detection-guided protein–ligand blind docking. Acta Pharmacol Sin 2020;41:138–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv, 2017, preprint: not peer reviewed.
- Min B, Ross H, Sulem E et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput Surv 2024;56:1–40. 10.1145/3605943 [DOI] [Google Scholar]
- Mylonas SK, Axenopoulos A, Daras P. Deepsurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics 2021;37:1681–90. [DOI] [PubMed] [Google Scholar]
- Ngan C-H, Hall DR, Zerbe B et al. Ftsite: high accuracy detection of ligand binding sites on unbound protein structures. Bioinformatics 2012;28:286–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Shea K, Nash R. An introduction to convolutional neural networks. arXiv, 2015, preprint: not peer reviewed.
- Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in python. J Mach Learn Res 2011;12:2825–30. [Google Scholar]
- Pokharel S, Pratyush P, Heinzinger M et al. Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep 2022;12:16933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pratyush P, Pokharel S, Saigo H et al. Plmsnosite: an ensemble-based approach for predicting protein s-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model. BMC Bioinfo 2023;24:41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pravda L, Berka K, Svobodová Vařeková R et al. Anatomy of enzyme channels. BMC Bioinfo 2014;15:379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pu L, Govindaraj RG, Lemoine JM et al. Deepdrug3d: classification of ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput Biol 2019;15:e1006718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rao R, Meier J, Sercu T et al. Transformer protein language models are unsupervised structure learners. Biorxiv, 2020, 2020–12, preprint: not peer reviewed.
- Roche DB, Tetchner SJ, McGuffin LJ. Funfold: an improved automated method for the prediction of ligand binding residues using 3d models of proteins. BMC Bioinformatics 2011;12:160–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roche DB, Brackenridge DA, McGuffin LJ. Proteins and their interacting partners: an introduction to protein–ligand binding site prediction methods. Int J Mol Sci 2015;16:29829–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roche R, Moussad B, Shuvo MH et al. Equipnas: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res 2024;52:e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model 2010;50:742–54. [DOI] [PubMed] [Google Scholar]
- Rusch TK, Bronstein MM, Mishra S. A survey on oversmoothing in graph neural networks. arXiv, March 2023, preprint: not peer reviewed.
- Serra A, Galdi P, Tagliaferri R. Machine learning for bioinformatics and neuroimaging. Wiley Interdiscip Rev Data Min Knowl Discov 2018;8:e1248. [Google Scholar]
- Srivastava N, Hinton G, Krizhevsky A et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014;15:1929–58. [Google Scholar]
- Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods 2019;16:603–6. [DOI] [PubMed] [Google Scholar]
- Su M, Yang Q, Du Y et al. Comparative assessment of scoring functions: the casf-2016 update. J Chem Inf Model 2019;59:895–913. [DOI] [PubMed] [Google Scholar]
- Suzek BE, Wang Y, Huang H et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015;31:926–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tiwary BK. Biological databases. Bioinformatics and Computational Biology: A Primer for Biologists. 2022, 11–31. 10.1007/978-981-16-4241-8 [DOI] [Google Scholar]
- Unsal S, Atas H, Albayrak M et al. Learning functional properties of proteins with language models. Nat Mach Intell 2022;4:227–45. [Google Scholar]
- Varadi M, Anyango S, Deshpande M et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 2022;50:D439–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. Adv Neural Info Process Syst 2017;30. [Google Scholar]
- Veličković P. Everything is connected: graph neural networks. Curr Opin Struct Biol 2023;79:102538. [DOI] [PubMed] [Google Scholar]
- Veličković P, Cucurull G, Casanova A et al. Graph attention networks. arXiv, 2017, preprint: not peer reviewed.
- Wang R, Fang X, Lu Y et al. The pdbbind database: collection of binding affinities for protein- ligand complexes with known three-dimensional structures. J Med Chem 2004;47:2977–80. [DOI] [PubMed] [Google Scholar]
- Wang W, Sun B, Yu M et al. Graphplbr: protein-ligand binding residue prediction with deep graph convolution network. IEEE/ACM Trans Comput Biol Bioinform 2023;20:2223–32. [DOI] [PubMed] [Google Scholar]
- Wang Y, You Z-H, Yang S et al. A high efficient biological language model for predicting protein–protein interactions. Cells 2019;8:122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wass MN, Kelley LA, Sternberg MJ. 3dligandsite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 2010;38:W469–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xia Y, Xia C-Q, Pan X et al. Graphbind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021;49:e51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Roy A, Zhang Y. Biolip: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res 2013;41:D1096–D1103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Roy A, Zhang Y. Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 2013;29:2588–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu D-J, Hu J, Yang J et al. Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Trans Comput Biol Bioinform 2013;10:994–1008. [DOI] [PubMed] [Google Scholar]
- Yuan Q, Chen S, Rao J et al. Alphafold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform 2022;23:bbab564. [DOI] [PubMed] [Google Scholar]
- Zhang X-M, Liang L, Liu L et al. Graph neural networks and their current applications in bioinformatics. Front Genet 2021;12:690049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y, Huang W, Wei Z et al. Equipocket: an e (3)-equivariant geometric graph neural network for ligand binding site prediction, arXiv, 2023, preprint: not peer reviewed.
- Zhao J, Cao Y, Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J 2020;18:417–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Z, Xu Y, Zhao Y. Sxgbsite: prediction of protein–ligand binding sites using sequence information and extreme gradient boosting. Genes (Basel) 2019;10:965. 10.3390/genes10120965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng Z, Deng Y, Xue D et al. Structure-informed language models are protein designers. bioRxiv, 2023, preprint: not peer reviewed.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated and/or analyzed during the current study, as well as pretrained models, are available in the following Zenodo link https://zenodo.org/records/15184302. The source code that was used to generate the results of the current study is available in the following GitHub repository https://github.com/hamzagamouh/pt-lm-gnn as well as in the following Zenodo link https://zenodo.org/records/15192327.



