Skip to main content
GigaScience logoLink to GigaScience
. 2026 Jan 24;15:giag010. doi: 10.1093/gigascience/giag010

ViralBindPredict: empowering viral protein–ligand binding sites through deep learning and protein sequence-derived insights

A M B Amorim 1,2,3,4, C Marques-Pereira 5,6, T Almeida 7, N Rosário-Ferreira 8,9, H S Pinto 10,11, C Vaz 12,13, A Francisco 14,15, I S Moreira 16,17,18,
PMCID: PMC13014472  PMID: 41578956

Abstract

Background

The development of a single therapeutic compound can exceed 1.8 billion USD and take more than a decade, underscoring the urgent need to accelerate drug discovery. Computational methods have become indispensable; however, traditional approaches, such as docking simulations, face limitations because they depend on protein and ligand structures that may be unavailable, incomplete, or of low accuracy. Even recent breakthroughs, such as AlphaFold, do not consistently provide models precise enough to identify ligand-binding sites or drug–target interactions.

Results

We present ViralBindPredict, a deep learning framework that predicts viral protein–ligand binding sites directly from sequence. We also introduce the first curated large-scale benchmark of viral protein–ligand interactions, comprising >10,000 viral chains and ≈13,000 interactions processed using a 4.5 Å heavy-atom contact threshold. ViralBindPredict combines Mordred ligand descriptors with contextual protein embeddings from ESM2 or ProtTrans, enabling structure-free learning of binding preferences. Leakage-controlled data splits were applied to prevent overlap across protein sequence clusters and ligand scaffolds (Cluster90%, NoRed90%→Cluster90%, Cluster40%, NoRed90%→Cluster40%). Across most regimes, multilayer perceptrons, especially with ESM-2 embeddings, outperformed LightGBM baselines, maintaining strong precision–recall for unseen ligands but showing larger drops for unseen proteins, indicating that the protein context dominates generalization.

Conclusions

ViralBindPredict introduces the first leakage-controlled benchmark for viral protein–ligand interactions and demonstrates accurate ligand-binding residue prediction directly from protein sequence. Together, these advances establish ViralBindPredict as a robust and extensible workflow for sequence-based antiviral discovery, supporting rapid target prioritization, compound repurposing, and de novo drug design, even in the absence of structural data.

Keywords: viral drug discovery, viral drug–target interactions, viral ligand binding site, deep learning, supervised learning, neural networks

Introduction

Viral structural proteins are increasingly recognized as promising targets for antiviral drug development [1, 2] because of their central roles in the viral life cycle, including assembly, disassembly, and interactions with host proteins [3–6]. Targeting these proteins offers a significant advantage because they are unique to viruses and typically lack homologues in humans [2, 7], thereby reducing the risk of off-target effects. Additionally, many viral structural proteins form oligomers [7], providing an opportunity for therapeutic intervention by disrupting oligomerization, which prevents the formation of functional viral complexes and inhibits viral replication.

Despite these advantages, targeting viral structural proteins poses significant challenges. Unlike enzymes, which have well-defined active sites for substrate binding, structural proteins rely on interactions at macromolecular interfaces, which are more complex to target [2]. These interactions may also involve allosteric regulation, in which a ligand binds to one site and influences activity at a distant, functionally important site. The design of drugs targeting allosteric sites requires a deeper understanding of how these structural proteins interact with other viral or host proteins. For example, pleconaril, an antiviral drug, binds to the capsid of rhinovirus and poliovirus, blocking the release of viral ribonucleic acid (RNA) [8] (Fig. 1) and preventing infection. Similarly, bevirimat targets the Gag protein of HIV and inhibits viral maturation [9]. These examples highlight the potential for targeting viral structural proteins and emphasize the need for accurate identification of ligand-binding sites (LBS) to optimize drug design.

Figure 1.

Figure 1

Crystal structure of human rhinovirus 14 (HRV14) complexed with the antiviral compound pleconaril (PDB ID: 1NA1). The structure illustrates pleconaril bound to a hydrophobic pocket within Viral Protein 1 (VP1) of the HRV14 capsid. Images were rendered using a Protein Imager [10].

Understanding LBS is crucial for evaluating the druggability of viral proteins. However, traditional drug discovery methods are costly and time-consuming and often limit the number of diseases and viral targets that can be studied. Drug discovery involves multiple stages, including the identification of unmet medical needs, validation of targets, screening of compounds, and optimization of leads. Preclinical and clinical testing adds further time and cost, making it imperative to streamline the process. In particular, the computational prediction of drug–target interactions (DTIs) and LBS has become increasingly important for accelerating drug discovery and reducing reliance on labor-intensive experimental methods.

Machine learning (ML) [11–13] and deep learning (DL) [14, 15] approaches have revolutionized the in silico predictions of DTIs and LBS [13, 16–20]. By learning from large datasets of experimentally validated interactions, these models can predict novel interactions based on the structural or sequence features of proteins and ligands. For DTI prediction, methods such as DTi2Vec [21], Drug3D-DTI [22], and DeepDTIs [13] have been previously developed. These methods leverage experimentally verified DTIs to extract structural elements, protein sequence features, and drug sequence features, which are then used to train models to predict new DTIs. However, negative labels (DTI pairs assumed not to interact) are typically inferred, which can lead to misclassification, even when experimental data on positive interactions are available. Most existing approaches for LBS prediction rely on structural data to identify potential binding sites. Classical ML-based tools, such as COACH-D [23] and FINDSITE [24], and DL models, such as DeepProSite [25], DEEPSite [26], and PUResNet [27], have been developed. For example, PUResNet, a DL-based LBS predictor, uses structural data to treat proteins as three-dimensional (3D) grids to predict the binding sites [27]. PUResNet depends heavily on high-quality structural data, and similar to many other models, its applicability to viral proteins is limited, where structural information is often unavailable. Similarly, PLINDER, the largest annotated protein–ligand interaction (PLI) dataset to date, provides extensive annotations but also relies on structural data [28]. Although these models are powerful tools, their dependence on structural data can limit their usefulness for viral proteins, highlighting the need for sequence-based approaches to address the lack of high-resolution structural information. Sequence-based models operate with simpler input formats, such as amino acid sequences and Simplified Molecular Input Line Entry System (SMILES), eliminating the need for costly and time-consuming experimental techniques, such as X-ray crystallography or cryo-EM, and reducing the computational time for large-scale PLI predictions. Their scalability and broad applicability make them essential for expanding in silico drug discovery beyond structurally characterized viral proteins [29, 30]. However, existing sequence-based methods, such as LIBRUS [31] and DeepCSeqSite [29], are not specifically designed for viral LBS prediction and lack the necessary specialization to address the key challenges unique to viral proteins. These challenges include high mutation rates, unconventional interaction patterns, and distinct structural motifs such as viral capsid folds [32, 33]. Consequently, these models may struggle to accurately capture viral-specific binding site features, leaving a significant gap in the field and highlighting the need for dedicated sequence-based approaches tailored to viral proteomes.

ViralBindPredict addresses this unmet need by providing a curated, large-scale viral-only PLI resource and sequence-based predictive models that identify ligand-binding residues in viral proteins without requiring experimentally solved 3D structures. Each residue–ligand pair was represented using contextual protein embeddings from large protein language models (PLMs) (ESM2 or ProtTrans) together with ligand physicochemical descriptors (Mordred), enabling the learning of viral binding preferences directly from the sequence and SMILES. We further introduced stringent, leakage-controlled data partitions, separating unseen ligands, unseen proteins, and fully novel protein–ligand pairs, with additional clustering at 90 and 40% sequence identities to obtain biologically meaningful generalization estimates. On this basis, we trained and evaluated multilayer perceptron (MLP) and gradient-boosted (LightGBM) classifiers to predict viral LBS at the residue resolution. ViralBindPredict models enable the rapid prioritization of druggable regions in viral proteins, support the repurposing of known chemotypes against new viral targets, and establish a benchmark for sequence-driven antiviral discovery in settings where structural information is sparse or unavailable. Rapid analysis of newly sequenced viral genomes supports the development of targeted therapeutics against emerging threats. ViralBindPredict models are publicly available in the GitHub link.

Methods

Dataset curation

To construct a comprehensive dataset of viral protein structures, entries from the Protein Data Bank (PDB) [34] were curated as of 18 June 2025. A total of 23,292 entries were initially selected based on taxonomic classifications spanning the major viral realms and clades: Riboviria, Duplodnaviria, Varidnaviria, Monodnaviria, Ribozyviria, Adnaviria, and Singelaviria, as well as unclassified viruses, including those of uncertain placement (incertae sedis) and bacterial viruses. Each structure was programmatically screened using polymer entity annotations to confirm the presence of at least 1 protein chain, resulting in 23,173 PDB structures being retained for further analysis. For each of these, the corresponding Macromolecular Crystallographic Information File (mmCIF) was downloaded, with this format prioritized over the legacy PDB format owing to its superior standardization and greater support for macromolecular complexity. Importantly, mmCIF accommodates extended ligand identifiers (e.g., 5-character codes), which are increasingly common in viral drug discovery but are no longer supported by the traditional PDB format [35]. To ensure accurate referencing and prevent duplication of identifiers (e.g., PDB chain IDs or ligand chain IDs), label identifiers from mmCIF files were used instead of author-assigned identifiers from legacy PDB files, as the former were standardized and unique within each structure, making them more suitable for systematic processing. In addition, chain-level metadata were extracted from the retained entries, including the source organism (to confirm viral origin), resolution (to assess structural quality), and experimental method (e.g., X-ray crystallography and cryo-electron microscopy). For structures containing multiple models, only the first model was retained to ensure consistency and to avoid redundancy.

Protein and ligand preprocessing

Entries containing nucleic acids (deoxyribonucleic acid [DNA] or RNA) or other non-protein polymers were excluded, as the focus was on viral PLIs [36]. Such macromolecules cannot be consistently mapped to canonical amino acids and would introduce structural and functional heterogeneity, whereas DNA/RNA–ligand interactions represent a distinct problem outside the scope of this dataset. For the remaining protein entries, non-standard amino acids were mapped to their canonical equivalents using the _pdbx_struct_mod_residue record in the mmCIF files [36, 37]. To achieve this, modified residues, often represented as HETATM entries, were reclassified as ATOM and substituted with their parent residues. For example, MSE was replaced with MET while preserving the original atom coordinates and sequence position. This procedure ensured that the modified residues were included as part of the protein sequence rather than misclassified as ligands. All modified residues with a defined parent were converted accordingly, whereas chains containing ambiguous residues or undefined amino acids (e.g., UNK) were excluded.

For each valid chain, the protein sequence was extracted directly from the 3D coordinates, and a chain-specific mmCIF file was generated containing only relevant ATOM records and associated HETATM ligands (excluding the water HOH). This initial filtering resulted in 133,504 chains. A length filter was applied, retaining only protein chains with more than 30 amino acids, as shorter chains are often fragments or peptides lacking stability and key residues required for accurate PLI analysis [37, 38]. Additionally, only chains associated with at least 1 ligand in the HETATM records were retained, as chains without ligands do not contribute to PLI data. This reduced the dataset to 74,578 chains. To ensure viral origin, chain source annotations were screened using viral-specific keywords (e.g., virus, phage, sars, hiv, ebola, coronavirus, viridae, virinae, viricetes, h1n1, h3n2, h5n1, h6n1, h7n9, h17n10, htlv, hbv, virulent, prrsv, viral, betacov). Moreover, entries annotated as “unknown” were manually curated by the authors. Chains exclusively annotated as viral were retained, whereas non-viral and chimeric chains with mixed (viral and non-viral) origins were excluded, resulting in 46,844 viral chains being retained.

Each retained viral chain was paired with its associated ligand to define the initial set of protein–ligand complexes used for downstream analysis. Each PDB entry may be associated with multiple ligand IDs, with the same ligand ID potentially appearing at multiple binding sites. For clarity, we define the ligand identifier alone as LIG_base (e.g., ATP) and the site-specific identifier, which includes the ligand chain ID, as LIG_full (e.g., ATP_A, ATP_B). Ligand information corresponding to 5,678 unique LIG_base identifiers was retrieved from the PDB Chemical Component Dictionary (CCD) [39]. First, the ligands were cross-referenced against curated exclusion lists (BioLiP2 [36], Q-BioLiP [37], PLIC [40], and an internal exclusion set), which captured common artifacts, such as metal ions, covalently bound ligands, and crystallization agents. To refine the dataset, the ligands were further filtered by type according to their classification in the PDB CCD. For _chem_comp.pdbx_type, only HETAIN (inhibitors), HETAD (drugs), HETAC (coenzymes), and ATOMS (sugars) were retained. Inhibitors and drugs represent the most pharmacologically meaningful compounds, whereas coenzymes and sugars were retained because of their critical biological roles in protein function. Other categories, such as ATOMN (nucleic acids), ATOMP (proteins), HETAI (ions), HETAS (solvents), and HETIC (ion complexes), were excluded because our dataset specifically targeted biologically relevant molecules in viral PLIs. For _chem_comp.type, only non-polymer, peptide-like, d-saccharide, l-saccharide, and saccharide entries were retained [41]. These categories include pharmacologically relevant compounds, peptide-like inhibitors, and biologically important sugars. Other types (e.g., peptide linking groups, termini, DNA, and RNA) were excluded because they represent capping groups, non-standard forms of residues that had already been converted to standard amino acids during preprocessing, or fall outside the scope of this protein–ligand dataset. Entries with missing chem_comp.pdbx_type values (“?”) were retained only if their _chem_comp.type matched one of the accepted categories. These cases were flagged for manual review and incorporated when they were judged to be biologically relevant. In addition, ligands containing non-relevant keywords (e.g., ion, metal, solvent, water, buffer, salt, terminus, linking, RNA, and DNA) were flagged and excluded.

Second, canonical SMILES were extracted from ligand CCD records, prioritizing OpenEye SMILES and using CACTVS SMILES as a fallback. All SMILES were standardized and canonicalized using RDKit, preserving isomerism and stereochemistry, and ensuring consistency with downstream cheminformatics workflows (e.g., Mordred descriptors). Since InChI [42] encodes molecular connectivity, stereochemistry, and tautomeric states deterministically, it should correspond uniquely to a single canonical SMILES. Therefore, ligands with conflicts, identical InChI but different standardized canonical SMILES, or identical canonical SMILES but different InChIs were eliminated. Invalid SMILES were also excluded. Finally, ligands containing uncommon atoms (e.g., metals such as Al, Fe, and Tb) and ligands with fewer than 4 heavy atoms were discarded, following published guidelines [43]. These filters removed rare or very small molecules that were poorly represented in the available data and were not the focus of typical PLI studies. Retaining only molecules composed of common bio-relevant atoms (H, C, N, O, F, P, S, Cl, Br, and I) ensured consistency and avoided data sparsity, which would hinder model learning.

In total, 706 ligands were excluded at this stage: 226 due to disallowed chemical types, 41 containing non-biorelevant atoms, 3 with fewer than 4 heavy atoms, 3 flagged by forbidden terms, 15 with invalid SMILES, 328 from curated exclusion lists, 12 with unknown or disallowed types, and 78 with conflicting InChI and canonical SMILES. After filtering and standardization, 4,972 chemically valid and biologically relevant ligands (LIG_base) were retained, yielding a dataset of 7,716 unique PDB entries and 16,665 protein chains.

Interaction class construction

To identify the protein residues involved in ligand binding, atom–atom distance matrices were computed between all heavy atoms of the protein (ATOM) and ligand (HETATM). Hydrogen atoms were excluded because their positions are often poorly resolved in crystallographic structures, which could introduce uncertainty in the distance calculations. A 4.5 Å minimum heavy atom-to-heavy atom distance cutoff was used: residues with at least 1 non-hydrogen atom within this range of a ligand heavy atom were classified as interacting (1), and all others were classified as non-interacting (0). This threshold provided a balance between sensitivity and specificity, capturing close non-covalent contacts, such as van der Waals contacts, while minimizing noise from distant or non-relevant atoms [40, 44–50]. Each protein residue was assigned to 1 of 2 classes, and for chains associated with multiple ligands, each chain–ligand pair was treated as an independent instance (Fig. 2). For reference, Interactions_base denotes entries in the simplified format PDB_ID:PDB_CHAIN:LIG_base (ignoring the ligand site) (e.g., 1gx6:A:UTP). Conversely, interactions were reported as Interactions_full in the format PDB_ID:PDB_CHAIN:LIG_full (e.g., 1gx6:A:UTP_B). Applying this procedure yielded 13,725 protein chains and 17,916 Interactions_full with at least 1 interacting residue. To ensure structural quality, only interactions from chains with an experimental resolution ≤5 Å were retained, reducing the dataset to 17,712 Interactions_full. Finally, low-information contacts were excluded, specifically those with fewer than 2 interacting residues or in which all interacting residues were consecutive in sequence, because biologically relevant ligands are typically stabilized by networks of non-adjacent residues distributed across the protein surface [36, 51]. After these refinements, 15,695 Interactions_full were retained and 2,017 were excluded.

Figure 2.

Figure 2

Example of class construction for Interaction_full 4CPM:A:G39_H. Chain A (from Influenza B virus) interacts with the compound of interest, G39 (ligand chain H), oseltamivir acid (DrugBank ID: DB02600). The distances between all ligand heavy atoms and protein chain residue heavy atoms were then calculated. The complex class was defined based on the protein residue–ligand atom distance. For instance, in this complex, the NH2 atom of arginine at position 116 interacts with the ligand, with several atoms (shown in the left matrix) at distances shorter than 4.5 Å. Consequently, ARG116 is classified as interacting because at least one of its heavy atoms contacts the ligand (right table). The image was rendered using Protein Imager [10] and BioRender [52].

Feature representation

To describe each compound of interest, we employed Mordred descriptors derived from the SMILES of the compounds. Mordred [53] was selected because of its comprehensive coverage of ligand properties and its extensive use in DTI studies. A total of 1,613 distinct features were extracted for each compound, encompassing a broad range of two-dimensional (2D) descriptors. These include atom and bond counts, acidic and basic group counts, hydrogen bond donor and acceptor counts, molecular weight, rotatable bond counts, and topological indices, such as connectivity and charge-related descriptors. These features capture the fundamental molecular characteristics that influence ligand–protein binding. For example, hydrogen bond acceptors and donors, van der Waals volume, and topological polar surface area (TPSA) are critical determinants of ligand–protein interactions [35].

For each protein chain residue, DL-based embeddings were extracted using ESM2 and ProtTrans software. Two independent datasets were constructed: one combining Mordred molecular descriptors with ESM2 embeddings (Mordred + ESM2) and another combining Mordred descriptors with ProtTrans embeddings (Mordred + ProtTrans). This design enabled a direct comparison of the representations that provided superior performances in downstream predictive tasks.

Both ESM2 and ProtTrans are large-scale PLMs trained on millions to billions of sequences designed to capture contextual, physicochemical, and evolutionary information directly from amino acid sequences. ESM2 [54] generated 1,280-dimensional residue-level embeddings by processing each sequence through a transformer model trained on the UniRef50 database. Each residue was represented by a vector encoding both local biochemical properties and global sequence context, capturing patterns linked to secondary structure, long-range dependencies, and potential interaction motifs. It learns implicit evolutionary and structural constraints without requiring multiple sequence alignments, providing rich residue-level representations that approximate features typically derived from homology or experimental data.

ProtTrans [55], based on transformer architectures such as BERT and T5 and trained on over 2.1 billion sequences, provided 1,024-dimensional residue-level embeddings. These embeddings captured long-range sequence dependencies and context-aware amino acid representations that reflected folding tendencies, solvent accessibility, intrinsic disorder, and interaction propensities. Similar to ESM2, ProtTrans embeddings are alignment-free and highly scalable, making them particularly suitable for large-scale viral proteome studies and high-throughput applications.

To ensure feature quality, all NaN, infinite, and zero-variance features were removed, as they provided no informative contribution to model learning [56]. From the initial 1,613 Mordred ligand descriptors, 1,079 informative features were retained, while all protein embedding dimensions were preserved. The full Mordred + ESM2 and Mordred + ProtTrans datasets consisted of 2,359 and 2,103 descriptors, respectively. A reduced dataset (at 90% sequence similarity) was later generated (see the “Dataset splitting design” section), in which 1,073 of the original 1,613 Mordred features were retained.

Dataset deduplication and quality control

The PDB is highly redundant, with multiple entries representing identical or near-identical protein–ligand complexes [57]. To ensure a deduplicated dataset, hierarchical deduplication was performed at both the intra-PDB and inter-PDB levels using the 15,695 retained Interactions_full instances. Interaction patterns, defined by the residues and positions involved in ligand binding, were used as the basis for comparison. First, intra-PDB deduplication was applied to the chains in the same PDB entry. When 2 chains shared the same sequence, ligand ID, and ligand chain, only 1 interaction was retained if the residue interaction pattern was identical, whereas both were retained if the patterns were different. Similarly, for chains with the same sequence and ligand ID but different ligand chains, only 1 interaction was retained when the residue patterns were identical, whereas both interactions were preserved if they differed. If either the sequence or ligand ID varied, all interactions were retained. Inter-PDB deduplication was performed for different PDB entries. When 2 chains had the same sequence and ligand ID, regardless of the ligand chain, only 1 interaction was retained if the residue interaction pattern was identical, with preference given to the structure of the highest quality (i.e., lowest resolution), whereas both were preserved if the patterns differed. As before, when either the sequence or ligand ID varied, all interactions were retained. In total, 2,484 redundant Interactions_full were removed during intra-PDB deduplication, and a further 518 were removed during inter-PDB deduplication.

The final dataset comprised 10,933 unique PDB chains, corresponding to 7,401 unique PDB entries. At the ligand level, there were 4,914 LIG_base identifiers. Overall, the dataset included 12,693 unique Interactions_full, encompassing a total of 173,320 interacting residues and 3,617,599 non-interacting residues, respectively. The dataset exhibited an ~1:21 class imbalance (4.57% interacting residues vs. 95.43% non-interacting residues), a characteristic pattern in biological datasets in which ligand-binding residues are naturally rare (see the “Results” and “ViralBindPredict Full and ViralBindPredict NoRed dataset” sections). Despite this imbalance, the large number of interacting samples ensured a robust foundation for predictive modeling, enabling the reliable identification of LBS. The overall workflow for constructing the full ViralBindPredict database is illustrated in Fig. 3. This database will hereafter be referred to as ViralBindPredict Full, as a non-redundant version clustered at 90% sequence identity was subsequently generated and will be referred to as ViralBindPredict NoRed (as explained in the next section).

Figure 3.

Figure 3

Workflow for constructing the ViralBindPredict Full database. The ViralBindPredict workflow comprises 6 sequential steps: (1) Retrieval of viral structures from the Protein Data Bank (PDB); (2) conversion of non-standard amino acids to their standard counterparts and filtering of protein chains to retain only viral chains composed of standard amino acids longer than 30 residues; (3) ligand curation was performed to remove ions, metals, artifacts, uncommon atoms, conflicts, and invalid SMILES, retaining 4,972 valid LIG_base entries; (4) the binary interaction classes were constructed based on interatomic distances (≤4.5 Å = interacting), and the non-biologically relevant interaction patterns were excluded; (5) extraction of ligand and protein features using Mordred, ESM2, and ProtTrans; and (6) deduplication of redundant protein–ligand pairs (intra-PDB and inter-PDB deduplication) to yield the final curated ViralBindPredict Full dataset. The image was rendered using BioRender [52].

Dataset splitting design

Five dataset splitting strategies were evaluated to assess generalization under different splitting and redundancy assumptions: (i) an original standard split (BASE), used as a baseline; (ii) 90% clustering combined with scaffold filtering (Cluster90%); (iii) 40% clustering combined with scaffold filtering (Cluster40%); (iv) redundancy removal at 90% followed by 90% clustering and scaffold filtering (NoRed90%→Cluster90%); and (v) redundancy removal at 90% followed by 40% clustering and scaffold filtering (NoRed90%→Cluster40%) (Fig. 4). The BASE split served as a performance reference (see Supplementary Section 1, Supplementary Tables S1 and S2), where no sequence redundancy removal or clustering was performed. This split preserves natural data redundancy (similar protein sequences and ligand scaffolds may occur across sets) and was used to assess the overfitting risk and contrast with stricter splits.

Figure 4.

Figure 4

Overview of dataset clustering and splitting logic for ViralBindPredict. The ViralBindPredict Full dataset comprised 10,933 PDB chains (3,778 unique sequences). Protein chains were clustered using MMseqs2 at 2 sequence identity thresholds of 40% (80% coverage) and 90% (80% coverage), resulting in 393 and 612 clusters, respectively. The ViralBindPredict NoRed dataset was generated by selecting 1 representative sequence per 90% cluster and retaining all chains corresponding to those sequences, yielding 1,494 PDB chains (612 sequences). The NoRed90% dataset was subsequently reclustered at 40% sequence identity, producing 394 clusters for downstream analysis. The lower panel illustrates the splitting logic. Interactions were grouped by PDB:SEQ:LIG_full identifiers and assigned to 3 disjoint conditions: blind protein, blind ligand, and blind sets, representing protein-, ligand-, or combined-level blind splits, respectively. In the blind protein set condition, 10% of the clusters were randomly sampled, and all chains from the corresponding PDBs were included to ensure complete PDB-level separation. In the blind ligand set condition, 10% of the Bemis–Murcko (BM) scaffolds were sampled to create ligand-disjoint splits. The blind set condition included interactions involving both the blind proteins and blind ligands. The remaining interactions were divided into training, validation, and testing subsets after excluding the blind proteins and ligands and identifying the collateral overlaps. The image was rendered using BioRender [52].

Viral proteins frequently appear as near-identical homologues arising from strain diversification, adaptive mutations, or engineered constructs of the same protein [58, 59]. Although they share a conserved global fold, they differ in the surface-level determinants of binding, such as electrostatics, hydrophobicity, and loop flexibility, which govern ligand recognition and host interaction [60]. This intrafamily diversity is relevant to antiviral discovery, which often aims to predict interactions across variants within the same or closely related viral families rather than across unrelated viruses [61, 62]. In the ViralBindPredict full dataset, ~28% of all PDB chains originated from a single organism (SARS-CoV-2), illustrating the high redundancy of viral proteomes. A 90% sequence identity threshold thus represents a realistic evolutionary boundary [30, 63, 64]. It removes trivial duplicates that could cause information leakage while mostly preserving biologically relevant variation across and within viral families, allowing models to learn transferable interaction principles across closely related strains.

Accordingly, the Cluster90% and NoRed90%→Cluster90% regimes provide the most interpretable and robust evaluation settings; they prevent sequence and scaffold leakage, retain meaningful evolutionary continuity, maintain training diversity, and align with the generalization range of PLMs [65]. All 10,933 PDB:Chain (corresponding to 3,778 unique sequences) entries were clustered using MMseqs2 [66, 67] at 90% sequence identity and 80% coverage, yielding 612 clusters (1–1,862 chains each, average 17) [68, 69]. To rigorously assess model generalization, 3 additional partitions were created alongside the standard training, validation, and testing sets: the blind protein set, the blind ligand set, and the blind set. Throughout this process, the integrity of the interaction clusters (PDB: SEQ: LIG_full) was preserved (Supplementary Table S2). All chains and interactions belonging to the same interaction cluster were always assigned to the same split so that no interaction cluster was fragmented across the training, validation, testing, blind protein, blind ligand, and blind sets. This enforced independence across splits and prevented the leakage of identical protein–ligand binding sites. The blind protein set captures interactions involving proteins that are not observed during training. To construct this set, 10% of the MMseqs2 clusters (61 of 612) were randomly chosen. All chains belonging to the sampled clusters were initially marked as blind proteins (not observed during training). To guarantee complete PDB-level independence, this selection was expanded to include all chains from any PDB ID entry that contained at least 1 blind protein.

In parallel, the blind ligand set represents interactions involving unknown ligand chemotypes. Ligands were grouped by exact Bemis–Murcko (BM) scaffolds derived from canonical SMILES using RDKit. Of the 4,914 LIG_base observed in the dataset, 4,833 were assigned a valid BM scaffold, yielding 2,652 unique scaffolds. We randomly sampled 10% of these scaffolds (265 scaffolds), which corresponded to 414 LIG_base. These LIG_base were marked as blind ligands. Ligands without a valid scaffold were not eligible for scaffold-based blinding but were retained in the remaining splits to avoid the discarding of chemically relevant interactions. The blind set represents the strictest scenario, containing interactions in which both partners were unseen: a protein chain from the blind protein pool and a ligand belonging to the blind ligand scaffold pool. Interactions in which both criteria were met were assigned to the blind set (93 Interactions_full). Interactions that met only one of these criteria were assigned to either the blind protein set (1,453 Interactions_full; blind proteins) or the blind ligand set (1,031 Interactions_full; blind ligands). After removing these, 10,116 Interactions_full remained as candidates for training, validation, and testing.

Before splitting the 10,116 Interactions_full, we performed 2 safety passes to prevent leakage. First, any interaction whose protein chain belonged to a blind MMseqs2 cluster (or to a PDB ID already assigned to a blind protein) was excluded from the training/validation. Second, any interaction in which a ligand shared a BM scaffold with a blind ligand was excluded from training/validation. Residual interactions that were not initially labeled as blind entities but still shared a PDB ID with a blind protein were labeled as collateral and subsequently reassigned to the testing set (170 Interactions_full). This collateral assignment ensured that no information from the blind proteins or ligands leaked back into the training/validation pools while preserving these interactions for evaluation. Consequently, no PDB ID contributing to the blind protein or blind sets appeared in the training, validation, or testing sets. After these exclusions, 9,946 Interactions_full remained as clean candidates for the analysis and were divided into training, validation, and testing sets at the interaction group level. We allowed biologically distinct binding sites of the same protein–ligand pair to appear in the training, validation, and testing. For example, the training set may include 1gx6:A:UTP_A, whereas the testing set includes 1gx6:A:UTP_B (SupplementaryTable S2). These correspond to alternative binding sites of the same PDB chain and LIG_base, which are functionally distinct, and after deduplication, they no longer share identical interaction patterns. This produced final splits of 7,957 training, 995 validation, and 940 testing Interactions_full. Additionally, the 170 collateral interactions described above were incorporated into the testing set, resulting in 1,164 testing Interactions_full (Supplementary Tables S1 and S2). By construction, no MMseqs2 protein clusters or PDB IDs assigned to blind protein or blind sets appeared in the training set, no BM scaffold assigned to blind ligand or blind sets appeared in training, and proteins and ligands in the blind sets were never observed in training under the same sequence cluster or scaffold definition (Supplementary Tables S1 and S2).

To examine the influence of redundancy, a non-redundant dataset (NoRed90%→Cluster90%) was generated by retaining only 1 representative sequence per MMseqs2 cluster at 90% identity and 80% sequence coverage (Fig. 4). From the 612 clusters, 612 representative sequences were retained, and all PDB chains containing those sequences were included, yielding 1,494 PDB chains from 1,176 PDB IDs and 1,734 protein–ligand Interactions_full. This reduced dataset is referred to as ViralBindPredict NoRed. Among the 766 LIG_base, 43 (5.6%) lacked valid scaffolds, leaving 723 ligands with valid scaffolds and 461 unique BM scaffolds remaining. Of these, 46 scaffolds (10%) were sampled, resulting in the identification of 70 blind ligands. The same splitting procedure described above was then applied, ensuring that the MMseqs2 protein clusters remained fully disjoint between the training and blind or blind protein sets. The resulting subsets contained 17 Interactions_full in the blind set, 112 in the blind protein set, and 174 in the blind ligand set, while the remaining 1,431 Interactions_full formed the training/validation/testing sets (1,145 training, 143 validation, and 143 testing) (Supplementary Table S1).

Finally, to evaluate the limits of model generalization beyond remote homology and viral family boundaries, 2 additional splits were constructed at a stricter 40% identity threshold [29, 70, 71]. Cluster40% and NoRed90%→Cluster40% applied a stricter 40% identity cutoff, shifting the task toward cross-family generalization. At ≤40% sequence identity, proteins may belong to distinct viral families with different folds and binding architectures, although highly divergent members of the same family may still be retained [72–74]. This makes the split biologically extreme because the viral protein space is inherently narrow and taxonomically biased, with most available structures derived from a few extensively studied pathogens, including SARS-CoV-2, HIV, influenza viruses, and bacteriophages. Consequently, enforcing a 40% sequence identity cutoff further fragmented the data, reducing the number of MMseqs2 clusters from 612 at 90% identity to 393 at 40% identity (Fig. 4). Second, PLMs such as ESM2 and ProtTrans perform best when evolutionary continuity exists [65]. Below 40% identity, homologous context and fold similarity diminish, forcing the model to extrapolate beyond its intended biological regime. Therefore, 40% identity splits simulate out-of-distribution scenarios (remote or different viral families) that are valuable for testing model limits. At this 40% threshold, the cluster sizes ranged from 1 to 2,049 chains (average = 27). The same splitting logic applied in Cluster90% was followed. A total of 2,652 unique BM scaffolds were identified, and 10% of these (265 scaffolds) were randomly sampled, resulting in the identification of 443 blind ligands. The resulting blind subsets contained 123 Interactions_full (both protein and ligand unseen), 1,437 blind protein interactions, and 883 blind ligand Interactions_full, leaving 10,250 Interactions_full as candidates for the training/validation/testing sets. These were partitioned into 8,178 training, 1,023 validation, and 1,049 testing Interactions_full, including 28 collateral cases that were reassigned from the main pool to the testing set to maintain strict PDB-level isolation. For the NoRed90%→Cluster40% split, the already reduced ViralBindPredict NoRed dataset was used as input. Its 1,494 PDB Chain protein sequences were reclustered at 40% sequence identity using MMseqs2, yielding 394 clusters (1–188 chains each; average, 3). 10% of these clusters (39 clusters) were randomly selected to define the blind protein set, from which the blind protein and blind subsets were derived, following the same logic as in the 90% regime. Among the 766 ligands, 723 (94.4%) possessed valid BM scaffolds, yielding 461 unique scaffolds. From these, 10% (46 scaffolds) were randomly sampled to produce 86 blind ligands. The resulting subsets contained 46 blind set Interactions_full (both protein and ligand unseen), 274 blind protein set Interactions_full, and 124 blind ligand Interactions_full, leaving 1,290 Interactions_full as candidates for supervised splitting. These were partitioned into 1,032 training, 129 validation, and 129 testing Interactions_full, with no collateral cases (SupplementaryTables S1 and S2). The positive and negative interaction distributions for all splitting strategies are summarized in Supplementary Table S3.

Standardization

The protein and ligand features were standardized to ensure comparable scales. This step was essential because ML models, particularly DL architectures, are sensitive to differences in the feature magnitude. Without standardization, features with larger numeric ranges can dominate the learning process, slow convergence, and reduce the generalization. The StandardScaler [75] was fitted exclusively to the training set, where the feature-wise means and standard deviations were calculated. No information from the validation, testing, or blind sets was used for fitting. The fitted scaler was consistently applied to all other subsets (validation, testing, blind protein, blind ligand, and blind sets).

Model hyperparameter optimization

Two approaches were benchmarked: LightGBM (ML) implemented using the LightGBM package [76] and an MLP (DL) implemented using the PyTorch package [77]. LightGBM is well suited to tabular descriptors, such as Mordred features, whereas MLPs can leverage high-dimensional protein embeddings (ESM2, ProtTrans) with ligand descriptors, capturing complex nonlinear interactions. Using both provided a balanced ML vs. DL baseline, a complementary baseline for comparison, and allowed us to disentangle improvements driven by the model architecture from those arising from the dataset design.

For each dataset splitting strategy and feature backbone (Mordred + ESM2 and Mordred + ProtTrans), the hyperparameters were optimized with Optuna [78] using Bayesian optimization over 200 trials. The optimization targeted the maximization of the F1-score on the validation set, which was selected as the tuning criterion, given the inherent class imbalance of the dataset. F1 provides a balanced measure of precision and recall, directly assessing the model’s ability to recover true binders without inflating false positives. To further address class imbalance, the pos_weight parameter in the MLP and scale_pos_weight in LightGBM were tuned within the range of 1–25. The full hyperparameter configurations, parameter descriptions, and search spaces are presented in Supplementary Table S4. In addition, a targeted grid-based hyperparameter stability analysis was performed for the most realistic splitting strategies (Cluster90% and NoRed90%→Cluster90%) to explicitly assess the robustness of the Optuna-selected optimal hyperparameter configurations to systematic perturbations of core hyperparameters (Supplementary Section 2, Supplementary Figs. S1S3, and Supplementary Table S5).

Model evaluation

The final models were retrained using the optimal hyperparameters identified during tuning and subsequently evaluated using independent testing, blind protein, blind ligand, and blind sets. The model performance was quantified using several complementary metrics: precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the precision–recall curve (AUC-PR). These metrics were selected because they provide a comprehensive assessment of performance under class imbalance, a characteristic feature of binding-site datasets, where interacting residues represent only a small fraction of all residues. Precision and recall capture the model’s ability to avoid false positives and recover true binding residues, whereas the F1-score summarizes their trade-off. The MCC was included as a balanced correlation measure that remained reliable even for skewed class distributions. The AUC-PR complements these metrics by evaluating the precision–recall behavior across all classification thresholds, offering a more informative measure than ROC-based metrics in imbalanced settings. Therefore, F1 and MCC are discussed in detail in the main text, whereas recall, precision, and AUC-PR are provided in the tables and SI for completeness.

Ablation and feature permutation studies

Ablation experiments were performed using the top-performing models from the Cluster40%, Cluster90%, NoRed90%→Cluster90%, and NoRed90%→Cluster40% splitting strategies to evaluate the relative contributions of the protein and ligand feature sets to the predictive performance. Because the ViralBindPredict full and ViralBindPredict NoRed datasets are residue-wise, where each residue is paired with its local protein embedding and the corresponding ligand descriptors, complete ligand-only ablation does not yield meaningful insights. Removing protein features eliminates the residue-level context essential for identifying binding sites. Nevertheless, ligand descriptors may still influence predictions by helping the model distinguish between distinct binding sites within the same protein or across proteins that interact with chemically diverse ligands.

To quantify this contribution, complementary ablation experiments were performed by comparing full-feature models (protein + ligand) with protein-only variants. In the protein-only configuration, residue embeddings from ESM2 or ProtTrans, depending on the best-performing models under these splitting strategies, were employed without concatenated ligand features. The model performance was then re-evaluated on identical dataset splits using the same hyperparameters that were previously optimized for the full-feature configuration (Supplementary Table S4). This allowed us to assess whether the ligand descriptors contributed additional discriminative power beyond sequence-derived protein embeddings.

To interpret the learned residue–ligand interaction patterns and assess biological plausibility, model explainability analyses were performed on the top-performing MLP configurations for the Cluster90%, Cluster40%, NoRed90%→Cluster90%, and NoRed90%→Cluster40% splitting strategies using the testing, blind protein, blind ligand, and blind evaluation sets. Model interpretability was assessed using feature permutation importance, an architecture-agnostic approach that quantifies the contribution of each input feature to the model’s predictive performance. For each feature, its values were randomly shuffled across residues, and the resulting drop in the F1 score was measured relative to the baseline performance. Larger drops indicate greater feature relevance. Ten independent permutations were conducted per feature to ensure reliable estimates.

Results

ViralBindPredict Full and ViralBindPredict NoRed datasets

The ViralBindPredict Full and NoRed datasets are unique collections of viral information extracted from the PDB database, including viral proteins that interact with compounds of interest, such as peptides, saccharides, and nonpolymer compounds. The ViralBindPredict full database corresponds to the total dataset used for training and evaluating the BASE, Cluster90%, and Cluster40% splitting strategies. It comprises 12,693 protein chain–ligand complexes (Interactions_full) derived from 7,401 unique PDB entries, classifying protein chain residues as either interacting or non-interacting with their corresponding ligands. Across all complexes, 173,320 residues were annotated as interacting and 3,617,599 residues as non-interacting, yielding a total of 3,790,919 residue samples. These interactions involved 10,933 unique protein chains and 4,914 distinct ligands (LIG_base), providing a comprehensive and biologically diverse representation of viral protein–ligand interfaces (Fig. 5). Additionally, the ViralBindPredict NoRed dataset, used for the training and evaluation of the NoRed90%→Cluster90% and NoRed90%→Cluster40% regimes, was generated after 90% sequence redundancy removal using MMseqs2. It contains 1,176 unique proteins (PDB IDs), 1,494 protein chains, 766 unique LIG_base, and 1,734 Interactions_full, comprising a total of 534,683 non-interacting residues and 21,259 interacting residues.

Figure 5.

Figure 5

Overview of the ViralBindPredict Full dataset. The figure summarizes the key dataset statistics, including the number of PDB entries, unique LIG_base, protein chains, protein chain–ligand complexes (Interactions_full), and the total, positive, and negative residue samples used for model training. Additionally, the ViralBindPredict NoRed dataset, generated after 90% sequence redundancy removal using MMseqs2, contains 1,176 unique proteins (PDB IDs), 1,494 protein chains (PDB Chains), 766 unique ligands (LIG_base), and 1,734 protein–ligand complexes (Interactions_full), comprising 534,683 non-interacting residues and 21,259 interacting residues. The image was rendered using BioRender [52].

In the ViralBindPredict Full dataset, protein chains are predominantly short viral proteins, with 49.54% containing 0–250 amino acids and 35.77% containing 250–500 amino acids (Fig. 6A). The structural resolution was generally high, with 56.75% of entries resolved between 1 and 2 Å and 35.81% between 2 and 3 Å, reflecting the overall quality of structural data (Fig. 6B). Most PDB structures were determined using X-ray crystallography (94.03%), followed by electron microscopy (5.78%), while other experimental methods accounted for <0.2% of the entries (Fig. 6C). Protein–ligand complexes (Interactions_full) typically involve 9–17 interacting residues, ranging from 2 to 47 interacting residues per complex (Fig. 6D). Supplementary Fig. S4 shows the results for the ViralBindPredict NoRed dataset.

Figure 6.

Figure 6

Structural and compositional characteristics of the ViralBindPredict Full dataset. (A) Distribution of protein chain lengths. (B) Resolution distribution of PDB structures. (C) Experimental methods used for PDB structure determination. (D) Distribution of interacting residues per protein–ligand complex (Interactions_full), illustrating typical interaction interface sizes (median of 14). (E) Ligand chemical classifications from _chem_comp.pdbx_type (TYPE) and _chem_comp.type (CHEM_TYPE) for LIG_base. The category saccharide includes d-saccharide, l-saccharide, and saccharide. Entries labeled “?” had missing TYPE values but were retained because their CHEM_TYPE corresponded to an allowed class (verified after manual review).

The ligands were predominantly non-polymer compounds (~97%), with smaller fractions of peptide-like (1.8%) and saccharide (1.1%) molecules (Fig. 6E). To characterize their physicochemical diversity, we analyzed the atom-type composition, hydrogen-bonding capacity, molecular weight, TPSA, heavy-atom count, and ring count (Fig. 7). Carbon (70.2%) and oxygen (14.0%) were the most abundant atomic types, followed by nitrogen (11.9%). Phosphorus (1.3%), sulfur (1.2%), chlorine (0.8%), and other elements are rare, reflecting the predominantly organic and drug-like nature of these compounds (Fig. 7A). TPSA values showed a median of 83 Ų, indicating moderate polarity and membrane permeability (Fig. 7B). The molecular weight values had a mean of 378.5 Da, with a minimum of 57 and a maximum of 2,133, which is consistent with small- to medium-sized bioactive molecules (Fig. 7C). The LogP-values had a mean of 2.3, while the number of hydrogen bond acceptors and donors had means of 5 and 2.4, respectively, both well within Lipinski’s thresholds (Fig. 7D–F). The ring-count distribution has a median of 3 rings, which is typical of aromatic and heterocyclic scaffolds (Fig. 7G), and the heavy atom count has a mean of 26 (Fig. 7H). Overall, 70.39% of the molecules met Lipinski’s criteria for drug-like properties, confirming that the dataset spanned a chemically diverse and pharmacologically relevant space. BM analysis revealed strong scaffold redundancy, with the top 10 scaffolds covering ~12% of all ligands, and the benzene ring scaffold alone representing 5.5% (Fig. 7I). This redundancy further corroborates the importance of scaffold-based splitting in blind ligand evaluation. Comparable distributions were observed for the reduced 90% similarity dataset (ViralBindPredict NoRed) (Supplementary Fig. S5).

Figure 7.

Figure 7

Physicochemical characterization of ligands in the ViralBindPredict Full dataset. (A) Atom-type composition (%). (B) TPSA. (C) Molecular weight (Da). (D) Hydrogen-bond donors. (E) Hydrogen-bond acceptors. (F) LogP. (G) Ring count. (H) Heavy atom count. (I) Frequency of the top 10 Bemis–Murcko (BM) scaffolds. The dashed line represents the Lipinski thresholds defining drug-likeness boundaries according to the rule of five.

Performance of MLP and LightGBM models using 2 feature representations

The LightGBM (ML) and MLP (DL) models were benchmarked and optimized through 200 Optuna trials to determine the most effective architecture for predicting viral PLIs. The testing set served as the primary reference for model selection within each splitting strategy, representing a moderately generalized scenario. The true generalization capacity was subsequently evaluated using blind sets, which constitute the most stringent and biologically challenging assessment conditions in the field. In addition, we compared 2 datasets differing in feature composition: Mordred + ESM2, which combines ligand descriptors with transformer-based protein embeddings, and Mordred + ProtTrans, which pairs the same ligand descriptors with ProtTrans protein representation. This comparison aimed to determine which feature backbone better supports generalization and captures residue-level binding patterns, given the residue-wise nature of the dataset used. The results for the BASE split across both models and feature sets are presented in Supplementary Table S6, while Table 1 summarizes the performance under the Cluster90% split for the testing, blind ligand, blind protein, and blind sets. Table 2 presents the corresponding results under the NoRed90%→Cluster90% split, and additional results for the Cluster40% and NoRed90%→Custer40% splits are provided in Supplementary Tables S7 and S8, respectively. Collectively, these results enable a systematic assessment of model performance on unseen proteins (blind protein set), novel ligands (blind ligand set), and entirely unseen protein–ligand pairs (blind set), thereby demonstrating the robustness and potential applicability of this approach for antiviral discovery and drug repurposing. This framework supports evaluation under biologically stringent similarity thresholds and at different levels of difficulty.

Table 1.

Performance of MLP and LightGBM models on the Cluster90% split using the Mordred + ESM2 and Mordred + ProtTrans feature sets. The models were evaluated using testing, blind ligand, blind protein, and blind sets. Each entry reports the F1 score, recall, precision, MCC, and AUC-PR. The best-performing metrics for each feature set and model (MLP or LightGBM) within a given evaluation set are highlighted in bold, whereas the overall best-performing combination across all feature sets and models for that set is shown in bold italics.

Set Features Size Positive rate Mehtod F1 Recall Precision MCC AUC-PR
Testing Mordred + ESM2 1,164 4.21% MLP 0.7909 0.8193 0.7644 0.7819 0.8665
LightGBM 0.7575 0.7183 0.8011 0.7486 0.8438
Mordred + ProtTrans 1,164 4.21% MLP 0.7902 0.7977 0.7828 0.7809 0.8626
LightGBM 0.7685 0.7676 0.7693 0.7583 0.8405
Blind ligand Mordred + ESM2 1,031 4.21% MLP 0.7159 0.6954 0.7377 0.7041 0.7639
LightGBM 0.7019 0.6514 0.7608 0.6921 0.7603
Mordred + ProtTrans 1,031 4.21% MLP 0.7171 0.7052 0.7294 0.7050 0.7736
LightGBM 0.6983 0.6758 0.7224 0.6859 0.7576
Blind protein Mordred + ESM2 1,453 4.72% MLP 0.5717 0.4860 0.6941 0.5640 0.5889
LightGBM 0.3809 0.2603 0.7096 0.4146 0.4785
Mordred + ProtTrans 1,453 4.72% MLP 0.5548 0.4406 0.7490 0.5592 0.5900
LightGBM 0.4387 0.3140 0.7277 0.4627 0.5461
Blind Mordred + ESM2 93 5.01% MLP 0.4974 0.4100 0.6323 0.4890 0.4887
LightGBM 0.2683 0.1762 0.5627 0.2966 0.3663
Mordred + ProtTrans 93 5.01% MLP 0.4729 0.3623 0.6806 0.4784 0.4816
LightGBM 0.3809 0.2785 0.6023 0.3898 0.4714

Table 2.

Performance of MLP and LightGBM models on the NoRed90%→Cluster90% split using the Mordred + ESM2 and Mordred + ProtTrans feature sets. The models were evaluated using testing, blind ligand, blind protein, and blind sets. Each entry reports the F1 score, recall, precision, MCC, and AUC-PR. The best-performing metrics for each feature set and model (MLP or LightGBM) within a given evaluation set are highlighted in bold, whereas the overall best-performing combination across all feature sets and models for that set is shown in bold italics.

Set Features Size Positive rate Mehtod F1 Recall Precision MCC AUC-PR
Testing Mordred + ESM2 143 4.05% MLP 0.6495 0.6761 0.6249 0.6347 0.6807
LightGBM 0.6459 0.6466 0.6451 0.6309 0.6636
Mordred + ProtTrans 143 4.05% MLP 0.6165 0.5689 0.6727 0.6040 0.6409
LightGBM 0.6309 0.6190 0.6432 0.6157 0.6357
Blind ligand Mordred + ESM2 174 3.64% MLP 0.5817 0.5432 0.6260 0.5685 0.5790
LightGBM 0.5672 0.5190 0.6252 0.5549 0.5920
Mordred + ProtTrans 174 3.64% MLP 0.5427 0.4875 0.6120 0.5311 0.5511
LightGBM 0.5218 0.4654 0.5937 0.5100 0.5484
Blind protein Mordred + ESM2 112 4.15% MLP 0.6430 0.6019 0.6902 0.6303 0.6130
LightGBM 0.5888 0.4913 0.7345 0.5871 0.5834
Mordred + ProtTrans 112 4.15% MLP 0.6445 0.5771 0.7297 0.6356 0.6311
LightGBM 0.5790 0.5060 0.6765 0.5699 0.5858
Blind Mordred + ESM2 17 4.70% MLP 0.7439 0.6489 0.8714 0.7419 0.7993
LightGBM 0.6186 0.4716 0.8986 0.6399 0.7319
Mordred + ProtTrans 17 4.70% MLP 0.5280 0.4184 0.7152 0.5309 0.5460
LightGBM 0.4098 0.2660 0.8929 0.4762 0.4752

The MLP architecture demonstrated consistently strong and stable performance across the evaluation regimes, with pronounced advantages under the Cluster90% and NoRed90%→Cluster90% conditions. To enable direct comparison across dataset splits and feature representations, the MLP performance was summarized in terms of F1, MCC, and AUC-PR across all split configurations (Fig. 8 and Supplementary Fig. S6, respectively). This visualization captures the evolution of predictive performance under increasingly stringent evaluation regimes, where lower clustering thresholds (e.g., 40% sequence identity) impose greater sequence dissimilarity between training and blind proteins, thereby simulating realistic discovery scenarios characterized by limited homologous templates. Figure 8 illustrates the model robustness under an escalating split difficulty and enables a direct comparison between the 2 feature combinations, Mordred + ESM2 and Mordred + ProtTrans, in the MLP framework. Among the reported metrics, the MCC provides a balanced measure of predictive agreement across varying class distributions, making it particularly suitable for evaluating the consistency of the model under different data regimes. In contrast, AUC-PR offers complementary insight into the model’s ability to recover true positives under a strong class imbalance (Supplementary Fig. S6), whereas F1 reflects the threshold-specific performance optimized for the best precision–recall trade-off.

Figure 8.

Figure 8

Performance comparison of the Mordred + ESM2 and Mordred + ProtTrans MLP models under different dataset-splitting strategies. (A) F1 scores; (B) MCC are reported for the testing, blind ligand, blind protein, and blind datasets across 5 splitting strategies: BASE, Cluster 90%, Cluster 40%, NoRed90% → Cluster 90%, and NoRed90% → Cluster 40%. The lines represent the Mordred + ESM2 and Mordred + ProtTrans models.

Discussion

The ViralBindPredict datasets (ViralBindPredict Full and ViralBindPredict NoRed) provide unique, large-scale resources dedicated exclusively to viral PLIs, addressing a critical gap in existing bioinformatics datasets. Through systematic curation, filtering, and annotation of viral complexes, they established a high-quality benchmark for the development and evaluation of ML and DL models for antiviral drug discovery.

We benchmarked the LightGBM​ and MLP models using 2 representation backbones (Mordred + ProtTrans and Mordred + ESM2) across multiple ViralBindPredict data partitioning strategies and evaluated them on 4 types of evaluation sets: the testing set (mild generalization), where Interactions_base entries may be shared with the training set but exhibit different interaction patterns (Supplementary Table S2). This split assesses whether the model can generalize to unseen residue-level interaction patterns; it may also include entirely new Interactions_base and new Interactions_full entries while still sharing some proteins and ligands with the training set, blind ligand (novel ligands), blind protein (novel proteins), and blind (novel ligands and proteins) sets (Supplementary Table S1). Blind sets, sometimes referred to in the literature as cold splits, are rarely used, and when they are, they typically focus only on blind protein or blind ligand cases. In contrast, the blind set used here represents the most challenging and informative scenario in which both proteins and ligands are entirely novel. This is particularly relevant because many ligands share underlying scaffolds, many PDB chains are sequence redundant, and protein–ligand complexes are frequently repeated in structural databases. Therefore, a robust assessment of residue-level PLIs requires restrictive splitting strategies beyond simple random partitioning to yield realistic and unbiased estimates of model performance. Five different splitting strategies were assessed: BASE, Cluster90%, NoRed90%→Cluster90%, Cluster40%, and NoRed90%→Cluster40%, designed to increase biological difficulty by reducing both ligand scaffold overlaps and protein sequence similarity between training and evaluation. All evaluation sets (testing and blind) were imbalanced, with positive interaction rates typically between ~3.6 and ~5.7%, indicating that any effective model must be capable of identifying rare binders. In the BASE split, positive rates ranged from 4.17 to 5.71%; in Cluster90%, from 4.21 to 5.01%; in Cluster40%, from 4.24 to 5.29%; in NoRed90%→Cluster90%, from 3.64 to 4.70%; and in NoRed90%→Cluster40%, from 3.61 to 4.84%. Given that the overall positive rate in the full dataset was 4.57%, these small variations indicate that even as the splits became biologically stricter, removing sequence and scaffold redundancy, the class proportions remained largely consistent.

The baseline performance under the BASE-like split represents the upper bound of the achievable metrics when the training and evaluation data remain closely related (Supplementary Table S6). In the testing set (5.71% positives; 4.40% in training), LightGBM achieved the highest scores, with an F1 of 0.8254 and an MCC of 0.8149 for Mordred + ProtTrans, and an F1 of 0.8184 and an MCC of 0.8078 for Mordred + ESM2, indicating a strong performance when evaluated using familiar interaction patterns. The MLP performed slightly lower but comparably well, with an F1 of 0.8027 and an MCC of 0.7943 for Mordred + ProtTrans, and an F1 of 0.7999 and an MCC of 0.7898 for Mordred + ESM2, confirming that the task remains learnable when the model encounters new binding sites within known proteins and ligands (Supplementary Table S6). When ligands were held out (blind ligand, 4.17% positives), the MLP with Mordred + ESM2 achieved the best performance, with an F1 of 0.7511 and an MCC of 0.7402, outperforming both LightGBM and the MLP with ProtTrans features. A similar trend was observed in the blind protein set (4.63% positives), where the MLP achieved an F1 of 0.7510 and an MCC of 0.7406 for Mordred + ESM2, and an F1 of 0.7472 and an MCC of 0.7383 for Mordred + ProtTrans. Even in the most demanding blind scenario, where both proteins and ligands were unseen (5.25% positives), the MLP maintained robust performance, with an F1 of 0.7423 and an MCC of 0.7280 for Mordred + ESM2, and an F1 of 0.7520 and an MCC of 0.7390 for Mordred + ProtTrans, whereas LightGBM showed a clear performance drop (Supplementary Table S6). These results indicate that while LightGBM excels under mild generalization (testing set), MLP generalizes more effectively to unseen proteins and ligands, likely because of its capacity to capture nonlinear and higher-order residue–ligand interaction patterns. However, the BASE configuration does not rigorously prevent information leakage because proteins with high sequence identity or ligands with shared scaffolds can appear across the partitions. The high and comparable performance observed for the blind protein and blind sets further suggests that redundancy among viral sequences can inflate results, as proteins with similar sequences tend to preserve comparable binding surfaces, and the related ligands often share conserved pharmacophoric features. Therefore, a reliable performance assessment requires more restrictive splitting strategies that minimize sequence and scaffold redundancy.

When difficulty is increased using Cluster90% (90% cluster + scaffold split), high-identity protein leakage (>90% sequence identity) is explicitly prevented, and ligand scaffold overlap is systematically controlled. This configuration increases the biological diversity between the training and evaluation proteins without necessarily restricting them to distinct families, as clustering at 90% identity primarily limits near-duplicate sequences. Consequently, proteins across splits may still share related folds but exhibit greater sequence and surface variability than BASE-like partitions. Therefore, this setup mimics more realistic discovery scenarios in which evaluation proteins may be related to, but not identical to those seen during training. Under these conditions, the model must generalize across proteins that may preserve similar global architectures but differ in local binding environments, reflecting the level of evolutionary and chemical variability typically encountered in real-world applications. On the testing set under Cluster90%, which had a 4.21% positive rate (comparable to 4.64% in the training split), the MLP achieved an F1 of 0.7909 and an MCC of 0.7819 with Mordred + ESM2, and an F1 of 0.7902 and an MCC of 0.7809 with Mordred + ProtTrans. These values are only slightly lower than the BASE testing set performance (F1 of 0.8254 and MCC of 0.8149 for LightGBM with Mordred + ProtTrans), indicating that enforcing 90% sequence clustering and ligand scaffold separation does not substantially impair generalization, at least when predicting new Interactions_full (Table 1). In the blind ligand set under Cluster90% (4.21% positives), where ligands were scaffold-separated but proteins could still share moderate similarity with those in training, the performance remained strong and consistent across both feature sets. The MLP with Mordred + ProtTrans achieved an F1 score of 0.7171 and an MCC of 0.7050, whereas the MLP with Mordred + ESM2 achieved an F1 score of 0.7159 and an MCC of 0.7041. This suggests that the model effectively learned transferable ligand-binding patterns, maintaining predictive accuracy even for novel chemotypes when the protein context was known. In contrast, in the blind protein set under Cluster90% (4.72% positives), where the protein sequences are novel (clustered at <90% identity) but ligand scaffolds are not restricted, the performance drops markedly. The best MLP model with Mordred + ESM2 achieved an F1 score of 0.5717 and an MCC of 0.5640, whereas the ProtTrans variant yielded an F1 score of 0.5548 and an MCC of 0.5592 (Table 1). This decline, from ~0.74 MCC in the BASE blind protein set to ~0.56 here, reflects the difficulty of generalizing binding patterns to proteins that are not nearly identical and may vary in key surface residues and binding-site geometry. Viral proteins often contain short, variable, surface-exposed regions (loops or basic/hydrophobic patches) that mediate ligand or peptide binding, and even small sequence changes in these regions can strongly affect their binding. Finally, in the blind set under Cluster90% (5.01% positives), where both the protein and ligand are novel, the performance drops further. The MLP with Mordred + ESM2 achieved an F1 score of 0.4974 and an MCC of 0.4890, whereas the ProtTrans variant achieved an F1 score of 0.4729 and an MCC of 0.4784. The moderate MCC (~0.49) indicates that the model retained a predictive signal above random chance, capturing transferable interaction features. Across all Cluster90% splits, the MLP consistently outperformed LightGBM, confirming it as the best-performing model under a more stringent regime. Moreover, ESM2-derived protein features contributed slightly greater predictive strength than ProtTrans embeddings. However, the drop from ~0.74 MCC in the BASE to ~0.49 under this fully blind regime underscores how overly permissive splits can inflate the apparent performance in drug discovery, masking the real challenge of generalizing to entirely new viral targets and chemotypes (Fig. 8). The precision–recall curves further corroborate these trends, showing a progressive decline in precision at higher recall levels as the split difficulty increases from the testing set to the blind protein and fully blind configurations (Supplementary Fig. S7). While the model maintained strong discrimination on the testing and blind ligand sets, the curves for the blind protein and blind sets exhibited lower and noisier profiles, consistent with the observed MCC, F1, and AUC-PR reductions. This pattern highlights the greater challenge of generalizing to unseen protein interfaces compared to unseen ligands, reflecting the higher structural and evolutionary diversity of viral proteins. Importantly, the model still performed well above random baselines, even under the strictest Cluster90% blind conditions, indicating that it captured transferable sequence–ligand interaction features rather than memorizing specific complexes. These results underscore the biological realism of the Cluster90%-based evaluation, providing a more reliable estimate of the expected performance in real-world antiviral discovery, where both target proteins and chemical scaffolds are often novel entities.

The Cluster40% (40% sequence identity + scaffold split) configuration represents a substantially harsher biological generalization test (Supplementary Table S7). In this setting, sequences in the evaluation sets (blind protein and blind) shared <40% identity with any protein in the training data, thereby predominantly assessing cross-family or remote homology generalization rather than intra-family transfer. On the testing set (4.41% positives in testing vs. 4.58% in training), the MLP performance remained superficially strong, with an F1 of 0.7921 and an MCC of 0.7831 for Mordred + ProtTrans, and an F1 of 0.7876 and an MCC of 0.7778 for Mordred + ESM2. This is close to Cluster90% and even BASE. However, the performance declined with the introduction of novel ligand scaffolds. The MLP achieved an MCC of 0.6240 (Mordred + ProtTransESM2), whereas LightGBM slightly. The most pronounced performance drop occurs in the blind protein split under Cluster40% (4.24% positives), where generalization becomes particularly challenging. The MLP with ESM2 achieved an F1 of 0.3299 and an MCC of 0.3218, reflecting the difficulty in predicting interactions for entirely new protein clusters. The ProtTrans version is even weaker (F1 = 0.2621, MCC = 0.2866) (Supplementary Table S7). This represents the biological basis of this challenge. In this setting, the model must predict the binding of genuinely novel protein sequences, where even the local 3D environments and electrostatic landscapes of the binding sites can differ substantially. Viral proteins evolve under strong selective pressure to alter host interactions and evade immune recognition, with diversification often concentrated in exposed, interaction-prone areas. The observed MCC of ~0.32 suggests that predicting interactions for an unseen or very remote viral protein family is intrinsically more difficult than predicting new ligands for known viral targets. The superior performance of MLP over LightGBM on the blind protein set further supports the enhanced capacity of MLP to generalize previously unseen proteins. In the blind set under Cluster40% (4.24% positives), where both proteins and ligands are unseen at this stringent level, the performance declines further. The best MLP configuration (Mordred + ESM2) achieved an F1 score of 0.3129 and an MCC of 0.2972, whereas the Mordred + ProtTrans-based model performed slightly worse (F1 = 0.2680, MCC = 0.3045). This setting represents a true out-of-distribution regime in which the model must infer interaction propensities between 2 entirely novel biological entities without shared scaffolds or closely related protein clusters. Mordred + ProtTrans performed best on the testing set. However, the superior performance of the ESM2-based model on the blind and blind protein sets aligns with its residue-level contextual representations, which enable it to capture the compatibility between local sequence environments and ligand physicochemical properties, even when the possible protein family has not been encountered during training. However, this evaluation regime also exposes the intrinsic limitations of the sequence-based models. As mentioned earlier, PLMs such as ESM2 and ProtTrans encode global sequence patterns and residue-level contextual information that are highly informative for viral proteins; however, their ability to generalize diminishes as sequence identity and evolutionary context decrease [65]. Despite this, both Cluster90% and Cluster40% reveal an important observation: predicting interactions for novel ligands (blind ligand set) is considerably easier than for novel proteins (blind protein set). This is expected because the dataset operates at the residue level, and protein features, being structurally and contextually complex, tend to drive higher variability and greater challenges in generalization.

Finally, we evaluated globally redundancy-reduced variants, the non-redundant 90% + cluster regimes (NoRed90%→Cluster90% and NoRed90%→Cluster40%) (Table 2). In these settings, the dataset was first compressed by retaining only 1 representative per 90% protein sequence cluster before performing subsequent clustering and scaffold-based splitting. This 2-stage approach removes near-duplicate protein sequences prior to evaluation, ensuring that the model learns from a cleaner, less repetitive dataset and that the performance reflects genuine generalization rather than memorization of highly similar protein–ligand pairs. This prevents the model from overfitting to repeated or nearly identical interaction pairs, while substantially reducing the training set size. For example, under NoRed90% + Cluster90%, the testing set had only 143 samples (4.05% positives vs. 3.76% positives in a much smaller 1145-sample training set) (Table 2). This global reduction in redundancy led to a noticeable decrease in the absolute performance of the testing set. Using ESM2 features, the MLP achieved an F1 of 0.6495 and an MCC of 0.6347, whereas the ProtTrans-based model performed slightly worse (F1 = 0.6165, MCC = 0.6040 for MLP), substantially lower than the ~0.79/0.78 MCC range observed under less stringent splits (Fig. 8). This decline was expected, as reducing the training size and sequence redundancy limits the presence of highly similar protein–ligand pairs originating from homologous proteins. Consequently, the model can no longer rely on near-duplicate examples and must instead infer broader, more transferable interaction patterns, even in the testing set, which does not guarantee novelty in proteins or ligands. On the blind ligand set within this reduced-redundancy setting (3.64% positives, 174 interactions), the MLP with Mordred + ESM2 again achieved the best performance (F1 = 0.5817, MCC = 0.5685), outperforming the ProtTrans variant (F1 = 0.5427 and MCC = 0.5311). This represents a controlled and biologically consistent decline relative to the Cluster90% regime, reflecting the compounded challenge of novel ligand scaffolds combined with reduced redundancy in the training data. However, an interesting deviation was observed in the NoRed90%→Cluster90% condition. For the blind protein and blind splits, the performance (F1 = 0.6430, MCC = 0.6303 for Mordred + ESM2; F1 = 0.6445, MCC = 0.6356 for Mordred + ProtTrans) was notably higher than that under the corresponding Cluster90% regime, where the MCC values were ~0.56 and 0.49, respectively (Fig. 8). On the blind set, the performance of Mordred + ESM2 exceeded that of the testing set, achieving an MCC of 0.7419 and an F1 score of 0.7439 (Table 2). This apparent improvement, despite the higher theoretical difficulty of the split, likely reflects the combined effects of reduced dataset noise and sample composition. The global redundancy filtering step may have preferentially retained clearer, high-confidence interaction pairs, enabling the model to learn more transferable binding principles and improve generalization across new protein clusters. Nonetheless, the limited dataset size, particularly for the blind set (17 samples), may artificially inflate the reported performance. Despite these caveats, ESM2-based models outperformed ProtTrans in all but the blind protein set, with particularly strong gains in blind set evaluations, reflecting the superior residue-level contextualization and evolutionary coverage of ESM2 embeddings in capturing transferable biochemical patterns across diverse viral proteins. Notably, on the blind set, performance improved from F1 of 0.5280 and MCC of 0.5309 for the MLP Mordred + ProtTrans model to F1 of 0.7439 and MCC of 0.7419 for the MLP Mordred + ESM2 model (Table 2). The precision–recall curves (Supplementary Fig. S8) further support these observations, showing consistently high precision at moderate recall levels across all evaluation splits, particularly in the blind and blind protein sets. The smooth decline in the testing and blind ligand curves indicates stable discrimination even under scaffold-separated conditions, whereas the blind protein and fully blind set curves remain above the random baseline, confirming meaningful generalization despite reduced redundancy and training size. Interestingly, the blind and blind protein curves displayed less noise and higher precision than those in the Cluster90% regime (Supplementary Fig. S7), consistent with the hypothesis that the redundancy-filtered dataset retains clearer and more reliable interaction examples. These trends reinforce that the apparent performance gain on the blind set likely stems from improved data quality and a better signal-to-noise ratio rather than from data leakage, thereby supporting the robustness of the ESM2 embeddings in learning transferable interaction features under global redundancy control.

Combining NoRed90% preprocessing with stricter 40% clustering (NoRed90%→Cluster40%) substantially increased the biological and predictive difficulty of the task (Supplementary Table S8). In the testing set (3.75% positives, 129 samples; 3.84% in training), the performance remained moderate, with the MLP achieving an F1 of 0.6770 and an MCC of 0.6643 using Mordred + ProtTrans, and an F1 of 0.6747 and an MCC of 0.6620 using Mordred + ESM2, again demonstrating the superior performance of the MLP compared to that of LightGBM. In the blind ligand set (4.15% positives, 124 samples), LightGBM slightly outperformed MLP, suggesting that under conditions of extreme data sparsity and high scaffold novelty, simpler models may benefit from stronger regularization and reduced variance. As noted earlier, predicting interactions for new ligands is less challenging than for new proteins, particularly in residue-wise datasets, where ligand features tend to be more informative and transferable than protein representation. However, in the blind protein (3.61% positives, 274 samples) and blind (4.84% positives, 46 samples) sets, the MLP again prevails, with the Mordred + ESM2 configuration achieving an F1 of 0.2993 and an MCC of 0.2817 in the blind protein and an F1 of 0.3155 and an MCC of 0.3062 in the blind set. Although these absolute values are modest, they represent the predictive capability for completely novel viral protein clusters (≤40% identity) and unseen ligand scaffolds (Supplementary Table S8).

The Cluster40% and NoRed90%→Cluster40% variants yielded the lowest overall performances across the blind and blind protein sets. This outcome can be attributed partly to the limited ability of PLMs to generalize when sequence similarity and evolutionary context are drastically reduced [65], as mentioned earlier, and to the extreme sequence collapse produced by such a stringent threshold. When clustering the 10,933 PDB chain entries, only 393 representative clusters were obtained at 40% identity compared to 612 at 90%, indicating that many viral proteins merged into a small number of distant clusters under such conditions. This reduction constrains sequence diversity, making the learning task difficult because viral proteins often share conserved folds and functional motifs with over 40% identity. Nonetheless, these results remain informative, illustrating the lower limit of model generalization when both protein sequences and ligand scaffolds are highly dissimilar. These scenarios reflect the practical challenges of predicting interactions for completely novel viral species. As structural coverage and viral sequence diversity continue to expand, these stricter evaluation regimes will become increasingly relevant and are likely to yield improved results with larger and more balanced datasets.

Across all partitioning strategies, the testing and blind ligand sets exhibited similar performance profiles, both maintaining relatively high F1 (≈0.64–0.79) and MCC (≈0.62–0.78) values, consistent with the fact that these subsets retain familiar protein contexts, even when the ligand scaffolds are novel (Fig. 8). In contrast, the introduction of new protein clusters (blind proteins) caused the largest performance decline, with MCC values dropping to ≈0.56 under Cluster90% and ≈0.32 under Cluster40%. This indicates that protein novelty poses a substantially greater challenge than ligand novelty, which aligns with the biological reality that viral proteins often exhibit rapidly evolving, non-conserved surface regions critical for binding. To rule out potential bias arising from the overrepresentation of specific viral families, we evaluated the model performance separately for each viral family across the blind protein and blind sets for the considered splitting strategies, confirming that the observed trends were not driven by a single viral family (Supplementary Section 3 and Supplementary  Table S9).

ESM2-based models mainly outperformed ProtTrans, especially regarding Cluster90% and NoRed90%→Cluster90%. Representative examples of well-predicted interactions further illustrate these trends. For the Cluster90% regime, several Interaction_full entries from the testing set, such as 1b11:A:3TL_D and 1c6y:A:MK1_C, were perfectly predicted (F1 = 1.0). Similarly, 1d4i:A:BEG_C on the blind ligand set and 1f8c:A:4AM_G and 3so9:B:017_C on the blind protein set exhibited completely accurate interaction prediction (Fig. 9). In the blind set, the highest-scoring examples were 5epn:A:5R2_D (F1 = 0.9362) and 3sv6:A:SV6_B (F1 = 0.9189). For the NoRed90% → Cluster90% regime, examples of perfectly predicted Interaction_full entries include 7i73:A:A1B5F_C and 3spk:A:TPV_C in the testing set, 1hsh:D:MK1_F in the blind ligand set, and 1c6z:A:ROC_C and 3ucb:B:017_C in the blind protein set. In the blind set, 1 interaction (1c6y:A:MK1_C) achieved a perfect prediction (F1 = 1.0), followed by 2p3d:A:3TL_C (F1 = 0.8824) and 1c6y:B:MK1_C (F1 = 0.8667) (Fig. 9). To complement these interaction-level results, residue-level structural analyses of a representative interaction shared across all splitting strategies are reported in the Supplementary  Information (Supplementary Section 4 and Supplementary Table S10).

Figure 9.

Figure 9

Representative examples of predicted PLIs under the Cluster90% and NoRed90%→Cluster90% evaluation regimes. Left: Cluster90% regime showing examples from the blind protein (3so9:B:017_C; F1 = 1.0000) and blind (5epn:A:5R2_D; F1 = 0.9362) sets. Right: NoRed90%→Cluster90% regime showing blind protein (1c6z:A:ROC_C; F1 = 1.0000) and blind (1c6y:A:MK1_C; F1 = 1.0000) interactions. Each complex depicts the ligand in color and the protein as a transparent surface with the backbone atoms. The image was rendered using Protein Imager [10] and BioRender [52].

To assess the influence and contribution of ligand information, ablation studies were performed using the best-performing MLP models on the testing set for each splitting strategy (excluding the BASE split), where only protein features were retained across all evaluation regimes (Supplementary Table S11). Across testing evaluations, the removal of Mordred features led to moderate but consistent decreases in F1 and MCC (by 0.01–0.09), confirming that ligand physicochemical descriptors provide complementary information that enhances the model’s ability to distinguish binders in a given protein context. The only exception occurred in the NoRed90%→Cluster40% split, where the performance was slightly higher (in terms of MCC and F1, but not AUC-PR) without ligand features, likely reflecting the limited number of interactions and higher noise under this highly stringent, data-sparse condition. For blind protein evaluations, performance consistently decreased (except in NoRed90%→Cluster40%) when ligand features were excluded, indicating that ligand information becomes increasingly critical as protein novelty increases and that sequence-derived embeddings alone may be insufficient to capture binding compatibility. Conversely, in blind ligand evaluations, the removal of ligand features often led to small increases in MCC and F1 (except under Cluster90%), suggesting that for unseen ligands, protein features already encode sufficient interaction context, and explicit ligand descriptors contribute less to scaffold novelty. On the blind set, performance generally increased after removing ligand features, except under Cluster90%, where both MCC and AUC-PR decreased, and Cluster40%, where AUC-PR declined but MCC and F1 improved, highlighting the context-dependent role of ligand information (Supplementary Table S11). Permutation-based feature importance analyses further supported these findings (Supplementary Figs. S9S12). Residue-level protein embeddings consistently dominated the importance rankings across the evaluation regimes, confirming their primary role in the generalization of unseen proteins and chemotypes. Ligand descriptors contributed in a context-dependent manner: under Cluster90%, several ligand features (e.g., ATSC2c and nP) ranked among the most influential on the blind ligand set, producing notable F1 decreases upon permutation and mirroring the ablation results where ligand removal most consistently reduced performance (Supplementary Table S11 and Supplementary Fig. S9). In contrast, under NoRed90%→Cluster90%, ligand features appeared only marginally among the top-ranked features and were absent from the blind protein and blind set rankings, indicating a reduced reliance on ligand information when residue embeddings dominate the predictions (Supplementary Table S11 and Supplementary Fig. S10). Under the more stringent Cluster40% setting, ligand features showed measurable importance across all evaluation regimes, with particularly strong contributions in the blind ligand evaluation, and this influence was translated into improved performance on the fully blind set (Supplementary Table S11 and Supplementary Fig. S11). Under the NoRed90%→Cluster40% configuration, ligand descriptors exhibited high relative importance on the blind ligand set, with a weaker influence on the blind and blind protein sets (Supplementary Fig. S12).

The evaluation of ViralBindPredict against general-purpose binding-site predictors is inherently limited by the absence of benchmarks tailored to the unique properties of viral proteins. Existing datasets and evaluation frameworks have primarily been developed for stable, well-conserved proteins, in which canonical folds and persistent binding pockets dominate [79–81]. These benchmarks fail to reflect the dynamic nature and functional diversity of viral proteomes, rendering them unsuitable for assessing models specifically trained on viral datasets.

Viral proteins evolve under intense selective pressure from host immune defenses, leading to rapid sequence diversification and the emergence of unique structural traits. These include transient host–virus interfaces, conformational plasticity, and molecular mimicry of host motifs, which deviate sharply from the predictable and modular architectures that are typical of most non-viral proteins [32, 33]. General LBS models, trained predominantly on non-viral datasets, are therefore biased toward static, well-resolved binding sites and fail to capture the variability and unconventional recognition strategies that characterize viral proteins [27]. The shortage of high-resolution viral structures further compounds this problem [82], as most ML methods for PLI prediction rely on abundant, high-quality structural data that are rarely available for viral targets. This disparity underscores the need for specialized sequence-based models capable of functioning effectively in data-sparse regimes and generalizing across rapidly mutating viral proteomes.

ViralBindPredict was designed to meet this need by explicitly focusing on the structural and functional idiosyncrasies of viral proteins. Its domain specialization enables the identification of sequence-encoded signatures that drive viral–host interactions and the localization of LBS that general models fail to detect. While this focus naturally limits its direct applicability to non-viral datasets, it is essential for advancing antiviral discovery and mechanistic understanding of viral infection. Evaluating such models using non-viral benchmarks obscures their true utility and reinforces the need for virus-specific evaluation criteria.

In addition to predictive accuracy, ViralBindPredict provides actionable insights into antiviral drug design. This enables the identification of key binding residues and supports the rational development or repurposing of small molecules that modulate the viral targets. These capabilities are particularly relevant for pandemic preparedness, where the rapid annotation of newly sequenced viral proteins can prioritize druggable targets and guide the repositioning of approved therapeutics. Given the high cost and slow pace of traditional antiviral drug discovery, such computational acceleration directly addresses a major bottleneck in the field. The COVID-19 pandemic underscored the urgency of developing AI-driven frameworks capable of identifying therapeutic targets in real time [83], and ViralBindPredict represents a substantial step toward that goal.

Therefore, ViralBindPredict introduces DL models based on protein sequences that effectively capture virus-specific features, enabling accurate LBS prediction in fast-evolving viral proteins. Their demonstrated robustness under stringent data partitions (Cluster90% and NoRed90% → Cluster90%) highlights both their methodological soundness and translational potential. These models constitute a valuable resource for antiviral research, accelerating the identification of therapeutic targets and contributing to the broader effort of preparedness against emerging viral threats.

Conclusion and future perspectives

Identifying LBS in viral proteins is essential for antiviral drug discovery and understanding virus–host interactions. Traditional approaches, such as molecular docking or template-based modeling, depend heavily on high-resolution structures, which are often unavailable for viral proteins because of their high mutability, structural flexibility, and limited crystallographic coverage. This lack of structural information hampers accurate site characterization and delays the development of effective antivirals.

ViralBindPredict overcomes these limitations by leveraging DL to predict viral protein LBS directly from amino acid sequences, thereby eliminating the need for explicit 3D structures. By specializing exclusively in viral proteins, it captures the sequence-level determinants of ligand recognition that are poorly represented in general protein datasets. This makes the framework particularly suited to fast-evolving viral proteomes and time-critical scenarios, such as emerging outbreaks, in which structural data are scarce.

The central contribution of this study is the introduction of the first 2 curated viral protein–ligand benchmarks, ViralBindPredict Full and ViralBindPredict NoRed, with multiple leakage-controlled partitions. These datasets systematically quantify how redundancy, sequence clustering, and scaffold overlap influence the apparent generalization of the model. Our results demonstrate that traditional random or redundancy-permissive splits markedly overestimate model performance, whereas cluster- and scaffold-aware regimes (Cluster90% and NoRed90% → Cluster90%) provide biologically realistic assessments. The 90% sequence identity threshold effectively balances evolutionary diversity with statistical robustness, preserving interstrain variability while preventing near-duplicate data leakage. In contrast, the 40% clustering regimes serve as extreme generalization tests that, while informative, are biologically stringent, given the compact evolutionary space of the viral proteins.

Model analysis revealed that protein embeddings, particularly those derived from ESM2, are the dominant determinants of generalization, driving predictive accuracy under stringent blind protein and blind pair conditions. This highlights that viral LBS prediction depends primarily on protein context rather than ligand chemistry and that high-quality residue-level embeddings are crucial for success.

Nevertheless, current PLMs, such as ESM2 and ProtTrans, remain limited in their ability to infer structural and dynamic contexts solely from sequences, particularly in the case of extreme sequence divergence. Future models incorporating geometric or structure-aware pre-training may better capture the flexible and adaptive nature of viral interfaces. Similarly, the ligand descriptors used here encode general physicochemical properties but lack 3D conformational details, constraining their ability to model the shape complementarity and steric effects in flexible viral pockets. Future studies could integrate graph-based or learned molecular embeddings to improve the representation of ligand geometries and enhance interpretability. Incorporating geometric or predicted structural features from tools such as AlphaFold or ESMFold could further bridge the gap between sequence-based inference and spatial precision.

Although ViralBindPredict currently focuses on PLIs, excluding protein–nucleic acid complexes and metal-binding proteins to maintain dataset coherence, extending the framework to include these interactions would broaden its biological relevance. The intrinsic class imbalance of the dataset, which reflects the natural scarcity of binding residues, remains a challenge because it can reduce recall and obscure rare but biologically relevant interactions. This can be mitigated through recall-optimized training strategies, uncertainty-aware learning, or adaptive resampling. Moreover, expanding the coverage of underrepresented viral families will further improve the generalization and robustness of the model.

Taken together, this study establishes the first biologically meaningful leakage-controlled framework for evaluating viral PLI predictors. It defines Cluster90% and NoRed90% → Cluster90% as realistic, evolution-aware benchmarks, demonstrates the resilience of MLP architectures combined with ESM2 embeddings, and underscores the importance of virus-specific evaluation standards that reflect the evolutionary and biochemical distinctiveness of viral proteomes. ViralBindPredict lays the foundation for sequence-driven antiviral discovery, enabling the rapid and data-efficient identification of druggable viral sites and guiding both drug repurposing and de novo antiviral design. As viral structural coverage expands and PLMs evolve to integrate geometric information, frameworks such as ViralBindPredict will become central to bridging artificial intelligence and experimental virology, accelerating therapeutic innovation, and strengthening preparedness against future pandemics.

Availability of supporting source code and requirements

Project name: VirHostAI

Project homepage: https://github.com/MoreiraLAB/VirHostAI

Operating system(s): Platform independent

Programming language: python—version 3.11.3

Other requirements: requirements.txt or requirements_cpu.txt

See DOME-ML registry for more details [84]

Supplementary Material

giag010_Supplemental_File
giag010_Authors_Response_To_Reviewer_Comments_original_submission
giag010_Authors_Response_To_Reviewer_Comments_revision_1
giag010_GIGA-D-25-00033_original_submission
giag010_GIGA-D-25-00033_Revision_1
giag010_GIGA-D-25-00033_Revision_2
giag010_Reviewer_1_Report_original_submission

Reviewer 1 -- 2/24/2025

giag010_Reviewer_1_Report_Revision_1

Reviewer 1 -- 11/6/2025

giag010_Reviewer_2_Report_original_submission

Reviewer 2 -- 3/3/2025

giag010_Reviewer_2_Report_Revision_1

Reviewer 2 -- 11/21/2025

giag010_Reviewer_3_Report_original_submission

Reviewer 3 -- 3/3/2025

Contributor Information

A M B Amorim, CNC—Center for Neuroscience and Cell Biology, Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Rua Larga - Faculdade de Medicina, Polo 1, 3004–504 Coimbra, Portugal; Department of Life Sciences, University of Coimbra, Calçada Martim de Freitas, 3000-456 Coimbra, Portugal; PURR.AI, Rua Pedro Nunes, IPN Incubadora, Ed C, 3030-199 Coimbra, Portugal; CoLAB4Ageing, Sítio Paço das Escolas, Reitoria da Universidade de Coimbra, 3000-530 Coimbra, Portugal.

C Marques-Pereira, CNC—Center for Neuroscience and Cell Biology, Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Rua Larga - Faculdade de Medicina, Polo 1, 3004–504 Coimbra, Portugal; Department of Life Sciences, University of Coimbra, Calçada Martim de Freitas, 3000-456 Coimbra, Portugal.

T Almeida, INESC-ID Lisboa, R. Alves Redol 9, 1000-029, Lisbon, Portugal.

N Rosário-Ferreira, PURR.AI, Rua Pedro Nunes, IPN Incubadora, Ed C, 3030-199 Coimbra, Portugal; CoLAB4Ageing, Sítio Paço das Escolas, Reitoria da Universidade de Coimbra, 3000-530 Coimbra, Portugal.

H S Pinto, INESC-ID Lisboa, R. Alves Redol 9, 1000-029, Lisbon, Portugal; Instituto Superior Técnico, University of Lisbon, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal.

C Vaz, INESC-ID Lisboa, R. Alves Redol 9, 1000-029, Lisbon, Portugal; Instituto Superior de Engenharia de Lisboa, Instituto Politécnico de Lisboa, R. Conselheiro Emídio Navarro 1, 1959-007, Portugal.

A Francisco, INESC-ID Lisboa, R. Alves Redol 9, 1000-029, Lisbon, Portugal; Instituto Superior Técnico, University of Lisbon, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal.

I S Moreira, CNC—Center for Neuroscience and Cell Biology, Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Rua Larga - Faculdade de Medicina, Polo 1, 3004–504 Coimbra, Portugal; Department of Life Sciences, University of Coimbra, Calçada Martim de Freitas, 3000-456 Coimbra, Portugal; CoLAB4Ageing, Sítio Paço das Escolas, Reitoria da Universidade de Coimbra, 3000-530 Coimbra, Portugal.

Additional files

Supplementary Section 1

Supplementary Section 2

Supplementary Section 3

Supplementary Section 4

Supplementary Table S1. Composition of all dataset splits across the splitting strategies and subsets. The columns summarize the number of protein chains, PDB entries, ligands, and interactions retained after deduplication and splitting of the dataset. Lig_full represents chain-specific ligand instances, and Lig_base denotes distinct ligand identifiers without chain suffixes. Interaction_base refers to unique protein–ligand pairs (PDB:Chain:LIG_base), and Interactions_full represents individual binding site instances (PDB:Chain:LIG_full). The interaction cluster groups all redundant interactions involving the same protein–ligand site (PDB:SEQ:LIG_full). Sequence–LIG_base indicates the number of unique sequence–ligand pairs, and MMseqs2_clusters denotes the number of sequence clusters at the applied identity threshold. Each LIG_base corresponds to a unique canonical SMILES string.

Supplementary Table S2. Overlap analysis across dataset-splitting strategies. The columns summarize the overlap of protein chains, PDB entries, ligands, and interactions retained after deduplication and splitting. Lig_full represents chain-specific ligand instances, and Lig_base denotes distinct ligand identifiers without chain suffixes. Interaction_base refers to unique protein–ligand pairs (PDB:Chain:LIG_base), and Interactions_full represents individual binding-site instances (PDB:Chain:LIG_full). The interaction cluster groups all redundant interactions involving the same protein–ligand site (PDB:SEQ:LIG_full). Sequence-LIG_base indicates the number of unique sequence–ligand pairs, and MMseqs2_clusters denotes the number of sequence clusters at the specified identity threshold. Each LIG_base corresponds to a unique canonical SMILES string.

Supplementary Table S3. Distribution of interacting and non-interacting residues across dataset-splitting strategies. The columns report the number of interacting residues (positive class), non-interacting residues (negative class), and total residues in each subset. The % Positive column indicates the proportion of interacting residues relative to the total.

Supplementary Table S4. Ranges of hyperparameters and corresponding values tested for the LightGBM and MLP models, along with the best hyperparameter configurations identified for each dataset splitting strategy and feature space. The columns list the optimized hyperparameter names, descriptions, and search ranges or discrete values explored during the optimization. The remaining columns report the best-performing configurations for each split and feature combination (ESM2 and ProtTrans embeddings). For the MLP models, training used the BCEWithLogitsLoss function.

Supplementary Table S5. Hyperparameter grids used to assess local performance stability around the optimal Optuna configuration for Cluster 90% (28 combinations) and NoRed→Cluster90% (160 combinations) data splits. For each model, all non-listed hyperparameters were fixed to the values obtained from the best Optuna trial (Supplementary Table S4). The optimal Optuna value for each core hyperparameter was explicitly included in the grid to enable the direct evaluation of the model’s robustness to small deviations from the optimum. All experiments were conducted using fixed random seeds.

Supplementary Table S6. Performance of MLP and LightGBM models on the BASE split using the Mordred + ESM2 and Mordred + ProtTrans feature sets. The models were evaluated using testing, blind ligand, blind protein, and blind sets. Each entry reports the F1 score, recall, precision, MCC, and AUC-PR. The best-performing metrics for each feature set and model (MLP or LightGBM) within a given evaluation set are highlighted in bold, whereas the overall best-performing combination across all feature sets and models for that set is shown in bold italics.

Supplementary Table S7. Performance of MLP and LightGBM models on the Cluster40% split using the Mordred + ESM2 and Mordred + ProtTrans feature sets. The models were evaluated using testing, blind ligand, blind protein, and blind sets. Each entry reports the F1 score, recall, precision, MCC, and AUC-PR. The best-performing metrics for each feature set and model (MLP or LightGBM) within a given evaluation set are highlighted in bold, whereas the overall best-performing combination across all feature sets and models for that set is shown in bold italics.

Supplementary Table S8. Performance of MLP and LightGBM models on the NoRed90%→Cluster40% split using the Mordred + ESM2 and Mordred + ProtTrans feature sets. The models were evaluated using testing, blind ligand, blind protein, and blind sets. Each entry reports the F1 score, recall, precision, MCC, and AUC-PR. The best-performing metrics for each feature set and model (MLP or LightGBM) within a given evaluation set are highlighted in bold, whereas the overall best-performing combination across all feature sets and models for that set is shown in bold italics.

Supplementary Table S9. Family-wise model performance on the blind protein and blind sets for the Cluster90%, Cluster40%, NoRed90%→Cluster90%, and NoRed90%→Cluster40% splitting strategies. The F1 score, recall, precision, MCC, and AUC-PR were reported for each viral family. The metrics were computed independently for each family. Viral family annotations were obtained from the UniProt database.

Supplementary Table S10. Residue-level overlap between model-predicted and experimentally defined interacting residues across splitting strategies. The representative protein–ligand complex 3qb8:A:COA_C was selected because it was shared across all evaluated blind protein-splitting strategies. For each interaction and model configuration, the table reports the total number of experimentally defined interacting residues, the number of residues predicted to be interacting, and the corresponding counts of true positives (TP), false negatives (FN), and false positives (FP). The amino acid compositions of the TP, FN, and FP residues are shown in parentheses. A total of 197 residues were analyzed for 3qb8:A:COA_C.

Supplementary Table S11. An ablation study using protein-only features in the best-performing models was evaluated on the testing set across all splitting strategies. Each model (MLP) was retrained using protein-only embeddings (ESM2 or ProtTrans) to quantify the contribution of ligand descriptors to the predictive performance. The reported metrics included the F1 score, recall, precision, MCC, and AUC–PR.

Supplementary Figure S1. Optuna hyperparameter importance plots based on fANOVA analysis. (A) Cluster90% split. (B) NoRed90%→Cluster90% split.

Supplementary Figure S2. Grid-based sensitivity analysis of the optimizer and dropout hyperparameters with batch normalization fixed to True for the Cluster90% splitting strategy. Heatmaps showing the F1 scores on the testing set and 3 blind evaluation settings (blind protein, blind ligand, and blind sets). Batch normalization was fixed because all configurations with batch_norm = False resulted in a zero F1 score across the evaluation sets. The star indicates the optimal Optuna configuration.

Supplementary Figure S3. Grid-based sensitivity analysis of learning rate and number of training epochs with all other hyperparameters fixed to the Optuna-selected configuration for the NoRed→Cluster90% splitting strategy. Heatmaps report F1 scores on the testing set and 3 blind evaluation settings (blind protein, blind ligand, and blind sets). The star indicates the Optuna-optimal configuration. The learning rate is shown on a logarithmic scale.

Supplementary Figure S4. Structural and compositional characteristics of the ViralBindPredict NoRed dataset. (A) Distribution of protein chain lengths. (B) Resolution distribution of PDB structures. (C) Experimental methods used for PDB structure determination. (D) Distribution of interacting residues per protein–ligand complex (Interactions_full), illustrating typical interaction interface sizes. (E) Ligand chemical classifications from _chem_comp.pdbx_type (TYPE) and _chem_comp.type (CHEM_TYPE) for LIG_base. The saccharide category includes d-, l-, and saccharides. Entries labeled “?” had missing TYPE values but were retained because their CHEM_TYPE corresponded to an allowed class (verified after manual review).

Supplementary Figure S5. Physicochemical characterization of ligands in the ViralBindPredict NoRed dataset. (A) Atom-type composition (%). (B) Topological polar surface area (TPSA), median of 73 and mean of 104.2. (C) Molecular weight (Da), median of 290.4 and mean of 343.4. (D) Hydrogen-bond donors, median of 2 and mean of 2.5. (E) Hydrogen-bond acceptors, median of 4 and mean of 5.4. (F) LogP, median of 1.3 and mean of 1.4. (G) Ring count, median of 2 and mean of 2.4;. (H) Heavy atom count, median of 21, and mean of 23.5. (I) Frequency of the top 10 BM scaffolds. The orange dashed line represents the Lipinski thresholds defining drug-likeness boundaries according to the rule of five. Overall, 557 of the 766 LIG_base (72.72%) complied with Lipinski’s rule of five.

Supplementary Figure S6. Performance comparison of the Mordred + ESM2 and Mordred + ProtTrans models under different dataset-splitting strategies. AUC-PR is reported for the testing, blind ligand, blind protein, and blind datasets across 5 splitting strategies: BASE, Cluster 90%, Cluster 40%, NoRed90% → Cluster 90%, and NoRed90% → Cluster 40%. The green and orange lines represent the Mordred + ESM2 and Mordred + ProtTrans models, respectively.

Supplementary Figure S7. Precision–recall curves under the Cluster90% regime. Precision–recall curves for the testing, blind ligand, blind protein, and blind evaluation sets using the MLP Cluster 90% model with the Mordred + ESM2 features.

Supplementary Figure S8. Precision–recall curves under the NoRed90%→Cluster90% regime. Precision–recall curves for the testing, blind ligand, blind protein, and blind evaluation sets using the MLP Nored90%→Cluster90% model with Mordred + ESM2 features.

Supplementary Figure S9. Top 30 most performance-dropping features identified by permutation importance (10 random permutations) for the Cluster90% splitting strategy (testing, blind ligand, blind protein, and blind sets). The bars represent the decrease in the F1 score (ΔF1) after the random shuffling of each feature, with larger drops indicating greater feature relevance.

Supplementary Figure S10. Top 30 most performance-dropping features identified by permutation importance (10 random permutations) for the NoRed→Cluster90% splitting strategy (testing, blind ligand, blind protein, and blind sets). The bars represent the decrease in the F1 score (ΔF1) after the random shuffling of each feature, with larger drops indicating greater feature relevance.

Supplementary Figure S11. Top 30 most performance-dropping features identified by permutation importance (10 random permutations) for the Cluster40% splitting strategy (testing, blind ligand, blind protein, and blind sets). The bars represent the decrease in the F1 score (ΔF1) after the random shuffling of each feature, with larger drops indicating greater feature relevance.

Supplementary Figure S12. Top 30 most performance-dropping features identified by permutation importance (10 random permutations) for the NoRed→Cluster40% splitting strategy (testing, blind ligand, blind protein, and blind sets). The bars represent the decrease in the F1 score (ΔF1) after the random shuffling of each feature, with larger drops indicating greater feature relevance.

Abbreviations

AUC-PR: area under the precision–recall curve; BM: Bemis–Murcko; DL: deep learning; DNA: deoxyribonucleic acid; DTI: drug–target interactions; LBS: ligand-binding sites; MCC: Matthews correlation coefficient; ML: machine learning; MLP: multilayer perceptron; mmCIF: Macromolecular Crystallographic Information File; PDB: Protein Data Bank; PLI: protein–ligand interaction; PLM: Protein Language Model; RNA: ribonucleic acid; SMILES: Simplified Molecular Input Line Entry System.

Author contributions

A. M. B. Amorim (Data curation [lead], Formal analysis [lead], Methodology [equal], Software [lead], Visualization [lead], Writing – original draft [equal], Writing – review & editing [equal]), C. Marques-Pereira (Methodology [equal], Writing – original draft [equal]), T. Almeida (Methodology [equal], Writing – original draft [equal]), N. Rosário-Ferreira (Conceptualization [equal], Supervision [equal], Writing – review & editing [equal]), H. S. Pinto (Supervision [equal], Writing – review & editing [equal]), C. Vaz (Conceptualization [equal], Supervision [equal], Writing – review & editing [equal]), A. Francisco (Conceptualization [equal], Funding acquisition [equal], Supervision [equal], Writing – review & editing [equal]), I. S. Moreira (Conceptualization [lead], Funding acquisition [lead], Supervision [lead, Writing – review & editing [lead]).

Funding

This work was co-funded by the EU Recovery and Resilience Facility and Portuguese national funds via FCT—Fundação para a Ciência e a Tecnologia, under projects, LA/P/0058/2020 [DOI: 10.54499/LA/P/0058/2020], UID/PRR/4539/2025 [DOI: 10.54499/UID/PRR/04539/2025], UID/04539/2025, and 2024.07255. IACDC [https://doi.org/10.54499/2024.07255.IACDC]. A.M.B.A. and C. M. P. were also supported by the FCT through Ph.D. scholarships 2024.03147.BDANA and 2020.07766.BD, respectively. PURR.AI also acknowledges the COMPETE2030-FEDER-01475900 SII & DT Individual—MPr-2023-09 program under contract 18415. The INESC-ID work was supported by national funds through the FCT under projects UID/50021/2025 and UID/PRR/50021/2025.

Data availability

All additional supporting data are available in the GigaScience repository, GigaDB [85].

Competing interests

The authors, Irina S. Moreira and Nícia Rosario-Ferreira, as co-founders, and Ana M. B. Amorim, as a member of the team at PURR.AI, declare that the company has not received any external funding. There are no patents, products under development, or marketed products associated with this study that could be construed as conflicts of interest.

References

  • 1. Spear  PG. Herpes simplex virus: receptors and ligands for cell entry. Cell Microbiol. 2004;6:401–10. 10.1111/j.1462-5822.2004.00389.x. [DOI] [PubMed] [Google Scholar]
  • 2. Schlicksup  CJ, Zlotnick  A. Viral structural proteins as targets for antivirals. Curr Opin Virol. 2020;45:43–50. 10.1016/j.coviro.2020.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Zlotnick  A. To build a virus capsid. An equilibrium model of the self assembly of polyhedral protein complexes. J Mol Biol. 1994;241:59–67. 10.1006/jmbi.1994.1473. [DOI] [PubMed] [Google Scholar]
  • 4. Johnston  IG, Louis  AA, Doye  JPK. Modelling the self-assembly of virus capsids. J Phys: Condens Matter. 2010;22:104101. 10.1088/0953-8984/22/10/104101. [DOI] [PubMed] [Google Scholar]
  • 5. Perlmutter  JD, Hagan  MF. Mechanisms of virus assembly. Annu Rev Phys Chem. 2015;66:217–39. 10.1146/annurev-physchem-040214-121637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Garmann  RF, Comas-Garcia  M, Knobler  CM, et al.  Physical principles in the self-assembly of a simple spherical virus. Acc Chem Res. 2016;49:48–55. 10.1021/acs.accounts.5b00350. [DOI] [PubMed] [Google Scholar]
  • 7. Tanner  EJ, Liu  H-M, Oberste  MS, et al.  Dominant drug targets suppress the emergence of antiviral resistance. eLife. 2014;3:e03830. 10.7554/eLife.03830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Romero  JR. Pleconaril: a novel antipicornaviral drug. Expert Opin Investig Drugs. 2001;10:369–79. 10.1517/13543784.10.2.369. [DOI] [PubMed] [Google Scholar]
  • 9. Purdy  MD, Shi  D, Chrustowicz  J, et al.  MicroED structures of HIV-1 Gag CTD-SP1 reveal binding interactions with the maturation inhibitor bevirimat. Proc Natl Acad Sci USA. 2018;115:13258–63. 10.1073/pnas.1806806115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Tomasello  G, Armenia  I, Molla  G. The protein imager: a full-featured online molecular viewer interface with server-side HQ-rendering capabilities. Bioinformatics. 2020;36:2909–11. 10.1093/bioinformatics/btaa009. [DOI] [PubMed] [Google Scholar]
  • 11. Chen  R, Liu  X, Jin  S, et al.  Machine learning for drug–target interaction prediction. Molecules. 2018;23:2208. 10.3390/molecules23092208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Wu  Z, Li  W, Liu  G, et al.  Network-based methods for prediction of drug–target interactions. Front Pharmacol. 2018;9:1134. 10.3389/fphar.2018.01134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wen  M, Zhang  Z, Niu  S, et al.  Deep-learning-based drug–target interaction prediction. J Proteome Res. 2017;16:1401–9. 10.1021/acs.jproteome.6b00618. [DOI] [PubMed] [Google Scholar]
  • 14. Mao  X, Chu  Y, Wei  D. Designed with interactome-based deep learning. Nat Chem Biol. 2024;20:1399–401. 10.1038/s41589-024-01754-7. [DOI] [PubMed] [Google Scholar]
  • 15. Zhou  Z, Chen  J, Lin  S, et al.  GRATCR: epitope-specific T cell receptor sequence generation with data-efficient pre-trained models. IEEE J Biomed Health Inform. 2025. 10.1109/JBHI.2024.3514089. [DOI] [PubMed] [Google Scholar]
  • 16. Yu  H, Chen  J, Xu  X, et al.  A systematic prediction of multiple drug–target interactions from chemical, genomic, and pharmacological data. PLoS One. 2012;7:e37608. 10.1371/journal.pone.0037608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Cheng  F, Zhou  Y, Li  J, et al.  Prediction of chemical-protein interactions: multitarget-QSAR versus computational chemogenomic methods. Mol BioSyst. 2012;8:2373. 10.1039/c2mb25110h. [DOI] [PubMed] [Google Scholar]
  • 18. Gilson  MK, Liu  T, Baitaluk  M, et al.  BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44:D1045–53. 10.1093/nar/gkv1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Tian  K, Shao  M, Wang  Y, et al.  Boosting compound-protein interaction prediction by deep learning. Methods. 2016;110:64–72. 10.1016/j.ymeth.2016.06.024. [DOI] [PubMed] [Google Scholar]
  • 20. Zhang  Y, Li  J, Lin  S, et al.  An end-to-end method for predicting compound-protein interactions based on simplified homogeneous graph convolutional network and pre-trained language model. J Cheminform. 2024;16:67. 10.1186/s13321-024-00862-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Thafar  MA, Olayan  RS, Albaradei  S, et al.  DTi2Vec: drug–target interaction prediction using network embedding and ensemble learning. J Cheminform. 2021;13:71. 10.1186/s13321-021-00552-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Liao  Z, Huang  X, Mamitsuka  H, et al.  Drug3D-DTI: improved drug–target interaction prediction by incorporating spatial information of small molecules. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Houston, TX: IEEE; 2021:340–47. [Google Scholar]
  • 23. Wu  Q, Peng  Z, Zhang  Y, et al.  COACH-D: improved protein–ligand binding sites prediction with refined ligand-binding poses through molecular docking. Nucleic Acids Res. 2018;46:W438–42. 10.1093/nar/gky439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Skolnick  J, Brylinski  M. FINDSITE: a combined evolution/structure-based approach to protein function prediction. Briefings Bioinf. 2009;10:378–91. 10.1093/bib/bbp017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Fang  Y, Jiang  Y, Wei  L, et al.  DeepProSite: structure-aware protein binding site prediction using ESMFold and pretrained language model. Bioinformatics. 2023;39:btad718. 10.1093/bioinformatics/btad718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Jiménez  J, Doerr  S, Martínez-Rosell  G, et al.  DeepSite: protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics. 2017;33:3036–42. 10.1093/bioinformatics/btx350. [DOI] [PubMed] [Google Scholar]
  • 27. Kandel  J, Tayara  H, Chong  KT. PUResNet: prediction of protein–ligand binding sites using deep residual neural network. J Cheminform. 2021;13:65. 10.1186/s13321-021-00547-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Durairaj  J, Adeshina  Y, Cao  Z, et al.  PLINDER: the protein–ligand interactions dataset and evaluation resource. Biorxiv.
  • 29. Cui  Y, Dong  Q, Hong  D, et al.  Predicting protein–ligand binding residues with deep convolutional neural networks. BMC Bioinf. 2019;20:93. 10.1186/s12859-019-2672-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Komiyama  Y, Banno  M, Ueki  K, et al.  Automatic generation of bioinformatics tools for predicting protein–ligand binding sites. Bioinformatics. 2016;32:901–7. 10.1093/bioinformatics/btv593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Kauffman  C, Karypis  G. LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction. Bioinformatics. 2009;25:3099–107. 10.1093/bioinformatics/btp561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Goswami  S, Samanta  D, Duraivelan  K. Molecular mimicry of host short linear motif-mediated interactions utilised by viruses for entry. Mol Biol Rep. 2023;50:4665–73. 10.1007/s11033-023-08389-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Peck  KM, Lauring  AS. Complexities of viral mutation rates. J Virol. 2018;92:e01031–17. 10.1128/JVI.01031-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Berman  HM, Westbrook  J, Feng  Z, et al.  The Protein Data Bank. Nucleic Acids Res. 2000;28:235–42. 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. RCSB Protein Data Bank: PDB entries with novel ligands now distributed only in PDBx/mmCIF and PDBML file formats. https://www.rcsb.org/news/feature/2023. Accessed 2 September 2025. [Google Scholar]
  • 36. Zhang  C, Zhang  X, Freddolino  L, et al.  BioLiP2: an updated structure database for biologically relevant ligand–protein interactions. Nucleic Acids Res. 2024;52:D404–12. 10.1093/nar/gkad630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Wei  H, Wang  W, Peng  Z, et al.  Q-BioLiP: a comprehensive resource for quaternary structure-based protein–ligand interactions. Genomics Proteomics Bioinformatics. 2024;22:qzae001. 10.1093/gpbjnl/qzae001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Ben Chorin  A, Masrati  G, Kessel  A, et al.  ConSurf-DB: an accessible repository for the evolutionary conservation patterns of the majority of PDB proteins. Protein Sci. 2020;29:258–67. 10.1002/pro.3779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Chemical Component Dictionary. https://www.wwpdb.org/data/ccd. Accessed 2 September 2025. [Google Scholar]
  • 40. Anand  P, Nagarajan  D, Mukherjee  S, et al.  PLIC: protein–ligand interaction clusters. Database (Oxford). 2014;2014:bau029. 10.1093/database/bau029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Worldwide Protein Data Bank: Data item _chem_comp.Type. https://mmcif.wwpdb.org/dictionaries/mmcif_ma.dic/Items/_chem_comp.type.html. Accessed 2 September 2025. [Google Scholar]
  • 42. Heller  SR, McNaught  A, Pletnev  I, et al.  InChI, the IUPAC International Chemical Identifier. J Cheminform. 2015;7:23. 10.1186/s13321-015-0068-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Wang  Y, Sun  K, Li  J, et al.  A workflow to create a high-quality protein–ligand binding dataset for training, validation, and prediction tasks. Digital Discovery. 2025;4:1209–20. 10.1039/d4dd00357h. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Clark  JJ, Orban  ZJ, Carlson  HA. Predicting binding sites from unbound versus bound protein structures. Sci Rep. 2020;10:15856. 10.1038/s41598-020-72906-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Suresh  MX, Gromiha  MM, Suwa  M. Development of a machine learning method to predict membrane protein–ligand binding residues using basic sequence information. Adv Bioinformatics. 2015;2015:1–7. 10.1155/2015/843030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Gao  M, Skolnick  J. The distribution of ligand-binding pockets around protein–protein interfaces suggests a general mechanism for pocket formation. Proc Natl Acad Sci USA. 2012;109:3784–9. 10.1073/pnas.1117768109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Cao  C, Xu  S. Improving the performance of the PLB index for ligand-binding site prediction using dihedral angles and the solvent-accessible surface area. Sci Rep. 2016;6:33232. 10.1038/srep33232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Hata  H, Phuoc Tran  D, Marzouk Sobeh  M, et al.  Binding free energy of protein/ligand complexes calculated using dissociation Parallel Cascade Selection Molecular Dynamics and Markov state model. Biophys Physicobiol. 2021;18:305–16. 10.2142/biophysico.bppb-v18.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Siebenmorgen  T, Menezes  F, Benassou  S, et al.  MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery. Nat Comput Sci. 2024;4:367–78. 10.1038/s43588-024-00627-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Jian  J-W, Elumalai  P, Pitti  T, et al.  Predicting ligand binding sites on protein surfaces by 3-dimensional probability density distributions of interacting atoms. PLoS One. 2016;11:e0160315. 10.1371/journal.pone.0160315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Yang  J, Roy  A, Zhang  Y. BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res. 2013;41:D1096–103. 10.1093/nar/gks966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Biorender. https://www.biorender.com  Accessed 2 November 2025.
  • 53. Moriwaki  H, Tian  Y-S, Kawashita  N, et al.  Mordred: a molecular descriptor calculator. J Cheminform. 2018;10:4. 10.1186/s13321-018-0258-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Lin  Z, Akin  H, Rao  R, et al.  Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 55. Elnaggar  A, Heinzinger  M, Dallago  C, et al.  ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022;44:7112–27. 10.1109/tpami.2021.3095381. [DOI] [PubMed] [Google Scholar]
  • 56. Caniceiro  AB, Amorim  AMB, Rosário-Ferreira  N, et al.  GPCR-A17 MAAP: mapping modulators, agonists, and antagonists to predict the next bioactive target. J Cheminform. 2025;17:102. 10.1186/s13321-025-01050-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Bushuiev  A, Bushuiev  R, Sedlar  J, et al.  Revealing data leakage in protein interaction benchmarks. arXiv [cs.LG ]. https://arxiv.org/abs/2404.10457. [Google Scholar]
  • 58. Smug  BJ, Szczepaniak  K, Rocha  EPC, et al.  Ongoing shuffling of protein fragments diversifies core viral functions linked to interactions with bacterial hosts. Nat Commun. 2023;14:7460. 10.1038/s41467-023-43236-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Nomburg  J, Doherty  EE, Price  N, et al.  Birth of protein folds and functions in the virome. Nature. 2024;633:710–7. 10.1038/s41586-024-07809-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Ataya  F, Alamro  A, Alghamdi  A, et al.  SARS-CoV-2 spike mutations alter structure and energetics to modulate ACE2 binding immune evasion and viral adaptation. Sci Rep. 2025;15:37546. 10.1038/s41598-025-15979-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Tuci  S, Mercorelli  B, Loregian  A. Antiviral drug repurposing: different approaches and the case of antifungal drugs. Pharmacol Ther. 2025;273:108903. 10.1016/j.pharmthera.2025.108903. [DOI] [PubMed] [Google Scholar]
  • 62. Shuler  G, Hagai  T. Rapidly evolving viral motifs mostly target biophysically constrained binding pockets of host proteins. Cell Rep. 2022;40:111212. 10.1016/j.celrep.2022.111212. [DOI] [PubMed] [Google Scholar]
  • 63. Liu  H, Su  M, Lin  H-X, et al.  Public data set of protein–ligand dissociation kinetic constants for quantitative structure-kinetics relationship studies. ACS Omega. 2022;7:18985–96. 10.1021/acsomega.2c02156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Xu  M, Zhang  Z, Lu  J, et al.  PEER: a comprehensive and multi-task benchmark for Protein sEquence undERstanding. arXiv [cs.LG ]. [Google Scholar]
  • 65. Kabir  A, Moldwin  A, Bromberg  Y, et al.  In the twilight zone of protein sequence homology: do protein language models learn protein structure?. Bioinform Adv. 2024;4:vbae119. 10.1093/bioadv/vbae119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Steinegger  M, Söding  J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–8. 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
  • 67. Gabler  F, Nam  S-Z, Till  S, et al.  Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr Protoc Bioinformatics. 2020;72:e108. 10.1002/cpbi.108. [DOI] [PubMed] [Google Scholar]
  • 68. Mirdita  M, von den Driesch  L, Galiez  C, et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45:D170–6. 10.1093/nar/gkw1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Yize  J, Xinze  L, Yuanyuan  Z, et al.  PoseX: AI defeats physics approaches on protein–ligand cross docking. arXiv [cs.LG ]. [Google Scholar]
  • 70. Yang  J, Shen  C, Huang  N. Predicting or pretending: artificial intelligence for protein–ligand interactions lack of sufficiently large and unbiased datasets. Front Pharmacol. 2020;11:69. 10.3389/fphar.2020.00069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Hu  X, Wang  K, Dong  Q. Protein ligand-specific binding residue predictions by an ensemble classifier. BMC Bioinf. 2016;17:470. 10.1186/s12859-016-1348-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Rost  B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. 10.1093/protein/12.2.85. [DOI] [PubMed] [Google Scholar]
  • 73. Roessler  CG, Hall  BM, Anderson  WJ, et al.  Transitive homology-guided structural studies lead to discovery of Cro proteins with 40% sequence identity but different folds. Proc Natl Acad Sci USA. 2008;105:2343–8. 10.1073/pnas.0711589105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Tian  W, Skolnick  J. How well is enzyme function conserved as a function of pairwise sequence identity?. J Mol Biol. 2003;333:863–82. 10.1016/j.jmb.2003.08.057. [DOI] [PubMed] [Google Scholar]
  • 75. StandardScaler. scikit-learn. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. Accessed 27 October 2025. [Google Scholar]
  • 76. LightGBM 4.6.0 documentation. https://lightgbm.readthedocs.io/en/stable/Python-Intro.html. Accessed 27 October2025.
  • 77. PyTorch Foundation. PyTorch. https://pytorch.org. Accessed 27 October 2025. [Google Scholar]
  • 78. Akiba  T, Sano  S, Yanase  T, et al.  Optuna. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY: ACM; 2019:2623–31. [Google Scholar]
  • 79. Hussein  HAM, Thabet  AA, Wardany  AA, et al.  SARS-CoV-2 outbreak: role of viral proteins and genomic diversity in virus infection and COVID-19 progression. Virol J. 2024;21:75. 10.1186/s12985-024-02342-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Verma  J, Subbarao  N. A comparative study of human betacoronavirus spike proteins: structure, function and therapeutics. Arch Virol. 2021;166:697–714. 10.1007/s00705-021-04961-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Bhella  D. Virus proteins and nucleoproteins: an overview. Subcell Biochem. 2018;88:1–18. 10.1007/978-981-10-8456-0_1. [DOI] [PubMed] [Google Scholar]
  • 82. del Sol  A, Carbonell  P. The modular organization of domain structures: insights into protein–protein binding. PLoS Comput Biol. 2007;3:e239. 10.1371/journal.pcbi.0030239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Kumar  M, Baig  MS, Bhardwaj  K. Advancements in the development of antivirals against SARS–coronavirus. Front Cell Infect Microbiol. 2025;15:1520811. 10.3389/fcimb.2025.1520811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. DOME-ML Annotations for “ViralBindPredict: empowering viral protein–ligand binding sites through deep learning and protein sequence-derived insights.” https://registry.dome-ml.org/review/hxgpadxn62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Amorim  AMB, Marques-Pereira  C, Almeida  T, et al.  Supporting data for “ViralBindPredict: empowering viral protein–ligand binding sites through deep learning and protein sequence-derived insights”. GigaScience. Database. 2026. 10.5524/102798. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Durairaj  J, Adeshina  Y, Cao  Z, et al.  PLINDER: the protein–ligand interactions dataset and evaluation resource. Biorxiv.
  2. Amorim  AMB, Marques-Pereira  C, Almeida  T, et al.  Supporting data for “ViralBindPredict: empowering viral protein–ligand binding sites through deep learning and protein sequence-derived insights”. GigaScience. Database. 2026. 10.5524/102798. [DOI] [PMC free article] [PubMed]

Supplementary Materials

giag010_Supplemental_File
giag010_Authors_Response_To_Reviewer_Comments_original_submission
giag010_Authors_Response_To_Reviewer_Comments_revision_1
giag010_GIGA-D-25-00033_original_submission
giag010_GIGA-D-25-00033_Revision_1
giag010_GIGA-D-25-00033_Revision_2
giag010_Reviewer_1_Report_original_submission

Reviewer 1 -- 2/24/2025

giag010_Reviewer_1_Report_Revision_1

Reviewer 1 -- 11/6/2025

giag010_Reviewer_2_Report_original_submission

Reviewer 2 -- 3/3/2025

giag010_Reviewer_2_Report_Revision_1

Reviewer 2 -- 11/21/2025

giag010_Reviewer_3_Report_original_submission

Reviewer 3 -- 3/3/2025

Data Availability Statement

All additional supporting data are available in the GigaScience repository, GigaDB [85].


Articles from GigaScience are provided here courtesy of Oxford University Press

RESOURCES