A generalizable deep learning framework for structure-based protein–ligand affinity ranking

Benjamin P Brown

doi:10.1073/pnas.2508998122

. 2025 Oct 16;122(42):e2508998122. doi: 10.1073/pnas.2508998122

A generalizable deep learning framework for structure-based protein–ligand affinity ranking

Benjamin P Brown ^a,¹

PMCID: PMC12557521 PMID: 41100673

Significance

Machine learning has potential to accelerate drug discovery by enabling rapid, high-quality predictions of protein–ligand binding affinities, but current models can fail unpredictably when applied to novel targets unseen during training. This generalizability problem limits their real-world utility. We hypothesize that this failure stems from models developing a bias toward structure-specific correlations that competes with learning the fundamentals of molecular interactions. We introduce COnvolutional Representation of Distance-dependent Interactions with Attention Learning (CORDIAL), an interaction-only deep learning framework that overcomes this limitation by focusing exclusively on the physicochemical properties of the protein–ligand interface. Validation simulating prospective screening against new protein families shows that CORDIAL uniquely maintains predictive performance. This work provides a machine learning solution for generalizable structure-based affinity ranking and offers a blueprint for developing reliable models.

Keywords: deep learning, computer-aided drug design, generalizability, virtual screening

Abstract

Rapid and accurate estimation of protein–ligand binding affinities is crucial for early-stage drug discovery, yet hindered by a trade-off between the accuracy of gold-standard physics-based methods and the speed of simpler empirical scoring functions. Machine learning (ML) promised to bridge this gap, but its potential is unrealized due to limited model generalizability. Current ML models often fail when predicting affinities for novel proteins or chemical series unseen during training. We hypothesize that this failure stems from a competition within these models during training, where the learning of spurious correlations from structural motifs prevalent in the training data competes with the learning of transferable, physicochemical principles governing molecular interaction. Here, we introduce COnvolutional Representation of Distance-dependent Interactions with Attention Learning (CORDIAL), a deep learning framework designed with an inductive bias toward learning the distance-dependent physicochemical interaction signatures between proteins and ligands, explicitly avoiding direct parameterization of their chemical structures. This interaction-only approach proves effective. Through leave-superfamily-out validation that simulates encounters with novel protein families, we demonstrate that CORDIAL maintains predictive performance and calibration. This contrasts with diverse contemporary ML models, whose predictive ability is degraded under these conditions. Our results highlight the value of encoding appropriate task-specific physicochemical principles into ML architectures and offer a validated strategy for developing generalizable models for structure-based drug discovery.

Therapeutic drug discovery is a multidisciplinary process focused on the identification and optimization of compounds capable of modulating biological targets to treat disease. Small molecules represent the majority of FDA-approved therapeutics (1). In recent years, there has been substantial progress in expanding the searchable chemical space for drug discovery through approaches such as DNA-encoded libraries and ultra-large make-on-demand chemical libraries. These advancements aim to increase the diversity and quality of chemical matter accessible for hit identification, demanding faster and more reliable computational screening methods.

The identification of high-quality hits, i.e., those compounds characterized by high potency, selectivity, favorable physicochemical properties, metabolic stability, low toxicity, and good absorption, distribution, metabolism, and excretion profiles, at the early stages of drug discovery is critical for reducing the cost of lead optimization and accelerating Investigational New Drug development and clinical trials (2). Computer-aided drug design has emerged as an important tool in this process and has the potential to significantly enhance early-stage drug discovery.

Unfortunately, the ability to computationally estimate the strength of protein-small molecule interactions accurately and at scale remains limited. While statistical physics-based approaches, such as alchemical free energy perturbation (FEP), offer accuracy, their computational cost prohibits exploration of vast chemical spaces (3, 4). Conversely, efficient docking scores suffer from limited accuracy and reliability (5, 6). Machine learning (ML) emerged as a powerful paradigm to potentially bridge this accuracy-speed gap. However, despite a decade of intense effort and architectural innovation, ML models for affinity prediction have failed to generalize reliably beyond their training distributions (7–11).

This generalizability failure represents a critical barrier to the field’s progress. We posit that this stems from models learning spurious correlations from structural motifs prevalent in limited training data that compete with the transferable principles of intermolecular interaction. This issue is exacerbated by widespread reliance on validation strategies that fail to adequately probe for out-of-distribution (OOD) generalization. Standard random k-fold cross-validation, for instance, overestimates real-world performance by ensuring training and test sets are drawn from the same data distribution. Other methods are similarly flawed. Temporal splits can contain high target and chemical scaffold similarity, particularly within large, well-studied protein families and proteins of long-term therapeutic interest. Leave-one-protein-out protocols suffer from data leakage when other members of the same protein family remain in the training set. Sequence-based splits can also be insufficient for structure-based models, as proteins with low sequence identity can share identical folds, creating a risk of data contamination from chemotypes known to target a particular fold (12–14). These challenges in developing robust validation strategies have fueled considerable debate about the true utility of published models.

To address these challenges in both modeling and validation, we make two primary contributions. First, we demonstrate the utility of a stringent validation benchmark for protein–ligand affinity prediction, CATH-based (15) Leave-Superfamily-Out (LSO), designed to simulate prospective screening. By withholding entire protein homologous superfamilies and their implicitly associated chemical scaffolds (12–14) from the training set, the LSO protocol provides a more robust measure of a model’s ability to generalize to novel protein architectures and chemistries. Second, we present a conceptual alternative to structure-centric embeddings. We introduce a feature extraction strategy designed to leverage protein–ligand interaction space exclusively and overcome representation-based bias. Our approach embeds the protein–ligand system by creating interaction radial distribution functions (RDFs) from the distance-dependent cross-correlations of fundamental chemical properties between protein–ligand atom pairs. This representation intentionally prevents the model from directly learning parameters tied to specific chemical substructures, forcing it instead to learn generalizable relationships within the interaction space. To effectively process these structured interaction RDFs, we developed a tailored neural network architecture, COnvolutional Representation of Distance-dependent Interactions with Attention Learning (CORDIAL), which utilizes 1D convolutions for local interaction-specific distance-dependent learning and axial attention for global distance and property interactions. To ensure robust training and evaluation under this framework, we constructed an augmented dataset combining structural data with extensive biochemical activity data from public repositories.

This interaction-only strategy stands in contrast to prevailing paradigms that rely on graph-based or voxel-based representations of molecular structure. Voxel-based 3D convolutional neural networks (3D-CNNs) are a natural extension of the grid-based scoring functions first pioneered in docking programs like DOCK, which used precalculated grids to rapidly evaluate van der Waals and electrostatic interaction energies (16). Modern 3D-CNNs, introduced through work including KDEEP and AtomNet (17–19), replace these empirical energy terms with learned convolutional filters. More recently, graph neural networks (GNNs) have driven significant advances in other areas of chemistry, perhaps most notably in the development of learned force fields (20–23), which naturally motivates their extension to structure-based affinity prediction (24, 25).

While powerful, we hypothesize that the inductive biases of these architectures, which are well suited for learning from chemical topology or explicit 3D structure, can inadvertently encourage models to learn spurious correlations tied to specific, recurring substructures rather than the underlying principles of interaction, especially when generalizing to novel targets. Indeed, while these approaches and others have demonstrated some efficacy, successes have been mixed. The performance of both 3D-CNN and GNN architectures often degrades significantly on OOD benchmarks designed to test generalization to novel protein families. Some analyses suggest that certain models derive much of their predictive power from ligand features alone, with minimal contribution from the protein’s structure (7–11, 26, 27). Such results raise the concern that high performance reported on standard, in-distribution test sets may be overly optimistic and not reflective of performance in real-world prospective applications. In the absence of sufficient type and quantity of data to prevent these biases, it is possible that OOD inference will remain challenging for these architectures.

This hypothesis, that architectural bias is the primary driver of generalization failure, motivates a direct comparison. We therefore leverage our validation framework to study in-distribution (random split) versus out-of-distribution (CATH-LSO) performance across in structure-centric and interaction-only ML architectures. The results demonstrate that while all models perform well on the in-distribution task, the predictive capacity of 3D-CNN and GNN models is notably reduced on the CATH-LSO benchmark. In contrast, our interaction-only model, CORDIAL, uniquely maintains predictive performance and calibration, demonstrating that its physicochemically intuitive inductive bias allows it to learn transferable principles of binding. This work demonstrates an effective approach for developing generalizable machine learning models for molecular structure-based drug design.

Results

Our objective is to develop an unbiased, structure-based model for predicting protein–ligand binding affinities suitable for virtual high-throughput screening. Given the vastness of estimated druglike chemical space ( $10^{18}$ to $10^{24}$ ) (28), models parameterizing structures risk learning biases from limited training data. In the absence of sufficient data, it can be challenging to overcome the lack of task-specific inductive bias of general model architectures. We hypothesize that models trained strictly on a projection of the interaction space defined by pairwise interacting atoms between two molecules will generalize more effectively than those whose parameters are directly tuned on chemical structures.

We develop an embedding and modeling approach using distance-dependent interaction graphs that we refer to as CORDIAL (Fig. 1). We trained CORDIAL to classify interaction strength against 8 ordinally ranked affinity thresholds (pKd $\geq$ 1 to pKd $\geq$ 8). Labels are cumulative; e.g., pKd $\geq$ 4 requires also meeting pKd $\geq$ 1, 2, and 3. All comparison models are trained using this identical ordinal classification scheme (for full details on featurization, architecture, and training, see Materials and Methods).

Assessing Ordinal Affinity Ranking on Novel Protein Families.

We evaluate CORDIAL’s performance against representatives from each of several common deep learning architectures for protein–ligand binding affinity prediction. Specifically, we train a prototypical voxel grid-based 3D-CNN with a conventional and well-established architecture (17, 19). We also train a modern graph attention network (GAT) that makes use of radial atomic environment vectors as node features (24). Each model is evaluated using both random 5% validation split and the more stringent CATH-LSO validation protocol (see Materials and Methods for details). As a control, we also include VinaScore from GNINA as a non-ML baseline.

To assess each model’s ability to discriminate between affinity classes, we measured the one-vs.-rest (OvR) receiver operating characteristic area under the curve (ROC AUC) for each of the eight ordinally ranked affinity thresholds (Fig. 2). This metric evaluates how well a model can distinguish complexes that meet a specific affinity threshold (e.g., pKd $\geq$ 5) from all other complexes. The results are presented as two sets of distributions for each model: performance on the 5% random split validation sets (gray boxplots), which reflects in-distribution performance, and performance across the ten CATH-LSO test sets (colored boxplots), which reflects out-of-distribution generalization.

Our results reveal considerable divergence in model generalization capabilities when subjected to realistic validation scenarios (Fig. 2). On standard random-split validation, GAT achieved the highest apparent performance (ROC AUC $\approx$ 0.95 to 1.0), followed by CORDIAL and 3D-CNN. All models outperform the VinaScore baseline on this in-distribution task.

However, the transition to the stringent CATH-LSO validation exposes weaknesses in models relying on direct structural parameterization. The benchmark deep learning models exhibit substantial performance degradation from their random split baseline. Both 3D-CNN and GAT show the most significant loss of performance at lower affinity thresholds (pKd $\geq$ 1 to 4) and the highest affinity threshold (pKd $\geq$ 8), with GAT’s median ROC AUC collapsing by over 0.3 at most thresholds. Their similar failure modes suggest that for both architectures, learned correlations specific to training set substructures compete with the learning of transferable interaction principles, hindering generalization to novel superfamilies. As expected, the non-ML VinaScore baseline performs consistently across both random and LSO splits; its predefined functional form and small number of tunable parameters make its performance relatively stable across data partitions.

In contrast to the other ML models, CORDIAL’s predictive performance remains largely comparable between the random and CATH-LSO splits, with only a minor drop in median ROC AUC. VinaScore and CORDIAL both exhibit higher variance in their CATH-LSO predictions compared to the random split. This increased variance is an expected consequence of the LSO benchmark because each test is more homogeneous than random split and presents the model under evaluation with a unique physicochemical challenge. Notably, CORDIAL consistently outperforms VinaScore in its discriminative capacity across all affinity thresholds on the CATH-LSO splits. Overall, CORDIAL’s performance under OOD conditions is consistent with the hypothesis that focusing learning exclusively on the interaction space can lead to a more generalizable model. Its improvement over VinaScore further demonstrates the potential utility of generalizable ML models compared to conventional physics-based empirical score functions.

While ROC AUC assesses overall affinity threshold discrimination, a visual inspection of the normalized confusion matrices provides a more granular view of predictive accuracy on a per-split basis (Fig. 3). In these plots, the diagonal from Bottom-Left to Top-Right represents the positive predictive value (PPV) for each affinity threshold, with darker blue indicating higher PPV. The first column for each model shows aggregate performance on the in-distribution random splits reflective of training fit, while the subsequent columns show performance on the ten OOD CATH-LSO splits. On the in-distribution random split, all the deep learning models perform well, exhibiting clear diagonal dominance. GAT clearly performs the best on the random split validation. These data indicate that GAT, 3D-CNN, and CORDIAL are all capable of effectively learning the ordinal classification task from the training data.

Fig. 3. — Affinity threshold classification. Comparison of GNINA (purple), CORDIAL (blue/green), GAT (gold/green), 3D-CNN (pink/red) on aggregate random split and individual CATH-LSO tests. The x- and y-axes correspond to predicted and true affinity thresholds, respectively. The color bar is scaled from 0.0 (white) to 0.5 (blue), and each predicted-value column is normalized to sum to 1. The diagonal line from *Bottom Left* to *Top Right* therefore corresponds to the positive predictive value (PPV) and ideal performance would be a value of 1 along the diagonal and 0 otherwise. Training validation splits included for model generalizability comparisons (i.e., how well each model performs on individual CATH-LSO sets compared to its ideal performance on in-distribution data). For both evaluations, each independent validation/test set is internally bootstrapped to balance the result labels such that the number of samples per label is equal to the maximum sample count of all the labels.

However, a clear divergence emerges on the OOD CATH-LSO splits. Both the GAT and 3D-CNN models exhibit an increase in off-diagonal predictions, indicating a higher rate of misclassification when faced with novel protein families. The GAT model shows varied performance, with strong diagonal prediction on some superfamilies but a high concentration of predictions in off-diagonal bins on others (e.g., 3.30.565.10 and 1.20.920.10). The 3D-CNN model shows a more consistent, but diffuse, pattern of misclassifications across most splits. In contrast, CORDIAL’s confusion matrices remain more diagonally dominant across the CATH-LSO splits, showing a performance pattern more similar to the VinaScore baseline, but with higher overall accuracy (Fig. 3).

We quantify the visual trends from the confusion matrix using several ordinal classification metrics, which compare in-distribution (random split) and OOD (CATH-LSO) performance (Fig. 4). We evaluate the quadratic weighted kappa (QWK), which measures ordinal agreement; the mean absolute error (MAE), which quantifies the average error in predicted affinity bins; and accuracy within one affinity bin. On the in-distribution random splits, all deep learning models show strong performance, with high QWK, high accuracy, and low MAE, indicating they have all successfully learned the ranking task to outperform the VinaScore baseline.

Fig. 4. — Aggregate ranking metrics. Comparison of GNINA (purple), CORDIAL (blue/green), GAT (gold/green), 3D-CNN (pink/red) on accuracy, accuracy +/- 1 affinity threshold bin, quadratic-weighted kappa (QWK), and mean absolute error (MAE) with respect to affinity threshold bin classification. For each model, we compare the 5% random split validation set performance (translucent box-and-whisker plots with dashed edges) to the test-time CATH-LSO split performance (solid box-and-whisker plots with solid edges) as a measure of model generalizability. For both random split and CATH-LSO evaluations, each independent validation/test set is internally bootstrapped to balance the result labels such that the number of samples per label is equal to the maximum sample count of all the labels.

On the OOD CATH-LSO splits, however, the performance of the structure-centric models degrades. Both GAT and 3D-CNN show a drop in median QWK to the 0.2 to 0.3 range, bringing their performance in line with the VinaScore baseline and reflecting a diminished capacity for correct ordinal ranking on novel targets. In contrast, CORDIAL maintains a median QWK of approximately 0.65 and an MAE of approximately 1.5 affinity bins, outperforming all other models, including VinaScore. CORDIAL’s ordinal performance is further confirmed by its accuracy within one bin, which remains higher than the other methods with a median of approximately 0.6.

Finally, to provide a more granular, per-protein assessment of performance, we analyze model predictions on ten diverse, representative protein targets, one from each held-out CATH-LSO test set (Fig. 5). These confusion matrices are analogous to those in the overall analysis, but show performance on a single protein’s congeneric ligand series without the label balancing used in the aggregate benchmarks. The 1D histograms show the true label distribution for each target, highlighting the class imbalance inherent to real-world test datasets. Note that even though these are protein-specific assessments, the models in question were trained on the corresponding CATH-LSO training dataset to exclude homologous proteins from training (i.e., the training sets are the same as Fig. 3).

Fig. 5. — Affinity threshold classification on sample proteins from CATH-LSO splits. Comparison of GNINA (purple), CORDIAL (blue/green), GAT (gold/green), and 3D-CNN (pink/red) on aggregate random split and individual CATH-LSO tests. The x- and y-axes correspond to predicted and true affinity thresholds, respectively. The color bar is scaled from 0.0 (white) to 0.5 (blue), and each predicted-value column is normalized to sum to 1. The diagonal line from *Bottom Left* to *Top Right* therefore corresponds to the positive predictive value (PPV) and ideal performance would be a value of 1 along the diagonal and 0 otherwise. Training validation splits included for model generalizability comparisons (i.e., how well each model performs on individual CATH-LSO sets compared to its ideal performance on in-distribution data). Testing is performed without label balancing and the corresponding true label counts are depicted as gray 1D histograms (gray). In this particular benchmark, each CATH-LSO ID corresponds to a single protein and PDB ID. For that PDB ID, we took as the test set the PDBbind2020 protein–ligand complex as well as the associated congeneric ligand series from PubChem. Even though each test set is restricted to a single protein from the CATH-LSO split, the entire CATH-LSO ID was held-out during training to avoid contamination with similar proteins and small molecules. The CATH-LSO ID to PDB ID mapping is as follows: 1.10.510.10–2ITZ, epidermal growth factor receptor; 2.40.60.10–2IQG, beta secretase 1; 2.40.10.10–3QTO, thrombin; 2.60.40.10–6AWO, sodium-dependent serotonin transporter; 1.10.565.10–1ZUC, progesterone receptor; 3.10.200.10–4YXI, carbonic anhydrase 2; 1.20.920.10–4HBY, bromodomain-containing protein 4; 3.30.565.10–6ELO, heat shock protein HSP90; 1.10.1300.10–1Q9M, cAMP-specific phosphodiesterase PDE4D2; and 3.40.190.10–1PBQ, n-methyl-d-aspartate receptor subunit 1.

The results on these individual targets largely mirror the aggregate findings. The GAT and 3D-CNN models often produce more diffuse predictions, indicating a higher rate of misclassification. In contrast, CORDIAL’s predictions for these targets form a clearer and more concentrated diagonal, signifying more robust ordinal ranking. The analysis also reveals the limitations of all current approaches on certain targets. A notable example is carbonic anhydrase 2 (4YXI, 3.10.200.10), where all models, including CORDIAL and VinaScore, exhibit poor performance exemplified by highly diffuse confusion matrices. This result highlights the particular difficulty of this superfamily, which may possess unique structural or physicochemical features that make it a challenging case for single-point binding affinity models. Overall, these data reinforce results from the aggregate ranking and are consistent with our hypothesis that structure-centric models may struggle on OOD tasks because learned substructural biases compete with the principles of molecular interaction, leading them to fail unpredictably. CORDIAL’s explicit restriction to the interaction space appears to mitigate this failure mode, leading to more robust generalization.

Analysis of Model Calibration on Out-of-Distribution Targets.

A model’s utility in virtual screening depends not only on its ranking ability but also on its calibration. Calibration is the degree to which the predicted probabilities correspond to the true likelihood of activity. Here, we evaluate model calibration on the OOD CATH-LSO splits to simulate performance on novel targets (Fig. 6).

Fig. 6. — Model calibration on out-of-distribution test sets. The plots show the observed accuracy (fraction of positives) versus the model’s predicted probability for each affinity threshold. For each probability bin, the solid line represents the mean accuracy calculated across the ten CATH-LSO test sets, while the shaded region represents the corresponding SD. An ideally calibrated model would follow the dashed diagonal line. The analysis is performed on bootstrap-balanced test sets, with model logits transformed to probabilities using a sigmoid function. Models are colored as follows: GNINA (purple), CORDIAL (blue/green), GAT (gold/green), and 3D-CNN (pink/red).

These calibration plots show the observed accuracy (fraction of correct classifications) as a function of the model’s predicted probability, averaged across the ten CATH-LSO splits. The shaded regions represent the SD across these splits, indicating the consistency of the calibration. For a perfectly calibrated model, the predictions would fall along the dashed diagonal line, where a predicted probability of, for example, 80% would correspond to an 80% true positive rate.

The results reveal a clear distinction in calibration performance between the models. The GNINA VinaScore baseline shows mixed performance; while poorly calibrated for lower affinity bins, it becomes increasingly linear in the pKd 4 to 7 range, though with a slope consistently lower than ideal. The benchmark deep learning models, GAT and 3D-CNN, are poorly calibrated across most affinity thresholds. Their prediction curves are largely flat, showing little correlation with the ideal diagonal, which indicates that their output probabilities are not reliable estimators of the true likelihood of activity.

In contrast, CORDIAL’s predictions are better calibrated. For the first six affinity thresholds (pKd $\geq$ 1 to $\geq$ 6), its performance is largely linear and tracks the ideal diagonal, though it exhibits some systematic overconfidence as the affinity threshold increases. The calibration curve becomes less stable for the pKd $\geq$ 7 threshold, particularly at higher probability estimates. For the highest affinity threshold, pKd $\geq$ 8, the model’s output probabilities are compressed such that accuracy does not reliably exceed 0.3 independent of the predicted probability (Fig. 6).

CORDIAL’s effective calibration is consistent with our primary hypothesis that focusing on the interaction space promotes generalizability. The calibration performance is also likely partially attributable to a training strategy tailored for the ordinal nature of the task. By treating each cumulative affinity threshold as an independent binary classification problem with its own binary cross-entropy loss, the model learns a well-calibrated probability for each level.

Feature Saliency Analysis.

Finally, we evaluate the extent to which CORDIAL learns chemically meaningful patterns. To do so, we perform a feature saliency analysis to understand which features most influence its predictions (Fig. 7). The analysis calculates the mean signed gradient of the model’s output probability with respect to each input feature, averaged over a large, random sample of the training set. The resulting heatmaps display this saliency for the 24 most abundant interaction features as a function of distance for each of the eight affinity thresholds. Red indicates a positive gradient, where an increase in the feature’s magnitude increases the predicted probability, while blue indicates a negative gradient.

Fig. 7. — Distance-resolved feature saliency for predicted affinity thresholds. Each subplot displays a heatmap of the mean signed gradient (saliency) for the top 24 most abundant interaction features, as a function of distance (Å), for a specific binding affinity threshold (pKd $\geq$ 1 to pKd $\geq$ 8). Rows correspond to distinct physicochemical feature pairs (*Materials and Methods*), and columns represent distance bins from 0.0 to 16.0 Å grouped in 0.5 Å intervals. The value of each feature on the y-axis is a nonnegative sum representing the total magnitude of a specific interaction type. The color indicates how increasing this magnitude influences the model’s prediction. Red (positive gradient) indicates that increasing the magnitude of the specified interaction is associated with a higher probability for the corresponding activity threshold. Blue (negative gradient) indicates that increasing the magnitude of the specified interaction is associated with a lower probability for the corresponding activity threshold. Saliency plots are generated using a random sample of 81,920 protein–ligand complexes from the fully trained CORDIAL training set.

The analysis reveals that the model learns physically intuitive principles that evolve logically across the affinity spectrum. The model’s behavior at the lowest affinity threshold (pKd $\geq$ 1) reflects this level’s role as a gate for acceptable steric and chemical interactions. It learns a general penalty for close atomic contacts, which is evident from the strong and consistent negative (blue) gradients observed across numerous features at short distances (<2.5 Å).

As the affinity threshold increases into the mid-range (pKd $\geq$ 4 to $\geq$ 6), the saliency patterns shift. The short-range negative gradients become less prominent, and distinct positive (red) gradients emerge at physically realistic interaction distances (approximately 2.5 to 5.0 Å). These gradients often exhibit continuity across adjacent distance bins, forming smooth patterns rather than noisy, isolated signals. This indicates the model has learned to associate specific interaction types at optimal distances with a higher probability of binding.

At the highest affinity thresholds (pKd $\geq$ 7 and $\geq$ 8), the saliency maps become sparser and the overall gradient magnitudes decrease. This trend is consistent with the idea that high-affinity binding relies on a few, highly optimized interactions. Importantly, the clear, distance-dependent, and chemically rational evolution of these patterns provides evidence that CORDIAL’s predictions are based on a consistent interpretation of distance-dependent interactions.

Discussion

Accurate and efficient prediction of protein–ligand binding affinities is a cornerstone of modern drug discovery, yet current computational methods face limitations. While physics-based approaches offer accuracy, their cost is prohibitive for screening vast chemical libraries, and faster scoring functions often lack accuracy and reliability. ML promised a solution, but widespread adoption has been stalled by poor model generalizability. Models trained on existing datasets frequently fail when applied prospectively to novel protein targets or chemical matter, limiting their real-world impact. We hypothesized that this stems from a competition within the models, where the learning of spurious correlations from structural motifs prevalent in the training data competes with the learning of the transferable, physicochemical principles governing molecular interaction. This work presents a framework to address this challenge through two primary contributions. The first is a challenging validation approach, CATH-LSO, designed to evaluate generalization capabilities. The second is an interaction-only deep learning model, CORDIAL, which serves as a test of the hypothesis that a more constrained, physically informed inductive bias can overcome this limitation.

Our results demonstrate a clear divergence in performance between models with different inductive biases when subjected to the stringent LSO benchmark. While contemporary graph-based and voxel-based architectures perform well on in-distribution random splits, their predictive power degrades on OOD tasks. In contrast, CORDIAL’s performance is maintained. This finding highlights a fundamental trade-off between architectural flexibility and data requirements. Highly flexible architectures, such as GNNs and 3D-CNNs, are powerful but may require massive and diverse datasets to optimize their target objective in a generalizable weight space (i.e., to implicitly learn to prioritize physical interactions over structural shortcuts). Our work presents an implementation of the alternative. By employing a specialized architecture with a strong, task-appropriate inductive bias, the model has a reduced ability to learn from competing correlations. The primary benefit of this approach is the potential to train robustly generalizable models from more moderately sized datasets. The corresponding trade-off is that such specialization requires more human-driven feature and architecture engineering and may ultimately have lower expressive power than a more general model trained on a sufficiently vast and comprehensive dataset.

The practical utility of a computational model in a drug discovery campaign is determined by more than just its ranking ability on benchmarks. Model calibration is critical for interpreting the results in prospective screens. For example, if a model is able to rank-order compounds correctly, but produces a compressed and uninformative range of scores, it can be difficult to distinguish an exceptional hit from an average one. In such cases, it then becomes necessary to perform expensive, per-target recalibration with known data. CORDIAL’s reliable calibration, even on unseen protein families, means that its confidence scores are more directly translatable to a true likelihood of activity. This feature is useful for building trust in ML-driven predictions and streamlining the path from large-scale virtual screening to experimental validation.

Finally, we find it important to note several limitations to our current work. While our CATH-LSO protocol is a rigorous benchmark, its stringency could be further improved. CATH classifies individual domains, but our targets can be multidomain complexes. Our protocol holds out proteins containing a domain from one of the ten most abundant superfamilies in the base PDBbind dataset; however, a multidomain protein in the test set could share a secondary domain with proteins still present in the training set. Future benchmarks could adopt an even stricter protocol by excluding any protein that shares any domain with a held-out superfamily, or by explicitly identifying the relevant ligand-binding domain as that to be held out. Designing challenging benchmarks is a critical step in the field. The development of the Critical Assessment of Structure Prediction challenge, for example, provided the consistent feedback and guidance the field required to ultimately address a fundamental challenge in structural biology (29, 30). We hope this manuscript will serve as a catalyst for developing more rigorous benchmarks for affinity prediction to facilitate similar breakthroughs.

The CORDIAL implementation itself also has limitations. It currently lacks explicit pose discrimination capabilities and sacrifices some geometric resolution for efficiency by relying on 1D distance profiles. Future work could incorporate additional geometric information or enforce relationships between predicted pose confidence and affinity. Additionally, the fixed atomic features could be extended to learnable atom-pair embeddings, and the data curation strategy could be improved to more explicitly handle complexities such as protein mutations in PubChem. Finally, our substructure-based data augmentation strategy can be improved, such as through the use of target-specific pharmacophore constraints.

Overall, we demonstrate an approach that seems promising for overcoming the generalizability barrier in ML-based affinity prediction. This work provides a validated strategy for developing reliable models by aligning the inductive bias with the fundamental principles of the system being studied. While not intended to replace statistical physics-based methods such as relative binding free energy FEP calculations, the principles demonstrated here offer a potentially effective approach for accelerating hit discovery and building more trustworthy computational tools for structure-based drug design.

Materials and Methods

Dataset Preprocessing.

The primary dataset was the full set of protein–ligand complexes from PDBbind2020 (31). All complexes underwent a structural refinement protocol to ensure consistency and quality. Ligand molecules were first neutralized and hydrogenated using the BioChemical Library (BCL) (32). Protonation states and formal charges at physiologic pH were then assigned using Schrodinger’s Epik (version 7) (33). Finally, a restrained energy minimization (5 kcal/mol $\cdot$ Å²) was performed with the Universal Force Field as implemented in RDKit to refine hydrogen atom placement without perturbing the heavy-atom coordinates of the experimental binding pose. Protein structures were prepared by first removing all existing hydrogen atoms with the “reduce” program in AmberTools 2022. New hydrogen atoms were subsequently added and the structure was subjected to a restrained energy minimization using the Rosetta REF2015 energy function with a coordinate constraint weight of 5.0 on all heavy atoms.

Dataset Augmentation with PubChem Bioassays.

The PDBbind2020 dataset was augmented with bioactivity data from PubChem to expand the chemical diversity of the training set (SI Appendix, Fig. S1). The augmentation protocol involved several steps. First, the PDBbind2020 database was filtered to include only complexes with ligands containing 100 or fewer heavy atoms. For each retained entry, the receptor protein’s GeneID was used to extract all confirmatory bioassay data (Kd, Ki, or IC50 measurements) from PubChem.

Next, binding poses for these bioassay compounds were generated. For each PDBbind complex, we identified bioassay compounds targeting the same protein that shared at least 40% Tanimoto similarity in their maximum common substructure (based on bond order/aromaticity and element type) with the reference PDBbind ligand. The coordinates of the matched substructure atoms from the bioassay compound were mapped onto the reference ligand’s pose. These mapped atoms were held fixed while a conformational ensemble of the remaining flexible portions of the molecule was generated using BCL::Conf (34).

The resulting poses were refined using a RosettaLigand protocol (35). The restricted conformational ensemble of the ligand was sampled with minimal allowed rotation or translational motion of the ligand (temperature factor $=$ 0.6, cycles $=$ 500, move_distance $=$ 0.001 angle $=$ 0.025, rmsd $=$ 1.0), followed by a high-resolution refinement involving side-chain repacking and energy minimization (cycles $=$ 24, repack every N $=$ 3). This entire process was repeated 10 times starting from the previous best pose. The best-scoring pose for each compound, based on the Rosetta interaction energy, was selected.

A final augmentation step was performed to generate negative examples. The original PDBbind ligands were redocked using a standard RosettaLigand docking protocol (36). Poses with a heavy-atom RMSD greater than 2.0 Å from the crystallographic pose were labeled as inactive decoys, while poses within 1.0 Å were retained as near-native examples with the original affinity label. This process provided physically realistic negative examples.

Creating CATH Homologous Superfamily Train/Test Splits.

Model generalization to novel protein families was evaluated using a validation protocol based on the CATH (Class, Architecture, Topology, and Homologous superfamily) hierarchical protein structure classification database (version 4.3) (37). For each PDBbind2020 complex, the target protein was first mapped to its corresponding CATH domain classification. Complexes were then grouped by their CATH homologous superfamily identifiers, creating distinct protein family clusters. The ten largest CATH superfamilies in the dataset were selected for LSO validation. For each superfamily evaluation, a test set was constructed comprising all complexes belonging to that superfamily, and these were removed from the training data. This exclusion applied to the original PDBbind complexes as well as all of their augmented variants from PubChem and redocking. These excluded complexes constituted an independent test set SI Appendix, Fig. S1.

This CATH-LSO strategy addresses key limitations of conventional cross-validation approaches. Random k-fold cross-validation can overestimate real-world performance because the training and test sets are drawn from the same data distribution. Temporal splits can contain high target and scaffold similarity. Leave-one-protein-out protocols suffer from data leakage when other members of the same protein family remain in the training set. Sequence-based splits can be powerful when the sequence conservation is negligible, but proteins with low sequence identity can nevertheless share identical folds that lead to the risk of inclusion of small molecule chemotypes that tend to target a particular protein fold. CATH-based LSO validation, in contrast, explicitly evaluates a model’s ability to extrapolate to novel regions of both protein and chemical space, which more accurately reflects the challenges of prospective virtual screening.

Distance-Dependent Interaction Embeddings.

We developed an interaction graph featurization approach that captures intermolecular interactions through distance-binned atomic property correlations, preserving important chemical interaction patterns across spatial dimensions.

Interaction graph construction.

For each protein–ligand complex, we first construct an interaction graph $G = (V, E)$ , where vertices $V$ represent atoms and edges $E$ connect atom pairs within a distance cutoff $d_{cutoff} = 16$ Å. The adjacency matrix $A \in R^{N_{1} \times N_{2}}$ for $N_{1}$ protein and $N_{2}$ ligand atoms is computed using

\begin{matrix} A_{ij} = \{\begin{matrix} 1 & if ‖ r_{i}^{prot} - r_{j}^{lig} ‖_{2} \leq d_{cutoff} \\ 0 & otherwise \end{matrix} \end{matrix}

[1]

Graph construction was optimized by reducing the graph to only retain interacting atoms.

\begin{matrix} V_{prot}^{'} & = {i \in V_{prot} ∣ \exists j : A_{ij} = 1} \end{matrix}

[2]

\begin{matrix} V_{lig}^{'} & = {j \in V_{lig} ∣ \exists i : A_{ij} = 1} \end{matrix}

[3]

Atomic properties and their implementation.

We calculate atomic properties for both protein and ligand atoms. These atomic properties are subsequently used to compute atom pair chemical property cross correlations. Unless otherwise stated, all atomic properties were computed using RDKit.

Charge features:
- –
  Gasteiger charges
- –
  Polarized Gasteiger charges: Gasteiger charges scaled by atomic polarizability
- –
  Formal charges
Hydrogen features:
- –
  Hydrogen atom ternary (is_h_ternary): 1 for hydrogen atoms, -1 otherwise
- –
  H-bond donor/acceptor ternary (is_h_bond_donor_ternary): 1 for N/O atoms bonded to H (donors), -1 for N/O without H (acceptors), 0 otherwise
Hydrophobicity features:
- –
  Hydrophobic ternary (is_hydrophobic_ternary): 1 for hydrophobic atoms (carbon without polar neighbors, CF₃ groups), -1 for hydrophilic atoms (N, O, polar groups), 0 for neutral atoms
- –
  Polarized hydrophobicity (polarized_hydrophobic_ternary): Hydrophobicity weighted by polarizability
Ring features:
- –
  Ring membership (is_in_ring_ternary): 1 for ring atoms, -1 otherwise
- –
  Aromatic ring (is_in_aromatic_ring_ternary): 1 for atoms in aromatic rings, -1 otherwise
Electronic features:
- –
  Electronegativity: Pauling electronegativity values from lookup table
- –
  Carbon electronegativity difference: Difference between atom’s electronegativity and carbon’s (2.55)
Steric features:
- –
  Van der Waals radius
- –
  Polarized VDW radius: VDW radius scaled by atomic polarizability
- –
  Polarizability: Atomic polarizability from literature values

Chemical property correlation radial distribution function features.

The feature representation was designed to capture how different interaction patterns vary spatially. By computing pairwise correlations between atomic properties and binning them by distance, this approach creates profiles that quantify how much of each interaction type occurs at various distances, aggregated across the entire complex. This encodes both the energetic and spatial aspects of molecular recognition without relying on specific structural motifs.

The features were separated into three categories: symmetric signed pairs, symmetric unsigned pairs, and asymmetric signed pairs.

Symmetric signed pairs (8 pairs $\times$ 3 bins $=$ 24 features):
- Electrostatic interactions:
  - –
    Polarized Gasteiger charges $\times$ Polarized Gasteiger charges
  - –
    Gasteiger charges $\times$ Gasteiger charges
  - –
    Formal charge $\times$ Formal charge
- Hydrogen-bonding and steric interactions:
  - –
    Hydrogen atom ternary $\times$ Hydrogen atom ternary
  - –
    H-bond donor ternary $\times$ H-bond donor ternary
- Hydrophobic and aromatic interactions:
  - –
    Polarized hydrophobicity $\times$ Polarized hydrophobicity
  - –
    Aromatic ring ternary $\times$ Aromatic ring ternary
  - –
    Carbon electronegativity difference $\times$ Carbon electronegativity difference
Symmetric unsigned pairs (4 pairs $\times$ 1 bin $=$ 4 features):
- Steric and scaling factors:
  - –
    Electronegativity $\times$ Electronegativity
  - –
    Polarized VDW radius $\times$ Polarized VDW radius
  - –
    VDW radius $\times$ VDW radius
  - –
    Polarizability $\times$ Polarizability
Asymmetric pairs (12 pairs $\times$ 3 bins $=$ 36 features):
- Electrostatic interactions:
  - –
    Electronegativity $\times$ Gasteiger charges
  - –
    Gasteiger charges $\times$ Electronegativity
- Hydrophobic directionality:
  - –
    Carbon electronegativity difference $\times$ Polarized hydrophobicity
  - –
    Polarized hydrophobicity $\times$ Carbon electronegativity difference
- $π$ -system interactions:
  - –
    Ring ternary $\times$ Polarized hydrophobicity ( $π$ -LJ)
  - –
    Polarized hydrophobicity $\times$ Ring ternary (LJ- $π$ )
  - –
    Aromatic ring ternary $\times$ Polarized hydrophobicity ( $π$ -LJ)
  - –
    Polarized hydrophobicity $\times$ Aromatic ring ternary (LJ- $π$ )
- Cation- $π$ interactions:
  - –
    Aromatic ring ternary $\times$ Polarized Gasteiger charges ( $π$ -Cation)
  - –
    Polarized Gasteiger charges $\times$ Aromatic ring ternary (Cation- $π$ )
- $π$ -steric interactions:
  - –
    Aromatic ring ternary $\times$ Polarized VDW radius ( $π$ -LJ)
  - –
    Polarized VDW radius $\times$ Aromatic ring ternary (LJ- $π$ )

For signed pairs, three separate histograms were created to capture negative–negative, positive–positive, and mixed-sign interactions, preserving the directional nature of forces. Each feature contribution was calculated for a property pair $(p_{1}, p_{2})$ and distance bin $b$ as

\begin{matrix} f_{unsigned}^{b} (p_{1}, p_{2}) = \sum_{\begin{matrix} (i, j) \in E \\ d_{ij} \in b \end{matrix}} p_{1} (i) p_{2} (j) \end{matrix}

[4]

For signed pairs, the contribution is computed as

\begin{matrix} f_{signed}^{b} (p_{1}, p_{2}) = \sum_{\begin{matrix} (i, j) \in E \\ d_{ij} \in b \end{matrix}} | p_{1} (i) p_{2} (j) |, \end{matrix}

[5]

where pairs are binned into three categories: (1) both properties negative ( $p_{1} (i) < 0, p_{2} (j) < 0$ ), (2) both properties positive ( $p_{1} (i) \geq 0, p_{2} (j) \geq 0$ ), and (3) mixed-sign interactions ( $p_{1} (i) p_{2} (j) < 0$ ).

This binning strategy heuristically encodes physical principles into the feature representation. For example, by maintaining separate histograms for attractive and repulsive electrostatic interactions, their distinct physical signatures are preserved in the interaction profile rather than being canceled out during aggregation.

Distance binning and feature assembly.

Distances were discretized into 64 uniform bins of width 0.25 Å, spanning the full 16 Å cutoff. This discretization was performed by rounding each distance to the nearest bin center, corresponding to the following definition for bin $k$ :

\begin{matrix} {bin}_{k} = {d ∣ (k - 0.5) \cdot Δ d \leq d < (k + 0.5) \cdot Δ d}, \end{matrix}

[6]

where $Δ d = 0.25$ Å and $k \in {0, 1, 2, . . ., 63}$ . For each property pair, histograms were created over these distance bins, resulting in a 64x64 distance-feature matrix. Prior to training, this matrix was flattened, and each of the 4,096 features was normalized to have zero mean and unit SD across the training set. To clearly summarize:

The 64 $\times$ 64 distance-feature matrix is first constructed for every sample.
For the entire training set, these matrices are conceptually flattened into vectors (of length 4,096) and are then Z-score normalized across the dataset. This global normalization preserves the relative magnitudes of signals between different feature columns, which is important information.
Finally, the normalized vectors are reshaped back into their 64 $\times$ 64 matrix structure before being fed into the CORDIAL model.

CORDIAL Architecture and Training.

The CORDIAL architecture was developed to process these interaction matrices. The model accepts precomputed 64 $\times$ 64 distance-feature matrices and uses a three-stage pipeline to capture local and global interaction patterns.

Feature-specific 1D convolutions: The first stage employs parallel 1D depthwise separable convolutions (kernel size 7 bins; receptive field 1.75 Å) across the distance dimension for each feature channel independently. This design preserves feature independence, allowing each chemical feature to learn its own distance-dependent patterns. The implementation uses grouped convolutions with batch normalization, GELU activation, and 5% spatial dropout.
Axial self-attention: The second stage implements axial self-attention to learn global interactions between distance bins (rows) and features (columns). This approach avoids flattening the 2D interaction space into a 1D sequence. Row-wise attention with positional encodings is first applied to capture long-range dependencies between distance bins. Column-wise attention is then applied to the output of the row-wise attention to learn higher-order relationships between different chemical interaction types. The attention implementation uses a prenorm architecture with layer normalization and dropout (15%).
Multilayer perceptron: The final stage uses a two-layer MLP (256 $\to$ 256 neurons) with Mish activation and 25% dropout to produce an 8-dimensional output vector. The model was trained as an ordinal classifier on eight cumulative affinity thresholds (pKd $\geq$ 1 to pKd $\geq$ 8).

CORDIAL was trained with a binary cross-entropy with logits loss function for each of the eight outputs, using the AdamW optimizer with $β_{1} = 0.9$ and weight decay $0.001$ . The initial learning rate was 0.001. A maximum of 1,000 epochs was performed with a batch size of 16,384. We implemented early stopping with 20-epoch patience, monitoring the validation loss on a 5% random split of the training data. The learning rate was halved after every 5 epochs without validation loss improvement.

3D-CNN Architecture and Training.

A conventional 3D-CNN was implemented as a representative voxel-based model for comparison, inspired by the style of architectures like KDEEP (17).

Protein–ligand complexes were represented as $24^{3}$ voxel grids with a resolution of 0.5 Å per voxel, centered on the ligand’s geometric center. This provided a 12 Å observation window. Protein atoms within 6.0 Å of any ligand atom were included in the grid. A set of 16 atomic features was calculated for all atoms in the binding site. To prevent cancellation effects when multiple atoms occupy the same voxel, these features were represented as 32 dual channels, separating their positive and negative components.

The 3D-CNN architecture begins with a batch normalization layer applied to the input channels. The core of the model consists of a sequence of three convolutional blocks. Each block is composed of a 3D convolution, a batch normalization layer, and a ReLU activation function, followed by a “MaxPool3d” layer (kernel size 2, stride 2) for spatial downsampling. The number of channels increases through the blocks (16, 32, and then 64). The output from the final convolutional block is flattened into a 1D vector and passed to a two-layer MLP (256 $\to$ 128 neurons) to produce the 8-dimensional ordinal output vector.

3D-CNN was trained with a binary cross-entropy with logits loss function for each of the eight outputs, using the AdamW optimizer with $β_{1} = 0.9$ and weight decay $0.001$ . The initial learning rate was 0.005. Due to the large memory footprint of the model, we were forced to train 3D-CNN in chunks. Each training dataset was split into 10 evenly sized chunks with at least 10% of the samples overlapping between chunks. For each chunk, a maximum of 100 epochs was performed with a batch size of 1,536. We implemented early stopping with 20-epoch patience, monitoring the validation loss on a 5% random split of the training data. The learning rate was halved after every 5 epochs without validation loss improvement.

GAT Architecture and Training.

A graph attention network (GAT) was implemented based on the Atomic Environment Vector-Protein Ligand Interaction Graph (AEV-PLIG) architecture described by Valsson et al. (24). This ligand-centric GNN encodes protein context through radial atomic environment vectors (AEVs). Ligand atoms are represented as graph nodes with intrinsic atomic features (e.g., identity, hybridization, etc.), and covalent bonds as edges. The protein environment is captured in the node features via AEVs, which describe the distribution of 22 ECIF protein atom types within a 5.1 Å cutoff of each ligand atom using 16 Gaussian-binned distance shells.

The model backbone consists of five GATv2 layers, followed by a global pooling step and a three-layer MLP (1,024 $\to$ 512 $\to$ 256) that produces the 8-dimensional ordinal output. Training followed the same protocol as CORDIAL except with a batch size of 8,192, an initial learning rate of 0.0002, and weight decay of 0.0001.

Feature Saliency Analysis.

To interpret the features most influential to CORDIAL’s predictions, a feature saliency analysis was performed. Saliency was calculated as the gradient of the model’s predicted probability for a given affinity threshold with respect to each input feature. The final saliency maps represent the mean signed gradient, averaged over a random sample of 81,920 complexes from the training set. This analysis provides a measure of how a small change in the magnitude of a specific feature at a specific distance affects the model’s output.

Statistical Analysis.

We evaluate model performance using several complementary metrics designed specifically for ordinal classification tasks. Given the ordered nature of binding affinity thresholds (pKd $\geq$ n), traditional binary classification metrics are insufficient to capture the ordinal relationships between classes.

Ordinal accuracy.

Measures the exact match between predicted and true ordinal levels. For a prediction $\hat{y}$ and true value $y$ , the ordinal accuracy is defined as:

\begin{matrix} Ordinal_Accuracy = \frac{1}{N} \sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i}), \end{matrix}

[7]

where $1$ is the indicator function and $N$ is the number of samples.

Within-one accuracy.

Extends ordinal accuracy by considering predictions within one level of the true value as correct. This metric acknowledges that in binding affinity prediction, small deviations may be acceptable:

\begin{matrix} Within_One_Accuracy = \frac{1}{N} \sum_{i = 1}^{N} 1 (| {\hat{y}}_{i} - y_{i} | \leq 1) \end{matrix}

[8]

Mean absolute error (MAE).

Quantifies the average magnitude of prediction errors:

\begin{matrix} MAE = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{y}}_{i} - y_{i} |, \end{matrix}

[9]

where values range from 0 to $K - 1$ for $K$ ordinal levels.

Quadratic weighted kappa (QWK).

Measures the agreement between predicted and true ordinal levels while accounting for both chance agreement and the magnitude of disagreements. For an ordinal classification task with $K$ levels, QWK is calculated as:

\begin{matrix} QWK = 1 - \frac{\sum_{i = 1}^{K} \sum_{j = 1}^{K} w_{ij} o_{ij}}{\sum_{i = 1}^{K} \sum_{j = 1}^{K} w_{ij} e_{ij}}, \end{matrix}

[10]

where:

$w_{ij} = {(i - j)}^{2} / {(K - 1)}^{2}$ is the weight matrix element representing the squared distance between levels $i$ and $j$ , normalized by the maximum possible distance
$o_{ij}$ is the observed proportion of cases where the true level is $i$ and the predicted level is $j$
$e_{ij}$ is the expected proportion of such cases occurring by chance

QWK ranges from $-$ 1 to 1, where 1 indicates perfect agreement, 0 indicates agreement equivalent to random chance, and negative values indicate systematic disagreement. The quadratic weighting scheme ensures that predictions further from the true value are penalized more heavily than near misses.

All metrics were computed using threshold-encoded predictions, where each sample is represented as a binary vector $[b_{1}, b_{2}, . . ., b_{K}]$ with $b_{i} \in {0, 1}$ , indicating whether the sample meets the criteria for each ordinal level. For example, $[1, 1, 1, 0]$ represents a sample that meets the criteria for levels 1 to 3 but not level 4.

Hardware.

All models were trained and evaluated on a server equipped with two AMD EPYC 9754 128-core processors, 1.5 TB of system RAM, and 8 48GB NVIDIA L40S GPUs.

Supplementary Material

Appendix 01 (PDF)

pnas.2508998122.sapp.pdf^{(1.6MB, pdf)}

Acknowledgments

This work was supported by the NIH under award number DP1DA058349 (B.P.B.). I thank the members of the Center for AI in Protein Dynamics (CAIPD) and Department of Pharmacology at Vanderbilt University for helpful discussions. We thank members of the Vanderbilt University Center for Structural Biology and CAIPD for computing technical support.

Author contributions

B.P.B. designed research; performed research; contributed new reagents/analytic tools; analyzed data; and wrote the paper.

Competing interests

The author declares no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

Source code, model weights, and sample scripts for CORDIAL are freely available with an Apache-2.0 license at https://github.com/bpBrownLab/CORDIAL (38).

Supporting Information

References

1.Seoane-Vazquez E., Rodriguez-Monguio R., Powers J. H., Analysis of us food and drug administration new drug and biologic approvals, regulatory pathways, and review times, 1980–2022. Sci. Rep. 14, 3325 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Schneider G., Automating drug discovery. Nat. Rev. Drug Discov. 17, 97–113 (2018). [DOI] [PubMed] [Google Scholar]
3.Zou J., Tian C., Simmerling C., Blinded prediction of protein-ligand binding affinity using amber thermodynamic integration for the 2018 d3r grand challenge 4. J. Comput. Mol. Des. 33, 1021–1029 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wang L., et al. , Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J. Am. Chem. Soc. 137, 2695–2703 (2015). [DOI] [PubMed] [Google Scholar]
5.Su M., et al. , Comparative assessment of scoring functions: The casf-2016 update. J. Chem. Inf. Model. 59, 895–913 (2019). [DOI] [PubMed] [Google Scholar]
6.Smith S. T., Meiler J., Assessing multiple score functions in Aosetta for drug discovery. PLoS One 15, e0240450 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Sieg J., Flachsenberg F., Rarey M., In need of bias control: Evaluating chemical data for machine learning in structure-based virtual screening. J. Chem. Inf. Model. 59, 947–961 (2019). [DOI] [PubMed] [Google Scholar]
8.Yang J., Shen C., Huang N., Predicting or pretending: Artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets. Front. Pharmacol. 11, 69 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Volkov M., et al. , On the frustration to predict binding affinities from protein-ligand structures with deep neural networks. J. Med. Chem. 65, 7946–7958 (2022). [DOI] [PubMed] [Google Scholar]
10.Chatterjee A., et al. , Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat. Commun. 14, 1989 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Mastropietro A., Pasculli G., Bajorath J., Learning characteristics of graph neural networks predicting protein-ligand affinities. Nat. Mach. Intell. 5, 1427–1436 (2023). [Google Scholar]
12.Jacob L., Vert J. P., Protein-ligand interaction prediction: An improved chemogenomics approach. Bioinformatics 24, 2149–2156 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kinjo A. R., Nakamura H., Comprehensive structural classification of ligand-binding motifs in proteins. Structure 17, 234–246 (2009). [DOI] [PubMed] [Google Scholar]
14.Andersson C. D., Chen B. Y., Linusson A., Mapping of ligand-binding cavities in proteins. Proteins 78, 1408–1422 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Orengo C. A., et al. , Cath–a hierarchic classification of protein domain structures. Structure 5, 1093–108 (1997). [DOI] [PubMed] [Google Scholar]
16.Ewing T. J., Makino S., Skillman A. G., Kuntz I. D., Dock 4.0: Search strategies for automated molecular docking of flexible molecule databases. J Comput. Aided Mol. Des. 15, 411–428 (2001). [DOI] [PubMed] [Google Scholar]
17.Jiménez J., Škalič M., Martínez-Rosell G., De Fabritiis G., Kdeep: Protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks. J. Chem. Inf. Model. 58, 287–296 (2018). [DOI] [PubMed] [Google Scholar]
18.Izhar Wallach A. H., Michael Dzamba, Atomnet: A deep convolutional neural network for bioactivity prediction in structure-based drug discovery. arXiv [Preprint] (2015). 10.48550/arXiv.1510.02855 (Accessed 30 March 2025). [DOI]
19.Ragoza M., Hochuli J., Idrobo E., Sunseri J., Koes D. R., Protein-ligand scoring with convolutional neural networks. J. Chem. Inf. Model. 57, 942–957 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Schütt K. T., et al. , “A continuous-filter convolutional neural network for modeling quantum interactions” in Advances in Neural Information Processing Systems (NeurIPS), Guyon I., et al., Eds. (2017), vol. 30.
21.Unke O. T., Meuwly M., Physnet: A neural network for predicting energies, forces, dipole moments, and partial charges. J. Chem. Theory Comput. 15, 3678–3693 (2019). [DOI] [PubMed] [Google Scholar]
22.Smith J. S., Isayev O., Roitberg A. E., Ani-1: An extensible neural network potential with DFT accuracy at force field computational cost. Sci. Data 4, 170193 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Batzner S., et al. , E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Valsson Í., et al. , Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data. Commun. Chem. 8, 41 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Cao D., et al. , Generic protein-ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling. Nat. Mach. Intell. 6, 688–700 (2024). [Google Scholar]
26.Libouban P. Y., Aci-Sèche S., Gómez-Tamayo J. C., Tresadern G., Bonnet P., The impact of data on structure-based binding affinity predictions using deep neural networks. Int. J. Mol. Sci. 24, 16120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Wang D. D., Wu W., Wang R., Structure-based, deep-learning models for protein-ligand binding affinity prediction. J. Cheminf. 16, 2 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Polishchuk P. G., Madzhidov T. I., Varnek A., Estimation of the size of drug-like chemical space based on GDB-17 data. J Comput. Aided Mol. Des. 27, 675–679 (2013). [DOI] [PubMed] [Google Scholar]
29.Kryshtafovych A., Schwede T., Topf M., Fidelis K., Moult J., Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins Struct. Funct. Bioinf. 89, 1607–1617 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Liu Z., et al. , PDB-wide collection of binding data: Current status of the PDBbind database. Bioinformatics 31, 405–412 (2015). [DOI] [PubMed] [Google Scholar]
32.Brown B. P., et al. , Introduction to the biochemical library (BCL): An application-based open-source toolkit for integrated cheminformatics and machine learning in computer-aided drug discovery. Front. Pharmacol. 13, 833099 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Johnston R. C., et al. , ePik: PKA and protonation state prediction through machine learning. J. Chem. Theory Comput. 19, 2380–2388 (2023). [DOI] [PubMed] [Google Scholar]
34.Mendenhall J., Brown B. P., Kothiwale S., Meiler J., BCL::Conf: Improved open-source knowledge-based conformation sampling using the crystallography open database. J. Chem. Inf. Model. 61, 189–201 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Meiler J., Baker D., ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins 65, 538–548 (2006). [DOI] [PubMed] [Google Scholar]
36.DeLuca S., Khar K., Meiler J., Fully flexible docking of medium sized ligand libraries with ROSETTALIGAND. PLoS One 10, e0132508 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Orengo C. A., et al. , The CATH protein family database: A resource for structural and functional annotation of genomes. Proteomics 2, 11–21 (2002). [PubMed] [Google Scholar]
38.Brown B. P., CORDIAL: Convolutional representation of distance-dependent interactions with attention learning. GitHub. https://github.com/bpBrownLab/CORDIAL. Deposited 20 September 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2508998122.sapp.pdf^{(1.6MB, pdf)}

Data Availability Statement

Source code, model weights, and sample scripts for CORDIAL are freely available with an Apache-2.0 license at https://github.com/bpBrownLab/CORDIAL (38).

[r1] 1.Seoane-Vazquez E., Rodriguez-Monguio R., Powers J. H., Analysis of us food and drug administration new drug and biologic approvals, regulatory pathways, and review times, 1980–2022. Sci. Rep. 14, 3325 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Schneider G., Automating drug discovery. Nat. Rev. Drug Discov. 17, 97–113 (2018). [DOI] [PubMed] [Google Scholar]

[r3] 3.Zou J., Tian C., Simmerling C., Blinded prediction of protein-ligand binding affinity using amber thermodynamic integration for the 2018 d3r grand challenge 4. J. Comput. Mol. Des. 33, 1021–1029 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4.Wang L., et al. , Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J. Am. Chem. Soc. 137, 2695–2703 (2015). [DOI] [PubMed] [Google Scholar]

[r5] 5.Su M., et al. , Comparative assessment of scoring functions: The casf-2016 update. J. Chem. Inf. Model. 59, 895–913 (2019). [DOI] [PubMed] [Google Scholar]

[r6] 6.Smith S. T., Meiler J., Assessing multiple score functions in Aosetta for drug discovery. PLoS One 15, e0240450 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Sieg J., Flachsenberg F., Rarey M., In need of bias control: Evaluating chemical data for machine learning in structure-based virtual screening. J. Chem. Inf. Model. 59, 947–961 (2019). [DOI] [PubMed] [Google Scholar]

[r8] 8.Yang J., Shen C., Huang N., Predicting or pretending: Artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets. Front. Pharmacol. 11, 69 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9] 9.Volkov M., et al. , On the frustration to predict binding affinities from protein-ligand structures with deep neural networks. J. Med. Chem. 65, 7946–7958 (2022). [DOI] [PubMed] [Google Scholar]

[r10] 10.Chatterjee A., et al. , Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat. Commun. 14, 1989 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11] 11.Mastropietro A., Pasculli G., Bajorath J., Learning characteristics of graph neural networks predicting protein-ligand affinities. Nat. Mach. Intell. 5, 1427–1436 (2023). [Google Scholar]

[r12] 12.Jacob L., Vert J. P., Protein-ligand interaction prediction: An improved chemogenomics approach. Bioinformatics 24, 2149–2156 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13] 13.Kinjo A. R., Nakamura H., Comprehensive structural classification of ligand-binding motifs in proteins. Structure 17, 234–246 (2009). [DOI] [PubMed] [Google Scholar]

[r14] 14.Andersson C. D., Chen B. Y., Linusson A., Mapping of ligand-binding cavities in proteins. Proteins 78, 1408–1422 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Orengo C. A., et al. , Cath–a hierarchic classification of protein domain structures. Structure 5, 1093–108 (1997). [DOI] [PubMed] [Google Scholar]

[r16] 16.Ewing T. J., Makino S., Skillman A. G., Kuntz I. D., Dock 4.0: Search strategies for automated molecular docking of flexible molecule databases. J Comput. Aided Mol. Des. 15, 411–428 (2001). [DOI] [PubMed] [Google Scholar]

[r17] 17.Jiménez J., Škalič M., Martínez-Rosell G., De Fabritiis G., Kdeep: Protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks. J. Chem. Inf. Model. 58, 287–296 (2018). [DOI] [PubMed] [Google Scholar]

[r18] 18.Izhar Wallach A. H., Michael Dzamba, Atomnet: A deep convolutional neural network for bioactivity prediction in structure-based drug discovery. arXiv [Preprint] (2015). 10.48550/arXiv.1510.02855 (Accessed 30 March 2025). [DOI]

[r19] 19.Ragoza M., Hochuli J., Idrobo E., Sunseri J., Koes D. R., Protein-ligand scoring with convolutional neural networks. J. Chem. Inf. Model. 57, 942–957 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20] 20.Schütt K. T., et al. , “A continuous-filter convolutional neural network for modeling quantum interactions” in Advances in Neural Information Processing Systems (NeurIPS), Guyon I., et al., Eds. (2017), vol. 30.

[r21] 21.Unke O. T., Meuwly M., Physnet: A neural network for predicting energies, forces, dipole moments, and partial charges. J. Chem. Theory Comput. 15, 3678–3693 (2019). [DOI] [PubMed] [Google Scholar]

[r22] 22.Smith J. S., Isayev O., Roitberg A. E., Ani-1: An extensible neural network potential with DFT accuracy at force field computational cost. Sci. Data 4, 170193 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23] 23.Batzner S., et al. , E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r24] 24.Valsson Í., et al. , Narrowing the gap between machine learning scoring functions and free energy perturbation using augmented data. Commun. Chem. 8, 41 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r25] 25.Cao D., et al. , Generic protein-ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling. Nat. Mach. Intell. 6, 688–700 (2024). [Google Scholar]

[r26] 26.Libouban P. Y., Aci-Sèche S., Gómez-Tamayo J. C., Tresadern G., Bonnet P., The impact of data on structure-based binding affinity predictions using deep neural networks. Int. J. Mol. Sci. 24, 16120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r27] 27.Wang D. D., Wu W., Wang R., Structure-based, deep-learning models for protein-ligand binding affinity prediction. J. Cheminf. 16, 2 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r28] 28.Polishchuk P. G., Madzhidov T. I., Varnek A., Estimation of the size of drug-like chemical space based on GDB-17 data. J Comput. Aided Mol. Des. 27, 675–679 (2013). [DOI] [PubMed] [Google Scholar]

[r29] 29.Kryshtafovych A., Schwede T., Topf M., Fidelis K., Moult J., Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins Struct. Funct. Bioinf. 89, 1607–1617 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r30] 30.Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r31] 31.Liu Z., et al. , PDB-wide collection of binding data: Current status of the PDBbind database. Bioinformatics 31, 405–412 (2015). [DOI] [PubMed] [Google Scholar]

[r32] 32.Brown B. P., et al. , Introduction to the biochemical library (BCL): An application-based open-source toolkit for integrated cheminformatics and machine learning in computer-aided drug discovery. Front. Pharmacol. 13, 833099 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r33] 33.Johnston R. C., et al. , ePik: PKA and protonation state prediction through machine learning. J. Chem. Theory Comput. 19, 2380–2388 (2023). [DOI] [PubMed] [Google Scholar]

[r34] 34.Mendenhall J., Brown B. P., Kothiwale S., Meiler J., BCL::Conf: Improved open-source knowledge-based conformation sampling using the crystallography open database. J. Chem. Inf. Model. 61, 189–201 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r35] 35.Meiler J., Baker D., ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins 65, 538–548 (2006). [DOI] [PubMed] [Google Scholar]

[r36] 36.DeLuca S., Khar K., Meiler J., Fully flexible docking of medium sized ligand libraries with ROSETTALIGAND. PLoS One 10, e0132508 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r37] 37.Orengo C. A., et al. , The CATH protein family database: A resource for structural and functional annotation of genomes. Proteomics 2, 11–21 (2002). [PubMed] [Google Scholar]

[r38] 38.Brown B. P., CORDIAL: Convolutional representation of distance-dependent interactions with attention learning. GitHub. https://github.com/bpBrownLab/CORDIAL. Deposited 20 September 2025.

PERMALINK

A generalizable deep learning framework for structure-based protein–ligand affinity ranking

Benjamin P Brown

Significance

Abstract

Results

Fig. 1.

Assessing Ordinal Affinity Ranking on Novel Protein Families.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Analysis of Model Calibration on Out-of-Distribution Targets.

Fig. 6.

Feature Saliency Analysis.

Fig. 7.

Discussion

Materials and Methods

Dataset Preprocessing.

Dataset Augmentation with PubChem Bioassays.

Creating CATH Homologous Superfamily Train/Test Splits.

Distance-Dependent Interaction Embeddings.

Interaction graph construction.

Atomic properties and their implementation.

Chemical property correlation radial distribution function features.

Distance binning and feature assembly.

CORDIAL Architecture and Training.

3D-CNN Architecture and Training.

GAT Architecture and Training.

Feature Saliency Analysis.

Statistical Analysis.

Ordinal accuracy.

Within-one accuracy.

Mean absolute error (MAE).

Quadratic weighted kappa (QWK).

Hardware.

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases