Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2026 Feb 20;66(5):2551–2559. doi: 10.1021/acs.jcim.5c02734

SwinSite: 3D Structure-Based Prediction of Protein–Ligand Binding Sites Using a Combined Vision Transformer and Convolution Model

Dongwoo Kim , Juyong Lee ‡,†,§,*
PMCID: PMC12977039  PMID: 41717955

Abstract

Accurate identification of protein–ligand binding sites is an essential step in structure-based drug discovery. Herein, we present SwinSite, a deep learning framework that leverages a hybrid architecture combining 3D convolutional neural networks and hierarchical vision transformer modules to predict ligand binding sites based on a 3D structure of a target protein. SwinSite encodes spatial information by voxelizing a protein structure into 3D grids centered around surface residues, allowing for a detailed spatial representation of the protein’s surface environment. By combining local feature extraction with hierarchical self-attention via shifted windows, SwinSite effectively captures both fine-grained geometric features and long-range dependencies. Evaluations on multiple benchmark data sets demonstrate that SwinSite outperforms existing CNN- and GNN-based ligand binding site detection methods consistently, highlighting its robustness and generalization ability.


graphic file with name ci5c02734_0007.jpg


graphic file with name ci5c02734_0005.jpg

Introduction

Identifying protein–ligand binding sites is a key task in computational drug discovery. Conventional physics-based docking methods require the identification of ligand binding sites for an accurate docking prediction. Accurate identification of ligand binding sites can substantially reduce the search space for docking and virtual screening, thereby lowering computational cost and accelerating early stage structure-based drug discovery pipelines.

Inaccurate pocket localization at this early stage can propagate to downstream steps by expanding the effective search space for docking calculations, thereby increasing the computational cost. Consequently, robust binding site detection is not merely auxiliary but often rate-limiting for end-to-end docking performance. With the growing availability of high-resolution protein structures enabled by advances in structure predictionmost notably AlphaFold2 and RoseTTAFold, and more recently AlphaFold3many deep learning methods have been developed to predict binding pockets directly from protein tertiary structures.

To tackle the ligand binding site prediction problem, convolutional neural network (CNN)-based models such as PUResNet, Kalasanty, and DeepSurf have been proposed, which voxelize protein surfaces and learn spatial features using 3D convolutional layers. These models encoded the spatial distribution information on atoms or residues constituting ligand binding sites by calculating their distances from the centers of grids, voxels. This voxel-based representation not only allows atomic-level feature encoding but also enables detailed modeling of the spatial environment surrounding a protein, which is crucial for identifying functional binding regions accurately. In addition to CNN-based models, graph neural network-based (GNN) methods, such as SiteRadar and LigBind, have also emerged and model residue-level or atom-level interactions using GNNs. These approaches leverage the inherent topology of protein structures to capture evolutionary and physicochemical distribution patterns and surface information.

Although CNNs are effective in capturing local geometric patterns within voxelized 3D inputs, their reliance on stacked convolutional blocks restricts their receptive field to local information, requiring increased depth to model global dependencies. This architectural constraint makes it difficult for CNNs to efficiently capture long-range spatial relationships, which are often crucial for identifying ligand binding sites in complex protein structures. , Also, an increase in model depth leads to a larger model size, possibly leading to a large computational burden and overfitting.

Vision Transformers (ViTs) have shown remarkable performance in 2D image understanding by enabling global contextual modeling via self-attention. They offer a potential solution to CNNs’ locality limitations. However, directly applying ViTs to protein structures introduces new challenges: (1) proteins exhibit heterogeneous spatial scales due to their sequence length variation; (2) 3D voxel grids are required to be of high resolution; and (3) each spatial unit should contain rich physicochemical features beyond RGB channels. These characteristics of protein structures compared to 2D images can easily lead to high memory consumption when combined with conventional ViT models.

Nevertheless, ViTs differ fundamentally from CNNs in how they process spatial information. While CNNs progressively expand their receptive field through stacked convolutions, ViTs partition an image (or a 3D voxel grid) into patches and model pairwise dependencies across all tokens simultaneously. As a result, ViTs can capture long-range relationships from early layers, maintaining a broad effective receptive field even in shallow architectures. This global attention mechanism allows ViTs to remain robust to background clutter or noisy spatial featuresan advantageous property when predicting binding sites that are often formed by spatially distant residues.

To address these limitations, we explore the use of the Swin Transformer for the ligand binding site prediction model. Compared with the original ViT, Swin Transformer is better suited for 3D protein modeling because it limits attention computation to local windows, introduces a shifted-window mechanism that enables information exchange across adjacent regions, and provides a hierarchical architecture that captures multiscale structural patterns. These capabilities align well with the characteristics of binding site prediction, which requires efficient integration of both the detailed local cavity geometry and the broader global protein structural context.

In this work, we propose SwinSite, a hybrid deep learning architecture that integrates 3D convolutional neural networks with hierarchical Swin Transformer blocks for structure-based protein–ligand binding site prediction. SwinSite follows a U-Net-style encoder–decoder design, where convolutional layers capture fine-grained geometric cues and Swin Transformer modules model long-range spatial dependencies through window-based self-attention. This combination allows SwinSite to effectively encode both local cavity geometry and global structural contexttwo complementary factors that are essential for accurate pocket identification. Across multiple benchmark data sets, SwinSite achieves consistently higher performance than the existing CNN- and geometry-based methods, indicating improved robustness and more precise spatial localization. Evaluation across benchmark data sets shows that SwinSite performs favorably compared to previous methods, with consistently higher success rates and alignment between binding confidence and spatial accuracy.

Methods

Data Set Preparation

We used a subset of the scPDB data set, consisting of 5020 protein–ligand complexes, which were previously curated for training PUResNet. This subset was generated by removing incomplete structures, complexes with low-quality or nonbiological ligands, ambiguous or ill-defined pockets, and highly redundant sequences clustered at 90% sequence identity.

The center of mass of a protein was aligned to the origin and randomly rotated using one of the 24 predefined orthogonal matrices (axis-aligned and mirror-symmetric). The resulting coordinates are voxelized into a 3D grid with an extent of ±35Å , a resolution of 1.5 Å, and a grid shape of 96 × 96 × 96. Each voxel stores a feature vector representing its local environment.

Each heavy atom in a protein was represented by an 18-dimensional feature vector. This includes information about the atom’s type (carbon, nitrogen, oxygen, etc.), hybridization state, the number of heavy and heteroatom neighbors, estimated partial charge, and functional group characteristics (such as aromaticity or hydrogen bonding potential) (Table S1).

We used a Gaussian-weighted embedding scheme where each atom contributes to neighboring voxels using a 3D Gaussian kernel, producing a smooth spatial representation. The weight assigned to each voxel was computed as shown in eq

w(d)=exp(d22σ2) 1

where d denotes the Euclidean distance between the atom center and the voxel center, and σ controls the smoothness of the spatial distribution, which was set to 1.0 Å during training.

Both input protein structures and ground truth labels were represented on a three-dimensional voxel grid of size 96 × 96 × 96, defined in a fixed spatial frame centered on the protein structure. For ground truth label generation, ligand heavy atoms were projected onto this voxel grid by using a Gaussian-weighted embedding scheme. The resulting label field was subsequently expanded using morphological dilation with a spherical structuring element of radius 2 voxels in each dimension, defined in eq .

S={(x,y,z)Z3|(x/r)2+(y/r)2+(z/r)21},r=2 2

The dilation operation assigns to each voxel the maximum Gaussian value within the spherical neighborhood of v, NS(v) , as defined in eq

D(v)=maxuNS(v)G(u) 3

where G(u) is the Gaussian-weighted ligand voxel at position u. The final ground truth mask was obtained by combining the original Gaussian map and its dilated counterpart via element-wise addition, followed by clipping to [0, 1], as defined in eq

Y(v)=min(G(v)+D(v),1.0) 4

At our grid resolution, 1.5 Å voxel, dilation with a radius of 2 voxels corresponds to an expansion of approximately 3.0 Å around each ligand atom. Voxels within this distance were labeled as positive, while all the remaining voxels were labeled as negative. This Gaussian-dilation formulation yields smooth boundary transitions and preserves a natural core-center structure around the ligand, providing a more stable supervisory signal for training.

Model Architecture

SwinSite is based on a U-Net-style architecture consisting of the encoder, bottleneck, and decoder blocks designed for structure-based ligand binding site prediction (Figure ). The input to the model is a 3D grid of voxels whose dimensions are 96 × 96 × 96 × 18, where each of the 18 channels encodes atomic-level properties such as atom type, hybridization, partial charge, and aromaticity. The detailed description of each component of the SwinSite is given below.

1.

1

Overall architecture of the SwinSite model for 3D protein–ligand binding site prediction. (a) Fundamental components of SwinSite. The Patch Merging3D and Patch Expanding3D modules perform spatial downsampling and upsampling, respectively, while the Conv Block3D extracts local geometric features through stacked 3D convolutions. (b) The U-shaped encoder–decoder architecture of SwinSite. The encoder alternates between Conv Block3D and Swin Block3D modules to capture both local and global contextual information, while the decoder reconstructs spatial resolution using skip connections and patch-expanding operations. (c) Internal structure of the Swin Block3D alternates window-based attention (W-MSA) and shifted-window attention (SW-MSA) layers. Shifting the window partition between successive layers enables information exchange across neighboring windows without computing global attention over the full 3D volume.

Encoder

The encoder consists of three primary modules. First, the input tensor is put into a PatchMerging3D layer, which reduces spatial resolution and increases the channel depth before 3D convolution and Transformer processing (Figure a). The processed tensor is independently processed via ConvBlock3D and Swin Block 3D layers. The ConvBlock3D layer applies two 3D convolutional layers, followed by LayerNorm and PReLU activation with a residual connection (Figure a). The SwinBlock3D layer implements the Swin Transformer V2 architecture adapted for 3D inputs (Figure c).

Each SwinBlock3D contains alternating Window-based multihead self-attention (W-MSA) and shifted window-based multihead self-attention (SW-MSA) layers. W-MSA applies self-attention within local nonoverlapping windows, while SW-MSA shifts the window partitions between layers to facilitate information exchange across neighboring regions. The shifted-window mechanism allows each voxel to attend not only to its immediate local neighborhood but also to adjacent regions across the layers. By stacking window-based and shifted-window attention blocks, the model gradually integrates local geometric details with a broader spatial context in a computationally efficient manner.

Relative positional encodings are computed based on voxel offsets and incorporated into the attention weights through a learned bias matrix, as defined in eq

Attention(Q,K,V)=Softmax(QKTd+B)V 5

where Q, K, and V are the query, key, and value matrices obtained by linearly projecting the input features of dimension dim to head_dim × heads and splitting into heads groups, with d being the head dimension (set to 32), and B is the relative position bias matrix of size n × n, where n is the number of voxels within a local window (n = w x w y w z ), for which we use w x = w y = w z = 3 (i.e., a 3 × 3 × 3 window). The bias matrix B is constructed by indexing a learnable embedding table of size (2w x – 1) × (2w y – 1) × (2w z – 1), where w x , w y , and w z denote the window sizes along each spatial dimension.

Our model employs a hidden dimension of 96 with 3, 6, 9, and 12 attention heads at progressively deeper encoder stages.

Decoder

The decoder reconstructs the spatial resolution using PatchExpanding3D layers implemented with the ConvTranspose3D layer (Figure a). These upsampling layers are processed by ConvBlock3D with the PReLU activations and Swin Block3D layers to refine the feature maps. Skip connections from the encoder are fused at each corresponding decoding level to retain spatial context and facilitate feature reuse (Figure b).

Output

The final decoder layer produces a feature map of shape 96 × 96 × 96 × C, which is passed through a 1 × 1 × 1 convolutional layer to generate a single-channel output of shape 96 × 96 × 96 × 1. Each voxel represents the predicted likelihood of being part of a ligand binding site.

To convert voxel-wise predictions into pocket-level candidates, we applied a postprocessing pipeline based on three-dimensional image segmentation. After thresholding the voxel-wise probability map to obtain a binary mask, small holes within the predicted binding regions were filled using a three-dimensional morphological closing operation. Next, connected voxel regions touching the boundary of the grid were removed to exclude incomplete pocket predictions. Spatially connected voxel clusters were then identified, and only connected components containing at least 70 voxels were retained as valid pocket candidates. The binding site confidence score was computed as the average score over voxels covering the site, where the site corresponds to a connected voxel component.

Loss Function

SwinSite is trained using focal loss to address the severe class imbalance in voxel-wise binary classification, as defined in eq

Lfocal=α(1eLBCE)γ·LBCE 6

where α = 1, γ = 0.2, and LBCE is the binary cross-entropy loss defined in eq

LBCE=ylog()(1y)log(1) 7

where y is the ground truth label of being binding site and is the likelihood of being binding site predicted by a model. The modulation term downweights easy examples, effectively guiding the model to focus on difficult positive samples.

Training Procedure

We trained the model to perform binary voxel-wise classification using the focal loss (eq ). Training was conducted using the AdamW optimizer and a cosine annealing scheduler. The model was trained up to 700 epochs with early stopping (patience = 20). We performed 4-fold cross-validation on the training data, where each fold used an internal validation split for early stopping and model selection (no benchmark test sets were used for tuning or selection). All hyperparameters were optimized exclusively using the validation splits, and the test sets were strictly held out and used only for final evaluation. The final prediction was generated by mean-pooling the sigmoid outputs of the four trained models.

Benchmark Sets

We evaluated SwinSite on diverse benchmark data sets widely used in binding site prediction, including DT198 (124 drug–target complexes), ASTEX85 (124 protein–ligand pairs), COACH420 (293 proteins with mixed ligands), HOLO4K (4735 holo protein–ligand complexes), and PDBbind2020 (refined set, 5316 protein–ligand complexes). These data sets cover a broad spectrum of protein families and ligand types, ranging from small curated benchmarks to large-scale collections, and have been commonly adopted in the literature for fair performance comparison with widely used CNN-based methods as well as traditionally strong baseline models.

During test set construction, CD-HIT was applied to remove identical protein sequences, while no additional sequence identity threshold was imposed in the main experiments. We note that sequence-based filtering alone is insufficient to fully eliminate structural redundancy, and controlling structural overlap across proteins remains challenging in practice. This limitation is shared by the existing benchmark data sets and prior binding site prediction study.

Structure-aware benchmarking frameworks such as LIGYSIS provide a principled approach for mitigating such data leakage and are particularly important for evaluating generalization across structurally related proteins. , In this study, however, we adopted conventional benchmark data sets and test set configurations to enable direct performance comparison with existing methods.

Evaluation Procedure

Prediction performance was assessed using two standard criteria. The first is the distance-to-closest-center (DCC), which measures the Euclidean distance between the predicted pocket center and the center (centroid) of the bound ligand, with a cutoff of 4 Å to determine success. The second is the distance-to-closest-atom (DCA), which measures the minimum distance between the predicted pocket center and the nearest ligand atom using a 4 Å threshold. For both metrics, we report Top-1, Top-3, and Top-(n + k) success rates, where a case is considered correct if at least one of the top-ranked predicted pockets meets the respective cutoff. Here, n denotes the number of true binding pockets annotated for a given protein, and k is set to 2 in our experiments.

This unified evaluation setup enables a consistent comparison with geometry-based methods: FPocket and P2Rank, and voxel-based CNN models: PUResNet, Kalasanty, DeepPocket, and our SwinSite model. For a fair comparison of SwinSite under the standardized input representation, we selected a subset of benchmark protein–ligand complexes that were commonly used by both PUResNet and Kalasanty. For a fair comparison, we evaluated FPocket, P2Rank, PUResNet, Kalasanty, and DeepPocket using their released pretrained weights with each model’s native preprocessing pipeline.

Results and Discussion

Comparative Evaluation with Existing Methods

Overall, SwinSite achieved strong performance across diverse benchmark data sets under the DCC criterion (Table ). While the absolute performance differs across data sets, SwinSite consistently matches CNN-based baselines such as PUResNet, Kalasanty, and DeepPocket in both Top-1 and Top-3 success rates, showing steady improvements across curated benchmarks, including curated sets (ASTEX85, DT198, and COACH420) and large-scale data sets (HOLO4K, PDBbind2020). Rather than relying on a substantial gain on a single data set, SwinSite exhibits stable and competitive accuracy across multiple benchmark data sets with diverse protein targets. This consistent improvement demonstrates that incorporating hierarchical self-attention enables more effective identification of binding regions, particularly in cases with complex spatial arrangements or long-range structural dependencies.

1. Top-1 and Top-3 DCC-Based Success Rates (%).

method ASTEX85 DT198 COACH420 HOLO4K PDBbind2020
Top-1
FPocket 26.61 20.00 22.60 22.30 22.02
P2Rank 36.29 46.96 50.00 45.72 50.05
DeepPocket 43.55 38.26 44.18 37.23 46.96
Kalasanty 36.29 26.96 38.01 31.75 32.35
PUResNet 39.52 39.13 51.03 42.27 48.56
SwinSite-CNN 42.74 38.26 49.32 37.55 45.10
SwinSite 44.35 51.30 49.66 45.26 51.87
Top-3
FPocket 37.10 30.43 31.51 34.30 35.79
P2Rank 45.16 51.30 53.08 58.34 61.69
DeepPocket 53.23 42.61 53.77 53.46 61.52
Kalasanty 39.52 27.83 38.70 35.84 35.26
PUResNet 42.74 40.00 52.74 45.55 52.30
SwinSite-CNN 42.74 38.26 49.66 39.61 46.57
SwinSite 50.00 54.78 55.14 55.64 61.99

To further disentangle the effect of architectural design from data or training settings, we compared SwinSite with a CNN variant, denoted as SwinSite-CNN. This variant was trained under identical conditions, using the same voxelized input representation, training data, and optimization settings, with the only difference being the removal of the Swin Block3D modules from the architecture shown in Figure c. The consistent performance gap between SwinSite and SwinSite-CNN indicates that the integration of hierarchical self-attention enables a more effective identification of binding regions, particularly in cases involving complex spatial arrangements or long-range structural dependencies.

In terms of DCA, SwinSite demonstrated comparable and competitive performance with both geometry-based methods and CNN-based models across all benchmark data sets (Table ). For the Top-1 evaluation, SwinSite achieved the highest success rates on the ASTEX85 (60.48%) and DT198 (73.04%) data sets. On the remaining benchmark sets (COACH420, HOLO4K, and PDBbind2020), SwinSite showed slightly lower Top-1 success rates than P2Rank, with performance differences ranging from approximately 1.1% to 2.7% (see Table ).

2. Top-1 and Top-3 DCA-Based Success Rates (%).

method ASTEX85 DT198 COACH420 HOLO4K PDBbind2020
Top-1
FPocket 30.65 29.57 36.64 35.40 33.72
P2Rank 57.26 72.17 73.63 66.71 70.68
DeepPocket 54.03 61.74 65.75 60.77 68.33
Kalasanty 56.45 53.91 61.99 52.33 54.39
PUResNet 54.84 71.30 67.12 61.03 64.17
SwinSite-CNN 57.26 63.48 66.78 57.20 61.56
SwinSite 60.48 73.04 70.89 65.58 68.46
Top-3
FPocket 43.55 47.83 52.74 55.84 55.52
P2Rank 69.35 77.39 80.82 86.17 85.97
DeepPocket 67.74 73.04 81.51 82.63 85.95
Kalasanty 61.29 54.78 63.01 58.63 58.83
PUResNet 58.06 72.17 69.18 65.87 68.93
SwinSite-CNN 58.06 63.48 67.47 60.16 63.87
SwinSite 68.55 79.13 76.37 79.04 80.50

3. Top-(n + 2) Success Rates.

  ASTEX85 DT198 COACH420 HOLO4K PDBbind2020
method DCC DCA DCC DCA DCC DCA DCC DCA DCC DCA
FPocket 57.33 65.33 30.43 47.83 34.07 57.04 41.05 63.85 35.79 55.52
P2Rank 69.33 93.33 51.30 77.39 57.41 87.78 66.82 93.03 61.69 85.97
DeepPocket 78.67 90.67 42.61 73.04 58.15 88.15 64.00 91.28 61.52 85.95
Kalasanty 61.33 82.67 27.83 54.78 41.85 68.15 43.54 67.58 35.26 58.83
PUResNet 66.67 81.33 40.00 72.17 57.04 74.81 57.87 79.55 52.30 68.93
SwinSite-CNN 69.33 84.00 38.26 63.48 53.70 72.96 49.67 73.09 46.57 63.87
SwinSite 74.67 94.67 54.78 79.13 59.63 82.59 66.02 88.55 61.99 80.50

When the Top-3 predicted pockets were considered, SwinSite exhibited consistently improved performance across all data sets. SwinSite achieved the highest Top-3 success rate on the DT198 data set (79.13%), while maintaining competitive performance on COACH420, HOLO4K, and PDBbind2020, where its success rates were within 1.0% to 6.9% of the best-performing methods. On ASTEX85 and HOLO4K, SwinSite achieved slightly lower Top-3 success rates than P2Rank by 0.8% and 7.1%, respectively.

The relatively lower DCA performance of SwinSite compared with its DCC results reflects the different aspects of binding site prediction captured by the two evaluation metrics. While DCC primarily evaluates the accuracy of pocket center localization, DCA is more sensitive to the spatial extent and surface coverage of the predicted binding regions. As SwinSite is optimized to precisely localize pocket centers through compact volumetric predictions, it may achieve strong DCC performance while exhibiting slightly lower DCA scores, which favors broader spatial coverage around ligand atoms. This distinction highlights that DCC and DCA capture complementary properties of binding site prediction rather than indicating a limitation of the proposed approach.

Overall, the consistent performance of SwinSite across both small curated data sets (ASTEX85 and DT198) and large heterogeneous benchmarks (COACH420, HOLO4K, and PDBbind2020) demonstrates the robustness and generalization capability of the proposed method. By integration of 3D CNNs with Swin Transformer blocks, SwinSite effectively balances local feature extraction and global spatial reasoning, resulting in stable and competitive DCA performance across diverse protein families. Additional performance results under a stricter 50% sequence identity threshold are provided in the Supporting Information (Tables S4–S6).

Because Top-K metrics are not sufficient to reflect the variable number of annotated binding sites across proteins, we additionally evaluated all methods using the Top-(n + 2) metric, where n denotes the number of true binding sites for each protein and k is set to 2, as considered in benchmarking studies.

Under this more flexible evaluation protocol, SwinSite demonstrated consistently strong performance across all benchmark data sets. In terms of DCC, SwinSite achieved the highest success rates on DT198, COACH420, and PDBbind2020, while showing competitive performance on ASTEX85 and HOLO4K. For DCA, SwinSite achieved the best performance on ASTEX85 and DT198 and remained comparable to P2Rank and DeepPocket on larger and more heterogeneous benchmarks, such as HOLO4K and PDBbind2020.

Overall, the Top-(n + 2) results confirm that the performance gains of SwinSite are not limited to fixed Top-K settings but extend to an evaluation protocol that accounts for variability in the number of true binding pockets across proteins.

Validation of the Scoring Function via Score–Accuracy Correlation

To validate the informativeness of SwinSite’s confidence scores, we analyzed the relationship between binding site score (0–1) and localization error (DCC) on the PDBbind2020 test set (n = 4678), retaining the Top-1 pocket per protein. Across all predictions, a strong negative correlation (Spearman ρ = −0.576, Pearson r = −0.589; both p < 0.001) was observed, confirming that higher confidence corresponds to smaller DCC (Figure a).

2.

2

Validation of the scoring function via score–accuracy correlation. (a) Correlation between SwinSite binding site scores and localization accuracy (DCC) on the PDBbind2020 test set (n = 4678). Each point represents the Top-1 predicted pocket per protein. A negative correlation is observed between confidence score and localization error (Spearman ρ = −0.5757, Pearson r = −0.5883). The dashed green line marks the success threshold at 4 Å. (b) High-quality prediction region. Panel (b) focuses on a refined subset of predictions characterized by high binding site scores (≥0.5) and accurate predictions (DCC ≤ 8 Å). This high-quality subset contains 2994 predictions, accounting for 64.0% of the total data set. Within this region, the score–accuracy correlation becomes lower (Spearman ρ = −0.28). The y-axis is restricted to the 0.5–0.85 range to enhance visualization of the score distribution in this high-confidence subset. Points are colored by local density, and the horizontal line at score = 0.5 denotes the confidence threshold.

We then focused our analysis on the high-quality prediction region (2994 predictions, 64.0% of the data set), defined as cases with a high confidence score (≥0.5) and an accurate prediction within the broader binding pocket environment (DCC ≤ 8 Å). This 8 Å threshold was chosen as it represents the “second-shell” residues, which, in contrast to the first-shell direct contacts (typically ≈ 4 Å), form a critical network that determines the binding pocket’s stability and the ligand’s selectivity. Within this high-confidence subset, the score’s reliability is clearly demonstrated by the prediction density distribution (Figure b). The plot shows that the highest-scoring predictions (e.g., scores ≈ 0.8) are not scattered, but form a dense concentration precisely at the lowest DCC values ≈ 1.5 Å. This analysis confirms that SwinSite’s binding site score serves as a robust and reliable indicator, capable of effectively identifying the most trustworthy predictions.

Qualitative Comparison with CNN-Based Models

A quantitative win–loss analysis demonstrates that SwinSite accounts for the majority of head-to-head disagreements with CNN-based baseline models across all benchmark data sets. When all test sets are aggregated at a DCC threshold of 8 Å, SwinSite accounts for 83.8% of the cases in which the compared methods differ (Table S3). These results indicate that SwinSite consistently matches or exceeds the performance of the existing CNN-based baselines when operating under equivalent voxel-based input representations.

In other words, the results suggest that the SwinSite architecture effectively complements the local geometric modeling capabilities of CNNs with an improved integration of global spatial relationships.

Here, the 4 Å threshold follows the commonly used criteria for assessing binding site localization accuracy. The 8 Å threshold is used as an auxiliary reference to distinguish predictions that are substantially displaced from the true binding site. Accordingly, representative examples were selected to illustrate typical success and failure cases observed across the benchmark data sets, rather than extreme or outlier predictions.

To qualitatively assess the prediction accuracy of SwinSite relative to CNN-based baselines under identical input representations, we examined representative examples from a shared test set used by PUResNet and Kalasanty. For each case, a prediction was considered successful if the DCC was less than 4 Å.

Figure a highlights the binding site prediction results for PDB ID: 1YGC, corresponding to human coagulation factor VIIa in complex with the selective inhibitor G17905.

3.

3

Representative binding site predictions. Visual comparison of SwinSite (red), PUResNet (yellow), and Kalasanty (green) on shared test cases. (a) PDB ID: 1YGC, (b) PDB ID: 2G25, and (c) PDB ID: 1TJW. Ground truth ligand is shown in gray.

For PDB ID: 2G25 (Figure b), which represents the E1 component of the Escherichia coli pyruvate dehydrogenase complex bound to a thiamin diphosphate reaction intermediate analogue, SwinSite successfully identifies the correct binding pocket.

A more challenging example is PDB ID: 1TJW (Figure c), corresponding to a duck δ2-crystallin mutant with bound argininosuccinate, which serves as a structural homologue of argininosuccinate lyase.

These results highlight SwinSite’s robust and consistent performance across the test proteins evaluated in this study. Its transformer-based architecture facilitates more accurate spatial reasoning in complex structural environments, offering clear advantages over the baseline models using local CNN layers. The qualitative findings presented in Figure are in agreement with the higher DCC and DCA values of SwinSite discussed previously.

Evaluation of Feature Importance through the Ablation Study

To evaluate the relative importance of the atomic features used in SwinSite, we conducted a systematic channelwise ablation study (Figure ). At inference time, each input channel was independently masked to zero while keeping the model parameters fixed, enabling a direct assessment of its contribution to binding site prediction. For the heavy-degree feature, which encodes the local atomic bonding environment of each atom, permutation-based masking rather than zero masking was applied to prevent complete disruption of this information.

4.

4

Evaluation of feature importance through ablation study. (a) Performance drop when each feature channel is removed, averaged over five benchmark data sets. (b) Data set-specific Top-1 DCC success rates (cutoff = 4 Å) and model prediction recall for baseline and ablation settings.

Model accuracy was measured using the Top-1 DCC success rate (cutoff = 4 Å) across five benchmark data sets: ASTEX85, COACH420, DT198, HOLO4K, and PDBbind2020. For completeness, the corresponding group-wise ablation results (geometric, chemical, atom-type, and charge) are provided in the Supporting Information.

The results reveal a clear hierarchy among the input features, consistent with previously observed group-wise trends. Geometric descriptors were found to be indispensable. Removing either hybridization (e.g., sp, sp2, sp3) or heavy-degree (the number of covalently bonded heavy atoms) caused the largest performance degradation and resulted in zero success rates and low recall values across all benchmark data sets. This confirms that the local bonding geometry and heavy-atom neighborhood density are the most critical determinants for pocket recognition.

Chemical and atom-type features also substantially influenced prediction accuracy, although to a lesser extent than geometric cues. Among atom-type channels, removing atom:O resulted in the largest performance drop (39.7%), followed by atom:C (25.45%), atom:N (14.88%), and atom:S (10.35%). This asymmetric sensitivity reflects the biochemical roles of polar and heteroatom-rich regions, such as backbone carbonyls and polar side chains, which strongly define the cavity boundaries.

Chemical descriptors including heterodegree, H-bond donor, aromatic, H-bond acceptor, and hydrophobic features produced intermediate performance drops (approximately 15–23%), indicating that hydrogen-bonding patterns, aromatic stacking surfaces, and hydrophobic organization refine the predicted binding region once the geometric scaffold has been identified. In contrast, the Ring channel contributed comparatively little (10.43%), suggesting partial redundancy with aromatic and atom-type information.

Notably, while ablation of nongeometric channels reduced Top-1 DCC success rates, the model often retained the ability to localize binding regions at a coarse level, indicating a stronger impact on pocket ranking accuracy than on overall binding site recall.

Overall, these results indicate that SwinSite relies predominantly on geometric descriptors to identify cavity-like structural motifs, while chemical and atom-type features enhance specificity and boundary precision, with the full group-wise ablation analysis provided in the Supporting Information.

Computational Cost and Efficiency

Transformer-based architectures are often associated with increased computational cost. SwinSite was therefore designed to incorporate Transformer modules while maintaining feasible training and inference requirements. All models were trained on NVIDIA RTX 6000 Ada GPUs (48 GB of memory). Training was conducted using 4-fold cross-validation, with each fold requiring approximately 4 days, with the four folds trained in parallel on four GPUs, resulting in an overall wall-clock training time of about 4 days.

At inference time, SwinSite requires approximately 1.7 s per protein target when using a 4-fold ensemble. In contrast, the SwinSite-CNN variant, in which the Transformer blocks are removed, operates at approximately 1.2× faster speed, requiring about 1.4 s per protein target. This runtime reflects the complete prediction process from the voxelized input to pocket score generation. Although SwinSite incurs additional computational cost compared to purely CNN-based models, the overhead remains moderate. The computational overhead introduced by the Swin Transformer blocks is alleviated by the window-based self-attention mechanism and hierarchical design. By restricting computation of attention to local 3D windows and progressively aggregating information across scales, SwinSite avoids the quadratic complexity of global self-attention. As a result, the model achieves improved predictive performance without an excessive computational burden.

Implication on Biological and Drug Discovery Applications

As a future direction, the insights gained from SwinSite highlight the potential benefits of extending structure-based binding site prediction toward multimodal frameworks. While SwinSite focuses on 3D voxelized representations to capture geometric and physicochemical properties of protein surfaces, binding site formation is also influenced by sequence context and evolutionary constraints that are not explicitly encoded in the current model. Recent advances in protein language models have demonstrated that sequence- and evolution-derived representations can capture complementary information related to functional residues and conserved motifs. Integrating such representations with structure-based features, through either joint embedding spaces or cross-modal attention mechanisms, may enable more robust and generalizable binding site prediction. These directions suggest that future architectures could move beyond vision-transformer-based frameworks toward integrated models that leverage both structural and sequence-level signals.

Conclusion

We presented SwinSite, a hybrid 3D deep learning framework that integrates convolutional and Transformer-based feature extraction for protein–ligand binding site prediction. By combining local geometric encoding with hierarchical self-attention, SwinSite effectively captures both fine-grained structural patterns and long-range contextual relationships within protein surfaces. Compared to conventional CNN-based methods, the Swin Transformer backbone provides broader receptive fields and better global awareness, leading to more accurate localization of binding pockets across diverse protein families. The consistent performance gains across various benchmark sets demonstrate that hierarchical attention can complement convolutional representations in a computationally efficient manner. Overall, SwinSite provides a balanced and generalizable framework for structure-based binding site prediction, suggesting that the ViT-based architectures can serve as a promising foundation for future developments in 3D molecular modeling.

Supplementary Material

ci5c02734_si_001.pdf (1.2MB, pdf)

Acknowledgments

This work was supported by Seoul National University (370C-20220109 and AI-Bio Research Grant 0413-20230053), the National Research Foundation of Korea (Grant nos. RS-2023-00256320, 2022M3E5F3081268, and 2022R1C1C1005080), and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2023-00220628, Artificial intelligence for prediction of structure-based protein interaction reflecting physicochemical principles). This research was also supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. RS-2024-00352229). This work was also supported by the Korea Drug Development Fund funded by the Ministry of Science and ICT, Ministry of Trade, Industry, and Energy, and Ministry of Health and Welfare (RS-2023-00217308).

The source code for SwinSite is available at: https://github.com/ding-oh/SwinSite.

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.5c02734.

  • Detailed definitions of atomic input voxel features and channel representations; hyperparameter settings for SwinSite attention blocks; win–loss comparison tables against CNN-based baseline models across benchmark data sets; group-wise ablation study results; and additional performance evaluations under a 50% sequence identity threshold, including Top-1, Top-3, and Top-(n + 2) DCC and DCA success rates (PDF)

The authors declare no competing financial interest.

References

  1. Rees D. C., Congreve M., Murray C. W., Carr R.. Fragment-based lead discovery. Nat. Rev. Drug Discovery. 2004;3:660–672. doi: 10.1038/nrd1467. [DOI] [PubMed] [Google Scholar]
  2. Murray C. W., Rees D. C.. The rise of fragment-based drug discovery. Nat. Chem. 2009;1:187–192. doi: 10.1038/nchem.217. [DOI] [PubMed] [Google Scholar]
  3. Pliushcheuskaya P., Künze G.. Evaluation of Small-Molecule Binding Site Prediction Methods on Membrane-Embedded Protein Interfaces. J. Chem. Inf. Model. 2025;65:6949–6967. doi: 10.1021/acs.jcim.5c00336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Ghersi D., Sanchez R.. Improving accuracy and efficiency of blind protein–ligand docking by focusing on predicted binding sites. Proteins: Struct., Funct., Bioinf. 2009;74:417–424. doi: 10.1002/prot.22154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Jumper J., Evans R., Pritzel A.. et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Baek M., DiMaio F., Anishchenko I.. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Abramson J., Adler J., Dunger J.. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Sim J., Kim D., Kim B., Choi J., Lee J.. Recent advances in AI-driven protein-ligand interaction predictions. Curr. Opin. Struct. Biol. 2025;92:103020. doi: 10.1016/j.sbi.2025.103020. [DOI] [PubMed] [Google Scholar]
  9. Kandel J., Tayara H., Chong K. T.. PUResNet: prediction of protein-ligand binding sites using deep residual neural network. J. Cheminf. 2021;13:65. doi: 10.1186/s13321-021-00547-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Stepniewska-Dziubinska M. M., Zielenkiewicz P., Siedlecki P.. Kalasanty: Ligand binding site prediction by 3D segmentation of protein structures. Bioinformatics. 2019;35:i531–i539. [Google Scholar]
  11. Mylonas S. K., Axenopoulos A., Daras P.. DeepSurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics. 2021;37:1681–1690. doi: 10.1093/bioinformatics/btab009. [DOI] [PubMed] [Google Scholar]
  12. Evteev S. A., Ereshchenko A. V., Ivanenkov Y. A.. SiteRadar: Utilizing Graph Machine Learning for Precise Mapping of Protein–Ligand-Binding Sites. J. Chem. Inf. Model. 2023;63:1124–1132. doi: 10.1021/acs.jcim.2c01413. [DOI] [PubMed] [Google Scholar]
  13. Xia Y., Pan X., Shen H.-B.. LigBind: identifying binding residues for over 1000 ligands with relation-aware graph neural networks. J. Mol. Biol. 2023;435:168091. doi: 10.1016/j.jmb.2023.168091. [DOI] [PubMed] [Google Scholar]
  14. Luo, W. ; Li, Y. ; Urtasun, R. ; Zemel, R. . Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), 2016, pp 4905–4913. [Google Scholar]
  15. Schiebel J., Radeva N., Krimmer S. G., Wang X., Stieler M., Ehrmann F. R., Fu K., Metz A., Huschmann F. U., Weiss M. S., Mueller U., Heine A., Klebe G.. Six Biophysical Screening Methods Miss a Large Proportion of Crystallographically Discovered Fragment Hits: A Case Study. ACS Chem. Biol. 2016;11:1693–1701. doi: 10.1021/acschembio.5b01034. [DOI] [PubMed] [Google Scholar]
  16. Utgés J. S., MacGowan S. A., Ives C. M., Barton G. J.. Classification of likely functional class for ligand binding sites identified from fragment screening. Commun. Biol. 2024;7:320. doi: 10.1038/s42003-024-05970-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Dosovitskiy A., Beyer L., Kolesnikov A.. et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv. 2020:arXiv:2010.11929. doi: 10.48550/arXiv.2010.11929. [DOI] [Google Scholar]
  18. Raghu, M. ; Unterthiner, T. ; Kornblith, S. ; Zhang, C. ; Dosovitskiy, A. . Do vision transformers see like convolutional neural networks?. In Advances in Neural Information Processing Systems, 2021, Vol. 34, pp 12116–12128. [Google Scholar]
  19. Liu, Z. ; Lin, Y. ; Cao, Y. ; Hu, H. ; Wei, Y. ; Zhang, Z. ; Lin, S. ; Guo, B. . Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp 10012–10022. [Google Scholar]
  20. Desaphy J., Bret G., Rognan D., Kellenberger E.. sc-PDB: a 3D-database of ligandable binding sites10 years on. Nucleic Acids Res. 2015;43:D399–D404. doi: 10.1093/nar/gku928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Stepniewska-Dziubinska M. M., Zielenkiewicz P., Siedlecki P.. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics. 2018;34:3666–3674. doi: 10.1093/bioinformatics/bty374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Gonzalez, R. C. ; Woods, R. E. . Digital Image Processing; Prentice Hall: Upper Saddle River, N.J., 2008. [Google Scholar]
  23. Ronneberger O., Fischer P., Brox T.. U-Net: Convolutional Networks for Biomedical Image Segmentation. Med. Image. Comput. Comput. Assist. Interv. 2015;9351:234–241. doi: 10.1007/978-3-319-24574-4_28. [DOI] [Google Scholar]
  24. Liu, Z. ; Hu, H. ; Lin, Y. ; Yao, Z. ; Cao, Y. ; Zhang, Z. ; Huang, B. ; Guo, B. . Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp 12009–12019. [Google Scholar]
  25. van der Walt S., Schönberger J. L., Nunez-Iglesias J., Boulogne F., Warner J. D., Yager N., Gouillart E., Yu T.. scikit-image: image processing in Python. PeerJ. 2014;2:e453. doi: 10.7717/peerj.453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Zhang Z., Li Y., Lin B., Schroeder M., Huang B.. Identification of cavities on protein surface using multiple computational approaches for drug binding site prediction. Bioinformatics. 2011;27:2083–2088. doi: 10.1093/bioinformatics/btr331. [DOI] [PubMed] [Google Scholar]
  27. Hartshorn M. J., Verdonk M. L., Chessari G., Brewerton S. C., Mooij W. T., Mortenson P. N., Murray C. W.. Diverse, high-quality test set for the validation of protein-ligand docking performance. J. Med. Chem. 2007;50:726–741. doi: 10.1021/jm061277y. [DOI] [PubMed] [Google Scholar]
  28. Yang J., Roy A., Zhang Y.. Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics. 2013;29:2588–2595. doi: 10.1093/bioinformatics/btt447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Schmidtke P., Souaille C., Estienne F., Baurin N., Kroemer R. T.. Large-Scale Comparison of Four Binding Site Detection Algorithms. J. Chem. Inf. Model. 2010;50:2191–2200. doi: 10.1021/ci1000289. [DOI] [PubMed] [Google Scholar]
  30. Wang R., Fang X., Lu Y., Wang S.. The PDBbind database: collection of binding affinities for protein–ligand complexes with known three-dimensional structures. J. Med. Chem. 2004;47:2977–2980. doi: 10.1021/jm030580l. [DOI] [PubMed] [Google Scholar]
  31. Li W., Godzik A.. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  32. Utgés J. S., Barton G. J.. Comparative evaluation of methods for the prediction of protein-ligand binding sites. J. Cheminf. 2024;16:126. doi: 10.1186/s13321-024-00923-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Utgés J. S., MacGowan S. A., Barton G. J.. LIGYSIS-web: a resource for the analysis of protein-ligand binding sites. Nucleic Acids Res. 2025;53:W351–W360. doi: 10.1093/nar/gkaf411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Krivák R., Hoksza D.. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J. Cheminf. 2018;10:39. doi: 10.1186/s13321-018-0285-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Le Guilloux V., Schmidtke P., Tuffery P.. Fpocket: An open source platform for ligand pocket detection. BMC Bioinf. 2009;10:168. doi: 10.1186/1471-2105-10-168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Aggarwal R., Gupta A., Chelur V., Jawahar C. V., Priyakumar U. D.. DeepPocket: Ligand Binding Site Detection and Segmentation Using 3D Convolutional Neural Networks. J. Chem. Inf. Model. 2022;62:1367–1379. doi: 10.1021/acs.jcim.1c00799. [DOI] [PubMed] [Google Scholar]
  37. Gao J., Wang S., Wang Y., Chen Y.. Second-shell residues of protein-ligand binding sites. Briefings Bioinf. 2021;22:bbab087. doi: 10.1093/bib/bbab087. [DOI] [Google Scholar]
  38. Olivero A. G.. et al. A selective, slow binding inhibitor of factor VIIa binds to a nonstandard active site conformation and attenuates thrombus formation in vivo. J. Biol. Chem. 2005;280:9160–9169. doi: 10.1074/jbc.M409068200. [DOI] [PubMed] [Google Scholar]
  39. Arjunan P., Sax M., Brunskill A., Chandrasekhar K., Nemeria N., Zhang S., Jordan F., Furey W.. A thiamin-bound, pre-decarboxylation reaction intermediate analogue in the pyruvate dehydrogenase E1 subunit induces large-scale disorder-to-order transformations in the enzyme. J. Biol. Chem. 2006;281:15296–15303. doi: 10.1074/jbc.M600656200. [DOI] [PubMed] [Google Scholar]
  40. Sampaleanu L. M., Codding P. W., Lobsanov Y. D., Tsai M., Smith G. D., Horvatin C., Howell P. L.. Structural studies of duck delta2 Crystallin mutants provide insight into the role of Thr161 and the 280s loop in catalysis. Biochem. J. 2004;384:437–447. doi: 10.1042/BJ20040656. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci5c02734_si_001.pdf (1.2MB, pdf)

Data Availability Statement

The source code for SwinSite is available at: https://github.com/ding-oh/SwinSite.


Articles from Journal of Chemical Information and Modeling are provided here courtesy of American Chemical Society

RESOURCES