Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2025 Aug 7;41(8):btaf449. doi: 10.1093/bioinformatics/btaf449

Leveraging large language models for literature-driven prioritization of protein binding pockets

Roman Stratiichuk 1,2, Mykola Melnychenko 3,, Ihor Koleiev 4,5, Taras Voitsitskyi 6,7, Vladyslav Husak 8,9, Nazar Shevchuk 10, Zakhar Ostrovsky 11, Volodymyr Bdzhola 12, Semen Yesylevskyy 13,14,15,16, Serhii Starosyla 17, Alan Nafiiev 18
Editor: Jonathan Wren
PMCID: PMC12371332  PMID: 40795239

Abstract

Motivation

Accurately identifying and prioritizing protein binding pockets is a foundational element of small-molecule drug discovery. Defining these known pockets currently relies on a laborious manual process of extracting key residue data from selected publications, reconciling inconsistent terminology, and independently computing volumetric representations. This manual curation to ensure biological relevance is time-consuming, error-prone, and represents a major bottleneck for efficient, high-throughput drug discovery.

Results

We present a novel approach for the identification and prioritization of protein binding pockets for small molecules by combining geometric pocket detection with large language models (LLMs). Our method leverages Fpocket to generate candidate pockets, which are then validated against published experimental data extracted from research articles using LLM with a series of prompts fine-tuned to identify and extract residue-level information associated with experimentally confirmed binding sites. We developed a curated benchmark dataset of diverse proteins and associated literature to train and evaluate the LLM’s performance in paper relevance assessment and pocket extraction.

Availability and implementation

The developed benchmark dataset and methodology are freely available at the GitHub repository (https://github.com/receptor-ai/LLM-benchmark-dataset) and Zenodo (DOI: 10.5281/zenodo.15798647).

Graphical abstract

Graphical Abstract.

Graphical Abstract

1 Introduction

The protein binding pockets are the regions on the protein surface where the ligands—such as small molecules, peptides, nucleic acids, or other proteins—can bind. They play a critical role in facilitating biological processes and are essential for understanding protein function. Accurate identification of these sites is crucial for the development of new drugs and the rational design of novel proteins with specific functions.

Over the years, various computational methods have been developed to predict protein binding pockets. Geometry-based methods rely on the structural identification of cavities or clefts on the protein surface that may accommodate ligands. The most widely used method of this family is Fpocket (Le Guilloux et al. 2009), which uses alpha spheres (Liang et al. 1998) to represent the shape of the pocket cavities.

Energy-based methods, such as EASYMIFs and SITEHOUND (Ghersi and Sanchez 2009), use clustering and filtering of the molecular interaction fields (MIFs) to locate the spatial regions on the protein surface that are energetically favorable for ligand binding. Despite their advantages in accounting for biologically relevant interactions, these techniques are computationally more expensive and require careful parameter tuning.

Methods like ConCavity (Capra et al. 2009) combine a geometry-based approach with a sequence conservation analysis to prioritize the tentative pockets based on their evolutionary significance. However, such techniques often overlook unique non-conserved binding pockets and perform poorly for proteins without a large number of homologs.

In recent years, pocket detection techniques based on machine learning and deep learning have emerged. Such methods as DeepSite (Jiménez et al. 2017), DeepPocket (Aggarwal et al. 2022), PUResNet (Kandel et al. 2021), and PUResNetV2.0 (Jeevan et al. 2024) integrate diverse geometric, energetic, and evolutionary features, resulting in improved prediction accuracy. Methods like FRASE-bot (An et al. 2024) apply deep learning to identify and score small molecular fragments that bind to different regions of the protein surface, which is often beneficial for proteins without well-defined pockets.

Finally, hybrid methods, such as DELIA (Xia et al. 2020), PGpocket (Zhao et al. 2024), and P2Rank (Krivák and Hoksza 2018), combine multiple predictive approaches to overcome the limitations of individual techniques. However, this has the disadvantage of higher computational demands and a large number of parameters.

Despite the large diversity of available techniques, most of them tend to identify more pocket-like regions than justified by experimental data. Nowadays, it is generally straightforward for a researcher to find a bunch of cavities, pits, and grooves on the surface of any protein that is capable of binding to different kinds of ligands, while only some of them are functionally significant in the biological context. This redundancy of pocket detection is desirable in certain scenarios, such as identification of the previously unknown allosteric, hidden or “cryptic” binding pockets, but becomes a significant obstacle otherwise.

One of the most common tasks in computational drug discovery is docking the ligands into the known binding site of the target protein. In order to set up the docking procedure, the researcher has to know the exact geometric extents of the binding pocket—something that is rarely explicitly annotated either in public databases or in research papers. The most obvious solution is to run one of the pocket prediction techniques, which provides such information. However, the result often includes redundant predicted pockets that partially overlap with the desired spatial or sequence region, so it is unclear which of them should be used as is, merged or discarded. Correct post-processing of such predictions usually requires extensive analysis of the research papers, which is both time-consuming and error-prone. This issue becomes particularly pressing in the high-throughput or time-constrained scenarios in commercial drug discovery when human assessment of the binding pockets becomes the major bottleneck and a source of errors.

In this article, we propose a hybrid approach that combines the geometry-based detection of the pockets with the large language models (LLMs), which assess them against the research papers and retain biologically relevant ones. We intend to test whether modern LLMs are powerful and accurate enough to substitute human researchers in binding pocket validation and prioritization tasks.

Specifically, we use Fpocket to generate multiple pocket candidates and LLMs to identify those that are supported by literature evidence. The LLMs are used to extract information from the literature about the residues associated with experimentally confirmed binding sites, which is then used to select the relevant pockets (Fig. 1).

Figure 1.

Figure 1.

The overview of the proposed pipeline for the hybrid binding pocket detection. Fpocket—the geometry-based method—predicts numerous pocket candidates. The LLM analyses scientific articles and selects biologically relevant pockets. The results are then mapped and merged so that LLM-extracted pockets are primarily used as filters for Fpocket predictions. Finally, the Fpocket pocket representation is converted into the spatial grid that undergoes postprocessing and refinement.

We have created an open-source benchmark dataset to evaluate LLM performance specifically for protein binding site detection tasks.

2 Materials and methods

2.1 Dataset

At the time of writing, there are no datasets for evaluating the performance of the protein binding pocket triaging and post-processing. To address this problem, we created a curated and manually annotated dataset that can be used as a public benchmark for LLM-based extraction of pocket-related information from the literature.

We selected 10 diverse proteins: DNA polymerase alpha catalytic subunit; Tyrosine-protein kinase ABL1; 5-hydroxytryptamine receptor 2A; Muscarinic acetylcholine receptor M3; Voltage-gated sodium channel type 7; Voltage-gated sodium channel from American cockroach; Gamma-aminobutyric acid receptor; GTPase KRas; Dihydroorotate dehydrogenase; and Mixed lineage kinase domain-like protein.

For each of these targets, we found 2–5 peer-reviewed research articles (31 in total) and 3D structures from the Protein Data Bank (PDB) (see Supplementary Information, available as supplementary data at Bioinformatics online).

Although the structure and functional features of popular proteins could be described in hundreds of papers, there is an empirical observation that only a handful of them usually contain well-defined information about the binding pockets. To simulate a real-world user workflow for our pipeline, we selected input articles for each target directly from top web search results, without imposing other selection criteria. Selected articles represent different complexity tiers for extracting pocket-related information:

  • Articles that do not describe binding sites are used as a negative control and allow false positives to be caught (11 articles).

  • Articles that describe only one binding site for a given protein target. This is the easiest tier, which reveals LLM’s general under-standing of structural biology concepts (17 articles).

  • Articles that describe multiple binding pockets within a given target protein to evaluate the ability of LLM to precisely attribute specific amino acids to distinct pockets (3 articles).

The resulting benchmark dataset comprises JSON files, one for each selected paper. These files detail the small-molecule binding sites explicitly described in the text, including the site’s name, description, and constituent amino acids.

We utilize a GROBID (https://github.com/kermitt2/grobid) PDF-to-text tool, which extracts text from PDFs into a markdown file with high accuracy for most papers, maintaining the document structure with a title, sections, and figure and table captions. Some manual editing was required to clean up the title and section formatting and to remove irrelevant sections, such as Acknowledgements and Author Contributions.

2.2 LLM-based binding pocket extraction

In order to keep our technique reasonably fast and cost-effective, we omitted advanced reasoning architectures like Chain of Thought (CoT) (Wei et al. 2023) or Tree of Thought (ToT) (Yao et al. 2023)—methods designed to guide complex, multi-step problem-solving. Instead, we relied on direct prompting, using examples and detailed instructions to steer the model. The final workflow incorporates three sequential steps: (i) paper relevance assessment, (ii) pocket extraction, and (iii) pocket refinement, conducted by the same LLM model (Fig. 2). The full text of each article was processed in full since all papers fit into the context window of modern LLMs.

Figure 2.

Figure 2.

Overview of the LLM pocket extraction pipeline utilizing direct prompting. The pipeline processes complete articles through three LLM stages: (i) filtering removes irrelevant papers lacking target relevance or residue-level descriptions. (ii) Extraction, guided by extensive prompt optimization for format and accuracy, identifies small molecule binding site details (name, description, residues). (iii) Refinement further improves data quality by correcting errors and omissions.

In the first step, we assess whether the provided paper is relevant for searching for the pockets for a given target protein, i.e. whether it describes the correct protein and whether it contains a description of the binding pockets at the residue level.

The pocket extraction step focuses on parsing the text and identifying amino acid residues comprising distinct binding pockets. Each pocket is assigned a name, description, and a list of residues in a predefined format. Significant prompt optimization efforts were focused on this step to ensure both strict output formatting and the correct identification of binding pockets.

The subsequent pocket refinement step evaluates the extraction results to correct any remaining inaccuracies, such as identifying and including missing residues, rectifying improperly segmented binding sites, and filtering out binding sites irrelevant for small molecule binding (such as those involved in PPI or DNA/RNA binding). Validation checks were performed during extraction and refinement, specifically to guarantee the correct and consistent formatting of residue details (chain ID, residue name, and residue ID).

2.3 Pocket extraction accuracy metrics

To assess the performance of our pipeline, we calculated the following pocket- and paper-based metrics:

Pocket Number Accuracy— for each paper, if the number of annotated and extracted pockets match, then the prediction is correct (1), otherwise it is incorrect 0. Based on these values for each pocket, the final accuracy per paper is calculated.

Pocket Recall— represents the percentage of annotated pockets that have been correctly extracted by the LLM:

Pocket Recall=nmatched annotated pocketsnannotated pockets

Pocket Specificity (True Negative Rate, TNR): Reflects how often the pipeline correctly avoids predicting a pocket when none is annotated in the ground truth. This is calculated as 1 - False Positive Rate (FPR). The FPR indicates the frequency at which the pipeline generates a fake pocket that does not correspond to any actual annotated pocket.

Pocket Specificity=1-nmissesnextracted pockets

where nmisses —the number of parsed pockets with no correspondence to any annotated pocket.

To evaluate performance at the amino acid level within the pockets, we track the F1 score, recall, and precision metrics for the lists of amino acids in annotated and extracted pockets. The primary optimization objectives for our pipeline are maximizing the F1 score at the amino acid level, minimizing the Pocket False Positive Rate (FPR), and maximizing Pocket Accuracy and Pocket Recall. While other metrics are monitored to evaluate overall pocket extraction quality, they do not guide the optimization process.

2.4 Pocket clustering and post-processing

The list of amino acids extracted by the LLM from the papers may not map directly into a specific PDB structure due to the sequence differences (e.g. insertions/deletions and engineered mutations) or errors in residue numbers, names, or chain IDs of the LLM extracted residues. Therefore, an additional algorithmic mapping step is required to mitigate these issues (Fig. 3). It allows handling such cases as assigning residues consistently across identical chains in homomers and correctly identifying binding sites situated at the interfaces between different chains, which are otherwise very hard to address on the level of LLM extraction.

Figure 3.

Figure 3.

Workflow of the residue mapping algorithm. (i) LLM-extracted residues are assigned to potential PDB chains. (ii) All mapped residues in the PDB structure undergo spatial clustering. (iii) Resulting residue clusters are matched against LLM-defined pockets using an overlap threshold. (iv) Matched clusters containing residues from multiple chains are identified as interface pockets, and their residues are output. (v) Otherwise, original mapped residues are used.

First, chain correspondence between the paper’s residue descriptions and the specific PDB structure is established using a sequence similarity threshold. Following this, spatial clustering is performed on all identified pocket residues within the PDB to detect potential inter-chain interfaces.

A key step involves determining if these spatially clustered inter-chain residue groups match any specific pockets extracted by the LLM using a different sequence similarity threshold. If a match is found, the amino acids from that corresponding cluster are used, defining a unique inter-chain pocket. If no such match is identified, the LLM-extracted residues for the pocket are instead mapped onto all potentially relevant PDB chain identifiers, indicating the possibility of multiple, non-unique pockets located on different protein subunits.

Although clustering can be applied to identify all binding sites, we found this approach yields suboptimal results when distinct binding sites are located in close spatial proximity. However, clustering is particularly effective at capturing identical binding sites formed at the interfaces between subunits in homo- or hetero-multimers.

For this task, we used the MeanShift algorithm, which is particularly advantageous as it does not require the number of clusters to be specified beforehand. While our experiments showed that other methods (e.g. DBSCAN) could also be effective, they typically require more extensive, dataset-specific parameter tuning to achieve comparable results.

Manipulations with protein structures were performed with custom scripts based on MolAR molecular modeling library (Yesylevskyy 2025).

2.5 Construction of pocket volumetric representation

Since residue information alone lacks the spatial detail to define a volumetric binding pocket, a geometry-based method provides this necessary geometric representation, while the LLM output subsequently serves as a filter to select the most plausible candidates among these geometric predictions. For geometry-based prediction, we use Fpocket, a method that identifies potential binding sites using Voronoi tessellation and alpha spheres. These predictions are subsequently filtered based on the LLM output, retaining only those pockets whose alpha spheres are located near the LLM-extracted residues.

The resulting alpha spheres may contact zero, one, or multiple atoms from the same residue resulting in multiple ways of performing the Fpocket binding site selection. We introduced a threshold for residue match that accepts only those residues that have a certain percentage of atoms in contact with alpha spheres from the same pocket. Additionally, we use a threshold pocket match parameter, which defines the minimum Jaccard index for residue overlap between the LLM extracted and Fpocket binding pockets.

If multiple LLM-extracted pockets overlap with the same geometric pocket, they are merged into a single pocket if the residue overlap between them exceeds a predefined threshold. This also allows blending the data from different publications describing the same binding pocket on the target protein but reporting different or incomplete sets of associated residues.

The space bounded by the binding pocket surface can be filled with either alpha-spheres or grid points, depending on the technique used. Multiple geometry-based approaches, such as CB-Dock2 (Liu et al. 2022), CASTp 3.0 (Tian et al. 2018), PyVOL (https://schlessinger-lab.github.io/pyvol/), and Fpocket itself use spheres to represent protein cavities. Grid representation is used by DoGSite3 (Graef et al. 2023) and POVME 2.0 (Durrant et al. 2014).

The representation of alpha spheres used by Fpocket is not very practical for our purposes because merging or splitting the pockets represented by spheres is non-trivial and involves unnecessarily complex algorithms. Thus, we convert the pockets to a grid representation, as shown in Fig. 4.

Figure 4.

Figure 4.

Mapped and filtered pockets for the inhibitor site of Dihydroorotate dehydrogenase (PDB: 6OC0) visualized using (a) Fpocket alpha spheres representation; (b) grid representation.

We define a rectangular grid with 1.5 Å steps whose boundaries correspond to the minimum and maximum coordinates of the binding site heavy atoms along each axis. Only grid points located inside the alpha spheres are retained.

The alpha spheres may protrude significantly from the protein cavities into the bulk solvent. The grid representations of such spheres overestimate the pocket volume and may lead to artifacts if this volume is used for molecular docking by allowing much bulkier compounds that can actually fit into the pocket. To address this, we construct a convex hull on the pocket-heavy atoms and remove all grid points that fall outside it (Fig. 5).

Figure 5.

Figure 5.

Filtering pocket grid points using a convex hull defined by pocket atoms. Grid points falling outside the hull boundary are discarded, while those inside are retained, resulting in a corrected pocket representation without protruding parts.

The resulting grid is further refined by deleting the points located closer than the Van der Waals radius to any protein atom. This step is not strictly necessary since the docking algorithms prevent steric clashes anyway, but it improves the visual quality of the pocket representation.

All parameter values are available in the Supplementary Information, available as supplementary data at Bioinformatics online.

3 Results

3.1 LLM pipeline performance

Due to the rapid advancement of LLMs, our development and testing focused exclusively on the model versions at the time of writing, and results are reported accordingly. Our goal is to show that a reliable pocket prioritization pipeline can be built with the LLM technologies at the time of writing. All steps of the LLM pipeline are performed by the GPT-4o-mini model.

The dataset was initially evaluated by the baseline set of prompts (see Supplementary Information, available as supplementary data at Bioinformatics online). Then, the prompts for filtering and extraction stages were optimized using the following strategies:

  • Adding explicit instructions to ensure that LLM identifies only binding sites relevant to small molecules, while actively excluding other functional regions such as protein-protein interaction (PPI) sites and areas involved in DNA/RNA binding.

  • Adding directives to only include amino acids explicitly involved in pocket formation or ligand interaction.

  • Detailed specifications of the response format and quality stand-ards, including guidelines that strongly reward correct residue reporting while penalizing hallucinated or inaccurate entries.

  • Enforcement of strict formatting rules during residue extraction, ensuring the correct parsing and consistent representation of chain identifiers, residue names, and residue IDs.

  • Enforcement of strict formatting rules during residue extraction, ensuring the correct parsing and consistent representation of chain identifiers, residue names, and residue IDs.

These changes were implemented sequentially, with monitoring performance following each prompt adjustment. Adjustments were only retained if they improved the metrics and/or correctness of output formatting. The final prompts are available in the Supplementary Information, available as supplementary data at Bioinformatics online.

The initial Baseline configuration utilized non-optimized prompts for extraction and filtering. This yielded a Pocket Number Accuracy of 0.484, Pocket Recall of 0.958, and Pocket Specificity of 0.46 (Fig. 6).

Figure 6.

Figure 6.

GPT-4o-mini performance metrics for three pipeline configurations: Baseline (filter and extraction steps with initial prompts), Optimized (filter and extraction steps with optimized prompts), and Full pipeline (adding a refinement step to the Optimized configuration).

Optimization of the extraction step (while keeping the filtering unchanged) boosted performance significantly. It resulted in a Pocket Number Accuracy of 0.581, perfect Pocket Recall (1.0), and Pocket Specificity of 0.558. However, manual inspection revealed that this configuration sometimes incorrectly split single pockets into multiple entries.

The addition of the Refinement step addressed this issue, while further improving Pocket Number Accuracy to 0.71 and Pocket Specificity to 0.657 and maintaining perfect Pocket Recall. This final step also enhanced residue-level metrics, as the refinement process often incorporates missing amino acids.

The filtering step was also evaluated separately as a binary classification task to determine paper relevance based on the presence of binding site descriptions. The model achieved an accuracy of 0.87, a recall of 1.0, a precision of 0.833, and a False Positive Rate (FPR) of 0.363 while retaining FNR at 0. The perfect recall indicates the model effectively identifies all of the relevant papers. However, the precision and FPR suggest that approximately 36% of papers flagged as relevant are false positives that do not get filtered out at this stage. We considered this acceptable because the inclusion of potentially irrelevant papers is unlikely to predict fake pockets that are hard to spot on post-processing.

3.2 Volumetric pocket representations

We visually assessed the residue mapping algorithm’s performance for each target structure, with the quality of the final mapping assessed manually by a human expert. This assessment focused primarily on ensuring correct residue assignments across all relevant PDB chains, as these would influence the pocket’s volumetric representation. To achieve optimal performance across the whole dataset, we empirically set the threshold to 0.6 for chain matching and 0.7 for cluster matching. Figure 7 shows mapped residues for the GABAA receptor PDB structure, where the residue mapping algorithm identified two equivalent binding pockets at distinct subunit interfaces. The output includes atom coordinates for the pocket’s constituent residues along with the corresponding pocket name and description extracted by the LLM (Kim et al. 2020).

Figure 7.

Figure 7.

The proposed algorithm’s residue mapping shows equivalent etomidate binding pockets in the GABAA receptor (PDB: 6X3T) at (a) the A:B interface and (b) the C:D interface.

We generated initial geometric pocket representations using Fpocket with default parameters. However, because Fpocket sometimes fragments larger sites, a post-processing step is implemented to merge spatially adjacent pockets meeting a predefined empirical residue overlap threshold. Kinase structures frequently necessitate such merging, as Fpocket often splits its ATP-binding sites into two or more smaller subpockets that must be combined to accurately represent the complete catalytic site (Fig. 8) (Murphy et al. 2014).

Figure 8.

Figure 8.

Mixed lineage kinase domain-like pseudokinase (PDB: 7MON) with (a) selected pockets as alpha spheres and (b) the final pocket after merging and post-processing.

The same merging strategy addresses another challenge: consolidating information from multiple papers. Pockets identified from different sources might describe the same physical site, but often list different constituent residues. The sequence-level comparisons could be attempted; however, merging based on the spatial proximity of the pockets is a more robust approach for identifying and unifying equivalent sites across various publications.

This merging process, however, cannot address the cases where Fpocket initially fails to separate distinct binding sites and outputs them incorrectly as a large single pocket. This limitation was observed with the M2 receptor (Kruse et al. 2014), where Fpocket did not distinguish between known separate binding sites (Fig. 9). Despite these challenges, the residue mapping algorithm performed reliably for binding sites contained within a single chain or distinct protein domain.

Figure 9.

Figure 9.

Muscarinic acetylcholine receptor M2 (PDB: 4MQT) with detected and mapped pocket that combines the orthosteric and extracellular vestibule binding sites.

For multimeric assemblies, the merging algorithm only performs reliably assuming correct initial residue identification by the LLM. For instance, the human Nav1.7-VSD4-NavAb chimera (PDB: 5EK0) is a tetramer complex where each of the four chains incorporates a human NaV1.7 VSD4 domain introduced via protein engineering. In this case, our algorithm correctly located four identical binding pockets—one within each of the VSD4 domains (Fig. 10) that correspond to the known deep binding site of the GX-936 warhead in the VSD4 extracellular cleft (Ahuja et al. 2015).

Figure 10.

Figure 10.

GX-936 binding pocket located at the VSD4 domain of the human NaV1.7 (PDB: 5EK0).

4 Discussion

The identification and prioritization of protein binding pockets is a critical step in small-molecule drug discovery. While numerous computational methods exist, practical workflows often face challenges, particularly when focusing on biologically validated binding sites relevant for virtual screening or druggability assessment, rather than discovering entirely novel pockets. Selecting and defining these known sites typically involves a laborious process: manually analyzing multiple research papers to identify key residues, dealing with inconsistent terminologies and data presentation across publications, computing volumetric representations (which are rarely provided directly), and mapping the literature-derived residue information onto these geometric pockets to filter for biological relevance. This manual workflow is time-consuming and prone to errors, especially in high-throughput industrial settings.

Although databases such as BioLip2 (Zhang et al. 2024) and Q-BioLip (Wei et al. 2024) provide a convenient interface for exploring the binding pockets of protein structures in PDB, they mark the residues as belonging to the pocket by using a distance to the crystallized ligand. This rough approximation does not take into account transient dynamic contacts and misses the residues of larger pockets crystallized with small ligands inside. It is also not applicable if no co-crystallized ligands are available. Furthermore, BioLip only uses literature information from the PubMed abstracts to confirm the biological relevance of the ligand itself without parsing any information about the pocket residues. In contrast, we aim to identify the binding pockets by collecting information about their residues from the literature, regardless of the availability of co-crystallized ligands. This includes information obtained by such techniques as mutational scanning or fluorescent probing and aggregates information for multiple ligands of different sizes and chemical classes, which is not possible when relying on PDB structures only. That is why our approach is potentially more robust and covers a large set of target proteins without crystallized ligands, which occur routinely in practical drug discovery scenarios.

The integration of our technique with BioLip is possible and may be useful for validating residues extracted by LLM and retrieving activity constants or other cross-referenced data present in the database. The general approach, in this case, would be to use the BioLip binding pocket as a starting point and to extend it with information about the residues extracted from the papers. Although such a combined approach might be of interest in certain cases, it is beyond the scope of the current work.

This article investigates the potential of automating this workflow using a hybrid approach combining geometry-based pocket detection with LLMs for literature analysis and pocket validation. We aimed to determine if current LLM technology can effectively substitute human researchers in the binding pocket validation and prioritization tasks outlined above. Our proposed pipeline integrates Fpocket for initial geometric pocket prediction with an LLM (specifically, GPT-4o-mini) for extracting and validating binding site information from scientific literature, followed by algorithmic mapping, merging, and refinement steps to produce accurate volumetric pocket representations.

A direct comparison with human researchers is not feasible because we concentrated on the time and accuracy of LLM-based pocket extraction and mapping. While a competent researcher can achieve near-perfect results with sufficient time (typically on the scale of hours), our automated pipeline produces its results in seconds or minutes (depending on the number and complexity of articles provided). However, its accuracy is not perfect (Fig. 6). Human post-processing of the results is likely to eliminate most of the mistakes while taking much less time than the complete human-based pipeline. Reliably quantifying this would require a large cohort of human experts and a much larger dataset, which is beyond the scope of this work.

While the automated parsing of complex scientific PDFs remains a non-trivial task, we used a robust semi-automated workflow to mitigate this issue. We utilized the GROBID tool, which reliably extracts structured text, including figure and table captions, from research articles. This was followed by a light manual curation step focused on standardizing formatting and removing non-essential sections. Although this process minimizes most parsing artifacts, the model’s performance is inherently dependent on the quality of the text extraction. A key remaining challenge, common to text-based pipelines, is the inability to directly interpret data from within tables or graphical elements themselves, a task that is beyond the scope of this work.

Our results indicate that the LLM pipeline, particularly after prompt optimization and the inclusion of a refinement step, can effectively identify relevant papers and extract pocket information with high accuracy. The refinement stage proved valuable in correcting initial extraction errors, such as improper pocket splitting or missing residues. However, some challenges remain: the LLM filtering step still passes some irrelevant papers through, and accurately interpreting the nuances of binding site descriptions spread across varied writing styles and sections within papers requires careful handling. Explicit instructions were needed to ensure that LLM focused only on small-molecule binding sites.

Despite its limited number of articles describing multiple binding pockets, our dataset allows us to capture the major hurdles that arise when processing complex biological texts, such as distinguishing between multiple binding sites, correctly associating specific residue lists with the correct pocket, and resolving inconsistent terminology used to describe the same site. Although increasing the number of papers is likely to reveal some other corner cases, we did not aim to find and address all of them in this paper.

The subsequent algorithmic mapping and clustering successfully bridged the gap between literature descriptions and specific PDB structures, handling inconsistencies like sequence differences, numbering errors, and multimeric interfaces. This step demonstrated its utility in identifying equivalent pockets across different subunits, as seen in the GABAA receptor and Nav1.7 chimera examples. Nevertheless, its accuracy depends on the initial LLM extraction quality and the structural similarity between the PDB file and the experimental construct described. Multimeric proteins remain challenging in this regard due to possible confusion of residue attribution between the chains and the positioning of the pockets between multiple subunits.

Generating the final volumetric representation involves filtering Fpocket’s geometric predictions using LLM-validated residues. Converting to a grid representation and subsequent post-processing, including merging fragmented pockets and applying convex hull truncation, significantly improved the final pocket definition. This merging also effectively consolidated information from multiple publications. Still, the pipeline inherits Fpocket’s limitations, such as its occasional inability to separate closely adjacent sites, as observed with the M2 receptor.

Despite existing limitations, including the reliance on the specific LLM used (GPT-4o-mini) and the limited benchmark dataset, our hybrid approach shows considerable promise. Future work could involve exploring more advanced LLM techniques, integrating alternative geometric predictors, expanding the benchmark dataset, and developing methods to automatically resolve structural discrepancies. Overall, leveraging LLMs to automate the analysis of research literature for binding pocket identification appears to be a valuable direction, potentially streamlining the drug discovery workflows.

5 Conclusion

This study introduces a practical hybrid approach for identifying and prioritizing protein binding pockets. By integrating geometric detection with LLM-based evidence extraction, our pipeline is designed to assist researchers by automating the most laborious steps of literature analysis and pocket validation. This method effectively reduces the manual burden in high-throughput drug discovery scenarios, enabling experts to focus on the most promising, literature-supported pockets. While human oversight remains essential for complex cases, this work represents a significant step toward more efficient, semi-automated pocket identification and prioritization workflows. The benchmark dataset is made publicly available to further support future advancements in this field.

Supplementary Material

btaf449_Supplementary_Data

Contributor Information

Roman Stratiichuk, Receptor.AI Inc., London N1 7GU, United Kingdom; Department of Biophysics and Medical Informatics, Educational and Scientific Centre “Іnstitute of Biology and Medicine”, Taras Shevchenko Kyiv National University, Kyiv 01601, Ukraine.

Mykola Melnychenko, Receptor.AI Inc., London N1 7GU, United Kingdom.

Ihor Koleiev, Receptor.AI Inc., London N1 7GU, United Kingdom; Department of Physics of Biological Systems, Institute of Physics of The National Academy of Sciences of Ukraine, Kyiv 03038, Ukraine.

Taras Voitsitskyi, Receptor.AI Inc., London N1 7GU, United Kingdom; Department of Physics of Biological Systems, Institute of Physics of The National Academy of Sciences of Ukraine, Kyiv 03038, Ukraine.

Vladyslav Husak, Receptor.AI Inc., London N1 7GU, United Kingdom; Department of Cellular, Computational and Integrative Biology, The University of Trento, Povo, Trento 38123, Italy.

Nazar Shevchuk, Receptor.AI Inc., London N1 7GU, United Kingdom.

Zakhar Ostrovsky, Receptor.AI Inc., London N1 7GU, United Kingdom.

Volodymyr Bdzhola, Institute of Molecular Biology and Genetics of The National Academy of Sciences of Ukraine, Kyiv 03143, Ukraine.

Semen Yesylevskyy, Receptor.AI Inc., London N1 7GU, United Kingdom; Department of Physics of Biological Systems, Institute of Physics of The National Academy of Sciences of Ukraine, Kyiv 03038, Ukraine; Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Prague 6 CZ-166 10, Czech Republic; Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic.

Serhii Starosyla, Receptor.AI Inc., London N1 7GU, United Kingdom.

Alan Nafiiev, Receptor.AI Inc., London N1 7GU, United Kingdom.

Author contributions

Roman Stratiichuk (Data curation [equal], Formal analysis [equal], Validation [equal]), Mykola Melnychenko (Data curation [equal], Formal analysis [equal], Investigation [lead], Software [lead], Writing—original draft [equal]), Ihor Koleiev (Data curation [equal], Validation [equal]), Taras Voitsitskyi (Data curation [equal], Validation [equal]), Vladyslav Husak (Data curation [equal], Validation [equal]), Nazar Shevchuk (Data curation [equal], Validation [equal]), Zakhar Ostrovsky (Data curation [equal], Software [equal]), Volodymyr Bdzhola (Data curation [supporting], Validation [supporting]), Semen Yesylevskyy (Formal analysis [equal], Methodology [equal], Project administration [equal], Supervision [equal], Writing—original draft [equal]), Serhii Starosyla (Conceptualization [equal], Project administration [equal], Supervision [equal]), and Alan Nafiiev (Conceptualization [lead], Project administration [lead], Supervision [equal])

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest: All authors are employees of Receptor. AI INC. S.S., A.N., and S.Y. have shares in Receptor.AI INC.

Funding

S.Y. has received funding through the grant MSMT-355/2025-16 from the Ministry of Education, Youth and Sport of the Czech Republic and through the MSCA4Ukraine project 101101923, which is funded by the European Union. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the MSCA4Ukraine Consortium as a whole nor any individual member institutions of the MSCA4Ukraine Consortium can be held responsible for them.

Data availability

The data underlying this article are available at the GitHub repository (https://github.com/receptor-ai/LLM-benchmark-dataset) and Zenodo (DOI: https://doi.org/10.5281/zenodo.15798647).

References

  1. Aggarwal R, Gupta A, Chelur V  et al.  DeepPocket: ligand binding site detection and segmentation using 3D convolutional neural networks. J Chem Inf Model  2022;62:5069–79. 10.1021/acs.jcim.1c00799 [DOI] [PubMed] [Google Scholar]
  2. Ahuja S, Mukund S, Deng L  et al.  Structural basis of Nav1.7 inhibition by an isoform-selective small-molecule antagonist. Science  2015;350:aac5464. 10.1126/science.aac5464 [DOI] [PubMed] [Google Scholar]
  3. An Y, Lim J, Glavatskikh M  et al.  In silico fragment-based discovery of CIB1-directed anti-tumor agents by FRASE-Bot. Nat Commun  2024;15:5564. 10.1038/s41467-024-49892-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Capra JA, Laskowski RA, Thornton JM  et al.  Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol  2009;5:e1000585. 10.1371/journal.pcbi.1000585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Durrant JD, Votapka L, Sørensen J  et al.  POVME 2.0: an enhanced tool for determining pocket shape and volume characteristics. J Chem Theory Comput  2014;10:5047–56. 10.1021/ct500381c [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Ghersi D, Sanchez R.  Improving accuracy and efficiency of blind protein‐ligand docking by focusing on predicted binding sites. Proteins  2009;74:417–24. 10.1002/prot.22154 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Graef J, Ehrt C, Rarey M.  Binding site detection remastered: enabling fast, robust, and reliable binding site detection and descriptor calculation with DoGSite3. J Chem Inf Model  2023;63:3128–37. 10.1021/acs.jcim.3c00336 [DOI] [PubMed] [Google Scholar]
  8. Jeevan K, Palistha S, Tayara H  et al.  PUResNetV2.0: a deep learning model leveraging sparse representation for improved ligand binding site prediction. J Cheminform  2024;16:66. 10.1186/s13321-024-00865-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Jiménez J, Doerr S, Martínez-Rosell G  et al.  DeepSite: protein-Binding site predictor using 3D-convolutional neural networks. Bioinformatics  2017;33:3036–42. 10.1093/bioinformatics/btx350 [DOI] [PubMed] [Google Scholar]
  10. Kandel J, Tayara H, Chong KT.  PUResNet: prediction of protein–ligand binding sites using deep residual neural network. J Cheminform  2021;13:65. 10.1186/s13321-021-00547-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kim JJ, Gharpure A, Teng J  et al.  Shared structural mechanisms of general anaesthetics and benzodiazepines. Nature  2020;585:303–8. 10.1038/s41586-020-2654-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Krivák R, Hoksza D.  P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform  2018;10:39. 10.1186/s13321-018-0285-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kruse AC, Kobilka BK, Gautam D  et al.  Muscarinic acetylcholine receptors: novel opportunities for drug development. Nat Rev Drug Discov  2014;13:549–60. 10.1038/nrd4295 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Le Guilloux V, Schmidtke P, Tuffery P.  Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics  2009;10:168. 10.1186/1471-2105-10-168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Liang J, Edelsbrunner H, Woodward C.  Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci  1998;7:1884–97. 10.1002/pro.5560070905 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Liu Y, Yang X, Gan J  et al.  CB-Dock2: improved protein–ligand blind docking by integrating cavity detection, docking and homologous template fitting. Nucleic Acids Res  2022;50:W159–64. 10.1093/nar/gkac394 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Murphy JM, Lucet IS, Hildebrand JM  et al.  Insights into the evolution of divergent nucleotide-binding mechanisms among pseudokinases revealed by crystal structures of human and mouse MLKL. Biochem J  2014;457:369–77. 10.1042/BJ20131270 [DOI] [PubMed] [Google Scholar]
  18. Tian W, Chen C, Lei X  et al.  CASTp 3.0: computed atlas of surface topography of proteins. Nucleic Acids Res  2018;46:W363–7. 10.1093/nar/gky473 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Wei H, Wang W, Peng Z  et al.  Q-BioLiP: a comprehensive resource for quaternary structure-based protein–ligand interactions. Genomics Proteomics Bioinform  2024;22:qzae001. 10.1093/gpbjnl/qzae001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Wei J, Wang X, Schuurmans D  et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv, arXiv:2201.11903, 2023, preprint: not peer reviewed.
  21. Xia C-Q, Pan X, Shen H-B.  Protein-Ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics  2020;36:3018–27. 10.1093/bioinformatics/btaa110 [DOI] [PubMed] [Google Scholar]
  22. Yao S, Yu D, Zhao J  et al. Tree of thoughts: deliberate problem solving with large language models. arXiv, arXiv:2305.10601, 2023, preprint: not peer reviewed.
  23. Yesylevskyy S. MolAR: memory‐safe library for analysis of MD simulations written in rust. J Comput Chem 2025;46:e27536. 10.1002/jcc.27536 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Zhang C, Zhang X, Freddolino L  et al.  BioLiP2: an updated structure database for biologically relevant Ligand-Protein interactions. Nucleic Acids Res  2024;52:D404–12. 10.1093/nar/gkad630 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Zhao Y, He S, Xing Y  et al.  A point cloud graph neural network for protein–ligand binding site prediction. Int J Mol Sci  2024;25:9280. 10.3390/ijms25179280 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaf449_Supplementary_Data

Data Availability Statement

The data underlying this article are available at the GitHub repository (https://github.com/receptor-ai/LLM-benchmark-dataset) and Zenodo (DOI: https://doi.org/10.5281/zenodo.15798647).


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES