Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2025 Jun 18:2025.06.14.659707. [Version 1] doi: 10.1101/2025.06.14.659707

Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction

Saro Passaro 1,2,*, Gabriele Corso 1,2,*, Jeremy Wohlwend 1,2,*, Mateo Reveiz 1,2,*, Stephan Thaler 3,4,*, Vignesh Ram Somnath 5, Noah Getz 1,2, Tally Portnoi 1,2, Julien Roy 3,4, Hannes Stark 1,2, David Kwabi-Addo 1,2, Dominique Beaini 3,4, Tommi Jaakkola 1,2, Regina Barzilay 1,2
PMCID: PMC12262699  PMID: 40667369

Abstract

Accurately modeling biomolecular interactions is a central challenge in modern biology. While recent advances, such as AlphaFold3 and Boltz-1, have substantially improved our ability to predict biomolecular complex structures, these models still fall short in predicting binding affinity, a critical property underlying molecular function and therapeutic efficacy. Here, we present Boltz-2, a new structural biology foundation model that exhibits strong performance for both structure and affinity prediction. Boltz-2 introduces controllability features including experimental method conditioning, distance constraints, and multi-chain template integration for structure prediction, and is, to our knowledge, the first AI model to approach the performance of free-energy perturbation (FEP) methods in estimating small molecule–protein binding affinity. Crucially, it achieves strong correlation with experimental readouts on many benchmarks, while being at least 1000× more computationally efficient than FEP. By coupling Boltz-2 with a generative model for small molecules, we demonstrate an effective workflow to find diverse, synthesizable, high-affinity binders, as estimated by absolute FEP simulations on the TYK2 target. To foster broad adoption and further innovation at the intersection of machine learning and biology, we are releasing Boltz-2 weights, inference, and training code 1 under a permissive open license, providing a robust and extensible foundation for both academic and industrial research.

1. Introduction

Complex biological processes are governed by interactions between biomolecules, including proteins, DNA, RNA, and small molecules. In this work, we introduce Boltz-2, a new foundation model for elucidating biomolecular interactions. Building on its predecessors AlphaFold3 [Abramson et al., 2024] and Boltz-1 [Wohlwend et al., 2025], Boltz-2 improves structural accuracy across modalities, extends predictions from static complexes to dynamic ensembles and sets a new standard in physical grounding. However, its key distinctive feature is its ability to predict binding affinity, which measures how tightly small molecules attach to proteins. This measure is critical for understanding whether a drug will act on its intended target and be potent enough to produce a therapeutic effect.

Despite its importance in drug design, in-silico affinity prediction remains an open challenge. To date, the most accurate techniques are atomistic simulations like free-energy perturbations (FEP). However, they are far too slow and expensive to be used at scale. Faster methods, such as docking, are not precise enough to give a reliable signal. In fact, no AI-based model has yet matched the accuracy of FEP methods or laboratory assays for binding affinity prediction.

Boltz-2 overcomes this long-standing performance/compute time trade-off. This advancement builds on two complementary developments: data curation and representation learning. Finding the right training signal for this task is a known barrier. While large amounts of binding data are publicly available, in their raw form they are not suitable for training due to experimental differences and noise. To this end, we standardized millions of biochemical assay measurements, tailoring data curation, sampling and supervision to extract the useful signal from the data.

In terms of representation learning, affinity prediction builds on the latent representation driving the cofolding process. This representation inherently encodes rich information about biomolecular interactions. Therefore, Boltz-2’s improvements in binding affinity prediction are driven by advances in structural modeling. These stem from: (1) extending training data beyond static structures to include experimental and molecular dynamics ensembles; (2) significantly expanding distillation datasets across diverse modalities; and (3) enhancing user control through conditioning on experimental methods, user-defined distance constraints, and multi-chain template integration.

The power of Boltz-2 to accurately predict affinity is evident in multiple discovery contexts:

  • Hit-to-lead and lead optimization Boltz-2 significantly outperforms deep learning baselines on the FEP+ benchmark [Ross et al., 2023] and approaches the accuracy of FEP-based methods, while being over 1000 times faster (see Figure 1). On the CASP16 affinity track, retrospective evaluation shows that Boltz-2 outperforms all submitted competition entries out of the box.

  • Hit discovery The model discriminates binders from decoys in high-throughput screens and achieves substantial enrichment gains on the MF-PCBA benchmark [Buterez et al., 2023], outperforming both docking and machine learning (ML) approaches.

  • De-novo Generation Coupled with a generative model [Cretu et al., 2024], Boltz-2 enables discovery of new binders. In a prospective screening against the TYK2 target, this pipeline is able to generate diverse, synthetizable, high-affinity binders, as estimated by absolute binding free energy (ABFE) simulations [Wu et al., 2025].

Figure 1:

Figure 1:

Boltz-2 presents a strong accuracy / speed trade-off for affinity prediction. Plot based on a 4-target subset (CDK2, TYK2, JNK1, P38) of the protein-ligand-benchmark [Hahn et al., 2022] for which baseline data are available for all methods. Full results in Figure 6.

Compared to Boltz-1, Boltz-2 improves crystallographic structure prediction across modalities, with notable gains on challenging targets such as antibody–antigen complexes. When benchmarked against molecular dynamics simulations, Boltz-2 matches the performance of recent specialized models, such as AlphaFlow [Jing et al., 2024] and BioEmu [Lewis et al., 2025], in predicting key dynamic properties like Root Mean Square Fluctuation (RMSF).

Alongside this manuscript, we are releasing Boltz-2’s model weights, inference pipeline and training code under a permissive open license. By making Boltz-2 freely available, we aim to accelerate progress across both academic and industrial efforts on tackling complex diseases and designing novel biomolecules. We also hope Boltz-2 will serve as a robust and extensible foundation for the growing machine learning community working at the interface of computation and biology, catalyzing further innovation in structure prediction, molecular design, and beyond.

2. Data

Aggregating and curating data are two of the most important steps in training strong foundational models. In this section, we summarize the training datasets and the key decisions made during data collection and preprocessing. Additional details are provided in Appendix A.

Structural Data

For the structure model, we increased the diversity of biomolecules and data sources compared to Boltz-1. Unlike Boltz-1, which trained on a single structure per system, we supervise Boltz-2 using ensembles coming from both experimental techniques, such as NMR, as well as computational ones, such as molecular dynamics. The experimental data used for training comprises structures in the Protein Data Bank (PDB) [Berman et al., 2000] released before 2023–06-01. For molecular dynamics, we collected poses from the trajectories released as part of three large-scale open efforts: MISATO [Siebenmorgen et al., 2024], ATLAS [VanderMeersche et al., 2024], and mdCATH [Mirarchi et al., 2024]. Our goal is to expose Boltz-2 not only to single equilibrium points from crystal structures but also to local fluctuations and global structural ensembles.

To further improve the model’s understanding of local dynamics, we supervise the model’s single representation at the end of the trunk of the architecture to predict B-factors coming from both experimental methods as well as molecular dynamics trajectories.

In addition, we employ distillation to increase the size and diversity of the training data and its supervision signal. Distillation obtains additional training data by using high-confidence outputs of other models to augment the original training set. Specifically, we use AlphaFold2 high-confidence predictions on single-chain monomers [Varadi et al., 2022], like many previous models. Additionally, we employ high-confidence Boltz-1 prediction across a wide variety of complexes of single-chain RNA, protein-DNA, ligand-protein, MHC-peptide, and MHC-peptide-TCR interactions.

Binding Affinity Data

Millions of binding affinity data points have been publicly released on central databases, such as PubChem [Kim et al., 2023] or ChEMBL [Zdrazil et al., 2024]; however, they have been notoriously difficult to combine into a single dataset for training due to variations in protocols and experimental noise [Landrum and Riniker, 2024].

Our data curation strategy focuses on: (1) retaining only the higher-quality assays, (2) mitigating overfitting to data biases by, for example, generating synthetic decoys, (3) ensuring structural quality by filtering targets with low confidence score, and (4) applying PAINS (pan-assay interference compounds) filters [Baell and Holloway, 2010] and discarding ligands with more than 50 heavy atoms.

Binding affinity predictions support two distinct tasks: hit discovery, where the goal is to identify likely binders across large chemical libraries, and hit-to-lead or lead optimization, where fine-grained affinity differences guide compound refinement. These use cases place different demands on the data: the former demands large-scale binary labeled data that distinguishes actives from inactives, while the latter requires precise, quantitative affinity measurements to resolve subtle activity differences. To support both settings, we curate a hybrid dataset comprising both binary and continuous labels. A summary of the resulting data is shown in Table 1.

Table 1:

Summary statistics of the affinity training dataset used in our model. Each row corresponds to a different data source or curation strategy. The table reports the number of binders, decoys, unique protein clusters at 90% sequence identity (referred to as Targets in the table), and compounds. Supervision indicates whether the data is used to supervise the binary and/or affinity value head. Values in parentheses show the corresponding statistics prior to applying the structural quality filter, which excludes examples with iptm below 0.75.

Source Type Supervision # Binders # Decoys # Targets # Compounds
ChEMBL and BindingDB optimization values 1.2M (1.45M) 0 2k (2.5k) 600k (700k)
PubChem small assays hit-discovery both 10k (13k) 50k (70k) 250 (300) 20k (25k)
PubChem HTS hit-discovery binary 200k (400k) 1.8M (3.5M) 300 (500) 400k (450k)
CeMM Fragments hit-discovery binary 25k (45k) 115k (200k) 1.3k (2.5k) 400 (400)
MIDAS Metabolites hit-discovery binary 2k (3.5k) 20k (35k) 60 (100) 400 (400)
ChEMBL and BindingDB synthetic decoys binary 0 1.2M (1.45M) 2k (2.5k) 600k (700k)

For the binding affinity regression values (e.g., Ki, Kd, IC50, AC50, EC50, XC50), we gather data from PubChem [Kim et al., 2023], ChEMBL [Zdrazil et al., 2024], and BindingDB [Liu et al., 2007]. We retain only assays that target a single protein and are categorized as either biochemical or functional, excluding any labeled as low-confidence or unreliable. All affinity values are standardized to log10 scale derived from values measured in μM. Assays with insufficient data or a low affinity standard deviation are discarded to encourage learning of intra-assay differences in values rather than inter-assay.

For the binary affinity classification data, we gather data from PubChem HTS (high-throughput screening) assays [Kim et al., 2023], a fragment screening dataset from CeMM [Offensperger et al., 2024], and MIDAS, a protein–metabolite interactome dataset from the University of Utah [Hicks et al., 2023]. For PubChem HTS, we retain only assays that include at least 100 compounds and exhibit a hit rate below 10%, helping to filter out noisy screens. To reduce false-positive labels introduced by HTS noise, we check for the presence of an associated quantitative affinity measurement (e.g., Ki, Kd, or XC50) in independent assays. Lastly, we augment the binary classification dataset by generating synthetic decoys created by shuffling binders identified in hit-to-lead screens across different targets, while mitigating low false negative rates by ensuring that each decoy has a Tanimoto similarity below 0.3 to all known binders associated with similar proteins. This expands the pool of negative examples, improves coverage of the chemical space surrounding each protein target, and helps mitigate spurious correlations present in HTS assays.

3. Architecture

As shown in Figure 2, Boltz-2’s architecture comprises four main components: the trunk, the denoising module with additional steering components, the confidence module, and the affinity module. Below, we highlight the major differences compared to the Boltz-1 and Boltz-1x architectures, mostly related to the controllability components and the affinity module. Appendix B provides a detailed description of each component.

Figure 2:

Figure 2:

Boltz-2 model architecture diagram.

Trunk optimization

The trunk is the most resource-intensive component of the model, largely due to the pairwise stack and triangular operations. We significantly improve the training and inference runtime as well as its memory consumption by using mixed-precision (bfloat16) and the trifast kernel for triangle attention. This also allows us to scale the crop size during training to 768 tokens, as done by AlphaFold3.

Physical quality

Co-folding models such as AlphaFold3, Chai-1, and Boltz-1 often produce structures with physical inaccuracies such as steric clashes and incorrect stereochemistry [Abramson et al., 2024, Buttenschoen et al., 2024]. To address this, we recently introduced Boltz-steering (as part of the Boltz-1x release) — an inference-time method that applies physics-based potentials, which improves physical plausibility without sacrificing accuracy. We also integrate this approach within Boltz-2 to obtain Boltz-2x.

Controllability

A frequent request from Boltz-1 users was a desire for more precise control of the model’s predictions, allowing them to test hypotheses or incorporate prior knowledge into the model without costly retraining or fine-tuning. To enable better controllability of the poses, we integrate three new components in Boltz-2: method conditioning, template conditioning and steering, and contact and pocket conditioning. Method conditioning allows for specification of the type of structure prediction method (e.g., X-ray crystallography, NMR, or molecular dynamics) that the predictions should align with and can capture their many nuances (see Section 5.2). Template conditioning integrates structures of similar complexes, helping the model without retraining [Jumper et al., 2021]. Unlike previous approaches, we allow users to either enforce strict observance of the templates via steering or just use the soft-conditioning like previous methods. As a departure from previous work, our templating approach also natively supports the use of multimeric templates. Finally, contact and p ocket conditioning allow for specification of particular distance constraints, whether they come from experimental techniques or human intuition.

Affinity module

The affinity module consists of a PairFormer and two heads: one predicting binding likelihood, the other regressing continuous affinity values. During training, we supervise the affinity value head using a mixture of related, but non-identical biochemical quantities (including Ki, Kd, and IC50) all converted to the logarithmic scale using μM as standardized unit. While some of these measures are related through the Cheng–Prusoff equation, they arise from different experimental contexts. As such, the predicted value should be viewed as a general measure of binding strength that supports ranking and can be approximately interpreted as an IC50-like value. The module operates on Boltz-2’s structural predictions, leveraging the pair representation and the predicted coordinates refined by a PairFormer model focused exclusively on the protein–ligand and intra-ligand interactions. These interactions are then aggregated to produce both a binding likelihood and an affinity value.

4. Training

The training of the model can be divided into three phases: structure training, confidence training, and affinity training. We further discuss how we use Boltz-2 to train a generative model for efficient exploration of the synthesizable chemical space. Full details on these components can be found in Appendix C.

Structure and Confidence training

The structure and confidence training largely follows Boltz-1, with a few exceptions. (1) Computational optimizations allowed us to train the model for more iterations and larger crops. (2) Ensembles from experimental methods and molecular dynamics were supervised with an aggregated distogram to reduce variance. (3) The trunk’s final representation was also supervised to predict the B-factor of each token.

Affinity training

Affinity training is performed after structure and confidence training, with gradients detached from the trunk. The pipeline incorporates several key components designed to improve generalization and scalability: pre-computation and cropping of binding pockets to focus on the most relevant interactions, pre-processing of trunk representations and a custom sampling strategy that balances binders and decoys while prioritizing informative, high-contrast assays. Batches are constructed to focus on local chemical variation. Supervision is applied jointly across binary and continuous affinity tasks using robust loss functions designed to mitigate the effects of experimental noise and assay heterogeneity. Continuous values are supervised using a Huber loss applied to both absolute affinity values and, with stronger weight, to the pairwise intra-assay differences. We observed best performance when training a single affinity value head on all available affinity measurements (eg, Ki, Kd, IC50, AC50, EC50, and XC50). Although these metrics reflect different underlying biochemical quantities, Kis and IC50s are related through the Cheng–Prusoff equation, and when comparing affinity values within the same assay, the pairwise differences loss effectively cancel out the correction term, and assays can be combinedRoss et al. [2023]. Binary classification is supervised using a focal loss [Lin et al., 2017] to address class imbalance and reduce overfitting. The final training objective is a weighted combination of the classification and the regression losses, designed to balance the different tasks.

Training a molecular generator with Boltz-2

As part of our evaluation, Boltz-2 is used to train a molecular generator to produce small molecules with high binding scores. Our generative agent (SynFlowNet [Cretu et al., 2024]) employs a GFlowNet [Bengio et al., 2021] loss function, enabling it to sample from arbitrary and multi-modal score distributions. Within this framework, the molecular generator undergoes off-policy training: batches of candidate molecules are asynchronously submitted to Boltz-2 workers for scoring, and the results are then incorporated into a replay buffer for the generative agent. The binding score (reward) for the agent is a strictly positive metric derived from a combination of both the binding likelihood and affinity values predicted by Boltz-2. The training procedure also incorporates basic drug-likeness properties through medicinal chemistry filters.

5. Evaluation

In this section, we evaluate Boltz-2 in various settings, including crystal structure prediction, local protein dynamics, binding likelihood and affinity predictions, and virtual screening. For the affinity measurements, all cross-assay averages are weighted by the number of compounds in the assay and error bars are computed as the bootstrapping standard deviation.

5.1. Boltz-2 improves over Boltz-1 on structure prediction

PDB evaluation set.

We evaluated the performance of Boltz-2 (and its version with enabled physicality steering potentials, Boltz-2x), comparing it against Boltz-1, Chai-1 [Chai et al., 2024], ProteinX [Chen et al., 2025], and AlphaFold3 and across a wide variety of complexes submitted to the Protein Data Bank in 2024 and 2025 that were significantly different from any structure that any of the models had seen in their training set. The results, presented in Figure 3, show that, across modalities, Boltz-2 matches or moderately improves over the performance of Boltz-1. Among the modalities where the improvements are strongest are RNA chains and DNA-protein complexes. These are the two modalities where we most significantly augmented the available data in the PDB with large distillation sets, suggesting that the distillation strategy could be important to improve these models beyond what available experimental data allows. Compared also to other methods, Boltz-2 performs competitively, edging the other commercially available models Chai-1 and ProteinX, but lagging a bit behind AlphaFold3. As expected Boltz-1x and Boltz-2x, thanks to Boltz-steering obtain significantly better physicality metrics both for small-molecule conformations and for steric clashes at interfaces.

Figure 3:

Figure 3:

Evaluation of the performance of Boltz-2 against existing co-folding models on a diverse set of unseen complexes. Error bars indicate 95% confidence intervals.

Antibody benchmark.

One modality where researchers have highlighted a performance gap between AlphaFold3 and the commercially-available models is antibody-antigen structure prediction, especially when looking at the generalization to unseen antigens. This observation is also reflected in the results from our antibody benchmark shown in Figure 4. However, we also observe a moderate improvement of Boltz-2 over Boltz-1, narrowing the gap between the proposed open models and proprietary ones, such as AlphaFold3.

Figure 4:

Figure 4:

Left: Performance of different co-folding methods on a challenging antibody benchmark. Boltz-2 shows an improvement over Boltz-1 while still lagging behind AlphaFold3. Right: Retrospective results for the Polaris-ASAP competition, with Boltz-2 matching the performance of the top 5 contenders without any fine-tuning or physics relaxation. Error bars indicate 95% confidence intervals.

Polaris-ASAP challenge (SARS-CoV-2 and MERS-CoV)

We further evaluated the model on the recent Polaris-ASAP Discovery competition on ligand pose estimation. This was composed of ligands bound to either the SARS-CoV-2 and MERS-CoV main proteases that ASAP Discovery generated as part of their antiviral drug discovery campaigns. On top of the PDB, 770 additional structures of similar ligands bound to these proteins were given as a training set. This challenge saw a clear success of co-folding models over more traditional physics-based and ML tools, with all the top-6 entries being composed of fine-tuned Boltz-1 or AlphaFold3 models (some with additional physics-based relaxation). Boltz-2 shows a clear improvement over Boltz-1 and the top performers in the challenge, without any finetuning or physics-based relaxation (Figure 4 right).

5.2. Boltz-2 can better capture local protein dynamics

In order to validate the impact of MD method conditioning and evaluate the model’s ability to capture local dynamics of protein structures, we evaluated Boltz-2 on the held-out clusters of the mdCATH and ATLAS datasets. The results, presented in Figure 5 and Appendix E.1, show that (1) MD conditioning has a clear effect on the predicted ensembles, leading to more diverse structures that better capture the conformational diversity of the simulations, (2) Boltz-2 with MD conditioning is competitive on various metrics with specialized models such as BioEmu [Lewis et al., 2025] and AlphaFlow [Jing et al., 2024]. When looking at RMSF, a standard measure of local dynamics, Boltz-2 MD ensembles generally obtain stronger correlations with the ground truth simulation and lower errors than Boltz-1, BioEmu and AlphaFlow. In addition to training on MD ensembles, Boltz-2’s performance may also benefit from supervision on both experimental and computational B-factor estimates, which are specifically designed to capture local structural dynamics. Looking at recall lDDT, Boltz-2 modestly outperforms Boltz-1 while improving over AlphaFlow and BioEmu. Conditioning on MD allows Boltz-2 to increase the diversity of samples while retaining its precision. This diversity increase is, however, outperformed by BioEmu and AlphaFlow, which more closely align with the reference diversity from the simulation.

Figure 5:

Figure 5:

Per-target RMSF correlation metrics and lDDT metrics on the held-out clusters from the mdCATH and ATLAS molecular dynamics datasets.

5.3. Boltz-2 approaches FEP accuracy on public benchmarks

Accurately ranking analogues within a chemical series is a critical challenge in hit-to-lead and lead optimization. Distinguishing subtle differences in binding affinity among closely related analogues is essential for guiding molecular refinement and progressing candidates through the pipeline. Traditional free energy simulation methods can often offer the required precision, but are too computationally expensive for more widespread use. Boltz-2 addresses this problem as it allows accurate affinity predictions at a fraction of the computational cost, enabling rapid prioritization in structure-guided optimization workflows.

To evaluate Boltz-2’s affinity prediction ability, we benchmarked it across a suite of hit-to-lead and lead-optimization datasets. Summary results are presented in Figure 6, while expanded tables and scatter plots are available in Appendices E.2.1 and E.2.2.

Figure 6:

Figure 6:

Pearson correlation averaged over each assay on our four affinity value test sets. Error bars represent bootstrap estimates of the standard error.

We evaluate the model on two subsets of the FEP+ benchmark [Ross et al., 2023]: the OpenFE dataset, consisting of 876 high-quality hit-to-lead measurements [Gowers et al., 2023], and a focused 4-target subset [Hahn et al., 2022], where more physics-based baselines are available, including absolute FEP (ABFE) [Wu et al., 2025] and Fragment Molecular Orbital (FMO) [Nishimoto and Fedorov, 2016, Guareschi et al., 2023], a semi-empirical quantum mechanics-based scoring function. The training sets are filtered to exclude proteins with ≥ 90% sequence identity to any protein in the FEP+ benchmark, ensuring that we benchmark on unseen proteins. Additionally, we assess the impact of compound similarity in Figure D.2.1. On the 4-target FEP subset, Boltz-2 achieves an average Pearson correlation of 0.66, outperforming all available inexpensive physical methods and ML baselines. Remarkably, Boltz-2 approaches state-of-the-art free energy simulations, while running more than 1,000× faster, providing a strong speed-accuracy tradeoff (Figure 1). Even on the full OpenFE benchmark set, Boltz-2 approaches the performance of OpenFE, a widely adopted open-source relative FEP method.

Additionally, we include the CASP16 affinity challenge [Gilson et al., 2025], a rigorous blind benchmark featuring 140 protein–ligand pairs across two targets. Here, while participants were given several weeks and used a range of ad-hoc machine learning and physics-based tools, we ran Boltz-2 out-of-the-box with no fine-tuning or input curation. Yet, Boltz-2 outperforms all top-ranking participants by a clear margin.

We also evaluated the model on eight blinded internal assays from Recursion that reflect complex real-world medicinal chemistry projects. Here, the model still outperforms by a large margin the other ML baselines and achieves a Pearson correlation of > 0.55 on 3 out of 8 assays, but has limited performance on the other 5. Such variation is also typical of FEP methods, which are known to perform weakly on some protein classes, such as GPCRs, without custom input preparation [Deflorian et al., 2020]. We include these results as a reminder that strong performance on public benchmarks does not always immediately translate to all complexities of real-world drug discovery without further work to understand the relative strengths and weaknesses of a given approach.

5.4. Boltz-2 enables accurate large-scale virtual screening

Accurate virtual screening remains one of the most impactful challenges in early-stage drug discovery. The ideal method must scale across vast chemical libraries while reliably identifying active compounds against diverse protein targets. Boltz-2 offers a promising solution to this problem, combining speed and precision in a unified affinity prediction framework.

To assess its utility in realistic screening settings, we first evaluated Boltz-2 on retrospective benchmarks derived from the MF-PCBA dataset [Buterez et al., 2023], which includes high-quality biochemical assays spanning diverse protein families. Performance was assessed using metrics tailored to hit discovery—average precision (AP), enrichment factor at top-ranked percentiles, and AUROC. Results highlight Boltz-2’s ability to retrieve actives from large, imbalanced datasets (Figure 7). On this benchmark, Boltz-2 substantially outperforms prior machine learning approaches, the widely used ipTM and docking, nearly doubling the average precision and achieving an enrichment factor of 18.4 at a 0.5% threshold (Table 13).

Figure 7:

Figure 7:

Left: Average precision averaged over the assays in the MF-PCBA test set. Error bars represent bootstrap estimates of the standard error. Right: Enrichment factors, computed at top-K thresholds with K = 0.5%, 1%, 2%, and 5%.

To evaluate Boltz-2 in prospective settings, we performed a virtual screen against the kinase target TYK2, a protein well-characterized in both ML and physics-based modeling benchmarks. We selected TYK2 for two main reasons: First, TYK2 is in the test set of the Boltz-2 affinity model, avoiding data leakage from known binders. Second, in the absence of experimental data, we validate the compounds selected by Boltz-2 with a single repeat of Boltz-ABFE2 [Wu et al., 2025], our recently developed absolute FEP pipeline to estimate ABFE values without experimental crystal structures, and Boltz-ABFE performs very well on this target. Indeed, based on the protein-ligand benchmark [Hahn et al., 2022], Boltz-ABFE achieves a Pearson R = 0.95, centered MAE = 0.42 kcal/mol and a comparatively small offset with respect to the experiment of 0.92 kcal/mol, supporting our confidence of this procedure as a validation step for TYK2-targeting virtual screens.

In these screens, we use a combination of the Boltz-2 predicted binding likelihood and affinity as a screen score for small molecules. We started by screening two commercially available compound libraries from Enamine—Hit Locator Library (HLL, 460,160 compounds) and Kinase Library (64,960 compounds). Boltz-2 successfully prioritized high-affinity ligands: Based on ABFE estimates, 8 of the top 10 compounds from HLL and all 10 compounds from the Kinase library are predicted to bind, while all 10 random compounds are predicted to be non-binders (Figure 8).

Figure 8:

Figure 8:

Virtual screening experiment performed on the TYK2 protein. Left: The Boltz-2 screen scores of the final set of compounds of each virtual screening stream correlate (R=0.74) with the absolute binding free energy (ABFE) estimates ΔG. Right: Distribution of the ABFE-predicted ΔG for the compounds proposed by the different screening strategies.

We further extended this screening pipeline using a generative approach. Boltz-2 was coupled with SynFlowNet [Cretu et al., 2024], a GFlowNet-based molecular generator designed to sample molecules from Enamine’s 76B REAL space (details in Appendices B.6 and C.3). This generative screen offers a scalable alternative to fixed libraries by exploring synthesizable chemical space beyond off-the-shelf compounds. After scoring, filtering, and diversity selection, we submitted 10 de novo candidates for ABFE simulation (see Appendix E.3). All selected compounds from the SynFlowNet stream are predicted to bind TYK2, with higher affinity on average compared to fixed screens, and while requiring substantially less computational budget than the HLL screen (117k Boltz-2 evaluations for SynFlowNet against 460k evaluations for HLL). Visualisations of all the selected compounds and of the top-2 ABFE-scored ligand-protein complexes for each stream are presented in Appendix E.3.3. Finally, in Appendix E.3.4, we further assess the novelty of the SynFlowNet-generated compounds by examining their Tanimoto similarity with known binders from the PDB that are part of the structure module training data, and find that the generated compounds do not exhibit significant similarity to public TYK2 binders. We note that these results might be optimistic given that Boltz-2 performs well on this target based on the protein-ligand benchmark data [Hahn et al., 2022], achieving Pearson R= 0.83.

Together, these results demonstrate how Boltz-2 enables structure-based prioritization at a large scale. By addressing both performance and scalability, Boltz-2 expands the scope of target-based in-silico optimization to large scale, encompassing hit discovery, hit-to-lead, and lead optimization.

6. Limitations

Despite the progress made in this work for structure and binding affinity prediction, we acknowledge several remaining limitations of the model that we aim to address in future work.

Molecular Dynamics

While there are clear improvements over Boltz-1, the model does not significantly deviate from other baselines such as AlphaFlow or BioEmu. The current model used a relatively small MD dataset at the later stages of training, with minor architectural changes to account for multiple conformations. Additional changes from the modeling perspective, as well as the datasets used, are required to further improve its capabilities.

Remaining challenges for structure prediction.

While we see a consistent improvement in the structure prediction performance from Boltz-1 to Boltz-2, the model does not significantly deviate from the structure prediction performance of its predecessors. This similarity is primarily due to the use of largely identical structural training data, a similar architectural design, and withstanding limitations in predicting complex interactions, particularly within large complexes. In addition, the model still often fails to capture large conformational changes, such as those that can be induced by binding.

Accurate structures for affinity predictions.

Boltz-2 relies on predicted 3D protein–ligand structures and reliable trunk features as input to the affinity module. If the model fails to identify the correct p ocket or inaccurately reconstructs the binding interface or conformational state of the protein, downstream affinity predictions are unlikely to be reliable. This is particularly relevant in biological contexts where cofactors are essential for binding, given that in its current form, the affinity module does not explicitly handle such cofactors, including ions, water, or multimeric binding partners. Finally, an insufficiently large affinity crop size could be limiting if important long-range interactions are truncated or if the crop does not include the corresponding po cket for each binder, e.g., in the case of both orthosteric and allosteric modulators.

Understanding the range of applicability of the affinity module.

Despite the progress on affinity predictions, we notice in Figures 12-14 that the performance varies strongly between assays. Further work is needed to determine the source of this variance in performance, whether it stems from, e.g., inaccuracies in predicted structures, limited generalization to distinct protein families, or insufficient robustness to out-of-distribution small molecules.

7. Conclusion

We introduce Boltz-2, a new structural biology foundation model that advances the frontiers of both structure and affinity prediction. Boltz-2 builds on the co-folding capabilities of its predecessor with improved physical plausibility, fine-grained controllability, and a better understanding of local dynamics. Our results show that Boltz-2 performs competitively across a broad range of structure prediction tasks, including challenging modalities and conformational ensembles derived from MD. Crucially, Boltz-2 is, to our knowledge, the first AI model to approach the accuracy of FEP methods for predicting binding affinities on the FEP+ benchmark, while offering orders-of-magnitude gains in computational efficiency. For affinity values, Boltz-2 demonstrates strong retrospective and prospective performance in both hit discovery, hit-to-lead and lead optimization settings, as observed on many assays across public benchmarks, private benchmarks, and virtual screening workflows. Coupled with a generative model for small molecules, Boltz-2 enables an end-to-end framework for de novo binder generation, which is validated through ABFE simulations on the TYK2 protein. Despite these advances, several limitations remain, as outlined above. Addressing these issues will require future work in expanding and curating training data, refining model architecture, and integrating additional biochemical context.

By releasing Boltz-2 and its training pipeline under a permissive license, we aim to support the growing community working at the intersection of AI and molecular science. We hope Boltz-2 will serve as a foundation for further advances in drug discovery, protein design, and synthetic biology, expanding the boundaries of what is computationally possible in biomolecular modeling.

Supplementary Material

1

Acknowledgment

We would like to thank Tim O’Donnell, Richard Qi, Ji Won Kim, Sergey Ovchinnikov, Rachel Wu, Felix Faltings, Kyle Swanson, Andreas Krause, Matteo Aldeghi, Francesco Capponi, Demitri Nava, Dylan Reid, Miruna Cretu, and Liam Atkinson for invaluable discussions and feedback around this work. We are grateful to all the members of the Boltz community, many of whom have contributed fixes, features, or helpful feedback that have helped improve the public repository and this project.

We would like to thank Recursion’s High Performance Computing team for their work in establishing, maintaining, solving, and optimizing the GPU usage through the duration of the project, most notably Caden Ellis and Joshua Fryer. We would like to thank Recursion’s physics-based team for helping evaluate poses generated by the model and benchmark the affinity against physical baselines such as docking and FEP, notably Geoff Wood, Gail Bartlett, Arnaldo Filho, Richard Bradshaw, Zhiyi Wu, as well as Thomas Grigg and Oliver Scott for helping with purchasability validation checks. We would like to thank the Valence Labs team for helping us with feedback on the data curation, model training, model evaluation, medicinal chemistry and for reading and proofing the paper, notably Austin Tripp, Michel Moreau, Cristian Gabellini, Lu Zhu, Prudencio Tossou, and Emmanuel Noutahi.

Part of the GPU resources necessary to complete the project were provided by National Energy Research Scientific Computing Center (NERSC), a Department of Energy Office of Science User Facility, via NERSC award GenAI@NERSC. The team from MIT was also supported by the Abdul Latif Jameel Clinic for Machine Learning in Health, the NSF Expeditions grant (award 1918839: Collaborative Research: Understanding the World Through Code), the DTRA Discovery of Medical Countermeasures Against New and Emerging (DOMANE) Threats program, the MATCHMAKERS project supported by the Cancer Grand Challenges partnership (funded by Cancer Research UK (CGCATF-2023/100001), the National Cancer Institute (OT2CA297463) and The Mark Foundation for Cancer Research), the Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium, the Centurion Foundation and the BT Charitable Foundation.

Footnotes

1

Code, weights and data available at https://github.com/jwohlwend/boltz.

2

Preprint available soon.

References

  1. Abramson Josh, Adler Jonas, Dunger Jack, Evans Richard, Green Tim, Pritzel Alexander, Ronneberger Olaf, Willmore Lindsay, Ballard Andrew J, Bambrick Joshua, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Baell Jonathan B and Holloway Georgina A. New substructure filters for removal of pan assay interference compounds (pains) from screening libraries and for their exclusion in bioassays. Journal of medicinal chemistry, 53, 2010. [DOI] [PubMed] [Google Scholar]
  3. Bengio Emmanuel, Jain Moksh, Korablyov Maksym, Precup Doina, and Bengio Yoshua. Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34, 2021. [Google Scholar]
  4. Bento AP, Hersey A, Félix E, Landrum G, Gaulton A, Atkinson, Bellis LJ, De Veij M, and Leach AR. An open source chemical structure curation pipeline using rdkit. Journal of Cheminformatics, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Berman Helen M, Westbrook John, Feng Zukang, Gilliland Gary, Bhat Talapady N, Weissig Helge, Shindyalov Ilya N, and Bourne Philip E. The protein data bank. Nucleic acids research, 28, 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Buterez David, Janet Jon Paul, Kiddle Steven J, and Lìo Pietro. Mf-pcba: Multifidelity high-throughput screening benchmarks for drug discovery and machine learning. Journal of Chemical Information and Modeling, 63, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Buttenschoen Martin, Morris Garrett M, and Deane Charlotte M. Posebusters: Ai-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science, 15, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Case David A, Aktulga Hasan Metin, Belfon Kellon, Cerutti David S, Cisneros G Andrés, Cruzeiro Vińıcius Wilian D, Forouzesh Negin, Giese Timothy J, Götz Andreas W, Gohlke Holger, et al. Ambertools. Journal of chemical information and modeling, 63, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chai Discovery, Boitreaud Jacques, Dent Jack, McPartlon Matthew, Meier Joshua, Reis Vinicius, Rogozhnikov Alex, and Wu Kevin. Chai-1: Decoding the molecular interactions of life. bioRxiv, 2024. [Google Scholar]
  10. Chen Wei, Cui Di, Jerome Steven V, Michino Mayako, Lenselink Eelke B, Huggins David J, Beautrait Alexandre, Vendome Jeremie, Abel Robert, Friesner Richard A, et al. Enhancing hit discovery in virtual screening through absolute protein–ligand binding free-energy calculations. Journal of Chemical Information and Modeling, 63, 2023. [DOI] [PubMed] [Google Scholar]
  11. Chen X, Zhang Y, Lu C, Ma W, Guan J, Gong C, Yang J, Zhang H, Zhang K, et al. Protenix-advancing structure prediction through a comprehensive alphafold3 reproduction. biorxiv. 2025. [Google Scholar]
  12. Cho Yehlin, Pacesa Martin, Zhang Zhidian, Correia Bruno, and Ovchinnikov Sergey. Boltzdesign1: Inverting all-atom structure prediction model for generalized biomolecular binder design. bioRxiv, 2025. [Google Scholar]
  13. Cretu Miruna, Harris Charles, Igashov Ilia, Schneuing Arne, Segler Marwin, Correia Bruno, Roy Julien, Bengio Emmanuel, and Lìo Pietro. Synflownet: Design of diverse and novel molecules with synthesis constraints. arXiv preprint arXiv:2405.01155, 2024. [Google Scholar]
  14. Deflorian Francesca, Perez-Benito Laura, Lenselink Eelke B, Congreve Miles, van Vlijmen Herman W. T., Mason Jonathan S., de Graaf Chris, and Tresadern Gary. Accurate prediction of gpcr ligand binding affinity with free energy perturbation. Journal of Chemical Information and Modeling, 60, 2020. [DOI] [PubMed] [Google Scholar]
  15. Du Yuanqi, Jamasb Arian R, Guo Jeff, Fu Tianfan, Harris Charles, Wang Yingheng, Duan Chenru, Lìo Pietro, Schwaller Philippe, and Blundell Tom L. Machine learning-aided generative molecular design. Nature Machine Intelligence, 6, 2024. [Google Scholar]
  16. Friesner Richard A, Banks Jay L, Murphy Robert B, Halgren Thomas A, Klicic Jasna J, Mainz Daniel T, Repasky Matthew P, Knoll Eric H, Shelley Mee, Perry Jason K, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of medicinal chemistry, 47, 2004. [DOI] [PubMed] [Google Scholar]
  17. Gilson Michael, Eberhardt Jerome, Skrinjar Peter, Durairaj Janani, Robin Xavier, and Kryshtafovych Andriy. Assessment of pharmaceutical protein-ligand pose and affinity predictions in casp16. Assessment, 15, 2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Goncharov Mikhail, Bagaev Dmitry, Shcherbinin Dmitrii, Zvyagin Ivan, Bolotin Dmitry, Thomas Paul G, Minervina Anastasia A, Pogorelyy Mikhail V, Ladell Kristin, McLaren James E, et al. Vdjdb in the pandemic era: a compendium of t cell receptors specific for sars-cov-2. Nature methods, 19, 2022. [DOI] [PubMed] [Google Scholar]
  19. Gowers R. J., Alibay I., Swenson D. W. H., Henry M. M., Ries B., Baumann H. M., and Eastwood J. R. B.. The open free energy library (v0.14.0). 10.5281/zenodo.8344248, 2023. [DOI] [Google Scholar]
  20. Guareschi Riccardo, Lukac Iva, Gilbert Ian H, and Zuccotto Fabio. Sophosqm: accurate binding affinity prediction in compound optimization. ACS omega, 8, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hahn David F, Bayly Christopher I, Boby Melissa L, Macdonald Hannah E Bruce, Chodera John D, Gapsys Vytautas, Mey Antonia SJS, Mobley David L, Benito Laura Perez, Schindler Christina EM, et al. Best practices for constructing, preparing, and evaluating protein-ligand binding affinity benchmarks [article v1. 0]. Living journal of computational molecular science, 4, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Hauser Maria, Steinegger Martin, and Söding Johannes. MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics (Oxford, England), 32, 2016. [DOI] [PubMed] [Google Scholar]
  23. Heather James M, Spindler Matthew J, Alonso Marta Herrero, Shui Yifang Ivana, Millar David G, Johnson David S, Cobbold Mark, and Hata Aaron N. Stitchr: stitching coding tcr nucleotide sequences from v/j/cdr3 information. Nucleic acids research, 50, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hicks Kevin G, Cluntun Ahmad A, Schubert Heidi L, Hackett Sean R, Berg Jordan A, Leonard Paul G, Aleixo Mariana A Ajalla, Zhou Youjia, Bott Alex J, Salvatore Sonia R, et al. Protein-metabolite interactomics of carbohydrate metabolism reveal regulation of lactate dehydrogenase. Science, 379, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Horton Josh. The Free Energy of Everything: Benchmarking OpenFE. https://blog.omsf.io/the-free-energy-of-everything-benchmarking-openfe, 2025. [Google Scholar]
  26. Jensen Jan H. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Jing Bowen, Berger Bonnie, and Jaakkola Tommi. AlphaFold Meets Flow Matching for Generating Protein Ensembles, 2024. [Google Scholar]
  28. Jolma Arttu, Yin Yimeng, Nitta Kazuhiro R, Dave Kashyap, Popov Alexander, Taipale Minna, Enge Martin, Kivioja Teemu, Morgunova Ekaterina, and Taipale Jussi. Dna-dependent formation of transcription factor pairs alters their binding specificity. Nature, 527, 2015. [DOI] [PubMed] [Google Scholar]
  29. Jumper John, Evans Richard, Pritzel Alexander, Green Tim, Figurnov Michael, Ronneberger Olaf, Tunyasuvunakool Kathryn, Bates Russ, Źıdek Augustin, Potapenko Anna, et al. Highly accurate protein structure prediction with alphafold. nature, 596, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kalvari Ioanna, Nawrocki Eric P, Ontiveros-Palacios Nancy, Argasinska Joanna, Lamkiewicz Kevin, Marz Manja, Griffiths-Jones Sam, Toffano-Nioche Claire, Gautheret Daniel, Weinberg Zasha, et al. Rfam 14: expanded coverage of metagenomic, viral and microrna families. Nucleic acids research, 49, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kanev Georgi K, de Graaf Chris, Westerman Bart A, de Esch Iwan JP, and Kooistra Albert J. Klifs: an overhaul after the first 5 years of supporting kinase research. Nucleic acids research, 49, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kim Sunghwan, Chen Jie, Cheng Tiejun, Gindulyte Asta, He Jia, He Siqian, Li Qingliang, Shoemaker Benjamin A, Thiessen Paul A, Yu Bo, et al. Pubchem 2023 update. Nucleic acids research, 51, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Krishnan Sowmya Ramaswamy, Roy Arijit, and Gromiha M Michael. R-sim: a database of binding affinities for rna-small molecule interactions. Journal of Molecular Biology, 435, 2023. [DOI] [PubMed] [Google Scholar]
  34. Kuzmanic Antonija, Pannu Navraj S., and Zagrovic Bo jan. X-ray refinement significantly underestimates the level of microscopic heterogeneity in biomolecular crystals. Nature Communications, 5, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Landrum Gregory A and Riniker Sereina. Combining ic50 or k i values from different sources is a source of significant noise. Journal of chemical information and modeling, 64, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Langevin Maxime, Vuilleumier Rodolphe, and Bianciotto Marc. Explaining and avoiding failure modes in goal-directed generation of small molecules. Journal of Cheminformatics, 14, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lewis Sarah, Hempel Tim, Jiménez-Luna José, Gastegger Michael, Xie Yu, Foong Andrew Y. K., Satorras Victor Garćıa, Abdin Osama, Veeling Bastiaan S., Zaporozhets Iryna, Chen Yaoyi, Yang Soojung, Schneuing Arne, Nigam Jigyasa, Barbero Federico, Stimper Vincent, Campbell Andrew, Yim Jason, Lienen Marten, Shi Yu, Zheng Shuxin, Schulz Hannes, Munir Usman, Tomioka Ryota, Clementi Cecilia, and Noé Frank. Scalable emulation of protein equilibrium ensembles with generative deep learning, 2025. [DOI] [PubMed] [Google Scholar]
  38. Li Min, Lu Zhangli, Wu Yifan, and Li YaoHang. Bacpi: a bi-directional attention neural network for compound–protein interaction and binding affinity prediction. Bioinformatics, 38, 2022. [DOI] [PubMed] [Google Scholar]
  39. Lin Tsung-Yi, Goyal Priya, Girshick Ross, He Kaiming, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2017. [DOI] [PubMed] [Google Scholar]
  40. Liu Tiqing, Lin Yuhmei, Wen Xin, Jorissen Robert N, and Gilson Michael K. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research, 35, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Liu Zhihai, Li Yan, Han Li, Li Jie, Liu Jie, Zhao Zhixiong, Nie Wei, Liu Yuchen, and Wang Renxiao. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics, 31, 2015. [DOI] [PubMed] [Google Scholar]
  42. Malkin Nikolay, Jain Moksh, Bengio Emmanuel, Sun Chen, and Bengio Yoshua. Trajectory balance: Improved credit assignment in gflownets. Advances in Neural Information Processing Systems, 35, 2022. [Google Scholar]
  43. McGann Mark. Fred pose prediction and virtual screening accuracy. Journal of chemical information and modeling, 51, 2011. [DOI] [PubMed] [Google Scholar]
  44. Miller Bill R III, McGee T Dwight Jr, Swails Jason M, Homeyer Nadine, Gohlke Holger, and Roitberg Adrian E. Mmpbsa. py: an efficient program for end-state free energy calculations. Journal of chemical theory and computation, 8, 2012. [DOI] [PubMed] [Google Scholar]
  45. Mirarchi Antonio, Giorgino Toni, and De Fabritiis Gianni. mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics. Scientific Data, 11, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Mirdita Milot, Schütze Konstantin, Moriwaki Yoshitaka, Heo Lim, Ovchinnikov Sergey, and Steinegger Martin. Colabfold: making protein folding accessible to all. Nature methods, 19, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Nishimoto Yoshio and Fedorov Dmitri G. The fragment molecular orbital method combined with density-functional tight-binding and the polarizable continuum model. Physical Chemistry Chemical Physics, 18, 2016. [DOI] [PubMed] [Google Scholar]
  48. Offensperger Fabian, Tin Gary, Duran-Frigola Miquel, Hahn Elisa, Dobner Sarah, am Ende Christopher W, Strohbach Joseph W, Rukavina Andrea, Brennsteiner Vincenth, Ogilvie Kevin, et al. Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells. Science, 384, 2024. [DOI] [PubMed] [Google Scholar]
  49. Pacesa Martin, Nickel Lennart, Schellhaas Christian, Schmidt Joseph, Pyatova Ekaterina, Kissling Lucas, Barendse Patrick, Choudhury Jagrity, Kapoor Srajan, Alcaraz-Serna Ana,et al. Bindcraft: one-shot design of functional protein binders. bioRxiv, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Renz Philipp, Van Rompaey Dries, Wegner Jörg Kurt, Hochreiter Sepp, and Klambauer Günter. On failure modes of molecule generators and optimizers. 2020. [DOI] [PubMed] [Google Scholar]
  51. Rettie Stephen A, Campbell Katelyn V, Bera Asim K, Kang Alex, Kozlov Simon, Bueso Yensi Flores, De La Cruz Joshmyn, Ahlrichs Maggie, Cheng Suna, Gerben Stacey R, et al. Cyclic peptide structure prediction and design using alphafold2. Nature Communications, 16, 2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Ross Gregory A, Lu Chao, Scarabelli Guido, Albanese Steven K, Houang Evelyne, Abel Robert, Harder Edward D, and Wang Lingle. The maximal and current accuracy of rigorous protein-ligand binding free energy calculations. Communications Chemistry, 6, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Segler Marwin HS, Kogej Thierry, Tyrchan Christian, and Waller Mark P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 4, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Siebenmorgen Till, Menezes Filipe, Benassou Sabrina, Merdivan Erinc, Didi Kieran, Mourão André Santos Dias, Kitel Radoslaw, Lìo Pietro, Kesselheim Stefan, Piraud Marie, Theis Fabian J., Sattler Michael, and Popowicz Grzegorz M.. MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery. Nature Computational Science, 4, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Steinegger Martin and Söding Johannes. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35, 2017. [DOI] [PubMed] [Google Scholar]
  56. Sun Zhoutong, Liu Qian, Qu Ge, Feng Yan, and Reetz Manfred T. Utility of b-factors in protein science: interpreting rigidity, flexibility, and internal motion and engineering thermostability. Chemical reviews, 119, 2019. [DOI] [PubMed] [Google Scholar]
  57. Meersche Yann Vander, Cretin Gabriel, Gheeraert Aria, Gelly Jean-Christophe, and Galochkina Tatiana. ATLAS: protein flexibility description from atomistic molecular dynamics simulations. Nucleic Acids Research, 52, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Varadi Mihaly, Anyango Stephen, Deshpande Mandar, Nair Sreenath, Natassia Cindy, Yordanova Galabina, Yuan David, Stroe Oana, Wood Gemma, Laydon Agata, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Vita Randi, Blazeska Nina, Marrama Daniel, IEDB Curation Team Members Shackelford Deborah Zalman Leora Foos Gabriele Zarebski Laura Chan Kenneth Reardon Brian Fitzpatrick Sidne Busse Matthew Coleman Sara Sedwick Caitlin Edwards Lindy MacFarlane Catriona Ennis Marcus, Duesing Sebastian, Bennett Jason, Greenbaum Jason, De Almeida Mendes Marcus, Mahita Jarjapu, Wheeler Daniel K, et al. The immune epitope database (iedb): 2024 update. Nucleic Acids Research, 53, 2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Walters Pat. Generative molecular design isn’t as easy as people make it look, 2024a. [Google Scholar]
  61. Walters Pat. Silly things large language models do with molecules, 2024b. [Google Scholar]
  62. Wang Huiwen. Prediction of protein–ligand binding affinity via deep learning models. Briefings in Bioinformatics, 25, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Wang Lingle, Wu Yujie, Deng Yuqing, Kim Byungchan, Pierce Levi, Krilov Goran, Lupyan Dmitry, Robinson Shaughnessy, Dahlgren Markus K, Greenwood Jeremy, et al. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. Journal of the American Chemical Society, 137, 2015. [DOI] [PubMed] [Google Scholar]
  64. Wohlwend Jeremy, Corso Gabriele, Passaro Saro, Getz Noah, Reveiz Mateo, Leidal Ken, Swiderski Wojtek, Atkinson Liam, Portnoi Tally, Chinn Itamar, Silterra Jacob, Jaakkola Tommi, and Barzilay Regina. Boltz-1 Democratizing Biomolecular Interaction Modeling, 2025. [Google Scholar]
  65. Wu Zhiyi, Koenig Gerhard, Boresch Stefan, and Cossins Benjamin. Optimizing absolute binding free energy calculations for production usage. ChemRxiv preprint chemrxiv-2025-q08ld, 2025. [DOI] [PubMed] [Google Scholar]
  66. Yin Yimeng, Morgunova Ekaterina, Jolma Arttu, Kaasinen Eevi, Sahu Biswajyoti, Khund-Sayeed Syed, Das Pratyush K, Kivioja Teemu, Dave Kashyap, Zhong Fan, et al. Impact of cytosine methylation on dna binding specificities of human transcription factors. Science, 356, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Yun Seongjun, Jeong Minbyul, Kim Raehyun, Kang Jaewoo, and Kim Hyunwoo J. Graph transformer networks. Advances in neural information processing systems, 32, 2019. [Google Scholar]
  68. Zdrazil Barbara, Felix Eloy, Hunter Fiona, Manners Emma J, Blackshaw James, Corbett Sybilla, de Veij Marleen, Ioannidis Harris, Lopez David Mendez, Mosquera Juan F, et al. The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic acids research, 52, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Zhang Shengyu, Huo Donghui, Horne Robert I., Qi Yumeng, Ojeda Sebastian Pujalte, Yan Aixia, and Vendruscolo Michele. Sequence-based drug design using transformers. bioRxiv, 2023. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES