Skip to main content
mAbs logoLink to mAbs
. 2025 Aug 14;17(1):2545601. doi: 10.1080/19420862.2025.2545601

What does AlphaFold3 learn about antibody and nanobody docking, and what remains unsolved?

Fatima N Hitawala 1, Jeffrey J Gray 1,
PMCID: PMC12360200  PMID: 40814020

ABSTRACT

Antibody therapeutic development is a major focus in healthcare. To accelerate drug development, significant efforts have been directed toward the in silico design and screening of antibodies for which high modeling accuracy is necessary. To probe AlphaFold3’s (AF3) capabilities and limitations, we tested AF3’s ability to capture the fine details and interplay between antibody structure prediction and antigen docking accuracy. With one seed, AF3 achieves a 10.2% and 13.3% high-accuracy docking success rate for antibodies and nanobodies, respectively. AF3-like models Boltz-1 and Chai-1 achieve 4.08% and 0% high-accuracy rates for antibodies, and 5% and 3.33% for nanobodies, respectively. With twenty seeds, AF3 achieves a median unbound CDR H3 RMSD accuracy of 2.9 Å … and 2.2 Å … for antibodies and nanobodies, respectively. Both AF3-like models Boltz-1 and Chai-1 improve further on antibodies (2.08 Å … and 2.71 Å …, respectively), but do poorly on nanobodies (3.78 Å … , 3.63 Å …). CDR H3 accuracy boosts AF3 complex prediction accuracy, with antigen context improving CDR H3 accuracy, particularly for loops longer than 15 residues. Combining ipTM-HA and I-pLDDT with ΔGB improves discriminative power for correctly docked antibody and nanobody complexes. However, AF3’s 65% failure rate for antibody and nanobody docking (with single seed sampling) demonstrates a need to further improve antibody modeling tools.

KEYWORDS: Antibody docking, nanobody docking, AlphaFold3, benchmark, ranking protocol

Introduction

Antibodies (Abs) play a critical role in the immune system, and the development of antibody and nanobody therapeutics is of major interest due to their ability to target cancer, autoimmune, cardiovascular, and infectious diseases. Therapeutic advantages include their soluble nature, tunable affinity, high tolerance by the human body, and manufacturability.1 The antigen (Ag) binding interface of an antibody (nanobody) is composed of six (three) hypervariable loops, called the complementarity determining region (CDR) loops. The third loop on the heavy chain of the antibody (CDR H3) is particularly diverse and typically has the highest number of contacts with the epitope.2 CDR loops sometimes undergo conformational changes upon binding to an antigen.3 Designing antibodies is challenging, primarily due to potential off-target effects4 and the substantial time and resources required for developability testing.1 Due to the flexibility and importance of antibody CDR loops, modeling structural movement and docking is highly valuable, and significant effort has been put into developing antibody and antibody-antigen complex structure predictors.5,6

Traditional Rosetta-based antibody-antigen docking algorithms use ensembles of homology models of antibodies, and sampling of rigid backbones, loop conformation, and VH-VL relative orientations.5 This general protocol had a 20% rate of successfully docking antibody-antigen complexes.7 (Docking accuracy is quantified via binning the fraction of native residue-residue contacts, interface and ligand RMSDs to the native structure into incorrect, acceptable, medium and high-accuracy categories by standards at the Critical Assessment of Protein Interactions (CAPRI) blind challenge, and have since been condensed into a single correlated score: DockQ, where a higher score defines better accuracy.)8,9 Several other physics- and structure-based Ab docking methods have been published with similar performance.10–13 While these methods are generalizable, the calculations are time consuming. Success rates are also limited by the ability to accurately model CDR H3.

Machine learning methods are faster and have significantly improved antibody-antigen and nanobody-antigen complex prediction. There are a variety of methods, with some focusing on only structure prediction or docking,14–18 while others combine the tasks.19–21 Tested architectures have included convolutional neural networks, transformers, diffusion models, and normalizing flow models for structure and complex prediction.14,16,17,20–24 While these methods have improved protein-protein docking success rates, AlphaFold2.3-Multimer (AF2.3-M) still had a poor 20% success rate for antibody-antigen docking.7,25

While AF2 and the sub-series of models that were based on the AF2 architecture (AF2.x-M) established a robust algorithm for predicting structure from processed sequence context, the limitations in docking and structure prediction of some protein families, e.g. antibodies and nanobodies to antigens, have led to developments focused on improving the processed sequence context and sample diversity. Multiple Sequence Alignment (MSA) sub-sampling can extract conformational change information from sequence data,23 while massive sampling with increased diversity via tuned dropout rates performed the best in CASP15.26 AF3 is a culmination of many of these methods.

Until AlphaFold3 (AF3), the highest reported success rate for antibody docking was 43% by AlphaRED,25 a hybrid model using AlphaFold2-Multimer (AF2-M) predicted complexes and confidence measures with Rosetta-based replica exchange docking. Then in May 2024, with AF3 being trained on the same antibody dataset as AF2-M,19,27 DeepMind reported a notable 60% success rate for AF3 when 1,000 seeds were sampled.

To understand the source of improvement and where AF3 still has limitations, here we thoroughly assess AF3’s ability to dock antibody-antigen and nanobody-antigen complexes and predict unbound antibody and nanobody structures. To discern the effects of the limited experimental structures provided in the PDB, we study the interplay of the CDR H3 loop and Ab-Ag (and Nb-Ag) docking using structures from a redundancy-filtered bespoke dataset after the 2021 AF3 training cutoff. With this dataset we similarly benchmark newer AF3-like models Boltz-128 and Chai-1.29 Using the DockQ score, we conduct an uncertainty quantification analysis to determine the best combination of confidence metrics for antibody and nanobody complexes. For reproducibility and standardized benchmarking, we make the benchmark data and code available at https://github.com/NooriFatima/AF3_AbNb_Benchmark.

Results

AF3 outperforms previous state-of-the-art antibody docking methods

To compare AF3 and AF3-like models (Boltz-1, and Chai-1) to previous state-of-the-art models, we first curated a benchmark set of bound and unbound antibodies and nanobodies from SAbDab structures filtered by AF3’s training set cutoff date followed by quality and sequence and structure redundancy filtering, resulting in 49 (bound), 13 (unbound) antibodies, and 60 (bound), and 10 (unbound) nanobodies (details in Methods). We limited our analysis to Boltz-1 as Boltz-2’s training cutoff date (June 1, 2023) is after AF3 (September 30, 2021). For every target sequence, we ran three seeds in AF3 to account for the variance generated by the diffusion model. The AF3 paper reports an overall success rate of 60% for Ab-Ag docking, however this is with using 1000 seeds.27 On the AF3 server we were limited by the number of jobs per day from running a greater number of seeds.

We sought to compare AF3 to its predecessors (AF2.3-M and AF2-M) AlphaRED (which previously achieved the highest reported Ab-Ag docking success rate), and newly released models with similar architectures: Boltz-1 and Chai-1. We compared AF3, AF2.3-M, Boltz-1, and Chai-1 using our curated benchmark using the top-ranked decoy from the first seed, as AlphaRED is run on a single decoy generated from AF2.3-M (ranking protocol for AF3, Boltz-1, and Chai-1 in Methods). As seen in Figure 1(A), AF3 improves over AF2.3-M in both the percentage of high-accuracy docking (DockQ 0.80) and overall docking accuracy (DockQ > 0.23). For high-accuracy docking, AF2.3-M has a low success rate of 2.4%, while AF3 has a considerably high accuracy success rate of 10.2%, AF3’s overall success (DockQ > 0.23) rate of 34.7% also improves on AF2.3-M’s 23.4% success rate. Surprisingly, Boltz-1 and Chai-1 perform poorly compared to both AF3 and AF2.3-M on our docking benchmark, with overall success rates of 20.4% for both, and high-accuracy success rates of 4.1% and 0% respectively. The Boltz-128 and Chai-129 papers report higher success rates; besides the different test sets, we used the default setting of three recycles, while the paper evaluations used 10 and 4, respectively. Increased recycling has been found to improve performance,19 although the ablations in the Boltz-1 paper show minimal impact above three recycles.28 Further, the three models vary in MSA depth, also shown to be a key factor in docking prediction quality in antibody-antigen docking.30

Figure 1.

A panel of figures; the first figure depicts a stacked column plot comparing the performance of state-of-the-art docking protocols on antibody and nanobody docking. The second and sixth columns, denoting AF3 performance on antibodies and nanobodies respectively, are much higher than the first and third denoting AF2.3-M, the third and seventh denoting Boltz-1, and the fourth and eighth denoting Chai-1. The second, third, and fourth figures are examples of incorrect, acceptable, and high accuracy docked antibody-antigen complexes.

The success rates in antibody (Ab) and nanobody (Nb) docking for state-of-the-art models. (A) Performance of AF3 on antibody-antigen docking (N = 49) and nanobody-antigen docking (N = 60) against AF2.3-M, Boltz-1, and Chai-1 with curated dataset of recent novel complex structures. DockQ scores for the bound antibodies and nanobodies are binned into incorrect, acceptable, medium, and high categories based on CAPRI classifications, and represented in a stacked column plot.9 (B, C, D) Ab protein complex structures of example incorrect, acceptable, and high accuracy predictions. Experimental crystal structures are gray, predicted antibodies are blue, the predicted antigen is seagreen; crystal CDR H3 loop (defined by Chothia numbering) is dark gray and predicted CDR H3 loop is dark blue.

In Figure 1(B-D), we show docked structures with highlighted CDR H3 loops with increasing docking accuracy to illustrate failed and correct docking predictions in different CAPRI quality categories. In the acceptable docking accuracy example, the CDR H3 loop structure is incorrect (5.3 Å … RMSD). In the high-accuracy case, the CDR H3 loop structure is almost exactly correct (0.76 Å …). To understand the difference between docking performance of AF3 and the related models (Boltz-1, Chai-1), we inspected an example complex that AF3 docks with high accuracy. In Figure S1A, B, C, D we show one example docked complex by AF3, AF2.3-M, Boltz-1, and Chai-1 respectively against the ground truth structure. AF2.3-M, Boltz-1 and Chai-1 all dock the complex inaccurately, despite Boltz-1 and Chai-1 having sub-angstrom accuracy on the CDR H3 loop. In these cases, we find that all three methods struggle with accurately predicting the antigen, making docking difficult.

For nanobodies, AF3 achieves a 13.3% success rate for highly accurate complexes, with a lower overall success rate (31.6%). Boltz-1 has a better overall success rate (23.3%) and high-accuracy success rate (5.0%) for nanobodies than on antibodies, while Chai-1’s nanobody performance decreases to 15.0% overall and 3.3% for high-accuracy docking. Docked Nb-Ag complexes shown in Figure S2 illustrate examples of incorrect and highly accurate nanobody-binding modes and the resulting structures. Again in the nanobody examples, the high-accuracy docking case has a sub-angstrom CDR H3 RMSD, while the medium and acceptable cases show more error (over 2 Å … RMSD).

We then compared AF3 to AlphaRED and AF2-M on the DB5.5 docking dataset. This alternate test set was chosen (acknowledging that there is contamination of targets in the AF training sets) because it would be time-consuming to run the computationally intensive AlphaRED refinement stages on the new benchmark set. Figure S3 shows that while the overall success rates of AlphaRED and AF3 are close (44.3% and 50.8% respectively), AF3 has more high-accuracy successes compared to AlphaRED. The fact that AF3’s overall success rate improves on the DB5.5 docking dataset (50.8%) over our curated, independent benchmark set (34.7%) shows that AF3 perhaps unsurprisingly achieves a higher success rate on targets that may have been in its training set.

High docking accuracy requires high CDR H3 accuracy, but not vice-versa

The antibody maturation process improves binding affinity for an expressed antigen31 so that an unbound antibody can target the antigen by complementing the epitope.32 As the hypervariable H3 loop often makes the majority of contacts between the antibody and antigen,2 in traditional methods, correctly modeling the CDR H3 loop has been pivotal in improving docking quality.33 This effect may be stronger for nanobodies, as they do not have a light chain with which to split the contacts to the antigen.

To understand the correlation between modeling the CDR H3 loop and docking accuracy in AF3 predictions, we measured the CDR H3 RMSDs and DockQ scores for antibody and nanobody complexes across three seeds and five diffusion samples (decoys) per target from our curated benchmark and visualized their joint and marginal distributions (Figure 2). The marginal DockQ distribution (y-axis on right) is bimodal, implying that AF3 often docks a complex with high accuracy, or incorrectly. A CDR H3 loop RMSD threshold of 1 Å … has previously been found to be sufficient to support rigid-body docking of Ab-Ag complexes.33 In the joint distribution, the predictions that are docked with high accuracy (DockQ 0.8, corresponding to the green box) are heavily clustered within the bounds of CDR H3 RMSD 1 Å (the pink box). However predictions with CDR H3 RMSD 1.0 Å (pink box) are scattered over the full range of possible DockQ scores, implying that a correct CDR H3 loop is insufficient to push the prediction to sample the correct antigen-binding interface.

Figure 2.

A large density of points in the scatter plot spans the range of CDR H3 RMSDs in the lowest docking accuracy class, but the range constricts towards low CDR H3 RMSD as docking accuracy goes up.

Joint distribution of DockQ scores and CDR H3 loop RMSD of predicted antibody-antigen complexes with marginal distributions of both variables. CAPRI classification zones separated by dashed lines, with the high-accuracy docking complex region (DockQ 0.8) shaded in green, the sub-angstrom CDR H3 loop RMSD region shaded in pink, and the intersection shaded in purple. The conditional probability of the CDR H3 loop RMSD being less than 1.0 Å … given a highly accurate complex is the number of points in the intersection of both events (purple) over the number of total points with highly accurate docking (green). The conditional probability of a highly accurate complex given a less than 1.0 Å … H3 loop RMSD is the number of points in the intersection (purple) over the number of points in the sub-angstrom H3 loop RMSD region (pink).

To quantify this observation, we calculate the conditional probabilities. To examine the effect of one variable (either CDR H3 RMSD, or DockQ score) on the joint concentration of points defined by both variables (both CDR H3 RMSD and DockQ score), we begin by examining conditional probabilities above two specific thresholds, CDR H3 RMSD 1.0 Å (which has previously been observed to be critical for traditional antibody docking protocols) and a DockQ score 0.8 (high-accuracy CAPRI cutoff). The p(CDRH3RMSD1.0Å |DockQ0.8) is 86.0%, implying that a correct CDR H3 loop is also critical for high-quality docking predictions for AF3. Similarly for nanobodies, p(CDRH3RMSD1.0Å |DockQ0.8) is 84.7%. The converse does not appear to be true: the p(DockQ0.8|CDRH3RMSD1.0 Å) is 55.8% for antibodies and 36.2% for nanobodies, showing that a correct CDR H3 loop does not guarantee high-accuracy docking.

To determine the effect of different thresholds of CDR H3 RMSD on docking success rate, we plot p(DockQD|CDRH3RMSDT Å) where D takes the values of each CAPRI quality category cutoff (0.23, 0.49, 0.8) and T is varied along the x-axis over the range of all CDR H3 RMSDs in the dataset (Figure 3). For the incorrect cases, we plot p(DockQ<D|CDRH3RMSDT Å), where D=0.23 is the DockQ cutoff for incorrect docking per CAPRI standards. The probabilities of successful docking drop sharply as CDR H3 RMSD increases over the first few angstroms. The probability of an acceptable-accuracy or better complex drops below that of the probability of an incorrect complex at a CDR H3 RMSD value of 3.1 Å … (antibodies), and 2.1 Å … (nanobodies). For nanobodies, Figure S4B shows the joint and marginal distributions, where the DockQ marginal distribution is also bimodal. Figure 3 shows similar trends to antibodies for the conditionals. Supplemental files (Files. S1, S2) contains the calculated probability distributions for users to be able to determine stringency of the CDR H3 RMSD cutoff needed for the acceptable probability of docking and dock score desired from their datasets.

Figure 3.

A two-panel line-plot depicting the conditional probability of docking accuracy given a CDR H3 RMSD threshold for antibodies and nanobodies respectively. The probability of correct complexes drops for both antibodies and nanobodies, but increases for incorrect complexes as the CDR H3 RMSD threshold increases.

Conditional probabilities of docking accuracy per CDR H3 RMSD threshold for antibodies and nanobodies. the probabilities of correctly docked antibody and nanobody complexes are calculated for DockQ scores corresponding to acceptable (D0.23), medium (D0.49), and high-accuracy (D0.80) CAPRI categories, while the probabilities of incorrect complexes are calculated via DockQ score < 0.23, covering a CDR H3 RMSD range from 0–10 Å … for antibodies and nanobodies.

AF3 outperforms AF2.3-M, AF2-M, and IgFold in predicting unbound Fv structures

Considering the impact of CDR H3 loop accuracy on overall docking success, we sought to evaluate AF3’s predictive accuracy for unbound CDR H3 loops. That is, we examined the accuracy when predicting antibodies alone, without antigen. To compare AF3 against previous state-of-the-art structure prediction models, we filtered IgFold’s curated benchmark of 196 unbound antibody variable fragments and 70 unbound nanobodies14 by the new training cutoff date used in AF2.3-M and AF3 (September 30, 2021), resulting in 63 antibodies and 23 nanobodies. For the comparison we used one seed with AF3, Boltz-1, and Chai-1 and used the top-ranked decoy (ranking methodology for each model in Methods) from the five decoys predicted, staying consistent with IgFold’s evaluation method.

Figure 4 shows CDR H3 accuracies. While AF2.3-M has the lowest median CDR H3 RMSD of the three previous state-of-the-art models at 2.83 Å … , for antibodies AF3 achieves 2.52 Å … . Boltz-1 improves the median further, to 2.08 Å … , while Chai-1 does not perform as well as the other diffusion architectures, achieving a median of 2.71 Å … . IgFold and AF2.3-M perform similarly (3.0 Å … for IgFold), with AF2.3-M performing better (2.83 Å …). Thus, AF3 has improved CDR H3 loop prediction by 0.31 Å … , and Boltz-1 by an impressive 0.75 Å … . AF3 does not significantly improve upon its predecessors for nanobody CDR H3 loops, decreasing median RMSD slightly from AF2-M’s 3.05 Å … performance to 3.01 Å … . Boltz-1 (3.78 Å …) and Chai-1 (3.63 Å …) fail to reach AF3’s level of performance, having CDR H3 median RMSDs closer to IgFold, albeit the statistical test’s ability to detect differences between model performance is limited by the small sample sizes for antibody and nanobody groups.

Figure 4.

A three-part panel of letter-value plots. The first panel compares the antibody structure predictor performance of six models on the CDR H3 loop. The second panel measures the RMSD of the CDR H3 loop between bound-unbound antibody-antigen complexes, and the third panel compares the CDR H3 loop RMSD values docked nanobody-antigen complexes of the same six models as the first panel.

Performance of AF3, Boltz-1, and Chai-1 on predicting unbound CDR H3 loop structures of 63 antibodies and 23 nanobodies compared to previous models. The letter-value plot represents the traditional quartiles of the data, with additional boxes representing additional density. (A) AF3 and Boltz-1 improve upon the average median to 2.52 Å … and 2.08 Å … , respectively. For reference, the RMSD between experimental unbound and antigen-bound antibody CDR H3 loops has a median of 0.5 Å … . CDR loops are all defined via the Chothia numbering scheme. (B) AF3 minimally improves the median of AF2-M (the lowest of IgFold, AF2-M and AF2.3-M) to 3.01 Å … . All CDRs are defined using the Chothia numbering scheme, and p 0.05 corresponds to *, p 0.01 corresponds to **, p 0.001 corresponds to ***, and p 0.0001 corresponds to ****.

To probe the effect of target similarity to the training set, we plotted CDR H3 RMSDs as a function of edit distance (substitutions plus indels, details in Methods) of concatenated CDRs from closest training point sample to the evaluated data point, including 133 additional antibody, and 47 nanobody structures solved between the AF3 and IgFold cutoff dates (July-September 2021) with very close or identical sequences to the training set. We observe a positive trend between edit distance and CDR H3 loop RMSD for antibodies for post-AF3-cutoff structures, but not for nanobodies (Figure S5A). There is a much heavier density of sub-angstrom CDR H3 RMSD nanobodies in the pre-cutoff group (20.3%) compared to the post-cutoff group (4.2%). Binning pre- and post-cutoff sets (i.e., small and large edit distances) (Figure S5B), there is a wide range of CDR H3 RMSDs for AF3 predictions on pre-cutoff proteins. The medians of pre-cutoff samples are lower (2.02 Å … (Abs), 1.80 Å … (Nbs)) compared to post-cutoff evaluation data points (2.52 Å … (Abs), 3.01 Å … (Nbs)).

The accuracy plateau reached by IgFold, AF2-M, and AF2.3-M suggests questions about the limit of accuracy possible for the loop, considering that it may simply be mobile in solution. AF3 and Boltz-1’s results prove that better predictions are possible. Further, a recent survey of 177 pairs of bound-unbound antibody complexes reported that in 70.6% of antibody CDR H3 loops, binding-induced conformational changes are under 1 Å.3 Figure 4 also shows the CDR H3 RMSD between antigen-bound and unbound antibodies (Xtal B-U column); these measurements suggest that antibody loop prediction may still be able to be far more accurate.

Antigen context affects antibody CDR H3 loop prediction accuracy

Using our curated benchmark set to observe how the CDR H3 loop’s structure is affected by antigen context, we compared the RMSD of all bound CDR H3 loops against all unbound CDR H3 loops, separately analyzing those of antibodies versus nanobodies. To further determine whether there are learned biases toward bound structures due to their over-representation in the training set, we also predict the experimentally bound structures without providing antigen context and then calculate their RMSDs with respect to the experimental bound conformation. We divide the RMSD into the global H3 RMSD, calculated by superposing heavy chains, and the local H3 RMSD, calculated by superposing only the CDR H3 residues. The global RMSD captures the loop shape and placement, while the local RMSD represents the loop’s shape only. As the generative nature of the model introduces high amounts of variation in the structures, the median CDR H3 loop RMSD can change from one seed to the next, so we first determine how many seeds are necessary for the median CDR H3 loop RMSD to minimally vary. We generated 100 seeds per target, calculated their RMSD, then generated a plot comparing the number of seeds to the variance in the medians as the number of seeds increase. We found that 20 seeds are sufficient (Figure S6). Upon visualization of per-seed RMSD distributions of all CDR loops and framework regions for antibodies and nanobodies (Figs. S7,S8,S9,S10), we found that unbound antibodies and nanobodies predictions have greater RMSD variation per-seed compared to their bound counterparts.

Figure 5(A) shows that providing antigen context improves CDR H3 loop shape and position prediction accuracy for antibodies. The local (1.99 Å …) and global (3.05 Å …) CDR H3 RMSD of predicted unbound structures are higher than that of local (1.67 Å …) and global (2.51 Å …) predicted bound structures, with significance of p = 3.50×104, and p = 3.47×106, respectively.

Figure 5.

A two-part panel of letter-value plots. The first panel compares the local versus global CDR H3 loop RMSD of antibodies between known unbound, known bound, and bound without antigen context. The second compares the local and global CDR H3 loop RMSDs of the same set of antibody structures broken into short, medium, and long residue length bins.

Effect of antigen context and loop length on antibody CDR H3 loop prediction accuracy. (A) Top-ranked known unbound structures, bound structures predicted with antigen, known bound structures predicted without antigen with number of decoys annotated above letter-value plots. Global RMSD calculated after superposition of the VH domain; local RMSD calculated by superposing the loop residues only. (B, C) Effect of loop length and antigen context on antibody CDR H3 loop prediction accuracy. Top-ranked known unbound structures, bound structures predicted with antigen, known bound structures predicted without antigen. Short loops are defined as less than 10 residues, medium loops between 10 to 15 residues, and long loops longer than 15 residues, with number of decoys above each letter-value plot. p 0.05 corresponds to *, p 0.01 corresponds to **, p 0.001 corresponds to ***, and p 0.0001 corresponds to ****.

For nanobodies, Figure S11A shows that the local CDR H3 RMSD is improved by the antigen context when comparing median RMSDs of the known bound structures (1.13 Å …) to the predicted unbound structures (0.80 Å …) (p = 1.55×105). The loop’s position (global RMSD) also improves when provided antigen context when comparing the known bound structures to both the known (p = 1.5×103) and predicted unbound structures (p = 2.34×105). The predicted unbound local CDR H3 RMSD decreases from 1.13 Å … to the known bound structure’s RMSD of 0.80 Å … , but the global CDR H3 RMSD only decreases from 1.88 Å … to 1.70 Å … ; demonstrating that for nanobodies, antigen context improves the loop’s shape to a greater extent than the position of the loop.

Previous studies report CDR H3 loop RMSD increasing as loop length increases due to the degrees of freedom granted to the loop.14 Locally, short antibody CDR H3 loops are minimally affected by the addition of antigen context, as the CDR H3 RMSD is already sub-angstrom (Figure 5(B)). For long loops however, bound structures (1.57 Å …) have lower local CDR H3 loop RMSDs compared to both known (3.34 Å …) and predicted unbound (2.78 Å …) structures with significances of p = 3.47×106, and p = 3.47×106, respectively. The difference in global CDR H3 loop RMSDs is more drastic; known bound structures have a lower median CDR H3 RMSD of 3.17 Å … , compared to known unbound structures (5.71 Å …) and predicted unbound structures (4.85 Å …), with significances of p = 6.74×106, and p = 7.40×103, respectively.

Nanobodies show no effect on short loops, as known bound and predicted unbound have sub-angstrom RMSDs locally and globally (Figure S11B). However, the medium-length loops surprisingly show a significantly higher (p = 8.75×104) global CDR H3 RMSD for known bound structures (2.65 Å …) compared to known unbound structures (2.12 Å …), while long loops show improvement given antigen context. Upon investigation, we found that seven out of the 36 structures in the medium length group bind to the spike glycoprotein, an antigen seen during training. However, the common paratope for these structures is inside the receptor binding domain (RBD), which is accessible only in its open conformation.34,35 AF3 repeatedly predicts the closed conformation, obstructing access to the correct spatial constraints to aid in CDR H3 prediction accuracy. Further, the spike protein has multiple sites amenable to antibody and nanobody binding.34 Supporting this, we see that for antibodies and nanobodies, predicted structures whose antigens were seen during training have larger global and local CDR H3 RMSDs, as a large portion of the overlapping antigens are the spike glycoprotein (Figure S12). Thus we find that long loops benefit the most from added antigen context so long as the antigen is correctly predicted. Consistent improvement in CDR H3 loop accuracy in known bound structures compared to known bound structures predicted without antigen context implies that AF3 leverages biophysical context for better predictions.

A combination of predicted confidence interface metrics and Rosetta energies refines blind prediction scoring

Researchers need to know when to trust AF3 Ab-Ag and Nb-Ag structural models. Thus, we investigated confidence metrics and their respective cutoff values for scoring blind Ab-Ag and Nb-Ag docking using our curated benchmark set. We use the top-ranked prediction for twenty seeds for all targets to simulate the confidence filtration common in blind prediction ranking. Using twenty seeds helps ensure a sufficient dataset size for analysis.

We first analyzed the ipTM score between the heavy chain and antigen (ipTM-HA) for both antibodies and nanobodies (output by AF3), because we have found that AF3 has a strong calibration of the ipTM (Figure S13) as it is part of the training loss.27 To effectively compare the discriminative power of single and combined metrics, we use the average precision metric (see Methods), with precision-recall (PR) curves and the average precision (AP) metric for antibodies and nanobodies (Figure 6). To determine the best threshold per confidence metric, we find where precision and recall are jointly maximized. For antibodies and nanobodies, ipTM-HA is the best single discriminatory metric, with an average precision of 0.458 (threshold of ipTM-HA 0.49) and 0.450 (threshold of ipTM-HA 0.40) respectively.

Figure 6.

A two-panel set of precision-recall curves comparing various ranking metrics for antibody-antigen and nanobody-antigen complexes respectively.

Precision-recall curves comparing single and combined scoring metrics for Ab-Ag and Nb-Ag docking using AF3.

We conducted a similar analysis using Rosetta-based binding energies (ΔGB) (calculations in Methods), as it has been used to reliably rank protein structures.36 Due to the high structural accuracy of AF3, we calculated ΔGB directly on the predicted complexes. For both antibodies and nanobodies, ΔGB has low average precision scores of 0.188 and 0.178 respectively. Using precision and recall values, we determined the best ΔGB cutoff threshold to be −30 REU and 4 REU respectively.

In Figure 6, we show the same analysis for the interface-predicted local distance test (I-pLDDT) as I-pLDDT is part of the training loss for the full model,27 and it has been reported to be a stronger discriminator than ipTM for AlphaFold2-Multimer25,30 (Methods). We found that I-pLDDT performs the best after ipTM-HA for both antibodies and nanobodies, with average precisions of 0.368 and 0.336 and I-pLDDT cutoffs of 85 and 87, respectively.

We next analyzed the averaged pLDDT of the CDR H3 loop residues (avg. H3-pLDDT) (Methods), due to the correlation we determined between CDR H3 accuracy and docking for antibodies and nanobodies. This metric is the third best discriminatory metric, performing better for nanobodies with average precision scores of 0.287 and 0.316 for antibodies and nanobodies, respectively. The optimal cutoff is an avg. H3-pLDDT of 86 for both.

Combining scoring metrics provides the chance to tighten filtration and optimize the likelihood of sampling a correctly predicted complex. We determined the best metric combination for antibodies and nanobodies by conducting a joint optimization procedure (Methods). The resulting AP score is unaffected by shorter curves, so we directly compared single and combined metrics with the raw score.

The metric combination with the highest average precision score for antibodies is ipTM-HA and ΔGB (AP = 0.412) and I-pLDDT and ΔGB (AP = 0.743) for nanobodies, demonstrating that ΔGB provides useful complementary data to AF3’s confidence metrics to enhance discrimination. For antibodies, the optimal cutoff of ipTM-HA is 0.49 combined with a ΔGB cutoff of −47 REU. For nanobodies the combination metric cutoffs for I-pLDDT and ΔGB are 87 and −4 REU respectively. While the AP score of the best combined metric for antibodies is lower than that of ipTM-HA by itself, the optimal operating point of the combination (R=0.8, P=1.0 at ipTMHA=0.49 and ΔGB=47) has higher precision (fewer false positives) than that of ipTM-HA alone (R=0.88, P=0.94 at ipTMHA=0.49). For nanobodies, the AP score of the combined metric is higher than that of all other metrics. Other tested thresholds for combined metrics are shown in Figure S14, where discriminative power improves as threshold stringency increases. Datafiles with all tested single and combined metrics along with their precision-recall values and average precision scores have been provided so that users can determine optimal thresholds specific to their needs (Files S3, S4).

Discussion

AF3 builds on AF2’s performance by expanding capabilities toward general chemical structures by reducing emphasis on MSA processing and replacing the residue frame-based structure module with an atomic-precision diffusion model. We demonstrate that these architectural changes help AF3 dramatically improve the success rate of high-quality docked antibody-antigen from 2.4% (AF2.3-M) to 10.2%; however, AF3 (with one seed) still leaves 65% of the targets incorrect. Further, new AF3-like models, Boltz-1 and Chai-1, perform poorly on antibody and nanobody docking compared to AF3. From the bimodal distribution of DockQ scores seen in the marginal distribution for antibodies in Figure 2 and for nanobodies in Figure S4, we see that AF3 either has difficulty sampling the correct binding interfaces or generates high-quality complexes, however AF3-like models face difficulty in accurate antigen prediction, leading to poorer performance despite accurate CDR H3 loops (Figure S1). The AF3 paper notes that during training local structures are learned quickly, while global structures are learned at a slower pace.27 The success rate of the model has been reported to increase to 60% when evaluating 1,000 seeds,27 a finding similar to the AF2 massive sampling study where artificial diversity was injected by tuning dropout rates26 The impressive rate of high-accuracy docked complex predictions coincides with the lower CDR H3 loop RMSD of unbound and bound antibodies and nanobodies as well as the observed improvement of long loop CDR H3 prediction accuracy when provided with antigen context, implying that AF3 has learned better docking biophysics.

Despite the improvement in CDR H3 loop RMSD, according to the findings of,3 the loop prediction accuracy should be able to improve further. The conformational ensembles in3 were derived from crystal structures which provide a limited view of protein flexibility as compared to NMR or molecular dynamics-simulated structures. However, bound crystal structures have been found to predominantly express CDR H3 loop conformations that optimize antigen binding37 and unbound antibodies affected by crystal packing effects have been studied using molecular-dynamics to find that the predominant conformation is the bound conformation.37,38 This result supports the theory of conformational selection, where a binding event results in the selection of one CDR H3 loop conformation out of a preexisting set.37 Thus, these crystal-structure ensembles can approximate conformational ensembles. A limitation of our work is that after removing contaminated data with respect to AF3’s training set, an insufficient number of conformational ensembles remained, barring us from assessing AF3’s ability to generate accurate bound-unbound conformational ensembles of antibodies and nanobodies and thus the limits of its understanding of Ab-Ag and Nb-Ag binding.

AF3 outputs a well-calibrated bespoke rank-score in the prediction’s confidence files, but often decoys will have the same rank score, causing difficulty in evaluation. We evaluated the robustness of the confidence metrics generated by AF3 against a benchmark of sequences and structures nonhomologous to the model’s training data to see if prediction ranking could be improved using additional metrics of CDR H3 loops and biophysical interface energies. We find that combining Rosetta-based binding energies with ipTM-HA improves discrimination.

When we started this study, we were limited by the number of jobs available per day on the AF3 server and the small dataset size due to data contamination concerns. Our results show that while AF3 is considerably more accurate in modeling antibody and nanobody structures and docked complexes than previous approaches, there remains room for improvement of the 65% failure rate for both antibody and nanobody docking for single seed predictions; even greater improvement is needed in AF3-like open-source models. Researchers can now improve upon this baseline by running more seeds on a local AF3 setup the authors of AF3 have made generously available,39 however we note that this improvement may not be replicable in Boltz-1 and Chai-1, as recent evaluations on antibody-antigen show conflicting results in improvements from multi-seed-based sampling for these models.40,41 Our curated benchmark set (with predicted structures for AF3, Boltz-1, and Chai-1) is available so that new methods replicating AF3,28,29,42 and additional seed sampling in local AF3 can be tested. We now have more powerful tools than ever for Ab engineering, which opens promising avenues to design and engineer Ab therapeutics and understand immunology better.

Methods and materials

Individual structure evaluation set

To create an immunoglobulin structure prediction dataset, we pulled structures from SAbDab (May 31, 2024 for Abs, June 4, 2024 for Nbs),43 and temporally separated evaluation structures using the Sept. 30, 2021 training dataset date cutoff by AF3. We separated all antibody copies in the remaining PDBs, then conducted a quality filtration similar to,3 where we removed PDBs with a resolution 2.8 Å … , or had missing residues in the CDR loops by comparing atomic sequence and sequence residues using the Bio.Seq python package.44 We generated ensembles of sequences using a 90% sequence identity cutoff and clustering using MMSeqs2.45 Using these ensembles, we then used the Kabsch alignment algorithm to calculate the structural redundancy of variable heavy and light chains between pairs of structures.14 We kept pairs of structures that had heavy and light chain RMSD 1 Å … , but only one representative out of a pair of redundant structures. To ensure that we did not lose structures with high CDR H3 loop diversity despite similar heavy chain structures, we recovered structures from pairs with heavy chain RMSD < 1 Å … , but CDR H3 RMSD 1 Å … . Finally, we filtered the remaining structures based on sequence redundancy against each other and AF3’s training set with a sequence identity cutoff of 99% and 95% respectively. We conducted MSA alignment on the heavy and light chains separately via Abalign,46 and then a custom Python function to calculate the sequence identities. To ensure that we did not retain structures that had identical CDRs to the training set, we concatenated all CDRs of the structure, conducted an MMSeqs search against concatenated CDRs from the AF3 training set, and removed any benchmark sequences that were identical to training data CDR combinations. The number of structures at each step of this process is shown in Figure 7. We characterize the benchmark sequences via sequence identity of both concatenated CDRs and the CDR H3 sequence against the benchmark set and AF3’s training set, as well as antigen overlap and the distribution of light chain isotypes in the benchmark and training sets in Figure 8. We did not remove identical CDR H3 loops to the training data in order to retain structural differences granted by non-CDR H3 loops.

Figure 7.

A two-part panel; the first panel shows a pyramid diagram with the number of antibody and nanobody structures remaining after every filtration step is conducted as the pyramid reaches the tip. The final number of each is in the topmost section of the plot, the second panel has stacked boxes with the number of bound antibodies above the number of unbound antibodies, next to the number of bound nanobodies above the number of unbound nanobodies.

(A) Dataset curation for evaluating antibody structure prediction, number of structures at each step. (B) Bins breaking down the number of bound and unbound structures in antibodies and nanobodies.

Figure 8.

A four-part panel describing the sequences and composition of the benchmark set compared to the training set. The first panel consists of histograms comparing the sequence identity of concatenated CDRs of benchmark sequences to training sequences of antibodies and nanobodies, with the peak of both histograms near 90%. The second panel consists of two histograms describing CDR H3 sequence similarity within the benchmark and against the training set for both antibodies and nanobodies. Similarity within the benchmark peaks at 50% sequence identity with small peaks at 100% for conformational ensembles, while the similarity to the training set peaks around 60%. The third plot compares the distribution of kappa and lambda light chain subtypes in the benchmark and training set, which are similar. The final plot shows antibodies and nanobodies having a significant portion of 100% antigen sequence identity between benchmark and training set antigens.

Dataset characterization of benchmark set. (A) Distribution of benchmark concatenated CDRs sequence identity to closest match in training set. (B) Distributions comparing the sequence identity of benchmark CDR H3 sequences against the benchmark set and AF3’s training set. (C) Count-plot comparing the percent of kappa and lambda light chain subtypes in the benchmark set and training set. (D) Histogram comparing the sequence identity of antigens in the benchmark set against those in AF3’s training set for antibodies and nanobodies respectively.

To prevent the RMSD calculations from being confounded by small hinge movements in the loop connecting the variable and constant regions, we cropped to the variable fragment region using a custom function. Finally, we renumbered the structures using the AbNum webserver47 using the Chothia scheme.

AF2.3-M predictions

We used a local ColabFold installation with AlphaFold-Multimer version 2.3 downloaded from https://github.com/YoshitakaMo/localcolabfold. We predicted a single decoy for each target and did not use templates. Similar to the curated benchmark, we cropped the predictions to the variable fragment region using a custom function and renumbered the structures using the AbNum webserver47 using the Chothia scheme.

AF3 predictions

We used the AF3 server (https://alphafoldserver.com) to generate decoys. The server generates five decoys (diffusion samples) per seed. To test the diversity produced by seeds, we predicted three seeds per target, with the seed number pre-set to either one, two, or three by using the JSON file upload option. We cropped the predictions to the variable fragment region using a custom function and renumbered the structures using the AbNum webserver47 using the Chothia scheme. By generating AF3 predictions through the server, it is possible that model weights may have changed during data collection, making it difficult to reproduce our results for the first three seeds.

As CDR H3 RMSD varies by seed, to ensure stable and reproducible data analysis we used the local AF3 set up (https://github.com/google-deepmind/alphafold3) to generate another 97 seeds per target on April 8th, 2025 using JSON files. The predicted structures were then cropped and renumbered with the same method mentioned above.

Boltz-1 predictions

We use the local Boltz-1 installed via PyPi (version 2.2.0)28 to generate five diffusion samples using a single seed (random). We use the default three recycling rounds (which is also consistent with AF3), the default 200 diffusion steps, and auto-generated MSAs using the ColabFold server.48 We renumber the structures using the AbNum webserver47 using Chothia scheme.

Chai-1 predictions

We use a local Chai-1 setup installed via PyPi (version 0.6.1)29 to generate five decoys across a single seed (random). We use three recycling rounds, 200 diffusion steps and auto-generate MSAs using the ColabFold server.48 Chai-1 has no flag to select MSA depth. Using pre-cropped sequences from crystal structures to match the length of the ground-truth and AF3 structures, then renumber using the AbNum webserver to the Chothia scheme.47

Selecting the top-ranked decoy

We choose the top-ranked AF3 prediction per seed via the calculated rank score (a bespoke combination of ipTM, pTM, and disorder confidence measures) that AF3 provides in the confidence summary files with its predictions. If the rank score is equivalent for multiple predictions, we select the prediction with the highest ipTM-HA (as it is applicable to both antibodies and nanobodies). If multiple predictions share a high rank score and ipTM-HA, we select one from the pool randomly, as we aim to simulate blind prediction scoring.

For predictions from the local AF3 setup, AF3 specifies a single decoy out of the five predicted via ranking_score in the score file, so we forego the secondary filtration step using ipTM-HA.

The top-ranked decoys for Boltz-1 and Chai-1 are similarly chosen through the decoy with the highest overall rank score in the provided score files, using confidence_score for Boltz-1 and aggregate_score for Chai-1. The ranking scores are a similar combination of confidences to AF3.

RMSD calculations

To calculate RMSD, we used PyRosetta with its AntibodyInfo class.49 For calculating the CDR H3 global RMSD, we used the CDR backbone function, and for CDR H3 local RMSD, we extracted sub-poses for the CDR H3 loops also using the AntibodyInfo class, then calculated the Cα loop RMSD using the Rosetta Scoring class.

ipTM extraction

The TM-score measures topological similarity of two proteins, such that incorrect loop movements do not result in a severe scoring penalty, and is not dependent on protein size, making it more robust than the RMSD. The ipTM score is the predicted TM-score between two specified chains which estimates the difference between the predicted complex and the unknown ground truth.50 AF3 outputs an ipTM score per chain. For our work we use either the score between the heavy chain and antigen, or light chain and antigen.

Binding energy calculations

To calculate binding energies, we used PyRosetta’s InterfaceAnalyzer with the REF2015 score function.49 We set the InterfaceAnalyzer to pack the complex before and after separating the binding partners to ensure that steric violations do not affect the rigid and flexible docking signals.

Figures and statistical analysis

We conducted Mann-Whitney-U and Pearson correlation statistical analyses using the Scipy package51 in Python, and generated letter-value plots, scatter plots, strip plots, and regression plots using the Seaborn package52 in Python. We used PyMol 3.0 (Schrodinger, Inc.)53 for protein structure visualization images.

Edit distance calculations

We calculate edit distances between concatenated CDR sequences using the Levenshtein python package54 via the distance function, which calculates the number of substitutions and indels required to convert one sequence to the other.

Precision-recall (PR) curve

The PR curve is used to compare the ability of the confidence metrics to discriminate between correct (DockQ 0.23) versus incorrectly docked (DockQ < 0.23) complexes. We generate operating points by comparing the recall at a particular threshold for the predictive metric to the precision. Both values are calculated via counts of true positive (Tp), false positive (Fp), and false negative (Fn) data points. The recall value is defined by TpTp+Fn, while the precision value is defined by TpTp+Fp. Discriminative power increases with increasing recall and precision.

As we vary the thresholds separating correct and incorrect cases for each metric, we generate a characteristic curve. The strength of the metric’s discrimination is measured via the average precision score. The average precision (AP) score is calculated through the equation n(RnRn1)Pn, where Rn and Pn are the recall and precision values at each threshold n.

Combined metrics are equivalent to a double filtering strategy. We use a union operator to combine an interface metric (ipTM or I-pLDDT) set at the optimized cutoff value with another measure, such as ΔGB or avg. H3-pLDDT. We generate a range of values of interface metrics to test based on the data, then find the range of values of the second measure by determining what range provides a true positive, false negative and false positive value. Thus each combination metric PR curve is generated via a single interface metric threshold with a range of values for the second threshold. We use the raw AP score to determine the best metric and do not normalize it. We are able to use the PR curves to investigate why a metric is poorer than another.

Interface and average H3 pLDDT calculations

We calculate the I-pLDDT by averaging the residue pLDDTs for all residues within a 10 Å … radius from the binding interface.

We extracted pLDDTs of Cα residues for each H3 loop using the bfactor column through PyRosetta. We then averaged them in Python.

Supplementary Material

Ab_conditional_probabilities.csv
Nb_conditional_probabilities.csv
Ab_Scoring_AP.csv
KMAB_A_2545601_SM4086.csv (561.6KB, csv)
Nb_Scoring_AP.csv

Acknowledgments

We thank the AlphaFold team for making the server available, Gray lab members for help in generating data from the server, and Samuel W. Canner and Emeline Haroldsen for comments on the manuscript.

Funding Statement

This work was supported by National Institutes of Health(NIH) [R35 GM141881].

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data and code availability

We make the curated benchmark dataset, PDB IDs used in prediction and their sequences, and the AF3 predictions available here: https://zenodo.org/records/16426003. The code we created for quick and standardized calculations and to visualize our results can be found here: https://github.com/NooriFatima/AF3_AbNb_Benchmark.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/19420862.2025.2545601

References

  • 1.Chungyoun MF, Gray JJ.. AI models for protein design are driving antibody engineering. Curr Opin Biomed Eng. 2023. Dec 1;28:100473. doi: 10.1016/j.cobme.2023.100473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Zhao L, Wong L, Li J.. Antibody-specified B-cell epitope prediction in line with the principle of context-awareness. IEEE/ACM Trans Comput Biol Bioinform. 2011. Nov;8(6):1483–17. doi: 10.1109/TCBB.2011.49. [DOI] [PubMed] [Google Scholar]
  • 3.Liu C, Denzler LM, Hood OEC, Martin ACR. Do antibody CDR loops change conformation upon binding? Mabs-austin. 2024. Jan;160(1):0 2322533. doi: 10.1080/19420862.2024.2322533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Chames P, Van Regenmortel M, Weiss E, Baty D. Therapeutic antibodies: successes, limitations and hopes for the future. Br J Pharmacol. 2009. May;157(2):220–233. doi: 10.1111/j.1476-5381.2009.00190.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Weitzner BD, Jeliazkov JR, Lyskov S, Marze N, Kuroda D, Frick R, Adolf-Bryfogle J, Biswas N, Dunbrack RL, Gray JJ. Modeling and docking of antibody structures with Rosetta. Nat Protoc. 2017. Feb;12(2):401–416. doi: 10.1038/nprot.2016.180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hummer AM, Abanades B, Deane CM. Advances in computational structure-based antibody design. Curr Opin Struct Biol. 2022. June;74(102379):102379. doi: 10.1016/j.sbi.2022.102379. [DOI] [PubMed] [Google Scholar]
  • 7.Ambrosetti F, Jiménez-García B, Roel-Touris J, Bonvin AMJJ. Modeling antibody-antigen complexes by information-driven docking. Structure. 2020. Jan 7;28(1):119–129.e2. doi: 10.1016/j.str.2019.10.011. [DOI] [PubMed] [Google Scholar]
  • 8.Basu S, Wallner B, Levy YK. DockQ: a quality measure for protein-protein docking models. PLOS ONE. 2016. Aug 25;11(8):e0161879. doi: 10.1371/journal.pone.0161879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Collins KW, Copeland MM, Brysbaert G, Wodak SJ, Alexandre MJJB, Kundrotas PJ, Vakser IA, Lensink MF. Capri-Q: the CAPRI resource evaluating the quality of predicted structures of protein complexes. J Mol Biol. 2024. Sep 1;436(17):168540. doi: 10.1016/j.jmb.2024.168540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sircar A, Gray JJ. Snugdock: paratope structural optimization during antibody-antigen docking compensates for errors in antibody homology models. PLOS Comput Biol. 2010. Jan 22;6(1):e1000644. doi: 10.1371/journal.pcbi.1000644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jiménez-García B, Roel-Touris J, Romero-Durana M, Vidal M, Jiménez-González D, Fernández-Recio J, Valencia A. LightDock: a new multi-scale approach to protein–protein docking. Bioinformatics. 2018. Jan 1;34(1):49–55. doi: 10.1093/bioinformatics/btx555. [DOI] [PubMed] [Google Scholar]
  • 12.Chen R, Weng Z. Docking unbound proteins using shape complementarity, desolvation, and electrostatics. Proteins. 2002. May 15;470(3):281–294. doi: 10.1002/prot.10092. [DOI] [PubMed] [Google Scholar]
  • 13.Kozakov D, Hall DR, Xia B, Porter KA, Padhorny D, Yueh C, Beglov D, Vajda S. The ClusPro web server for protein-protein docking. Nat Protoc. 2017. Feb;12(2):255–278. doi: 10.1038/nprot.2016.169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ruffolo JA, Chu L-S, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023. Apr 25;14(1):2389. doi: 10.1038/s41467-023-38063-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Abanades B, Ki Wong W, Boyles F, Georges G, Bujotzek A, Deane CM. Immunebuilder: deep-learning models for predicting the structures of immune proteins. Commun Biol. 2023. May 29;6(1):1–8. doi: 10.1038/s42003-023-04927-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Corso G, StŤrk H, Jing B, Barzilay R, Jaakkola T. DiffDock: diffusion steps, twists, and turns for molecular docking. arXiv. 2023. Feb 11. http://arxiv.org/abs/2210.01776.
  • 17.Jing B, Berger B, Jaakkola T. Alphafold meets flow matching for generating protein ensembles. arXiv. 2024. Sep 2. http://arxiv.org/abs/2402.04845.
  • 18.Chu L-S, Ruffolo JA, Harmalkar A, Gray JJ. Flexible protein–protein docking with a multitrack iterative transformer. Protein Sci. 2024;33(2):e4862. doi: 10.1002/pro.4862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žŭdek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021. Aug;596(7873):583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Verma Y, Heinonen M, Garg V. AbODE: Ab initio antibody design using conjoined ODEs. arXiv. 2023. May 31. http://arxiv.org/abs/2306.01005. [Google Scholar]
  • 21.Jin W, Wohlwend J, Barzilay R, Jaakkola T. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv. 2022. Jan 27. http://arxiv.org/abs/2110.04624.
  • 22.Peng J, Xu J. RaptorX: exploiting structure information for protein alignment by statistical inference. Proteins. 2011;790(S10):161–171. doi: 10.1002/prot.23175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rao R, Liu J, Verkuil R, Meier J, Canny JF, Abbeel P, Sercu T, Rives A. Msa transformer. bioRxiv. 2021. Feb 13. doi: 10.1101/2021.02.12.430858. [DOI] [Google Scholar]
  • 24.Giulini M, Xu X, Mjj Bonvin A. Improved structural modelling of antibodies and their complexes with clustered diffusion ensembles. bioRxiv. 2025. Feb 28. doi: 10.1101/2025.02.24.639865. [DOI] [Google Scholar]
  • 25.Harmalkar A, Lyskov S, Gray JJ. Reliable protein-protein docking with AlphaFold, Rosetta, and replica-exchange. eLife. 2024. Feb 9. 13. doi: 10.7554/eLife.94029.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Wallner B, Kelso J. Afsample: improving multimer prediction with AlphaFold using massive sampling. Bioinformatics. 2023. Sep 2. 39(9). doi: 10.1093/bioinformatics/btad573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wallner B. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024. June;6300(8016):0 493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Wohlwend J, Corso G, Passaro S, Reveiz M, Leidal K, Swiderski W, Portnoi T, Chinn I, Silterra J, Jaakkola T, et al. Boltz-1 democratizing biomolecular interaction modeling. BioRxiv. 2024. Dec 27. doi: 10.1101/2024.11.19.624167. [DOI] [Google Scholar]
  • 29.Boitreaud J, Dent J, McPartlon M, Meier J, Reis V, Rogozhnikov A, Wu K. Chai-1: decoding the molecular interactions of life. bioRxiv. 2024. Oct 11. doi: 10.1101/2024.10.10.615955. [DOI] [Google Scholar]
  • 30.Yin R, Pierce BG. Evaluation of AlphaFold antibody-antigen modeling with implications for improving predictive accuracy. Protein Sci. 2024. Jan;33(1):e4865. doi: 10.1002/pro.4865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mishra AK, Mariuzza RA. Insights into the structural basis of antibody affinity maturation from next-generation sequencing. Front Immunol. 2018. Feb 1;9:117. doi: 10.3389/fimmu.2018.00117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Conti S, Ovchinnikov V, Faris JG, Chakraborty AK, Karplus M, Sprenger KG. Multiscale affinity maturation simulations to elicit broadly neutralizing antibodies against HIV. PLOS Comput Biol. 2022. Apr 20;18(4):e1009391. doi: 10.1371/journal.pcbi.1009391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Marze NA, Roy Burman SS, Sheffler W, Gray JJ. Efficient flexible backbone protein-protein docking for challenging targets. Bioinformatics. 2018. Oct 15;34(20):0 3461–3469. doi: 10.1093/bioinformatics/bty355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Xiang Y, Huang W, Liu H, Sang Z, Nambulli S, Tubiana J, Kevin L Williams WPD Jr, Schneidman-Duhovny D, Wilson IA, Taylor DJ, et al. Superimmunity by pan-sarbecovirus nanobodies. Cell Rep. 2022. June 28;39(13):111004. doi: 10.1016/j.celrep.2022.111004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Shi W, Cai Y, Zhu H, Peng H, Voyer J, Rits-Volloch S, Cao H, Mayer ML, Song K, Xu C, et al. Cryo-EM structure of SARS-CoV-2 postfusion spike in membrane. Nature. 2023. Jul;619(7969):403–409. doi: 10.1038/s41586-023-06273-4. [DOI] [PubMed] [Google Scholar]
  • 36.Alford RF, Leaver-Fay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, Shapovalov MV, Douglas Renfrew P, Mulligan VK, Kappel K, et al. The Rosetta all-atom energy function for macromolecular modeling and design. J Chem Theory Comput. 2017. June 13;13(6):3031–3048. doi: 10.1021/acs.jctc.7b00125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Monica LF, Fernández-Quintero ML, Kraml J, Liedl KR, Klaus RL, Georges G, Georges G, Kraml J. Cdr-h3 loop ensemble in solution - conformational selection upon antibody binding. Mabs-Austin. 2019. June 9;11(6):1077–1088. doi: 10.1080/19420862.2019.1618676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Fernández-Quintero ML, Pomarici ND, Math BA, Kroell KB, Waibl F, Bujotzek A, Georges G, Liedl KR. Antibodies exhibit multiple paratope states influencing VH-VL domain orientations. Commun Biol. 2020. Oct 20;3(1):589. doi: 10.1038/s42003-020-01319-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Callaway E. AI protein-prediction tool AlphaFold3 is now more open. Nature. 2024. Nov;635(8039):0 531–532. doi: 10.1038/d41586-024-03708-4. [DOI] [PubMed] [Google Scholar]
  • 40.Xu S, Feng Q, Qiao L, Wu H, Shen T, Cheng Y, Zheng S, Sun S. FoldBench: An all-atom benchmark for biomolecular structure prediction. bioRxiv. 2025. May 27. doi: 10.1101/2025.05.22.655600. [DOI] [Google Scholar]
  • 41.Fromm S, Ludaic M, Elofsson A. Evaluating deep learning based structure prediction methods on antibody-antigen complexes. bioRxiv. 2025. Jul 11. doi: 10.1101/2025.07.11.662141. [DOI] [Google Scholar]
  • 42.Passaro S, Corso G, Wohlwend J, Reveiz M, Thaler S, Ram Somnath V, Getz N, Portnoi T, Roy J, Stark H, et al. Boltz-2: towards accurate and efficient binding affinity prediction. bioRxiv. 2025. June 18. doi: 10.1101/2025.06.14.659707. [DOI] [Google Scholar]
  • 43.Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, Shi J, Deane CM. Sabdab: the structural antibody database. Nucleic Acids Res. 2014. Jan 1;42(D1):D1140–D1146. doi: 10.1093/nar/gkt1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009. June 1;25(11):1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Steinegger M, SŶding J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017. Nov;35(11):1026–1028. doi: 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
  • 46.Zong F, Long C, Hu W, Chen S, Dai W, Xiao Z-X, Cao Y. Abalign: a comprehensive multiple sequence alignment platform for B-cell receptor immune repertoires. Nucleic Acids Res. 2023. May 19;51(W1):W17–W24. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10320167/. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Abhinandan KR, Martin ACR. Analysis and improvements to Kabat and structurally correct numbering of antibody variable domains. Mol Immunol. 2008. Aug;45(14):3832–3839. doi: 10.1016/j.molimm.2008.05.022. [DOI] [PubMed] [Google Scholar]
  • 48.Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022. June;19(6):679–682. doi: 10.1038/s41592-022-01488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Chaudhury S, Lyskov S, Gray JJ. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics. 2010. Mar 1;26(5):689–691. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828115/. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004. Dec 1;57(4):702–710. doi: 10.1002/prot.20264. [DOI] [PubMed] [Google Scholar]
  • 51.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J. et al. Scipy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020. Mar;17(3):261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Waskom M. Seaborn: statistical data visualization. J Open Source Softw. 2021. Apr 6;6(60):3021. doi: 10.21105/joss.03021. [DOI] [Google Scholar]
  • 53.Schrödinger LLC. The PyMOL molecular graphics system, version 1.8. 2015. Nov.
  • 54.Bachmann M. Python-levenshtein: Levenshtein distance and string similarity. 2024. https://github.com/maxbachmann/python-Levenshtein.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Ab_conditional_probabilities.csv
Nb_conditional_probabilities.csv
Ab_Scoring_AP.csv
KMAB_A_2545601_SM4086.csv (561.6KB, csv)
Nb_Scoring_AP.csv

Data Availability Statement

We make the curated benchmark dataset, PDB IDs used in prediction and their sequences, and the AF3 predictions available here: https://zenodo.org/records/16426003. The code we created for quick and standardized calculations and to visualize our results can be found here: https://github.com/NooriFatima/AF3_AbNb_Benchmark.


Articles from mAbs are provided here courtesy of Taylor & Francis

RESOURCES