Abstract
Deep learning has significantly enhanced protein structure prediction, and AlphaFold2 marked a particular milestone among these methods for predicting protein monomer and complex structures. The AlphaFold3 represents a pivotal further advancement in biomolecular structure prediction, extending beyond proteins to model diverse assemblies. Despite attracting a huge number of users, there is still an absence of third-party benchmarks to fairly demonstrate the performance of the AlphaFold3. In this work, we benchmark AlphaFold3’s performance across nine datasets, protein monomers, orphan proteins, alternative conformations, protein multimers, peptide-protein complexes, antigen-antibody complexes, RNA, RNA multimers, and protein-nucleic acid complexes, compared to AlphaFold2, AlphaFold-Multimer, and RoseTTAFoldNA, RhoFold+, NuFold and trRosettaRNA. For protein monomers, AlphaFold3 demonstrates improved local structural accuracy over AlphaFold2, though global accuracy gains are limited. In modeling general protein complexes, AlphaFold3 surpasses AlphaFold-Multimer in local structural prediction. For peptide-protein complexes, their performances are nearly indistinguishable, whereas on antigen-antibody complexes, AlphaFold3 is significantly superior. AlphaFold3 shows substantial superiority over RoseTTAFoldNA in protein-nucleic acid predictions, with significant gains in TM-score, local distance difference test scores, and interaction network fidelity scores, whereas for RNA multimers its advantage is limited to significant gains in local distance difference test scores. For RNA monomers, trRosettaRNA achieves higher global prediction accuracy. These results highlight AlphaFold3’s ability to predict both structural detail and interactions, positioning it as a versatile tool for diverse biomolecular systems and suggesting promising applications in structural biology and molecular interaction research, while at the same time highlighting areas ripe for continuing improvements in performance.
Keywords: protein structure prediction, protein complex structure prediction, RNA structure prediction, RNA multimer structure prediction, protein-nucleic acid complex structure prediction
Introduction
Protein structure prediction refers to the computational process of determining the three-dimensional shape of a protein from its amino acid sequence [1, 2]. To date, protein structure prediction has been a topic of scientific inquiry for over 50 years. In recent years, the rapid development of artificial intelligence and the successful application of deep learning technologies have led to remarkable progress in protein structure prediction and related fields [3]. From early attempts at predicting backbone torsion angles [4, 5] and directly constructing structures from these predictions [5], to later advances in inter-residue distance prediction, these efforts enabled the first end-to-end predictions by deep learning possible and laid the foundation for the eventual arrival of AlphaFold2 [6, 7]. A pivotal breakthrough occurred with the integration of co-evolutionary information from multiple sequence alignments (MSAs) coupled with deep learning [8, 9]. This approach substantially improved the accuracy of protein spatial restraints prediction [9–11], enhancing protein structure prediction by guiding folding processes through algorithms like the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method or Replica Exchange Monte Carlo (REMC) simulations [10, 12–15].
The first generation of AlphaFold adopted a similar strategy, achieving top performance in tertiary structure modeling and ranking as the leading human group in the 13th Critical Assessment of protein Structure Prediction (CASP13) [16]. A milestone was reached at CASP14 in 2020 with the advent of AlphaFold2 [7], which incorporated spatial restraint learning and protein folding into a unified end-to-end self-attention mechanism. This innovation advanced prediction accuracy to unprecedented levels, producing models comparable to experimentally determined structures for many single-domain proteins [17]. Following the success of AlphaFold2, numerous methods with similar end-to-end frameworks [18–21] or incorporating modified versions of AlphaFold2 have been developed [22–25], demonstrating excellent performance in predicting structures for multidomain and orphan proteins.
The DeepMind team also further extended AlphaFold2 to predict structures of protein multimer, named AlphaFold-Multimer [26], which also exhibited the good performance in protein multimer structure prediction. In CASP15, methods making use of AlphaFold-Multimer as a pipeline component achieved remarkable results; for instance, DMFold achieved the first place in the multimer structure prediction category by integrating diverse MSA generation strategies with the AlphaFold-Multimer computational core [27], and AFsample [28] employed a massive sampling strategy using various AlphaFold2 configurations, likewise yielding strong performance.
Despite these advances, challenges persist in predicting the structures of more complex biomolecular systems, such as RNA molecules and protein-nucleic acid interactions, which are essential in cellular processes including gene regulation, genetic information transfer, replication, transcription, and translation. Although a few deep learning-based methods have been developed for ab initio RNA structure prediction [29–33], blind tests in CASP15-16 suggest these methods are nowhere near parity with more conventional methods combining human input and classical molecular modeling, indicating significant potential for further advancements [34]. Similarly, deep learning-based methods like RoseTTAFoldNA [31] and RoseTTAFold All-Atom [35] have been developed to predict protein-nucleic acid complex structures, but their predictive accuracy remains unsatisfactory.
In May 2024, AlphaFold3 was released, expanding capabilities beyond protein structures to model biomolecular complexes involving proteins, nucleic acids, small molecules, ions, and modified residues [36]. AlphaFold3 builds on the architecture of AlphaFold2, with the main differences including the replacement of the structure module with a diffusion module that predicts accurate atomic coordinates from noisy coordinates, along with an extended tokenization scheme to accommodate a broader range of molecular types and simplified MSA handling to enhance efficiency.
AlphaFold3 claims substantial improvements in accuracy over many previous specialized tools, showing it is possible to accurately predict the structures of diverse biomolecular systems within a unified framework [36]. However, these results are based on an internal version and evaluation of AlphaFold3. In this work, we evaluated the performance of AlphaFold3 server and standalone package against AlphaFold2 and AlphaFold-Multimer on datasets of protein monomers (general monomers and orphan proteins), complexes (general protein multimers, peptide-protein complexes, antigen-antibody complexes) and alternative conformations. We compared AlphaFold3 with RoseTTAFoldNA [18, 31], NuFold [37], RhoFold+ [33], trRosettaRNA [32] on RNA monomers, and with RoseTTAFoldNA on datasets of RNA multimers and protein-nucleic acid complexes. For protein monomers, including both general monomers and orphan proteins, AlphaFold3 demonstrates improved local structural accuracy compared with AlphaFold2, though global accuracy shows no significant difference. AlphaFold3 also shows an improved capability for modeling multiconformation proteins in some cases, but further improvement in developing and scoring diverse ensembles of candidate structures is clearly required. For antigen-antibody complexes, AlphaFold3 significantly outperforms AlphaFold-Multimer in all metrics. In contrast, for general protein complexes, its advantage is limited to local accuracy, and for peptide-protein complexes, the two methods perform comparably. For RNA multimers and protein-nucleic acid complexes, AlphaFold3 significantly surpasses RoseTTAFoldNA in both global and local prediction accuracy. However, for RNA monomers, trRosettaRNA achieves higher global accuracy, whereas AlphaFold3 remains superior in local accuracy.
Material and methods
Datasets
To evaluate the performance of AlphaFold3 in modeling biomacromolecules and their interactions, nine datasets were constructed from the recently released entries in the Protein Data Bank (PDB). Those nine datasets include: a protein monomer dataset (Benchmark-I), an orphan protein dataset (Benchmark-II), a protein multi-conformation dataset (Benchmark-III), a protein complex dataset (Benchmark-IV), a peptide-protein complex dataset (Benchmark-V), an antigen-antibody complex dataset (Benchmark-VI), an RNA dataset (Benchmark-VII), an RNA multimer dataset (Benchmark-VIII), and a protein-nucleic acid complex dataset (Benchmark-IX). The training dataset for AlphaFold3 (v3.0.1) and AlphaFold-Multimer (v2.3.0) contains all structures released in the PDB before September 30, 2021. The training dataset for AlphaFold2 (v2.3.0) contains all structures released in the PDB before April 30, 2018.
Protein monomer-related benchmark datasets
Benchmark-I: the protein monomer benchmark dataset was constructed based on the following four criteria: (i) all structures were released after 1 January 2024; (ii) structures with sequence similarity >40% and coverage >80% to those in the AlphaFold3 training dataset were excluded; (iii) each of the structures must cover at least 80% of the residues in the corresponding crystal structure; (iv) none of the remaining structures have more than 40% sequence similarity and 80% coverage to each other, after redundancy was removed using CD-HIT [38]. As a result, 150 monomers, ranging from 60 to 1700 amino acids, were selected for the final benchmark dataset (see Table S1). Furthermore, those 150 monomer proteins were categorized into 108 single-domain proteins and 42 multidomain proteins based on DomainParser [39, 40], which applies graph algorithms to identify domain boundaries based on connectivity of residue interactions. These 150 monomers were further classified into 86 “easy” targets and 64 “hard” targets according to LOMETS3 [41].
Benchmark-II: The orphan protein benchmark dataset was constructed based on the following five steps: (i) all protein sequences deposited in the PDB after January 2024 were collected; (ii) redundancy reduction was applied at a sequence identity threshold of 40% with a minimum alignment coverage of 80%; (iii) each sequence was searched against the UniRef30 database using HHblits with default parameters to identify potential homologues; (iv) proteins without homologous sequences were considered as orphans; (v) each of the structures must cover at least 80% of the residues in the corresponding crystal structure. This procedure yielded a final benchmark set comprising 113 orphan proteins, ranging from 41 to 1012 amino acids (see Table S2).
Benchmark-III: to evaluate the performance of AlphaFold3 in modeling proteins with multiple alternative conformations, 60 sets of structures were selected from the M-SADA benchmark [42] (see Table S3). Each such case contains only two selected structures that share identical sequences but have the most distinct conformations (TM-score < 0.80).
Protein multimer-related benchmark datasets
Benchmark-IV: 206 protein multimers, with amino acid lengths ranging from 100 to 3000 and containing up to 11 chains, were selected for Benchmark-IV. The selection was based on the following four criteria: (i) all structures were released after 1 January 2024; (ii) multimers with any component chain having sequence similarity >40% with any chain from the AlphaFold3 training dataset were excluded; (iii) the remaining complexes must cover at least 80% of residues in the corresponding crystal structure; (iv) finally, multimers were grouped by the number of component chains, then clustered by US-align [43] using TM-score = 0.5 as threshold, and only one complex was selected from each cluster. Those 206 protein multimers were further classified into 35 heteromers and 171 homomers, where a heteromer refers to a complex containing at least two different component proteins, and a homomer consists of multiple copies of identical component proteins (see Table S4).
Benchmark-V: the peptide-protein complex benchmark dataset was constructed through the following steps: (i) all peptide-protein complexes deposited in the PDB after January 2024 were collected; (ii) complexes were retained only if the protein chain contained more than 40 amino acids, and the peptide chain comprised 5–30 residues; (iii) redundancy among protein chains was assessed using CD-HIT, and complexes were removed if the protein sequences shared more than 40% identity with at least 80% coverage; (iv) the remaining complexes were further filtered by comparing protein sequences against the AlphaFold3 training set using CD-HIT with 40% sequence identity and 80% coverage cutoffs. Following this procedure, 80 nonredundant peptide-protein complexes, ranging from 69 to 1172 amino acids, were selected to form the benchmark dataset (see Table S5).
Benchmark-VI: the antigen-antibody complex dataset was constructed according to the following criteria: (i) all antigen-antibody complexes released after 30 September 2021 were collected; (ii) only complexes comprising paired VH and VL domains bound to a single-chain protein antigen were retained; (iii) complexes were excluded if their antigen chain showed more than 40% sequence identity to antigen chains from antigen-antibody complexes in the AlphaFold3 training dataset; (v) the remaining complexes were clustered based on antigen chain sequence similarity (40% sequence identity cutoff) to remove redundancy, and only one representative complex was selected from each cluster. Finally, 102 antigen-antibody complexes, ranging from 360 to 2168 amino acids, were included in the Benchmark-VI dataset (see Table S6).
Nucleotide-related benchmark datasets
Benchmark-VII: 52 RNA structures, with nucleotide lengths ranging from 40 to 1000, were selected for the Benchmark-VII dataset. The RNA structures were chosen by the following four criteria: (i) all structures were released after May 1, 2022 (RNA structures released after the training sets of all test methods); (ii) the RNA must cover at least 80% of the nucleotides in the corresponding crystal structure; (iii) RNAs with high similarity to the AlphaFold3 or RoseTTAFoldNA [31] training datasets were excluded, with the similarity have been checked by BLASTN [44] using an e-value threshold of 10; (iv) none of the remaining structures have more than 80% sequence similarity to each other, after redundancy was removed using CD-HIT-EST [38]. The 52 RNA targets were categorized as “easy” targets or “hard” targets based on the structural similarity to the best identified template (see Table S7). BLASTN was used to search for hits in the AlphaFold3 training dataset, and the highest-identity sequence with <80% identity was selected to calculate the TM-score by US-align [43]. Targets with a TM-score >0.50 were classified as “easy”, while those without matches or with lower TM-scores were deemed “hard”, resulting in 30 “hard” targets and 22 “easy” targets.
Benchmark-VIII: The RNA multimer dataset was constructed according to the following criteria: (i) only RNA multimers released after 1 May 2023 were considered; (ii) assemblies containing more than one RNA chain and no non-RNA polymers were retained; (iii) all RNA component chains were required to contain more than 10 nucleotides, and only multimers with a total residue count <1000 were selected; (iv) complexes where all RNA component chains showed sequence similarity (BLASTN e-value < 10) to chains within a single complex in the AlphaFold3 training dataset, were excluded; (v) the remaining multimers were grouped by the number of component chains, then clustered by US-align using TM-score = 0.5 as threshold, and only the largest multimer from each cluster was retained. Following these steps, the Benchmark-VIII dataset comprises 23 RNA multimers, with sizes ranging from 26 to 814 nucleotides (see Table S8).
Benchmark-IX: The protein-nucleic acid complex dataset was constructed according to the following five criteria: (i) all protein-nucleic acid complexes were released after May 1, 2023; (ii) only complexes with a total residue count <1000, and at least 10 nucleotides per chain, were selected to ensure RoseTTAFoldNA compatibility; (iii) complexes where all protein component chains had sequence similarity >40%, and all nucleic acid component chains had a sequence similarity (BLASTN e-value<10) to any entity in the AlphaFold3 training dataset, were excluded; (iv) in the remaining complexes, the corresponding components of the crystal structure must cover at least 80% of the amino acids and nucleotides; (v) complexes were grouped by the number of component protein, RNA or/and DNA chains, then clustered by US-align using TM-score = 0.5 as a threshold, and the largest complex was selected from each cluster. This resulted in 59 protein-nucleic acid complexes in our Benchmark-IX dataset (see Table S9).
Metrics
The TM-score [43, 45, 46] and local distance difference test (LDDT) [47] are used to evaluate the global and local accuracy of predicted models, making them suitable for assessing the modeling quality of protein monomers, multimers, RNA, and protein-nucleic acid complexes. DockQ [48] and the fraction of native contacts (FNAT) [49] are used to evaluate the quality of interactions in protein multimers and protein-nucleic acid complexes, respectively. Additionally, interaction network fidelity (INF) [50] is utilized to assess the secondary structure prediction accuracy of RNA monomer. CDR-TMscore is introduced to specifically assess the accuracy of CDR loop modeling in antibodies. We detail each of these measures, and their applications in our benchmarking, below.
TM-score is a metric for evaluating the topological similarity between protein structures, ranging from 0 to 1, with a score of 1 indicating a perfect structural match. A TM-score >0.5 typically indicates that the two structures share the same global topology [46]. The TM-score is defined as:
![]() |
(1) |
where N represents the length of the experimental structure, T is the length of the aligned residues to the reference structure, di denotes the distance between the i-th pair of aligned residues and d0 is a scaling factor used to normalize the matching difference. “Max” refers to the maximum value obtained after optimal spatial superposition. The TM-score is calculated for all protein monomers, multimers, RNA, and protein-nucleic acid complexes using US-align [43], with the options “-TMscore 7 -ter 0”.
The LDDT is a superposition-free score that evaluates local distance differences of all atoms in a model, ranging from 0 to 1, where a score of 1 indicates perfect agreement between the predicted and experimental structures at the local level. The LDDT can be calculated as:
![]() |
(2) |
where N represents the length of the experimental structure, Mi represents the set of residues where the distance between residue i and any other residue j is <15 Å,
is the distance of residue pair (i, j) in experimental structure,
represents the distance deviation between residue pairs (i, j) in the predicted and experimental structures,
represents different cutoffs and H(x) represents the unit step function. Here,
,
,
and
are 0.5 Å, 1 Å, 2 Å, and 4 Å, respectively. LDDT was calculated using the OpenStructure software with default settings.
DockQ is a metric used to evaluate the quality of predicted protein–protein interactions, as a composite score that integrates Fnat, interface root-mean-square deviation (iRMS), and ligand root-mean-square deviation (LRMS). These individual metrics capture different aspects of docking accuracy, and DockQ combines them into a single score to provide an overall measure of the quality of a docking prediction. The DockQ is defined as Eq. 3,
![]() |
(3) |
![]() |
(4) |
where Fnat is defined as the fraction of interfacial contacts of experimental structure preserved in the interface of the predicted complex, LRMS is calculated for the backbone of the shorter chain (ligand) of the model after superposition of the longer chain (receptor) [51], iRMS is calculated by superposing the backbone atoms of the receptor-ligand interface residues in the experimental structure onto their equivalents in the predicted complex. Here, a pair of residues on different sides of the interface was considered to be in contact if any of their atoms were within 5 Å [51], and the receptor-ligand interface in the experimental structure is redefined at a relatively relaxed atomic contact cutoff of 10 Å [48]. RMSscaled represents the scaled RMS deviations (Eq. 4), and di is a scaling factor, d1 for LRMS and d2 for iRMS, optimized to d1 = 8.5 Å and d2 = 1.5 Å [48]. The DockQ ranges from 0 to 1, where a score of 1 represents a perfect match and a score < 0.23 is generally considered to be an incorrect model. A DockQ score between 0.23 and 0.49 indicates acceptable quality, a score between 0.49 and 0.80 represents medium quality, and a score ≥ 0.80 represents high quality [48]. DockQ was calculated using the package provided by [52], with default options.
The definition of FNAT is identical to Fnat in the discussion of Eq. 3 above, but FNAT is specifically calculated using DockQ to evaluate the accuracy of predicted interactions in protein-nucleic acid complexes.
The INF, which is used only for RNA targets, measures the similarity between the interactions in the experimental structure and those in the predicted structure. The entire RNA structure can be considered as a large interaction network composed of Watson-Crick-Franklin interactions, non-Watson-Crick-Franklin interactions, and base stacking interactions [53]. INF is defined as the Matthews correlation coefficient between the predicted interactions and experimental interactions, where a higher score indicates stronger agreement between the interaction patterns of the predicted and experimental structures. Here, INF is computed using ClaRNA [54] with default options.
The CDR-TMscore measures the structural accuracy of the antibody complementarity-determining regions (CDRs), with the calculation performed as follows. For each predicted antigen-antibody complex, the model is first superposed onto the native structure based on the antigen, and then the TM-score of six loops in the VH and VL domains is calculated. Higher CDR-TMscore values indicate closer agreement between the predicted and experimental CDR conformations.
Results and discussion
We compared the performance of AlphaFold3 with AlphaFold2 (version 2.3) for protein monomers and multimers to assess improvements in modeling accuracy, with AlphaFold2 using a template library with a maximum template date of 30 September 2021, the same library used for AlphaFold3 [36]. The AlphaFold3 results were obtained from the AlphaFold3 server and AlphaFold3 standalone package. For RNA monomers, modeling quality was compared among AlphaFold3, RoseTTAFoldNA [31], RhoFold+ [33], NuFold [37], and trRosettaRNA [32]. For protein-nucleic acid complexes and RNA multimers, comparisons were limited to AlphaFold3 and RoseTTAFoldNA. Unless otherwise specified, all methods were run with default parameters.
Overall performance on protein monomer
We conducted a comparative analysis of AlphaFold3 and AlphaFold2 on 150 protein monomers (see Dataset for Benchmark-I) to assess their modeling performance in terms of global (TM-score) and local (LDDT) structure prediction. For each protein, the top-ranked models from both methods, determined by their default ranking scores, were selected for comparison. Overall, the server version of AlphaFold3, referred to as “AlphaFold3-server” produced models with higher TM-scores than AlphaFold2 in 57.3% (86 out of 150) of the cases (Fig. 1A). However, the difference in global modeling quality between the two methods is not strong enough to be statistically significant, and the actual performance difference in terms of TM-score on this dataset is negligible: the average TM-scores were 0.817 and 0.812 for AlphaFold3-server and AlphaFold2, respectively (P = .16, Wilcoxon signed rank test, which is also used throughout this work for similar comparisons unless otherwise noted; see Table S10 for confidence intervals and other statistics). Similar to AlphaFold3-server, the AlphaFold3 standalone package, referred to as “AlphaFold3-local”, outperformed AlphaFold2 in 51.3% (77 out of 150) of cases in terms of TM-score. However, this improvement was again marginal in magnitude and not statistically significant: the average TM-score for AlphaFold3-local was 0.812, compared to 0.812 for AlphaFold2 (P = .44; see Table S10). In terms of LDDT scores, AlphaFold3 showed larger and more robust improvements: AlphaFold3-server outperformed AlphaFold2 in 65.3% (98 out of 150) of the cases (Fig. 1B), with average LDDTs were 0.792 and 0.787 for AlphaFold3-server and AlphaFold2 (P = 2.67 × 10−3; Table S10), respectively. Similarly, AlphaFold3-local achieved an average LDDT score of 0.790 and outperformed AlphaFold2 in 63.3% of cases (95 out of 150 cases) (P = 3.77 × 10−3; Table S10). Although the differences in average scores appear small, they are statistically significant, providing stronger evidence that AlphaFold3 yields improved local structure prediction compared to AlphaFold2.
Figure 1.
Comparison of AlphaFold3 vs. AlphaFold2 performance for protein monomer structure prediction (Benchmark-I). (A) and (B) Head-to-head comparison between AlphaFold3 and AlphaFold2 for TM-score and LDDT. (C) and (D) Head-to-head comparisons on 108 single-domain proteins for TM-score and LDDT. (E) and (F) Head-to-head comparisons on 42 multi-domain proteins for TM-score and LDDT. Here, the dot, triangle, and square represent the types of proteins: the dot for all protein monomers, the triangle for single-domain proteins, and the square for multi-domain proteins. The colors denote the comparisons: blue represents the comparison between AlphaFold3-server and AlphaFold2, while red indicates the comparison between AlphaFold3-local and AlphaFold2. (G) A representative example shows AlphaFold3 can build better full-length models in some isolated cases. A phage endolysin (PDB 8TW1) is shown, with the experimental structure colored in green, the AlphaFold3-server model in blue, the AlphaFold3-local model in hot pink, and the AlphaFold2 model in gray.
To further investigate the performance on different protein types, the 150 protein monomers were classified into 108 single-domain and 42 multi-domain proteins using DomainParser [40]. For the 108 single-domain proteins, AlphaFold3-server, AlphaFold3-local, and AlphaFold2 showed comparable performance, showing a lack of substantive difference (see Fig. 1C and D; see Table S11 for comparative statistics). Among the 42 multi domain proteins, AlphaFold3 demonstrated slight improvements over AlphaFold2 in both global and local structure prediction (Fig. 1E and Fig. 1F), but only the improvement in local structural prediction accuracy was statistically significant (P = 5.53 × 10−4 for AlphaFold-server and P = 3.47 × 10−4 for AlphaFold-local; Table S12).
A noteworthy case study of the improved performance for AlphaFold3 relative to Alphafold2 comes from the thermostable phage endolysins Lys2972 (which contains two domains; PDB ID: 8TW1 [55]). AlphaFold3 achieved a substantially higher TM-scores (0.916 for server version and 0.873 for local version) for the full-length models compared to 0.562 by AlphaFold2 (Fig. 1G). While AlphaFold2 accurately predicted the individual domain structures with a TM-score = 0.990 for the first domain and a TM-score = 0.840 for the second domain, respectively, it failed to model the correct orientation between two domains. In contrast, AlphaFold3-server and AlphaFold3-local successfully predicted both the domain structures and their correct orientation, highlighting its enhanced capability in modeling domain-domain interactions. In addition, for the three-domain target 8JGO, we observed a notable TM-score difference (~0.300) between AlphaFold3-server (0.934) and AlphaFold3-local (0.636). To determine whether this difference was inherent or random, we ran each version 20 times, generating 100 models. As shown in Fig. S1, both versions produced nearly identical TM-score and LDDT distributions. After 20 runs, their top-ranked models had identical TM-scores of 0.633. However, the TM-score distribution ranged widely from 0.6 to 1.0, suggesting that for this target, AlphaFold3’s predictions may be unstable, and its built-in model ranking struggles to distinguish high-quality models.
To provide a more comprehensive view of AlphaFold3’s capabilities, we also analyzed performance on “easy” versus “hard” targets, with classification guided by LOMETS3 [41]. These two classes of proteins are primarily distinguished by the availability of templates in the PDB and are often used to assess differences in modeling performance on targets of varying difficulty, such as in CASPs [23]. On the 86 “easy” targets, AlphaFold3 (both server and local versions) achieved higher average TM-scores (0.911 and 0.907, respectively) and LDDT values (both 0.860) compared with AlphaFold2 (TM-score = 0.906; LDDT = 0.854; see Table S13 and Fig. S2). However, statistical significance was observed only for the improvements in LDDT. In contrast, on the 64 “hard” targets, the performance of AlphaFold3 was largely comparable to AlphaFold2, with average TM-scores of 0.691 (AlphaFold3-server), 0.685 (AlphaFold3-local), and 0.687 (AlphaFold2), and nearly identical LDDT values (0.700, 0.697, and 0.697, respectively; see Table S14 and Fig. S2). The proportion of correctly predicted targets was marginally higher for AlphaFold2 (75.0%) than for AlphaFold3 (71.9% for server; 70.3% for local; see Table S14 and Fig. S2), although the differences did not reach statistical significance. In addition, these results indicate that both AlphaFold3 and AlphaFold2 perform considerably better on easy targets than on “hard” targets, with performance metrics on the easy set exceeding those on the hard set by ~20%–30%. Furthermore, to more rigorously assess the performance of AlphaFold3 and AlphaFold2, we also minimized contamination from close homologs. Specifically, we selected 52 monomers from Benchmark-I that shared <25% sequence identity (ignoring coverage) to the AlphaFold3 training set and each other, and compared the performance of AlphaFold3 and AlphaFold2 on this subset. On this subset, the performance of AlphaFold3 was largely comparable to AlphaFold2 (see Table S15 and Fig. S2), with no statistically significant differences in either global or local structural accuracy, closely resembling their performance on the “hard” targets. These results indicate that, after close homologs were rigorously excluded, AlphaFold3 does not confer a measurable advantage over AlphaFold2, with all three methods performing at a comparable level on this strictly filtered benchmark. For LDDT, AlphaFold3 significantly outperformed AlphaFold2 the entire dataset, while no difference was seen on the more rigorous subset. For TM-score, no significant differences were observed across datasets.
Overall, AlphaFold3 and AlphaFold2 exhibit comparable performance in the global structure prediction of protein monomers, with no significant differences in their TM-scores. However, AlphaFold3 demonstrates a statistically significant improvement over AlphaFold2 in terms of LDDT. For single-domain proteins, the AlphaFold3 does not provide a statistically significant improvement in either TM-score or LDDT compared to AlphaFold2. However, for multidomain proteins, AlphaFold3 demonstrates a statistically significant yet minor improvement in LDDT over AlphaFold2. In addition, both AlphaFold2 and AlphaFold3 exhibit markedly different performance between the easy and hard target sets, with overall accuracy on the easy targets being ~20%–30% higher than that on the hard targets.
Single-sequence mode and orphan proteins
To evaluate modeling performance in the absence of any MSAs or templates, testing prediction accuracy for entirely novel proteins, we compared AlphaFold2 and AlphaFold3 in single-sequence mode (i.e. without the use of MSAs or templates) across 150 protein monomers. (Figure S3, Table S16; detailed results in Table S17). Overall, AlphaFold3 exhibited superior accuracy compared to AlphaFold2. The average TM-score was 0.413 for AlphaFold3-local and 0.353 for AlphaFold2 (P = 2.92 × 10−7). Similarly, the average LDDT values were 0.433 for AlphaFold3-local and 0.377 for AlphaFold2 (P = 1.96 × 10−9). Moreover, 26.0% of AlphaFold3-local predictions achieved TM-scores ≥0.5, compared to only 16.0% for AlphaFold2. Scatter plots of head-to-head comparisons (Fig. 2A and B) further illustrate this improvement, showing a systematic shift toward higher TM-scores and LDDT values for AlphaFold3. AlphaFold3 uses only four layers of EvoFormer for MSA processing and 48 layers of PairFormer for handling single- and pairwise information, which may reduce the reliance of MSA information. As a results, AlphaFold3 performs better than AlphaFold2 on single-sequence mode. Nevertheless, both models exhibited substantial difficulties in single-sequence mode, with most predictions falling below TM-score thresholds indicative of reliable folding. This underscores the intrinsic challenge of protein structure prediction in the absence of evolutionary and template information.
Figure 2.
Comparison of AlphaFold3 vs. AlphaFold2 performance for protein monomer structure prediction (Benchmark-I and Benchmark-II). (A) and (B) Head-to-head comparison between AlphaFold3 and AlphaFold2 for TM-score and LDDT on 150 protein monomers using single-sequence mode (without MSA or templates). (C) and (D) Head-to-head comparison between AlphaFold3 and AlphaFold2 for TM-score and LDDT on 102 orphan proteins. (E) and (F) Head-to-head comparisons on 14 orphan proteins without any detected MSA from AlphaFold3 pipeline.
While single-sequence mode represents the most extreme case of structure prediction without evolutionary or template information, orphan proteins provide a more realistic but still challenging test. We therefore compared the performance of AlphaFold3 and AlphaFold2 on a benchmark of 113 orphan proteins (Benchmark-II). AlphaFold3-server generated models with higher TM-scores than AlphaFold2 in 57.5% of cases (65/113; Fig. 2C). However, the overall difference in global accuracy was small and not statistically significant: the average TM-scores were 0.877 and 0.864 for AlphaFold3-server and AlphaFold2, respectively (P = .08; Table S18). AlphaFold3-local produced higher TM-scores in 52.2% of cases (59/113), with mean values of 0.871 versus 0.864 for AlphaFold2 (P = .36; Table S18), indicating no significant advantage. In contrast, improvements in local structural accuracy were consistent. AlphaFold3-server achieved higher LDDT scores in 60.2% of cases (68/113; Fig. 2D), with average values of 0.800 versus 0.786 for AlphaFold2 (P = .02; Table S18), suggesting a statistically significant but modest gain. AlphaFold3-local also outperformed AlphaFold2 in 60.2% of cases (68/113), with mean LDDT values of 0.794 versus 0.786 (P = .09; Table S18), indicating a trend toward improved local accuracy, although the difference did not reach statistical significance. Among the 113 orphan proteins, the 14 proteins without any detected homologous sequence in the MSA from AlphFold3 pipeline were selected from analysis. Among the 14 orphan proteins, AlphaFold2 demonstrated slight improvements over AlphaFold3 in both global and local structure prediction (Fig. 2E and F), but the differences were not statistically significant (see Table S19). These results indicate AlphaFold2 provides a minor improvement in structure prediction accuracy but without significance over AlphaFold3 on the 14 more rigorous orphan proteins in our benchmark dataset. In addition, we found that, among these 14 proteins, a small number of homologous sequences could be detected by the AlphaFold2 pipeline for four cases. Notably, on these four targets AlphaFold2 consistently outperformed AlphaFold3. After excluding them, the remaining 10 proteins yielded mean TM-scores of 0.884, 0.878, and 0.869 for AlphaFold3-server, AlphaFold3-local, and AlphaFold2, respectively; the corresponding mean LDDT values were 0.793, 0.792, and 0.790. These observations further underscore the importance of detectable homology for accurate structure prediction of orphan proteins.
Overall, AlphaFold3 performs better than AlphaFold2 in single-sequence mode, likely because it simplifies MSA processing and thus reduces reliance on evolutionary information. However, for orphan proteins with only limited MSA data available, the performance of AlphaFold3 and AlphaFold2 shows no substantial difference.
Analysis of alternative conformation modeling
A multiconformation protein is a protein that can adopt different stable structures, which may be at equilibrium in the same environment or, as is often the case, may occur in response to environmental changes, ligand binding, or regulatory signals [56]. These conformations are essential for proteins with multiple functional states, such as many enzymes and receptors [42]. For example, a protein may switch between active and inactive forms, or between different binding affinities, depending on its conformational state [57]. These structural variations may play a crucial role in understanding a protein’s full range of functions and interactions within a biological system.
To evaluate the performance of the AlphaFold3 in modeling proteins with alternative conformations, nine sets of structures, each containing two distinct states from the M-SADA benchmark, were selected (see Dataset for Benchmark-III). Since the AlphaFold3 server can generate only five models per submission, both the AlphaFold3-server and AlphaFold3-local pipelines were run 20 times per protein sequence with different random seeds to increase the likelihood of sampling alternative states, generating a total of 100 models per version. Similarly, an equal number of models were generated by AlphaFold2 for a fair comparison, and the TM-score between each model and the two experimental states within the same protein sequence was calculated (Fig. 3 and Fig. S3).
Figure 3.
Comparison of AlphaFold3 vs. AlphaFold2 performance for predictions of multiple conformation proteins (Benchmark-III). (A) shows the best TM-scores of models in two different states for each protein across these methods. The details on each target can be found in Supplementary Fig. S3 (A) to (G1). (B) displays the best models for state 1, where each model is selected based on the highest difference between its TM-score to state 1 and TM-score to state 2. The y-axis, “specificity for state 1,” represents the value of (TM-score to state 1—TM-score to state 2) for the selected model of each protein, while the x-axis shows its TM-score to state 1. (C) displays the best models for state 2, where each model is chosen based on the highest difference between its TM-score to state 2 and TM-score to state 1. The y-axis, “specificity for state 2,” represents the value of (TM-score to state 2—TM-score to state 1) for the selected model of each protein, while the x-axis shows its TM-score to state 2. The models in the yellow-shaded region represent specific and highly accurate models for state 1 or state 2. (D) shows AlphaFold3’s ability to generate models approximating distinct conformational states and the PDB ID of this protein is 4TU7. The experimental structures of different states are colored in yellow and green, the AlphaFold3-server model in blue, the AlphaFold3-local model in hot pink, and the AlphaFold2 model in gray.
We observed highly varied results for different targets comprising the Benchmark-III dataset. Overall, the results on this dataset are summarized in Fig. 3A and details can be found in Table S3 and Fig. S3. For AlphaFold3-server, both conformational states were accurately predicted for 14 out of 60 proteins (23.3%), only one state was correctly predicted for 30 proteins (50.0%), and neither state was predicted for 16 proteins (26.7%). Similarly, for AlphaFold3-local, both conformational states were accurately predicted for 12 proteins (20.0%), only one state was correctly predicted for 32 proteins (53.3%), and neither state was predicted for 16 proteins (26.7%). For AlphaFold2, both states were accurately predicted for nine proteins (15.0%), one state was correctly predicted for 29 proteins (48.3%), and neither state was predicted for 22 proteins (36.7%). Moreover, when we consider the ability of the different pipelines to generate structures specific for one state or the other (i.e. those with a TM-score above 0.8 for one state and simultaneously below 0.8 for the other), AlphaFold3-server generated accurate and distinguishable models for a total of 32 states, AlphaFold3-local for 54, and AlphaFold2 for 43 (in each case out of a possible 120 states), as shown in the yellow-shaded regions of Fig. 3B and C. The results indicate that for target 1qr4 (chains A and B; Fig. S3H) [58], multiple runs of AlphaFold3-server, AlphaFold3-local, and AlphaFold2 generated models aligning with both conformational states, where the TM-scores of the best models for chains A and B across all methods are close to or exceed 0.900 (and thus, the models generally could not be distinguished between the two target conformations). In contrast, for target 4tu7 (chains A and B; Fig. S3E1) [59], only AlphaFold3-server generated models that aligned with both conformational states, with the TM-scores of its best models for chains A and B exceeding 0.850. Protein 4tu7 is a homodimer with two domains per chain, exhibiting two distinct structures for its constituent monomers due to different domain orientations, resulting in a TM-score of only 0.530 between their experimental monomer structures (Fig. 3D ). AlphaFold3-server generated accurate models for both native structures, with its best models achieving TM-scores of 0.916 for chain A and 0.853 for chain B. In contrast, AlphaFold2’s best models had TM-scores of 0.772 for chain A and 0.574 for chain B, respectively, indicating that AlphaFold3-server can predict different inter-domain orientations in distinct states for this target, whereas AlphaFold2 cannot. Similar to AlphaFold2, AlphaFold3-local accurately predicted chain A with a TM-score of 0.882 but failed for chain B, which had a TM-score of only 0.697.
These results indicate that AlphaFold3’s predictions are relatively stable in a single state, leaving considerable room for improvement in predicting alternative conformations, and not substantively exceeding the performance of AlphaFold2. Recent research shows that AlphaFold2 can predict different conformational states by modifying MSA and template features [3, 60], a strategy that may also be applicable to future work with the standalone version of AlphaFold3. In the meantime, users should proceed with caution when using AlphaFold3 to model proteins which may have multiple conformational states, as the sensitivity to detect alternative conformations without modified/enhanced sampling approaches may be limited.
Overall performance on protein multimers
We evaluated the performance advancements of AlphaFold3 in predicting protein complex structures by comparing it with its predecessor, AlphaFold-Multimer [7, 26], on a dataset of 206 protein multimers (see Dataset for Benchmark-IV). Our assessment focused on three key aspects: global topology modeling quality (reflected by TM-score), local modeling quality (reflected by LDDT), and interface modeling quality (reflected by DockQ).
On the Benchmark-IV dataset, AlphaFold3-server and AlphaFold3-local generated models with higher TM-scores than AlphaFold-Multimer for 49.0% (101/206) and 52.4% (108/206) of the targets, respectively. Conversely, AlphaFold-Multimer achieved higher TM-scores for 51.0% (105/206) and 47.6% (98/206) of the targets compared to AlphaFold3-server and AlphaFold3-local. Among these, AlphaFold3-server and AlphaFold3-local outperformed AlphaFold-Multimer by more than 0.05 TM-score on 19 and 22 targets, respectively (Fig. 4A). In comparison, AlphaFold-Multimer achieved this in 21 targets against AlphaFold3-server and 20 targets against AlphaFold3-local. AlphaFold3-server achieved an average TM-score of 0.821, a 1.1% improvement over AlphaFold-Multimer’s average TM-score of 0.812. However, the Wilcoxon signed-rank (P = .49; see Table S20) indicates that there is no statistically significant difference between them. Similarly, AlphaFold3-local also showed no significant difference (average TM-score = 0.824, P = .74; see Table S20). Additionally, AlphaFold3-server, AlphaFold3-local and AlphaFold-Multimer produced a comparable number of “correct global fold” models (TM-score > 0.5), with 174, 178, and 173, respectively (see Table S21). Taken together, all of these findings suggest that the performance of AlphaFold-Multimer is on parity with the AlphaFold3 pipelines for multimer targets in terms of global fold quality.
Figure 4.
Comparison of AlphaFold3 vs. AlphaFold-Multimer performance for protein complex structure prediction (Benchmark-IV). We show head-to-head comparisons between AlphaFold3 and AlphaFold-Multimer for TM-score (A), LDDT (B), and DockQ (C), respectively. Here, the colors denote the comparisons: blue represents the comparison between AlphaFold3-server and AlphaFold-Multimer, while red indicates the comparison between AlphaFold3-local and AlphaFold-Multimer.
Local modeling quality was evaluated using LDDT scores, where AlphaFold3-server and AlphaFold3-local showed modest improvements of 1.1% and 1.2%, respectively, over AlphaFold-Multimer. The average LDDT scores were 0.860 for AlphaFold3-server, 0.861 for AlphaFold3-local, and 0.851 for AlphaFold-Multimer. Despite the close average LDDT scores, the Wilcoxon signed-rank tests indicate that the differences between AlphaFold3 and AlphaFold-Multimer are statistically significant for both AlphaFold3 versions, with P = 2.42 × 10−4 for AlphaFold3-server and P = 8.55 × 10−4 for AlphaFold3-local (Fig. 4B and Table S20). These results suggest that AlphaFold3 consistently outperforms AlphaFold-Multimer, even if the improvement is small in magnitude.
Interface modeling quality was evaluated by DockQ scores. AlphaFold3-server achieved an average DockQ score of 0.589, a 3.7% improvement over AlphaFold-Multimer’s 0.568 (see Table S20 and Fig. 4C). When considering DockQ scores ≥0.23 as indicative of a “correct interface”, AlphaFold3-server correctly modeled 78.2% (161 out of 206) of protein complexes, compared to 71.8% (148 out of 206) for AlphaFold-Multimer (see Table S21). Despite some improvements in DockQ, a two-sample proportion test (P = .14) and a Wilcoxon signed-rank test (P = .19; see Table S20) indicate that the difference is not statistically significant. The results for AlphaFold3-local are similar to those for AlphaFold3-server. This suggests that the observed improvement may result from random variation rather than a consistent performance advantage. For the 206 targets, the correlation between the interface predicted TM-score (ipTM) and DockQ was calculated for all methods, showing strong correlations: 0.81 for AlphaFold3-server, 0.79 for AlphaFold3-local, and 0.76 for AlphaFold-Multimer (Fig. S4). These strong correlations indicate that ipTM is a reliable predictor of interface modeling quality across all methods, with AlphaFold3 demonstrating slightly stronger correlation than AlphaFold-Multimer.
Although AlphaFold3-server and AlphaFold3-local show no overall significant difference, we observed notable performance discrepancies on certain targets. For example, AlphaFold3-server outperformed the local version on 8VA1 (TM-score: 0.976 versus 0.304), while the reverse was true for 8JF2 (0.289 versus 0.779). To assess variability, we ran both versions 20 times on each target. As shown in Fig. S5, LDDT distributions were consistently tight and similar, but TM-scores varied widely (∼0.2–1.0). For 8VA1, the top model from server runs reached 0.992, while the local version achieved only 0.666. For 8JF2, both versions did produce some high-quality models, but only rarely, and they failed to select them as the top-ranked. These results highlight potential instability in AlphaFold3 on certain targets and limitations in its model ranking; based on our numerical experiments, it appears likely that the AlphaFold3-server and -local versions are sampling from the same model space, but for some targets may yield different quality models when limited numbers of candidate structures are generated.
In summary, while AlphaFold3 demonstrates comparable (but not substantively improved) performance to AlphaFold-Multimer in terms of global fold and interface quality, it does exhibit a small but significant increase in modeling quality for local structures.
Comparison of homomeric and heteromeric protein complex modeling
The 206 protein complexes in the multimeric Benchmark-IV dataset were categorized into 35 heteromers and 171 homomers, where heteromers consist of at least two different proteins, and homomers are solely composed of multiple copies of a single protein. We compared AlphaFold3 and AlphaFold-Multimer’s performance across both groups to identify any differences in their predictive capabilities.
In terms of global modeling accuracy, AlphaFold3 exhibited slightly better performance on average for both heteromers and homomers, though the differences were not statistically significant in either case (see Table S22 for heteromers and Table S23 for homomers). The pipelines also achieved rough parity Fig. 5A and D. However, only a few targets exhibited substantial differences (ΔTM-score > 0.05; see Tables S24–S27). Additionally, the number of “correct fold” and “high quality” models produced by AlphaFold3 was only modestly different from those of AlphaFold-Multimer for both heteromers and homomers (see Tables S28 and S29). These results indicate that AlphaFold3 does not achieve a significant enhancement in global modeling quality compared to AlphaFold-Multimer, regardless of whether the targets are heteromers or homomers.
Figure 5.
Comparison of AlphaFold3 vs. AlphaFold-Multimer performance for heteromer and homomer structure prediction (Benchmark-IV). (A) and (B) Head-to-head comparison of TM-score between AlphaFold3 and AlphaFold-Multimer on 35 heteromers and 171 homomers. (C) and (D) Head-to-head comparison of LDDT between AlphaFold3 and AlphaFold-Multimer on heteromers and homomers. (E) and (F) Head-to-head comparison of DockQ between AlphaFold3 and AlphaFold-Multimer on heteromers and homomers. Here, the square and triangle represent the types of protein complexes: the square for heteromers and the triangle for homomers. The colors denote the comparisons: blue represents the comparison between AlphaFold3-server and AlphaFold-Multimer, while red indicates the comparison between AlphaFold3-local and AlphaFold-Multimer. (G) Structural models of the homodimer 8FV3 generated by AlphaFold3 and AlphaFold-Multimer. (H) Structural models of the homodimer 8JGX generated by AlphaFold3 and AlphaFold-Multimer. In panels G and H, the experimental structure is shown in surface representation and colored gray, while the predicted model is displayed in cartoon representation, with the two chains colored sand and teal.
For local modeling accuracy on heteromers, the average LDDT for both AlphaFold3-server and AlphaFold3-local were marginally higher than AlphaFold-Multimer, but the difference was not statistically significant in either case (Fig. 5B and Table S22). On homomers, we saw a quantitatively small but statistically significant improvement of both AlphaFold3 pipelines over AlphaFold-Multimer (Fig. 5E and Table S23).
For protein–protein interaction modeling quality on heteromers, AlphaFold3-server achieved an average DockQ score of 0.640, representing a 9.4% improvement over AlphaFold-Multimer (P = .04; see Table S22); similarly, AlphaFold3-local achieved an average DockQ score of 0.656, outperforming AlphaFold-Multimer with a 12.1% improvement (P = 4.25 × 10−3). In contrast, for homomers, AlphaFold3-server’s DockQ score averaged 0.579, only 2.7% higher than AlphaFold-Multimer (P = .72), and AlphaFold3-local had an average DockQ = 0.578 (P = .76; see Table S23). These results indicate that AlphaFold3 demonstrated a relatively greater improvement in heteromeric interactions compared to homomers, suggesting that its advantage in interface modeling is more pronounced in heteromer modeling (Fig. 5C and F).
Overall, AlphaFold3 displayed improved accuracy in local structure for homomers and interface predictions for heteromers. However, both methods showed comparable performance in global structure prediction. It should also be noted that, as can be seen in Fig. 4 and 5, for the vast majority of cases the performance of the AlphaFold3 and AlphaFold-Multimer pipelines are quite similar, whereas the differences are driven by a small minority of cases, with a nearly even mix of individual cases where AlphaFold3 versus AlphaFold-Multimer were better (see Tables S24–S27). The complementary strengths of AlphaFold3 and AlphaFold-Multimer, particularly evident in certain cases like homodimers, suggest that combining their predictions could lead to better overall results. For example, for the homodimer 8FV3 (Fig. 5G) [61], AlphaFold3-server achieved a high TM-score (0.974) and DockQ (0.735), indicating accurate predictions of both the overall structure and protein–protein interactions, supported by high-confidence scores with pTM = 0.820 and ipTM = 0.770, respectively. In contrast, AlphaFold-Multimer’s TM-score (0.475) and DockQ (0.010), despite a monomeric TM-score of 0.885, indicate that it failed to predict the correct subunit orientation, with low ipTM (0.398) and pTM (0.628) scores reflecting this low-confidence structure. In contrast, for the homodimer 8JGX (Fig. 5H) [62], AlphaFold3-server did not accurately model the complex, with TM-score and DockQ values of 0.507 and 0.159, respectively, due to inaccuracies in monomer prediction. Similarly, AlphaFold3-local failed to model the complex accurately, yielding a TM-score of 0.506 and a DockQ of 0.159. Conversely, AlphaFold-Multimer performed well, with TM-score and DockQ values of 0.950 and 0.767. In both cases, confidence scores (pTM and ipTM) effectively distinguish the superior model, even comparing across the two programs. Those two cases demonstrate that each method has distinct advantages depending on the target, and this complementary behavior underscores the potential benefit of integrating the results from both models. Despite this, either for the 35 heteromers or the 171 homomers, integrating the models from AlphaFold-Multimer and AlphaFold3, and selecting the final model based on the highest confidence score (calculated as 0.8 × ipTM +0.2 × pTM, as used in AlphaFold-Multimer), still did not lead to a significant improvement in TM-score (see Table S30). This also highlights the importance of developing efficient estimation of model accuracy (EMA) methods specifically for the AlphaFold-Multimer and AlphaFold3 pipelines to better select high-quality structural models.
As a special class of heteromeric protein complexes, peptide-protein assemblies play crucial roles in cellular regulation and peptide-based drug development. We therefore further evaluated the performance of AlphaFold3 and AlphaFold-Multimer on a benchmark of 80 peptide-protein complexes (Benchmark-V). Overall, the three methods exhibited nearly identical performance, with average TM-scores of ~0.91 to 0.92, average pLDDT values of ~0.86, and average DockQ scores of ~0.63 (Fig. S6 and Table S31). The head-to-head comparison scatterplots revealed strong concordance between AlphaFold3 and AlphaFold-Multimer predictions in terms of TM-score and LDDT. Moreover, success rates across different quality thresholds were also comparable: more than 95% of predictions achieved a TM-score ≥ 0.5, and ~85% exceeded a DockQ ≥0.23 (see Table S32). However, when focusing on high-quality models, over 85% of predictions reached a TM-score ≥ 0.80, whereas only ~40% achieved a DockQ ≥0.80. Relative to the TM-score ≥ 0.80 threshold, the success rate was reduced by almost half when evaluated using DockQ ≥ 0.80. This discrepancy arises because DockQ specifically measures the accuracy of the predicted interaction interface, in contrast to TM-score and LDDT, which primarily emphasize global structural similarity. In peptide-protein complexes, the peptide represents only a small fraction of the overall structure. Thus, global metrics such as TM-score and LDDT are dominated by the accuracy of the protein scaffold and largely insensitive to errors in peptide orientation or binding pose. Consequently, AlphaFold3 and AlphaFold-Multimer exhibit highly similar performance when assessed by TM-score or LDDT, reflecting comparable accuracy in modeling the protein component. However, interface-focused evaluation using DockQ reveals substantial variability across targets. Although the overall average DockQ scores are comparable and show no significant difference, individual complexes often display large discrepancies between the two methods (Fig. S6C). This highlights that while both models achieve similar global accuracy, their ability to capture peptide-protein interaction details is not always consistent, which is different from the performance on the 35 general protein hetero-oligomers.
Performance on antigen-antibody complexes
Antigen-antibody recognition is a biologically important problem of broad interest, yet accurate structural modeling remains challenging. To evaluate potential improvements, we constructed a curated benchmark dataset of 102 antigen-antibody complexes (see Dataset for Benchmark-VI) and compared the performance of AlphaFold3 with AlphaFold-Multimer. Performance was evaluated using multiple metrics: global structural accuracy (TM-score), local modeling quality (LDDT), and interface quality (DockQ). To specifically assess the accuracy of the antibody binding regions, we further introduced CDR-TMscore (see Metrics).
AlphaFold3 demonstrated clear improvements over AlphaFold-Multimer in global structural accuracy. Both the server and local versions achieved higher average TM-scores (0.739 and 0.735, respectively), representing improvements of 8.1% and 7.5% over AlphaFold-Multimer’s 0.683, with significant differences (P = 6.82 × 10−4 and 3.41 × 10−4 for the server and local versions, respectively, relative to AlphaFold-Multimer; see Table S33 and Fig. 6A). In pairwise comparisons, AlphaFold3 outperformed AlphaFold-Multimer in 65 (server) and 66 (local) complexes. Using a stricter threshold of TM-score improvement >0.05, AlphaFold3 surpassed AlphaFold-Multimer in 33 (server) and 31 (local) cases, versus only 11 and 9 for AlphaFold-Multimer, underscoring its enhanced reliability in global modeling.
Figure 6.
Comparison of AlphaFold3 vs. AlphaFold-Multimer performance for antigen-antibody complex structure prediction (Benchmark-VI). (A)–(D) Head-to-head comparison of TM-score, LDDT, DockQ, and CDR-TMscore between AlphaFold3 and AlphaFold-Multimer on the 102 antigen-antibody complexes. (E) Structural models of the 8ts0 generated by AlphaFold3 and AlphaFold-Multimer. In panel E, the experimental structure is shown in surface representation and colored gray, while the predicted model is displayed in cartoon representation, with teal for antigen, sand for heavy chain, and hot pink for light chain, respectively.
In terms of local modeling quality, improvements were modest. AlphaFold3-server and AlphaFold3-local achieved average LDDT scores of 0.828 and 0.828, respectively, representing slight improvements of 1.9% and 2.0%, over AlphaFold-Multimer (0.812). Despite the small magnitude of the improvements, the differences were statistically significant for both AlphaFold3 versions, with P = 3.92 × 10−9 for AlphaFold3-server and P = 2.80 × 10−9 for AlphaFold3-local (see Table S33 and Fig. 6B). Overall, these results indicate that AlphaFold3 provides slightly enhanced local structural accuracy compared with AlphaFold-Multimer.
Notably, the most substantial gains were observed in interface modeling. AlphaFold3 achieved higher average DockQ scores of 0.422 (server) and 0.418 (local), corresponding to 29.8% and 28.6% improvements over AlphaFold-Multimer (0.325), with statistically significant differences (P = 2.36 × 10−5 and 2.00 × 10−5, respectively, see Table S33 and Fig. 6C). Nevertheless, absolute interface modeling accuracy remained limited: using the conventional DockQ score threshold of 0.23 to define a “correct interface”, AlphaFold3 correctly modeled only 27.5% (server) and 28.4% (local) of antigen-antibody complexes, whereas AlphaFold-Multimer performed even worse, with just 9.8% correctly modeled. This highlighted that high-fidelity prediction of antigen-antibody interfaces continues to pose a major challenge.
Given the critical role of complementarity determining region (CDR) loops in antigen recognition, we further evaluated their structural accuracy using CDR-TMscore. AlphaFold3-server and AlphaFold3-local achieved higher average scores (0.183 and 0.176, respectively, see Table S33 and Fig. 6D) compared with AlphaFold-Multimer (0.074), with significant differences (P = 2.27 × 10−4 and P = 8.94 × 10−6). When applying a threshold of TM-score improvement >0.05, AlphaFold3-server and AlphaFold3-local outperformed AlphaFold-Multimer for 33 and 30 complexes, respectively, whereas AlphaFold-Multimer was better for only 9 and 8 complexes. Despite these significant improvements, the overall accuracy remained limited, with correct CDR folds (CDR-TMscore ≥0.5) achieved in only 18.6% (server) and 17.6% (local) of cases, and AlphaFold-Multimer performed even worse at just 7.8%.
Metrics such as CDR-TMscore, which specifically evaluate antigen-antibody interface quality, provide a more accurate assessment of binding site predictions. For example, the complex 8ts0, comprising the human ASGR1 CRD bound to the 8 M24 Fab, illustrates this point. Both versions of AlphaFold3 generated near-native global models (TM-score = 0.974 and 0.976), whereas AlphaFold-Multimer achieved a TM-score of 0.810, which was also acceptable in terms of global accuracy. The LDDT scores were similarly high, with only minor differences between AlphaFold3 (0.957 and 0.949) and AlphaFold-Multimer (0.910). However, striking differences emerged at the binding interface. As shown in Fig. 6E, all predicted models were superposed onto the native structure based on the antigen. AlphaFold3 nearly perfectly reproduced the CDR loops involved in antigen binding, achieving CDR-TMscores of 0.840 (server) and 0.887 (local), whereas AlphaFold-Multimer completely failed to predict the binding sites, with a CDR-TMscore of only 0.003.
In summary, the benchmark analysis demonstrates that AlphaFold3 exhibits clear improvements over AlphaFold-Multimer in the modeling of antigen-antibody complexes, particularly at the interface level. Nevertheless, AlphaFold3 continues to face challenges in accurately recapitulating the fine-grained conformational details of antigen-antibody interfaces, and achieving high-fidelity prediction of these interaction interfaces remains a significant challenge.
Performance on RNA
On the constructed nonredundant RNA dataset comprising 52 RNA monomers (see Dataset for Benchmark-IV), we evaluated the modeling performance of five deep learning-based RNA structure prediction methods: AlphaFold3, RoseTTAFoldNA, RhoFold+, NuFold, and trRosettaRNA. These 52 RNA targets are further divided into 22 “easy” targets and 30 “hard” targets, based on the quality of the detected templates. For all the five methods, the top models were selected for comparison using each method’s default scoring criterion.
In terms of global modeling accuracy, AlphaFold3 occupied an intermediate position. trRosettaRNA achieved the highest mean TM-score (0.548), followed by NuFold (0.525). AlphaFold3 yielded mean TM-scores of 0.510 (server) and 0.504 (local), slightly trailing the top two methods, but clearly outperforming RoseTTAFoldNA (0.449) and RhoFold+ (0.431), as shown in Table S34. This performance trend was consistent across difficulty categories. On the 30 “hard” targets, trRosettaRNA again led with a mean TM-score of 0.457, while AlphaFold3 and NuFold showed comparable performance (both around 0.40), RoseTTAFoldNA and RhoFold+ still exhibited substantially lower accuracy, as shown in Table S35. On the 22 “easy” targets, all methods improved markedly, with NuFold achieving the highest mean TM-score (0.700), followed by trRosettaRNA (0.672) and AlphaFold3 (0.654 and 0.652), as shown in Table S36 and Fig. 7.
Figure 7.
Comparison of the performance of AlphaFold3, RoseTTAFoldNA, NuFold, RhoFold+, and trRosettaRNA in RNA structure prediction (Benchmark-VII). (A–C) Distributions of TM-scores, LDDT, and INF for all methods on 52 RNA monomers. (D–O) Head-to-head comparisons between AlphaFold3 and RoseTTAFoldNA, NuFold, RhoFold+, and trRosettaRNA in terms of TM-score, LDDT and INF, respectively. Here, the square and triangle represent the types of RNA targets: the square for “easy” targets, and the triangle for “hard” targets. The colors denote the comparisons: blue represents the comparison between AlphaFold3-server and other methods, while red indicates the comparison between AlphaFold3-local and other methods. The RNA (7URM) structures from experiment (P), AlphaFold3-server (Q), AlphaFold3-local (R), RoseTTAFoldNA (S), NuFold (T), RhoFold+ (U), and trRosettaRNA (V). In order to more intuitively show the details of the models predicted by AlphaFold3 and RoseTTAFoldNA and compare them with the native structure, we use gray cartoons to display the native structure models and use colored structures to display the predicted models.
In contrast, local modeling accuracy, assessed using the LDDT metric, revealed a clear advantage for AlphaFold3. Both versions achieved the highest average LDDT scores (0.713 for server and 0.715 for local), representing improvements of at least 13.0% over the next-best method. Wilcoxon signed-rank tests indicated that both versions yielded significantly higher local modeling accuracy than each of the four competing methods. After Bonferroni correction for multiple comparisons within each version, all pairwise comparisons remained significant (adjusted P < 10−4). Moreover, when comparing per-target performance against the best result among the other four specialized RNA predictors, AlphaFold3 maintained a clear advantage: AlphaFold3-server yielded a higher LDDT in 39 of 52 cases (75.0%), and AlphaFold3-local did so in 41 cases (78.8%). These results demonstrate that AlphaFold3 consistently provides superior local structural accuracy for RNA monomers compared to other four methods.
RNA secondary structure modeling quality was evaluated using the INF metric. Consistent with the LDDT results, AlphaFold3 again demonstrated leading performance, with both versions achieving the highest mean INF scores (0.814 for server and 0.816 for local), at least 2.9% above the next-best method. Wilcoxon signed-rank tests confirmed that this advantage was statistically significant across all pairwise comparisons (adjusted P < 10−2 after Bonferroni correction). Moreover, AlphaFold3 yielded higher INF scores than the best-performing alternative in 34 of 52 targets (65.4%) for both versions, reinforcing its consistent lead in capturing RNA secondary structure.
To illustrate the performance differences among methods on a representative RNA target, we consider chain “v” of PDB 7URM [63], which is the P-site bound structure of the synthetic tRNAUTu1A. In this case, AlphaFold3 outperformed all other methods, achieving highest TM-score (0.727 for server and 0.613 for local), LDDT scores (0.796 and 0.797), and INF (0.93 and 0.94). Notably, AlphaFold3 is able to model the overall structure with reasonable accuracy, even correctly capturing the highly flexible variable loop, as shown in Fig. 7E and F. INF values >0.90 indicate that the secondary structure was also predicted with high accuracy, including the crucial elbow region interactions. The other three methods (trRosettaRNA, NuFold, and RhoFold+) also generated acceptable models. In contrast, RoseTTAFoldNA struggled with this target, yielding a significantly lower TM-score of 0.312 and an LDDT of 0.477. As shown in Fig. 7G, the structure predicted by RoseTTAFoldNA deviates markedly from the native structure. While RoseTTAFoldNA correctly identified the acceptor arm, T-arm, and anticodon stem, it failed to model the variable loop and D-arm, incorrectly modeling the bases as forming double helices. These incorrect predictions of local interactions led to a substantially reduced INF score of 0.66.
There is no significant overall difference between AlphaFold3-server and AlphaFold3-local on the 52 RNA targets. However, we observed that AlphaFold3-server performs considerably better on some targets, such as 7URM_v, while AlphaFold3-local outperforms the server version on others, such as 8UPT_A. To further analyze these differences, we ran both versions 20 times on these two targets (Fig. S7). The TM-score distributions of models generated by AlphaFold3-server and AlphaFold3-local are very similar, with most values concentrated between 0.5 and 0.7. After 20 runs, the TM-score gap between the top-ranked models from the two versions was reduced from 0.114 to 0.037 for 7URM_v, and from 0.065 to 0.019 for 8UPT_A. Thus, as with the case for the protein models noted above, it appears that both the -server and -local implementations sample from similar model distributions, but that especially for difficult targets, substantial variations may arise in the generated structures when the number of models tested is small. Likewise, improved EMA methods would be of substantial benefit in ensuring that the truly best model is chosen from each pool of candidate structures.
Overall, the results from the 52 RNA targets indicate that AlphaFold3 achieves moderate performance in global fold accuracy compared with other methods but demonstrates clear advantages in local structural fidelity and secondary structure modeling. These findings underscore AlphaFold3’s potential as a powerful tool for RNA structure prediction.
Performance on RNA multimer
RNA multimer structure prediction was recently introduced as a new category in CASP16, reflecting growing interest in modeling RNA–RNA interactions. To assess AlphaFold3’s performance on this task, a benchmark dataset of 23 RNA multimers was constructed (see Dataset for Benchmark-VIII). RoseTTAFoldNA was selected as the control method.
In global structure prediction, AlphaFold3 showed modest improvements over RoseTTAFoldNA. The server and local versions of AlphaFold3 achieved average TM-scores of 0.300 and 0.299, respectively, corresponding to relative improvements of 23.9% and 23.5% over RoseTTAFoldNA’s average TM-score of 0.242. AlphaFold3 yielded substantially better models (ΔTM-score > 0.05) for 11 (server) and 10 (local) targets, compared to 5 and 6 targets where RoseTTAFoldNA showed a comparable advantage (see Table S37 and Fig. 8A). However, these differences did not reach statistical significance (P = 9.18 × 10−2 and 1.19 × 10−1). Critically, both methods struggled to produce accurate global folds: only two of the 23 complexes were modeled correctly (TM-score ≥ 0.5) by AlphaFold3, while RoseTTAFoldNA failed to generate any correct global folds. These results underscore the considerable difficulty of de novo RNA multimer structure prediction.
Figure 8.
Comparison of AlphaFold3 vs. RoseTTAFoldNA performance for RNA multimer structure prediction (Benchmark-VIII). (A) and (B) Head-to-head comparisons between AlphaFold3 and RoseTTAFoldNA in terms of TM-score and LDDT, respectively, for 23 RNA multimers. (C) An example of the RNA multimer (PDB ID: 8VVJ) structures built by AlphaFold3 and RoseTTAFoldNA. Here, experimental structures are colored gray, and predicted structures are colored differently according to different chains. The structures of the left part are predicted by AlphaFold3-server, the structures of the middle part are predicted by AlphaFold3-local, and the structures of the right part are predicted by RoseTTAFoldNA.
In terms of local modeling quality, AlphaFold3 demonstrated consistent and statistically significant improvements. The server and local versions of AlphaFold3 achieved average LDDT scores of 0.564 and 0.571, representing improvements of 22.2% and 23.6% over RoseTTAFoldNA’s average LDDT of 0.462 (P = 2.14 × 10−3 and 6.71 × 10−3, see Table S37 and Fig. 8B). AlphaFold3 outperformed RoseTTAFoldNA in 17 of the 23 cases, whereas RoseTTAFoldNA was superior in only 6. This indicates that while global topology remains challenging to capture, AlphaFold3 provides more reliable local structural detail in RNA multimers.
The RNA homodimer 8VVJ represents one of the few cases in which AlphaFold3 achieved high-quality predictions, as shown in Fig. 8C. Both the server and local versions yielded models with TM-scores of 0.758 and 0.735, respectively, substantially higher than the TM-score of 0.439 obtained by RoseTTAFoldNA. AlphaFold3 also demonstrated improved local accuracy, with LDDT scores of 0.752 (server) and 0.755 (local) compared to 0.637 for RoseTTAFoldNA. AlphaFold3 recapitulated a near-native global fold and correctly modeled key intermolecular interactions, whereas RoseTTAFoldNA failed to capture the correct dimeric arrangement.
In summary, while AlphaFold3 exhibits clear gains over RoseTTAFoldNA in local modeling and succeeds on a small subset of RNA multimers, it still faces fundamental limitations in reliably predicting the global architecture of RNA multimers. The inclusion of RNA multimers in CASP16 marks an important step toward community-wide progress, but significant methodological innovations will be required to achieve consistently accurate structure prediction for this challenging class of biomolecules.
Performance on protein-nucleic acid complex
To evaluate AlphaFold3’s performance in protein-nucleic acid complex structure prediction, a dataset of 59 protein-nucleic acid complexes was constructed (see Dataset for Benchmark-IX). RoseTTAFoldNA served as the control method, as it is currently (to our knowledge) the only other end-to-end tool capable of modeling such complexes. Consistent with the above results on RNA monomer targets, the AlphaFold3 model for each target was chosen based on the default ranking score, while for RoseTTAFoldNA, the top-ranked model was selected according to the default pLDDT score.
Overall, AlphaFold3 demonstrated superior global modeling quality over RoseTTAFoldNA, as the average TM-score achieved by AlphaFold3-server was 0.745, which is 14.1% higher than RoseTTAFoldNA’s average of 0.653 (P = 1.04 × 10−5; see Table S38). As depicted in Fig. 9A, AlphaFold3-server produced models with higher TM-scores for 42 out of the 59 targets, with 17 targets showing improvements of more than 0.1 TM-score units, while RoseTTAFoldNA surpassed AlphaFold3-server by more than 0.1 TM-score units in only two targets. Results from AlphaFold3-local are similar to AlphaFold3-server, with an average TM-score of 0.741 (P = 6.45 × 10−6; see Table S38). These results indicate that AlphaFold3 generally outperforms RoseTTAFoldNA in global structural predictions.
Figure 9.
Comparison of AlphaFold3 vs. RoseTTAFoldNA performance for protein-nucleic acid complex structure prediction (Benchmark-IX). (A) and (B) Head-to-head comparisons between AlphaFold3 and RoseTTAFoldNA in terms of TM-score and LDDT, respectively, for 59 protein-nucleic acid complexes. Here, the colors denote the comparisons: blue represents the comparison between AlphaFold3-server and RoseTTAFoldNA, while red indicates the comparison between AlphaFold3-local and RoseTTAFoldNA. (C) Distributions of TM-scores for constituent protein, RNA, and DNA chains extracted from protein-nucleic acid complexes predicted by AlphaFold3 and RoseTTAFoldNA. The figure shows the TM-scores for each constituent type, with the mean represented by a long black line and the median by a short white line. (D) An example of the protein-nucleic acid complex (PDB ID: 8A0X) structures built by AlphaFold3 and RoseTTAFoldNA. Here, the experimental structure is colored gray, and predicted structures are colored differently according to different chains. The structures of the upper row are predicted by AlphaFold3, structures of the middle part are predicted by AlphaFold3, and the structures of the lower part are predicted by RoseTTAFoldNA. The structure of the subunit is extracted from the whole complex structure.
The enhanced global modeling quality of AlphaFold3 can be attributed to two main factors: improved local structure and subunit modeling, and more accurate protein-nucleic acid interface predictions. Regarding local structural accuracy, AlphaFold3-server achieved an average LDDT of 0.774 (P = 1.75 × 10−10; see Table S38), and AlphaFold3-local achieved an average LDDT of 0.753 (P = 1.75 × 10−10; see Table S38), representing increases of 14.3% and 11.2%, respectively, over RoseTTAFoldNA’s average of 0.677. Figure 9B illustrates that AlphaFold3 consistently demonstrated superior LDDT values across nearly all targets. In terms of subunit modeling quality, the average TM-score of each type of subunit is shown in Fig. 9C and Table S39. For 122 protein subunits of protein-nucleic acid complexes, AlphaFold3-server and AlphaFold3-local achieved average TM-scores of 0.875 (P = 1.04 × 10−10) and 0.873 (P = 2.66 × 10−10; see Table S38), respectively, which are slightly higher than RoseTTAFoldNA’s average of 0.835. Notably, for RNA and DNA subunits, AlphaFold3-server achieved TM-scores of 0.401 and 0.309, respectively, surpassing RoseTTAFoldNA by 35.0% (RNA TM-score of 0.297) and 15.7% (DNA TM-score of 0.267). This indicates that AlphaFold3’s advantage in predicting protein-nucleic acid interactions may be partly due to its ability to better model subunit structures, particular for nucleic acid structures, consistent with findings regarding RNA monomers in the last section. A targeted analysis of the structure of the heterotetrametric transcription factor HigAB in complex with DNA (PDB code 8A0X [64]) provides an example consistent with this conclusion (Fig. 9D), as AlphaFold3 accurately modeled the DNA “bending” induced by protein binding, achieving a TM-score close to 0.600 for the DNA subunit, while RoseTTAFoldNA failed to do so, resulting in a TM-score of 0.343. Furthermore, for the individual protein subunits, AlphaFold3-server’s TM-scores were significantly higher than those of RoseTTAFoldNA by 25.4%, 25.5%, 4.0%, and 3.0%, respectively. AlphaFold3-local achieved a performance similar to the server version. This superior modeling of the subunits contributed to a more accurate global fold for AlphaFold3, which attained an overall TM-score >0.850, compared to RoseTTAFoldNA’s 0.311 for this target.
In addition to improved subunit modeling, AlphaFold3 demonstrated superior accuracy in predicting protein-nucleic acid interfaces. As seen in Table S38, the average FNAT scores achieved by AlphaFold3-server and AlphaFold-local were 0.410 (P = 2.50 × 10−6) and 0.417 (P = 2.50 × 10−6), respectively, compared to 0.186 for RoseTTAFoldNA. As shown in Fig. 10A, ~35.6% of structures predicted by AlphaFold3-server successfully captured more than half of the native contacts between protein and nucleic acid, in contrast to only 16.9% for RoseTTAFoldNA. Complex 8PI9 [65] offers a clear example of AlphaFold3’s superior interface modeling (Fig. 10B). All methods achieved TM-scores >0.9 for all protein subunits and above 0.5 for the DNA duplexes. However, AlphaFold3 accurately captured the protein-nucleic acid interface with an FNAT around 0.850, in contrast to RoseTTAFoldNA’s FNAT of 0.013. This enhanced interface accuracy contributed to AlphaFold3’s superior overall modeling quality for this complex, achieving a TM-score >0.950, compared to RoseTTAFoldNA’s 0.447.
Figure 10.
FNAT comparison of AlphaFold3 vs. RoseTTAFoldNA performance for protein-nucleic acid complex structure prediction (Benchmark-IX). (A) Head-to-head comparisons between AlphaFold3 and RoseTTAFoldNA in terms of FNAT for 59 protein-nucleic acid complexes. Here, the colors denote the comparisons: blue represents the comparison between AlphaFold3-server and RoseTTAFoldNA, while red indicates the comparison between AlphaFold3-local and RoseTTAFoldNA. (B) An example of the protein-nucleic acid complex (PDB ID: 8PI9) structures built by AlphaFold3 and RoseTTAFoldNA. Here, experimental structures are colored grey, and predicted structures are colored differently according to different chains. The structures of the left part are predicted by AlphaFold3-server, structures of the middle part are predicted by AlphaFold3-local, and the structures of the right part are predicted by RoseTTAFoldNA. The structure of the subunit is extracted from the whole complex structure.
When run only once, AlphaFold3-server and AlphaFold3-local show no significant overall difference on the 59 protein-nucleic acid complexes, except in LDDT. To explore this further, we analyzed two targets (8QA9 and 8P5Q) by running each version 20 times (Fig. S8). For both targets, TM-score and LDDT distributions were highly similar across versions. After 20 runs, the differences between top-ranked models were greatly reduced. These findings highlight that AlphaFold3 can be unstable on some complexes, but repeated runs may mitigate this variability. They also suggest room for improvement in the model ranking system for such cases.
In summary, AlphaFold3 outperforms RoseTTAFoldNA in predicting protein-nucleic acid complex structures, exhibiting significant improvements in global and local modeling quality, as well as in modeling protein-nucleic acid interfaces. The enhanced performance is attributed to AlphaFold3’s superior ability to model both the subunits, particularly nucleic acids, and the interactions between proteins and nucleic acids. These findings suggest that AlphaFold3 currently offers a more accurate and reliable tool for modeling protein-nucleic acid complexes.
Running time
In this section, we compare the runtimes of the methods evaluated in this study across different benchmark datasets. On benchmark-I (protein monomers), AlphaFold3-local achieved a substantially shorter average runtime (~20 min) than AlphaFold2 (~160 min) (Fig. S9). In single-sequence mode on benchmark-I, AlphaFold3-local required only ~3 min on average, compared with ~19 min for AlphaFold2 (Fig. S10). On benchmark-II (orphan proteins), AlphaFold3-local ran in ~20 min on average, whereas AlphaFold2 required ~74 min (Fig. S11). On benchmark-IV (protein multimers), AlphaFold3-local again exhibited markedly faster performance, with an average runtime of ~31 min compared with ~297 min for AlphaFold-Multimer (Fig. S12). For benchmark-V (peptide-protein complexes), AlphaFold3 completed predictions in ~34 min on average, compared with ~272 min for AlphaFold-Multimer (Fig. S13). On benchmark-VI (antigen-antibody complexes), the average runtimes of AlphaFold3 and AlphaFold-Multimer were ~47 and ~382 min, respectively (Fig. S14). On benchmark-VII (RNA), the average runtimes of AlphaFold3, RoseTTAFoldNA, RhoFold+, NuFold, and trRosettaRNA were ~10, ~445, ~402, ~531, and ~426 min, respectively (Fig. S15). On dataset benchmark-VIII (RNA multimers), AlphaFold3 required ~11 min on average, compared with ~799 min for RoseTTAFoldNA (Fig. S16). Finally, on benchmark-IX (protein-nucleic acid complexes), AlphaFold3 and RoseTTAFoldNA required ~23 and ~996 min, respectively (Fig. S17). These results demonstrate that AlphaFold3 achieves faster performance than all other methods across every benchmark set. All results were obtained on an NVIDIA A100 40 GB GPU and an AMD EPYC 7763 64-Core Processor CPU, with each job allocated 8 CPU cores and 40 GB of system memory.
Conclusions
Although it is challenging to achieve high-accuracy predictions on all types of biomacromolecules and their interactions, AlphaFold3 uses a unified model architecture in attempting to reach this goal. Based on both the original AlphaFold3 paper and our independent benchmarks, AlphaFold3 has certain advantages over existing advanced methods. In this work, we evaluated and analysed the performance of AlphaFold3 across five different types of benchmark sets including protein monomers, protein complexes, RNA monomers, RNA multimers, and protein-nucleic acid complexes, and compared it to current state-of-the-art methods.
In modeling protein monomers, AlphaFold3 demonstrates comparable global modeling accuracy to AlphaFold2 and exhibits a small but statistically significant improvement in local structure prediction, particularly for multidomain proteins. It is worth noting that the AlphaFold2 (monomer) model weights were trained on PDB structures released prior to 30 April 2018. For protein monomer prediction, the conclusions might differ if AlphaFold2 were retrained on the same dataset as AlphaFold3; however, no substantial differences are expected, as both methods already achieve high accuracy on monomeric proteins. When used in single-sequence mode, AlphaFold3 achieves substantially better overall accuracy than AlphaFold2, although its absolute accuracy is 20%–30% lower than when run with default settings. This indicates that AlphaFold3 simplifies MSA processing and thereby reduces its dependence on MSA information. For orphan proteins with only limited MSAs, AlphaFold3-server shows a small but statistically significant improvement in local structural prediction, while no significant differences are observed in other aspects. Despite these improvements, the overall enhancement in modeling accuracy for protein monomers is modest, suggesting that the prediction accuracy for these structures may have reached a plateau, at least for current modeling architectures and database sizes. This plateau makes further significant gains challenging, especially for single-domain proteins; it is notable, however, that the integration of deeper MSAs and Monte Carlo sampling has been shown to yield substantial performance improvements over that achieved by AlphaFold3 [23].
In predicting alternative conformations for multiconformation proteins, AlphaFold3 shows a superior ability compared to AlphaFold2 in some instances. However, for most cases, its predictions remain stable in a single state, indicating considerable room for improvement in capturing the full range of conformational variability inherent in certain proteins, and perhaps the need for more targeted developments aimed at sampling, identifying, and scoring alternative protein conformations.
For protein complexes, compared with AlphaFold-Multimer, AlphaFold3 achieves statistically significant improvements in local structure accuracy, as evidenced by higher LDDTs, especially for homomers. For heteromers, AlphaFold3 achieves statistically significant improvements in interface modeling quality, as evidenced by higher DockQ. For peptide-protein complexes, a special type of heteromeric protein complex, AlphaFold3 and AlphaFold-Multimer exhibit very similar performance across TM-score, LDDT, and DockQ metrics. In contrast, for antigen-antibody complexes, AlphaFold3 significantly outperforms AlphaFold-Multimer in terms of TM-score, LDDT, DockQ, and CDR-TMscore. This superiority may be related to AlphaFold3’s reduced dependence on MSAs. Nevertheless, AlphaFold3 still faces challenges in accurately capturing the fine-grained conformational details of antigen-antibody interfaces. Moreover, the stronger correlation between AlphaFold3’s predicted interface metrics (ipTM) and actual interface quality also indicates a more reliable prediction of protein-protein interactions. However, both methods display comparable performance in global structure prediction, with neither consistently outperforming the other across all targets, and each substantively outperforming the other in a nontrivial number of cases. The complementary strengths observed between AlphaFold3 and AlphaFold-Multimer suggest that integrating their predictions could enhance overall modeling accuracy. This potential synergy underscores the value of combining different predictive approaches to improve results. Although integrating models from AlphaFold3 and AlphaFold-Multimer based on the highest confidence score did not significantly improve the overall quality of the final models, this underscores the need for developing effective model quality assessment methods tailored to these prediction pipelines. While the structure of single-domain proteins can be accurately predicted, the situation for protein complex structure prediction is different. Current methods such as AlphaFold-Multimer and AlphaFold3 have enabled steady progress, yet the overall accuracy remains limited. In CASP16, more than 30% of oligomer targets, especially antigen-antibody complexes, remained highly challenging, with most groups only correctly predicting about a quarter of such cases [66]. This is consistent with our benchmark results on antigen-antibody complexes, where AlphaFold3 produced correct interfaces for only ~30% of the targets and AlphaFold-Multimer succeeded on merely ~10%. Even when correct models were present among sampled structures, ranking, and selection remain major bottlenecks. Additional difficulties persist in modeling high-order assemblies, as shown in CASPs, although such targets were limited in our benchmark dataset. Thus, complex structure prediction is still far from solved. Substantial methodological innovation, beyond incremental improvements of AlphaFold, is required to achieve reliable solutions.
For RNA monomers, AlphaFold3 ranked in an intermediate position: trRosettaRNA achieved the highest mean TM-score, followed by NuFold and AlphaFold3. However, AlphaFold3 remained the best performer across the other evaluation metrics. On RNA multimers, AlphaFold3 shows clear gains over RoseTTAFoldNA in local modeling and succeeds on a small subset of targets. However, fewer than 10% of RNA multimers were predicted with correct folding (TM-score > 0.5), underscoring that significant methodological innovations will be required to achieve consistently accurate structure prediction for this challenging class of biomolecules. For protein-nucleic acid complexes, AlphaFold3 significantly outperforms RoseTTAFoldNA, achieving higher TM-scores, LDDTs, and FNAT scores. Due to the high quality of the subunit models that it generates, AlphaFold3 also demonstrates superior performance over RoseTTAFoldNA in global modeling quality. The enhanced accuracy is also attributed to AlphaFold3’s better modeling of local structure and the interactions between proteins and nucleic acids. However, the average TM-scores for AlphaFold3’s RNA and protein-nucleic acid complex predictions are 0.510 and 0.745, which are much lower than those (0.898 and 0.821) from protein monomer and protein complex, respectively, highlighting a gap in accuracy compared to protein structure predictions and suggesting substantial room for improvement in RNA and protein-nucleic acid complex modeling.
Additionally, we also analyzed the differences between AlphaFold3-server and AlphaFold3-local on several targets. These discrepancies appear to stem from random variation and limitations in built-in model ranking capabilities, as repeated runs showed that both versions produced highly similar distributions of predicted models in all cases that we considered.
Taken together, our findings demonstrate that AlphaFold3 represents a meaningful advancement in the field of biomacromolecule structure prediction, offering a unified framework capable of modeling a diverse range of biomacromolecules and their complexes with improved accuracy. The improvement of AlphaFold3 over AlphaFold2, AlphaFold-Multimer, and RoseTTAFoldNA, however, ranged from minor (for protein monomers and RNA multimers) to substantial (for antigen-antibody complexes and protein-nucleic acid complexes), and in the case of protein multimers there was no notable difference in performance between the newer and older pipelines apart from antigen-antibody complexes. These findings highlight AlphaFold3’s potential as a more accurate and reliable tool for modeling a wide range of complex biomacromolecular systems. However, future efforts should focus on addressing the remaining challenges, such as improving predictions of alternative conformations, enhancing model ranking performance, and further increasing the accuracy of modeling RNA and protein-nucleic acid complexes, in order to fully realize the capabilities of AlphaFold3 in structural biology.
Key Points
This study provides a comprehensive benchmark of AlphaFold3, comparing it with AlphaFold2 and AlphaFold-Multimer on protein monomers (general proteins and orphan proteins), multimers (general multimers, peptide-protein complexes, and antigen-antibody complexes), and alternative conformations; with RoseTTAFoldNA, RhoFold+, NuFold, and trRosettaRNA on RNA monomers; and with RoseTTAFoldNA on RNA multimers and protein-nucleic acid complexes.
AlphaFold3 improves local structural detail for general and orphan protein monomers with only marginal global gains over AlphaFold2, but in single sequence input it achieves clear superiority in both global and local accuracy.
AlphaFold3 surpasses AlphaFold-Multimer in local structural accuracy for general protein complexes, performs comparably on peptide-protein complexes, and demonstrates clear superiority on antigen-antibody complexes.
The new diffusion-based architecture can capture distinct conformational states for certain proteins, but it still struggles to generate a broad, well-scored ensemble of alternative structures.
AlphaFold3 shows advantages over RoseTTAFoldNA, achieving significantly higher local accuracy for RNA multimers, and both global and local accuracy improvements for protein-RNA/DNA complexes. For RNA monomers, trRosettaRNA achieves higher global accuracy, whereas AlphaFold3 shows superior performance in local accuracy and interaction network fidelity.
Supplementary Material
Acknowledgements
We would like to thank Dr. Lydia Freddolino for her insightful suggestions on this work.
Contributor Information
Chunxiang Peng, Department of Biological Chemistry, University of Michigan, 1136 Catherine Street, Ann Arbor, MI 48109-1085, United States.
Wentao Ni, NITFID, School of Statistics and Data Science, AAIS, LPMC and KLMDASR, Nankai University, 94 Weijin Road, Nankai District, Tianjin 300071, China.
Quancheng Liu, Gilbert S Omenn Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218, United States.
Gang Hu, NITFID, School of Statistics and Data Science, AAIS, LPMC and KLMDASR, Nankai University, 94 Weijin Road, Nankai District, Tianjin 300071, China.
Wei Zheng, NITFID, School of Statistics and Data Science, AAIS, LPMC and KLMDASR, Nankai University, 94 Weijin Road, Nankai District, Tianjin 300071, China.
Conflict of interest: All authors declare that they have no conflicts of interest.
Funding
This work is supported in part by the National Natural Science Foundation of China (12426303 to W.Z., 92370128 and 12326611 to G.H.), the Fundamental Research Funds for the Central Universities (054-63253109 to W.Z.), and the Tianjin Science and Technology Program (24ZXZSSS00320 to G.H. and W.Z.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Data availability
All benchmark data are freely available at: https://zenodo.org/records/15502855
The tools and resources used in this study include:
US-align (https://seq2fun.dcmb.med.umich.edu/US-align),
LDDT calculation (https://github.com/metalcycling/openstructure),
DockQ (https://github.com/bjornwallner/DockQ),
INF calculation (https://github.com/mmagnus/rna-tools),
AlphaFold2/AlphaFold-Multimer (https://github.com/google-deepmind/alphafold),
AlphaFold3 server (https://alphafoldserver.com/),
AlphaFold3 standalone package (https://github.com/google-deepmind/alphafold3), and
RoseTTAFoldNA (https://github.com/uw-ipd/RoseTTAFold2NA).
RhoFold+ (https://github.com/ml4bio/RhoFold).
NuFold (https://github.com/kiharalab/NuFold).
trRosettaRNA (https://yanglab.qd.sdu.edu.cn/trRosettaRNA).
References
- 1. Zhang Y. Protein structure prediction: when is it useful? Curr Opin Struct Biol 2009;19:145–55. 10.1016/j.sbi.2009.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Pearce R, Zhang Y. Toward the solution of the protein structure prediction problem. J Biol Chem 2021;297:100870. 10.1016/j.jbc.2021.100870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Peng CX, Liang F, Xia YH, et al. Recent advances and challenges in protein structure prediction. J Chem Inf Model 2023;64:76–95. 10.1021/acs.jcim.3c01324. [DOI] [PubMed] [Google Scholar]
- 4. Lyons J, Dehzangi A, Heffernan R., et al. Predicting backbone cα angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network. J Comput Chem 2014;35:2040–6. 10.1002/jcc.23718. [DOI] [PubMed] [Google Scholar]
- 5. Faraggi E, Yang YD, Zhang SS, et al. Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction. Structure 2009;17:1515–27. 10.1016/j.str.2009.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Zhou YQ, Litfin T, Zhan J. 3=1+2: how the divide conquered de novo protein structure prediction and what is next? Natl Sci Rev 2023;10:10. 10.1093/nsr/nwad259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583–9. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY: IEEE, 2016, 770–8.
- 9. Wang S, Sun SQ, Li Z, et al. Accurate De novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol 2017;13:e1005324. 10.1371/journal.pcbi.1005324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Yang JY, Anishchenko I, Park H, et al. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci USA 2020;117:1496–503. 10.1073/pnas.1914677117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Li Y, Zhang C, Yu DJ, et al. Deep learning geometrical potential for high-accuracy ab initio protein structure prediction. iScience 2022;25:104425. 10.1016/j.isci.2022.104425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Peng CX, Zhou XG, Zhang GJ. De novo protein structure prediction by coupling contact with distance profile. IEEE/ACM Trans Comput Biol Bioinform 2022;19:395–406. 10.1109/TCBB.2020.3000758. [DOI] [PubMed] [Google Scholar]
- 13. Xu JB. Distance-based protein folding powered by deep learning. Proc Natl Acad Sci USA 2019;116:16856–65. 10.1073/pnas.1821309116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Mortuza SM, Zheng W, Zhang CX, et al. Improving fragment-based ab initio protein structure assembly using low-accuracy contact-map predictions. Nat Commun 2021;12:5011. 10.1038/s41467-021-25316-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Zheng W, Zhang CX, Li Y, et al. Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations. Cell Reports Methods 2021;1:100014. 10.1016/j.crmeth.2021.100014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Senior AW, Evans R, Jumper J, et al. Improved protein structure prediction using potentials from deep learning. Nature 2020;577:706–10. 10.1038/s41586-019-1923-7. [DOI] [PubMed] [Google Scholar]
- 17. Pereira J, Simpkin AJ, Hartmann MD, et al. High-accuracy protein structure prediction in CASP14, proteins: structure. Functi,on, and Bioinformatics 2021;89:1687–99. 10.1002/prot.26171. [DOI] [PubMed] [Google Scholar]
- 18. Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021;373:871–6. 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Kandathil SM, Greener JG, Lau AM, et al. Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins. Proc Natl Acad Sci USA 2022;119:e2113348119. 10.1073/pnas.2113348119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Lin ZM, Akin H, Rao RS, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
- 21. Wang WK, Peng ZL, Yang JY. Single-sequence protein structure prediction using supervised transformer protein language models. Nature Computational Science 2022;2:804–14. 10.1038/s43588-022-00373-3. [DOI] [PubMed] [Google Scholar]
- 22. Peng ZL, Wang WK, Wei H, et al. Improved protein structure prediction with trRosettaX2, AlphaFold2, and optimized MSAs in CASP15, proteins: structure. Function, and Bioinformatics 2023;91:1704–11. 10.1002/prot.26570. [DOI] [PubMed] [Google Scholar]
- 23. Zheng W, Wuyun Q, Freddolino PL, et al. Integrating deep learning, threading alignments, and a multi-MSA strategy for high-quality protein monomer and complex structure prediction in CASP15, proteins: structure. Functi,on, and Bioinformatics 2023;91:1684–703. 10.1002/prot.26585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Oda T. Improving protein structure prediction with extended sequence similarity searches and deep-learning-based refinement in CASP15, proteins: structure. Function, and Bioinformatics 2023;91:1712–23. 10.1002/prot.26551. [DOI] [PubMed] [Google Scholar]
- 25. Peng CX, Zhou XG, Xia YH, et al. Structural analogue-based protein structure domain assembly assisted by deep learning. Bioinformatics 2022;38:4513–21. 10.1093/bioinformatics/btac553. [DOI] [PubMed] [Google Scholar]
- 26. Evans R, O’Neill M, Pritzel A, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv 2022:2021.2010.2004.463034. [Google Scholar]
- 27. Zheng W, Wuyun QQG, Li Y, et al. Improving deep learning protein monomer and complex structure prediction using DeepMSA2 with huge metagenomics data. Nat Methods 2024;21:279–89. 10.1038/s41592-023-02130-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Wallner B. AFsample: improving multimer prediction with AlphaFold using massive sampling. Bioinformatics 2023;39: btad573. 10.1093/bioinformatics/btad573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Pearce R, Omenn GS, Zhang Y. De novo RNA tertiary structure prediction at atomic resolution using geometric potentials from deep learning. bioRxiv 2022: 2022.2005.2015.491755. [Google Scholar]
- 30. Li Y, Zhang CX, Feng CJ, et al. Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure prediction, Nat Commun 2023;14:14. 10.1038/s41467-023-41303-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Baek M, McHugh R, Anishchenko I, et al. Accurate prediction of protein-nucleic acid complexes using RoseTTAFoldNA. Nat Methods 2024;21:117–21. 10.1038/s41592-023-02086-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Wang WK, Feng C, Han R, et al. trRosettaRNA: automated prediction of RNA 3D structure with transformer network. Nat Commun 2023;14:14. 10.1038/s41467-023-42528-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Shen T, Hu ZH, Sun SQ, et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat Methods 2024;21:2287–98. 10.1038/s41592-024-02487-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Das R, Kretsch RC, Simpkin AJ, et al. Assessment of three-dimensional RNA structure prediction in CASP15, proteins: structure. Function, and Bioinformatics 2023;91:1747–70. 10.1002/prot.26602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Krishna R, Wang J, Ahern W, et al. Generalized biomolecular modeling and design with RoseTTAFold all-atom. Science 2024;384:384. 10.1126/science.adl2528. [DOI] [PubMed] [Google Scholar]
- 36. Abramson J, Adler J, Dunger J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024;630:493–500. 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Kagaya Y, Zhang ZC, Ibtehaz N, et al. NuFold: end-to-end approach for RNA tertiary structure prediction with flexible nucleobase center representation, Nat Commun 2025;16. 10.1038/s41467-025-56261-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Li WZ, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 39. Xu Y, Xu D, Gambow HN. Protein domain decomposition using a graph-theoretic approach. Bioinformatics 2000;16:1091–104. 10.1093/bioinformatics/16.12.1091. [DOI] [PubMed] [Google Scholar]
- 40. Guo JT, Xu D, Kim D, et al. Improving the performance of DomainParser for structural domain partition using neural network. Nucleic Acids Res 2003;31:944–52. 10.1093/nar/gkg189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Zheng W, Wuyun QQG, Zhou XG, et al. LOMETS3: integrating deep learning and profile alignment for advanced protein template recognition and function annotation. Nucleic Acids Res 2022;50:W454–64. 10.1093/nar/gkac248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Peng C, Zhou X, Liu J, et al. Multiple conformational states assembly of multidomain proteins using evolutionary algorithm based on structural analogues and sequential homologues. Fundamental Research 2024. 10.1016/j.fmre.2024.05.003. [DOI] [Google Scholar]
- 43. Zhang CX, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods 2022;19:1109–15. 10.1038/s41592-022-01585-1. [DOI] [PubMed] [Google Scholar]
- 44. Ye J, McGinnis S, Madden TL. BLAST: improvements for better sequence analysis. Nucleic Acids Res 2006;34:W6–9. 10.1093/nar/gkl164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality, proteins: structure. Function, and Bioinformatics 2004;57:702–10. 10.1002/prot.20264. [DOI] [PubMed] [Google Scholar]
- 46. Xu JR, Zhang Y. How significant is a protein structure similarity with TM-score=0.5? Bioinformatics 2010;26:889–95. 10.1093/bioinformatics/btq066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Mariani V, Biasini M, Barbato A, et al. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013;29:2722–8. 10.1093/bioinformatics/btt473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Basu S, Wallner B. DockQ: a quality measure for protein-protein docking models. PLoS One 2016;11:e0161879. 10.1371/journal.pone.0161879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Das S, Chakrabarti S. Classification and prediction of protein-protein interaction interface using machine learning algorithm. Sci Rep 2021;11:1761. 10.1038/s41598-020-80900-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Parisien M, Cruz JA, Westhof E, et al. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA 2009;15:1875–85. 10.1261/rna.1700409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Méndez R, Leplae R, De Maria L, et al. Assessment of blind predictions of protein–protein interactions: current status of docking methods, proteins: structure. Functi,on, and Bioinformatics 2003;52:51–67. 10.1002/prot.10393. [DOI] [PubMed] [Google Scholar]
- 52. Mirabello C, Wallner B. DockQ v2: improved automatic quality measure for protein multimers, nucleic acids, and small molecules. Bioinformatics 2024;40. 10.1093/bioinformatics/btae586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Magnus M, Antczak M, Zok T, et al. RNA-puzzles toolkit: a computational resource of RNA 3D structure benchmark datasets, structure manipulation, and evaluation tools. Nucleic Acids Res 2020;48:576–88. 10.1093/nar/gkz1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Walen T, Chojnowski G, Gierski P, et al. ClaRNA: a classifier of contacts in RNA 3D structures based on a comparative analysis of various classification schemes. Nucleic Acids Res 2014;42:e151. 10.1093/nar/gku765. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Oechslin F, Zhu XJ, Morency C, et al. Fermentation practices select for thermostable Endolysins in Phages. Mol Biol Evol 2024;41:msae055. 10.1093/molbev/msae055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Ha JH, Loh SN. Protein conformational switches: from nature to design. Chem 2012;18:7984–99. 10.1002/chem.201200348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Li JX, Wang L, Zhu ZF, et al. Exploring the alternative conformation of a known protein structure based on contact map prediction. J Chem Inf Model 2023;64:301–15. 10.1021/acs.jcim.3c01381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Bisig D, Weber P, Vaughan L, et al. Purification, crystallization and preliminary crystallographic studies of a two fibronectin type-III domain segment from chicken tenascin encompassing the heparin- and contactin-binding regions. Acta Crystallogr D Biol Crystallogr 1999;55:1069–73. 10.1107/S090744499900284X. [DOI] [PubMed] [Google Scholar]
- 59. Agrawal AA, McLaughlin KJ, Jenkins JL, et al. Structure-guided U2AF65 variant improves recognition and splicing of a defective pre-mRNA. Proc Natl Acad Sci USA 2014;111:17420–5. 10.1073/pnas.1412743111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Heo L, Feig M. Multi-state modeling of G-protein coupled receptors at experimental accuracy. Proteins: Structure, Function, and Bioinformatics 2022;90:1873–85. 10.1002/prot.26382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Wittlinger F, Ogboo BC, Shevchenko E, et al. Linking ATP and allosteric sites to achieve superadditive binding with bivalent EGFR kinase inhibitors. Communications Chemistry 2024;7:7. 10.1038/s42004-024-01108-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Dai S, Wang BQ, Ye R, et al. Structural evolution of bacterial polyphosphate degradation enzyme for phosphorus cycling. Adv Sci 2024;11:2309602. 10.1002/advs.202309602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Prabhakar A, Krahn N, Zhang JJ, et al. Uncovering translation roadblocks during the development of a synthetic tRNA. Nucleic Acids Res 2022;50:10201–11. 10.1093/nar/gkac576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Hadzi S, Zivic Z, Kovacic M, et al. Fuzzy recognition by the prokaryotic transcription factor HigA2 from vibrio cholerae. Nat Commun 2024;15:3105. 10.1038/s41467-024-47296-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Kind L, Molnes J, Tjora E, et al. Molecular mechanism of HNF-1A-mediated HNF4A gene regulation and promoter-driven HNF4A-MODY diabetes. JCI Insight 2024;9:e175278. 10.1172/jci.insight.175278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Zhang J, Yuan R, Kryshtafovych A, et al. Assessment of protein complex predictions in CASP16: are we making progress? bioRxiv 2025:2025.2005.2029.656875. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All benchmark data are freely available at: https://zenodo.org/records/15502855
The tools and resources used in this study include:
US-align (https://seq2fun.dcmb.med.umich.edu/US-align),
LDDT calculation (https://github.com/metalcycling/openstructure),
DockQ (https://github.com/bjornwallner/DockQ),
INF calculation (https://github.com/mmagnus/rna-tools),
AlphaFold2/AlphaFold-Multimer (https://github.com/google-deepmind/alphafold),
AlphaFold3 server (https://alphafoldserver.com/),
AlphaFold3 standalone package (https://github.com/google-deepmind/alphafold3), and
RoseTTAFoldNA (https://github.com/uw-ipd/RoseTTAFold2NA).
RhoFold+ (https://github.com/ml4bio/RhoFold).
NuFold (https://github.com/kiharalab/NuFold).
trRosettaRNA (https://yanglab.qd.sdu.edu.cn/trRosettaRNA).














