Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 May 22;93(10):1747–1766. doi: 10.1002/prot.26844

Using AlphaFold and Symmetrical Docking to Predict Protein–Protein Interactions for Exploring Potential Crystallization Conditions

Kuan‐Ju Liao 1,, Yuh‐Ju Sun 1,
PMCID: PMC12433261  PMID: 40401365

ABSTRACT

Protein crystallization remains a major bottleneck in X‐ray crystallography due to difficulties in achieving favorable molecular arrangements within the crystal lattice. While protein–protein interactions at molecular packing interfaces are crucial for determining crystallization conditions, methods for predicting crystal packing interfaces and systematically exploring crystallization conditions remain limited. In this study, we present MASCL (Molecular Assembly Simulation in Crystal Lattice), a novel approach that integrates AlphaFold with symmetrical docking to simulate crystal packing. To evaluate packing quality, we introduced PackQ, a stringent metric based on the DockQ framework, where models with scores above 0.36 are considered successful. In benchmark tests on P41212 and P43212 space groups, MASCL successfully predicted packing interfaces for 26.8% and 30.1% of targets within the top 100 models. When focusing on models with successfully predicted initial crystallographic dimeric assemblies (DockQ ≥ 0.23), success rates improved to 57.9% and 39.8% within the top 25 models, respectively. Additionally, we developed AAI‐PatchBag, a patch‐based method using physicochemical descriptors to assess molecular interface similarity. Compared to conventional condition‐searching strategies like sequence alignment, structure superposition, and shape comparison, AAI‐PatchBag reduced the number of trials required to identify potential crystallization conditions. Applied to lysozyme crystallization, AAI‐PatchBag efficiently identified conditions yielding crystals with the desired packing. Overall, MASCL and AAI‐PatchBag advance the prediction of protein–protein interactions within the crystal lattice and facilitate the identification of potential crystallization conditions through molecular packing interface similarity, contributing to a deeper understanding of protein crystallization.

Keywords: AlphaFold, crystal packing interface, crystallization condition exploration, molecular surface, protein–protein interaction, X‐ray crystallography

1. Introduction

In the 14th Critical Assessment of Protein Structure Prediction (CASP14), AlphaFold2 (AF2), a neural network‐based system, revolutionized single‐chain protein structure prediction, achieving high global distance test total scores (GDT_TS) and low all‐atom root‐mean‐square deviation (RMSD) [1, 2]. Subsequent works, including the integration of AF2 into the molecular docking pipeline [3, 4] and the development of AlphaFold‐Multimer [5] and AlphaFold3 (AF3) [6], have further extended its application to protein complex prediction. Despite these computational advancements, X‐ray crystallography remains indispensable in structural biology [7, 8, 9], offering atomic‐level resolution that is crucial for elucidating detailed protein architectures [10, 11]. Moreover, the process of determining protein‐ligand interactions often begins with crystallizing the apo form of a target protein, followed by soaking or co‐crystallization experiments to reveal drug‐binding interfaces under conditions similar to initial crystallization [12, 13]. These high‐resolution structural insights are essential for understanding protein‐ligand interactions, facilitating the design of targeted drugs. Furthermore, X‐ray crystallography captures distinct conformational states that static structure prediction methods may miss, resulting in deeper insights into protein function and regulation [7, 8].

However, protein crystallization persists as a significant challenge in X‐ray crystallography [14, 15]. The formation of high‐quality crystals requires achieving favorable molecular packing within the crystal lattice, which depends on optimizing energetic interactions between protein molecules while reducing protein surface entropy [16, 17, 18]. Predicting protein–protein interactions at crystal packing interfaces is crucial for bridging the gap between computational structure prediction and experimental crystallization. A better understanding of these molecular interactions would allow crystallographers to fine‐tune reagent combinations to promote essential lattice‐forming contacts while mitigating unfavorable interactions [19, 20, 21]. Such insights also enable the modification of protein surfaces, such as the removal of disordered regions [22, 23] or mutation of surface residues [18, 21, 24, 25, 26, 27], to minimize aggregation and enhance crystallization likelihood. Additionally, analyzing these interactions helps assess whether small‐molecule drugs in co‐crystallization or soaking experiments might disrupt the original lattice structure [13], improving the design of crystallization conditions for both apo proteins and protein–drug complexes.

A recently proposed strategy for enhancing crystallization success involves molecular glues, which introduce new interaction sites on protein surfaces via macrocycles to induce molecular assembly [28]. Meanwhile, in computational crystallography, Li et al. reported a hierarchical computational strategy that successfully designed near‐atomic‐accuracy protein crystals in polyhedral cages for the F4132 and I432 space groups, employing de novo sequence design to establish stable interfaces [29]. While this approach represents a significant achievement, it remains computationally intensive, restricted to specific polyhedral architectures, and focused on space groups that are relatively uncommon in practical crystallography (Figure S1C,D). Consequently, no reliable methods currently exist to predict crystal packing interfaces other than experimentally growing crystals, leaving protein crystallization as a time‐consuming and empirical process [14, 15].

In addition to high‐throughput screening of commercial crystallization kits, leveraging knowledge from the literature of similar proteins through sequence alignment, structure superposition, or three‐dimensional (3D) shape comparison is one of the few approaches for searching potential crystallization conditions [30, 31, 32]. Several alignment‐free protein interface comparison algorithms, such as PatchBag [33] and PPI‐Surfer [34], have also been proposed to evaluate the interface similarities instead of overall protein structures. The exposed residues or protein–protein interaction surfaces are represented as vectors that count the occurrences of pre‐defined local surface patches or 3D Zernike Descriptors (3DZDs) in PatchBag and PPI‐Surfer respectively, which are then used to compute the distance of interface in resemblance. Though these techniques equip crystallographers with potential tools to draw inspiration for crystallization from homologous and even non‐homologous proteins, sequence‐based searches have proven ineffective in identifying crystallization conditions [35], and alternative approaches have yet to be systematically explored. Therefore, a generalized method for identifying potential crystallization conditions is still lacking.

To address these challenges, our study focuses on two critical questions: (1) Can crystal packing within a lattice be accurately simulated? and (2) Can crystal packing interfaces serve as better predictors of crystallization conditions? Here, we report MASCL (Molecular Assembly Simulation in Crystal Lattice), the first protocol designed for crystal packing prediction using AF3 combined with a symmetrical docking pipeline (DIPER). By emphasizing the importance of crystallographic symmetry in filtering out incompatible molecular packings, MASCL achieved success rates of 26.8% and 30.1% for crystal packing in two representative space groups, P41212 and P43212, respectively. Notably, in cases where initial crystallographic dimeric assemblies were accurately predicted, the average success rate increased to 57.9% for P41212 and 39.8% for P43212 within the top 25 models, highlighting the potential of MASCL for accurate crystal packing prediction. Additionally, we introduce AAI‐PatchBag, a novel computational tool that integrates 69 physicochemical descriptors from the AAIndex [36] into PatchBag, providing a comprehensive characterization of protein crystal packing interfaces. AAI‐PatchBag outperformed traditional condition‐searching methods, including sequence alignment, structural superposition, and 3D shape comparison, in ranking potential crystallization conditions from the Protein Data Bank (PDB). A case study on hen egg‐white lysozyme demonstrated the utility of AAI‐PatchBag, successfully identifying top‐ranking crystallization conditions that led to the efficient production of desired crystals.

Together, MASCL and AAI‐PatchBag present a proof‐of‐concept framework for utilizing crystal packing interfaces to predict crystallization conditions, enhancing our understanding of protein crystallization and paving the way for broader applications in structural studies.

2. Results

2.1. Prevalence and Importance of 2‐Fold Symmetry in Crystal Lattices

Protein crystals, as highly ordered molecular assemblies characterized by specific rotational and translational symmetries [37], can be modeled by repeating discrete lattice units through symmetry operators and finite translations (Figure S1A,B). To better understand the molecular symmetry underlying protein crystal packing, we applied symmetry operators to construct the complete protein crystal lattices and analyzed the distribution of space groups in single‐chain human proteins from the PDB database. Our analysis revealed that the rankings of the top 12 space groups—P212121, P21, C2, C2221, P43212, P3121, P41212, P3221, I222, P21212, P6322, and P6122—remained consistent, with only slight variations in proportions when redundant samples were excluded based on sequence identity (Figure S1C,D). Notably, the non‐uniform distribution of space group frequencies shown in Figure S1C aligns with previous findings [38], which suggest that constraints on rigid‐body degrees of freedom during nucleation favor certain space groups. Among these, many feature 2‐fold rotational symmetry, which may explain why pairwise assemblies are frequently observed in protein crystals. Indeed, our analysis found that approximately 66.6% of our similarity‐reduced PDB set contains crystallographic dimeric assemblies (hereafter referred to as “similarity‐reduced 2‐fold proteins”) after crystal packing construction (Figure S1E). This high prevalence of 2‐fold symmetry provided a valuable mathematical constraint that allowed us to exclude incompatible packing arrangements for similarity‐reduced 2‐fold proteins, thereby enabling crystal packing simulation through multiple rounds of molecular docking.

To take advantage of this symmetry constraint, we emphasized “symmetry handling” throughout our MASCL protocol, significantly reducing the vast conformational space required in conventional rigid‐body docking [39]. As outlined in Figure 1, we first predicted crystallographic dimeric models and subsequently assembled them into higher‐order tetrameric protein complexes while ensuring symmetry validation—mimicking the stepwise formation of the protein crystal lattice. By fitting symmetry operators to compute cell parameters, we were able to construct the entire crystal packing structure. Of note, a key component of our approach was using AF3 to predict pairwise assemblies within crystal lattices. While AF3—trained on protein structures in the PDB—consistently generated assemblies with standard 2‐fold symmetry, it lacked the ability to model assemblies with screw symmetries, preventing it from accurately addressing space groups containing screw symmetry along all three axes (e.g., P212121). As a result, we restricted our proof‐of‐concept study to space groups that exclusively exhibit standard 2‐fold symmetry.

FIGURE 1.

FIGURE 1

Workflow of the MASCL protocol for crystal packing prediction. (A) Two complementary approaches were used in the initial step of the MASCL protocol to predict crystallographic dimeric assemblies. In the first approach, AlphaFold2 (AF2) monomeric structures were retrieved from the AlphaFold Protein Structure Database, with regions scoring below pLDDT 70 truncated to improve docking accuracy. These AF2‐truncated structures were docked using C2‐DIPER across 3281 2‐fold‐like (180° ± 4°) symmetries, and the top models were selected based on energy rankings. In the second approach, AlphaFold3 (AF3) predicted five crystallographic dimeric models for each target, with regions below pLDDT 50 trimmed to enhance accuracy for subsequent docking. The top crystallographic dimeric models from both approaches were combined for further studies. (B) A second round of docking (C4‐DIPER) was performed on the crystallographic dimeric models to predict tetrameric assemblies, exploring 3395 4‐fold‐like (90° ± 4°) symmetries. Further refinement using SurfOrca, a grid docking and symmetry correction tool, mitigated atomic clashes and aligned the models with symmetry operators for accurate cell parameter determination. (C) The SurfOrca‐refined tetrameric models were replicated through a set of finite translations based on cell parameters to construct minimal protein crystal packings (MPCPs). The top MPCPs were selected for detailed evaluation.

To mitigate the computational demand of iterative molecular docking, we carefully balanced sample size with the inclusion of multiple symmetries in our choice of representative space groups. P41212 and P43212, two of the top five space groups for similarity‐reduced 2‐fold proteins (n = 190 and 183, respectively), were selected as exemplary subsets for subsequent benchmark experiments (Figure S1–S3).

2.2. Prediction of Crystallographic Dimeric Assembly Using C2‐DIPER and AF3

The initial step in our MASCL protocol involves predicting the pairwise molecular assembly within crystal lattices (Figure 1A). To achieve this, we employed two complementary approaches: (1) 2‐fold symmetry‐based docking (C2‐DIPER) utilizing AF2 monomeric structures, and (2) crystallographic dimeric model prediction with AF3.

For the C2‐DIPER method, AF2 monomeric structures were obtained from the AlphaFold Protein Structure Database using UniProt IDs recorded in PDB files. After removing low‐confidence regions with average predicted Local Distance Test (pLDDT) scores below 70 to enhance docking accuracy [4], 95.1% of the AF2‐truncated models achieved acceptable or better accuracy (RMSD95 < 5 Å) [2] (see Section 5.3.1 in Materials and Methods; Figure S4, Table S1). C2‐DIPER docking was then applied using pre‐optimized potential coefficients and a reduced conformational space of 3281 2‐fold‐like (180° ± 4°) symmetries, significantly decreasing computational load compared to the standard PIPER setup (70 000 conformations) [39]. The top 20 crystallographic dimeric models were selected based on energy‐based ranking following hierarchical clustering. In parallel, AF3 was used to predict five crystallographic dimeric models for each target protein. After trimming relatively low‐confidence regions with pLDDT scores below 50, five refined AF3 models were generated for each target. Recognizing the strengths of both methods—C2‐DIPER's broader exploration of conformational space and AF3's high‐quality predictions—we combined the top models from each method, yielding a final pool of 25 models (top 5 from AF3 and top 20 from C2‐DIPER) for further analysis.

A comparative performance analysis of AF3, C2‐DIPER, and the combined AF3 + C2‐DIPER approach is shown in Figure 2A,B and Table S1. Among the top 5 models, AF3 exhibited a higher success rate for crystallographic dimeric assembly prediction (DockQ ≥ 0.23) [40], achieving 32.6% and 31.7% success for the P41212 and P43212 subsets, respectively, compared to 19.5% and 30.1% using C2‐DIPER. AF3 also demonstrated better docking quality, producing more high‐quality models in both P41212 (AF3: 15.3% vs. C2‐DIPER: 2.6%) and P43212 (AF3: 17.5% vs. C2‐DIPER: 5.5%) subsets. However, AF3's limited exploration of conformational space restricted its ability to capture diverse crystallographic dimeric interactions, capping its overall prediction accuracy at approximately 30%. In contrast, C2‐DIPER's performance improved significantly when more models were considered. By the top 15 models, C2‐DIPER surpassed AF3 in success rate (P41212: 33.7%; P43212: 48.1%), albeit with slightly reduced docking quality. Notably, the integrated AF3 + C2‐DIPER method outperformed either method alone in terms of both top 5 and top 15 models. When considering all 25 models, the integrated approach achieved a success rate exceeding 60%, effectively doubling the performance of AF3's top 5 predictions. This allowed us to capture the majority of potential dimeric interactions within crystal lattices. Additionally, the integrated method improved docking quality, with average DockQ scores of 0.527 for the P41212 subset and 0.593 for the P43212 subset, markedly higher than C2‐DIPER alone (0.406 and 0.456, respectively; Figure 2C).

FIGURE 2.

FIGURE 2

Performance of crystallographic dimeric assembly predictions. (A, B) Crystallographic dimeric assembly prediction accuracies for the P41212 (A) and P43212 (B) subsets using three approaches—AF3, C2‐DIPER, and the combined AF3 + C2‐DIPER. Models with DockQ scores greater than 0.23 are considered successful, with further classifications as acceptable (0.23 ≤ DockQ < 0.49), medium (0.49 ≤ DockQ < 0.80), or high (DockQ ≥ 0.80). (C) Comparison of crystallographic dimeric assembly prediction accuracies in terms of DockQ scores between C2‐DIPER and AF3 + C2‐DIPER for the P41212 and P43212 subsets. The interquartile range (with darker color) and median DockQ score for each approach are shown. (D, E) Representative predicted crystallographic dimeric models for the P41212 (D) (PDB: 5CG5, 2ILR, 3MNG, 6BT1) and P43212 (E) (PDB: 1IAT, 4FLB, 4IZE, 4MUM) subsets, comparing AF3 and C2‐DIPER models against ground truth PDB reference structures. PDB models are shown in green, AF3 models in cyan, and C2‐DIPER models in marine, with DockQ scores provided to indicate model quality.

Representative examples of successful crystallographic dimeric assembly predictions by the integrated method are shown in Figure 2D,E. The top‐ranked models from C2‐DIPER (blue) closely aligned with the PDB structures (green) with DockQ scores ≥ 0.23 in all cases. In contrast, AF3 incorrectly predicted crystallographic dimeric assemblies for structures such as 2ILR, 6BT1, and 4IZE, with DockQ scores below 0.1. This demonstrates the value of integrating AF3 and C2‐DIPER for more accurate crystallographic dimeric assembly predictions in crystal packing. The superior performance of the integrated method underscores the complementary nature of AF3 and C2‐DIPER. While AF3 excels in generating high‐quality models, its limited exploration of conformational space restricts its overall accuracy. We hypothesize that AF3's training on PDB data, which predominantly includes protein complexes and multi‐domain proteins rather than molecular interactions within crystal lattices, may have introduced bias in crystal packing predictions and affect its capability of predicting specific molecular interactions within crystal lattices [41]. On the other hand, C2‐DIPER, using conventional docking with symmetry considerations, effectively examines a broader conformational space, exploring alternative molecular assemblies that AF3 might miss.

Taken together, the integrated AF3 + C2‐DIPER method consistently outperforms individual methods (AF3 and C2‐DIPER). By leveraging the strengths of both deep learning‐based structure prediction and traditional docking techniques, we enhance the reliability of crystallographic dimeric assembly predictions, capturing the majority of crystallographic dimeric interactions.

2.3. Simulation of Tetrameric Assembly Using C4‐DIPER and SurfOrca

To advance our predictions to higher‐order tetrameric protein complexes in the P41212 and P43212 space groups, we applied a second round of molecular docking—referred to as C4‐DIPER—to the top 25 crystallographic dimeric models from our previous analysis (Figure 1B). C4‐DIPER explored 3395 4‐fold‐like (90° ± 4°) symmetries to simulate tetrameric assemblies. Further refinement of these models was performed using SurfOrca, a grid docking and symmetry correction tool designed to mitigate atomic clashes and fit models into symmetry operators for accurate cell parameter calculations (Materials and Methods 5.2, Figure S3). After clustering and symmetry corrections, we selected the top 100 models from each crystallographic dimeric assembly prediction method (AF3, C2‐DIPER, and AF3 + C2‐DIPER) for detailed analysis.

Recognizing a potential overestimation of DockQ for packing quality assessment (Materials and Methods 5.3.2, Figure S6), we introduced a more stringent evaluation criterion, CAPRI*, by raising the thresholds for f nat and replacing LRMS with RMSD95 in the CAPRI definition (Table S2) [42]. Additionally, we developed PackQ, a novel continuous scoring metric specifically designed to assess crystal packing, based on DockQ's training rationale [40]. Models with PackQ scores above 0.36 were considered successful.

As depicted in Figure 3A,B and Table S3, the integrated AF3 + C2‐DIPER approach exhibited better overall success rates compared to both individual methods across the P41212 and P43212 subsets. For the top 50 and top 100 models, the integrated approach achieved success rates of 26.8% and 31.1% for P41212, and 32.8% and 39.9% for P43212 targets, respectively. This marked a significant improvement over AF3 (24.1% and 26.2% for P41212; 28.3% and 30.4% for P43212) and C2‐DIPER (13.2% and 16.3% for P41212; 21.9% and 31.1% for P43212) alone. Consistently, the integrated approach also produced a greater number of medium‐ and high‐quality docking models, highlighting its superior performance not only in the success rate of tetrameric assembly predictions but also in the overall quality of the models.

FIGURE 3.

FIGURE 3

Evaluation of tetrameric assembly predictions. (A, B) Success rates of tetrameric assembly predictions for the P41212 (A) and P43212 (B) subsets using AF3, C2‐DIPER, and AF3 + C2‐DIPER. Models with PackQ scores above 0.36 are classified successful. The adjacent bar chart shows detailed prediction accuracy in terms of PackQ scores: acceptable (0.36 ≤ PackQ < 0.54), medium (0.54 ≤ PackQ < 0.67), and high (PackQ ≥ 0.67). (C) Relationship between DockQ scores of crystallographic dimeric models and PackQ scores of tetrameric assemblies, with a black line indicating the correlation trend. (D) Comparison of the tetrameric assembly accuracies (measured by PackQ scores) for the P41212 and P43212 subsets across different crystallographic dimeric assembly prediction accuracies (measured by DockQ scores): acceptable (0.23 ≤ DockQ < 0.49), medium (0.49 ≤ DockQ < 0.80), and high (DockQ ≥ 0.80). (E, F) Success rates of tetrameric assembly predictions for the P41212 (E) and P43212 (F) subsets, focusing on accurately predicted crystallographic dimeric models, categorized by DockQ accuracy classes: Acceptable, medium, and high. (G, H) Representative SurfOrca‐refined tetrameric assemblies for the P41212 (G) (PDB: 5CG5, 2ILR, 3MNG, 6BT1) and P43212 (H) (PDB: 1IAT, 4FLB, 4IZE, 4MUM) subsets using AF3 + C2‐DIPER. PDB reference models are shown in green, and AF3 + C2‐DIPER models in marine, with PackQ scores to demonstrate model accuracy.

Furthermore, additional analysis revealed a positive correlation between the DockQ scores of crystallographic dimeric models and the PackQ scores of the resulting tetrameric assemblies (Figure 3C,D), suggesting that accurate crystallographic dimeric assembly predictions improve the precision of tetrameric simulations. Specifically, when crystallographic dimeric models achieved acceptable or better accuracy (DockQ ≥ 0.23), the average success rate for the top 15 tetrameric predictions (PackQ ≥ 0.36) substantially increased to 74.3% for P41212 and 67.5% for P43212 (Figure 3E,F). Remarkably, in cases where crystallographic dimeric models were highly accurate (DockQ ≥ 0.8), the success rate rose to approximately 90% for both subsets. These findings underscore the critical role of precise crystallographic dimeric assembly predictions in simulating higher‐order assemblies. As crystallographic dimeric assembly precision improves, particularly with advancements in complex prediction tools like AF3, the reliability of tetrameric assembly simulations is expected to increase accordingly.

Representative examples of eight high‐accuracy tetrameric models derived from the integrated approach are shown in Figure 3G,H. As illustrated, the AF3 + C2‐DIPER models (blue) aligned well with the experimentally determined PDB structures (green), with PackQ scores exceeding 0.55. Collectively, these results demonstrate that tetrameric assembly can be accurately predicted using our integrated AF3 + C2‐DIPER approach, particularly when the underlying crystallographic dimeric models exhibit high docking quality.

2.4. Prediction of Protein Crystal Packing Using MASCL

In the final step of our MASCL protocol (Figure 1C), we constructed minimal protein crystal packings (MPCPs), consisting of a central monomer surrounded by interacting molecules (defined as any pair of alpha carbons from the two molecules within 10 Å), by assembling the tetrameric models using cell parameters determined by SurfOrca. As shown in Figure 4A and Table S4, the integrated AF3 + C2‐DIPER approach achieved success rates of 26.8% for P41212 and 30.1% for P43212 targets (PackQ ≥ 0.36) when considering the top 100 models. These rates consistently outperformed both AF3 (23.0% and 25.5%) and C2‐DIPER (12.1% and 17.5%), while maintaining docking quality comparable to AF3. Notably, the prediction power of AF3 plateaued after the top 100 models, whereas the integrated approach continued to identify successful crystal packing predictions beyond the top 100, suggesting that refining the scoring function for packing results might further improve the performance of the integrated method.

FIGURE 4.

FIGURE 4

Prediction of minimal protein crystal packings using MASCL. (A) Success rates of minimal protein crystal packing (MPCP) predictions for the P41212 and P43212 subsets utilizing AF3, C2‐DIPER, and AF3 + C2‐DIPER. Models with PackQ scores above 0.36 are deemed successful. The bar charts below quantify the detailed prediction accuracy in terms of PackQ scores: Acceptable (0.36 ≤ PackQ < 0.54), medium (0.54 ≤ PackQ < 0.67), and high (PackQ ≥ 0.67). (B) Success rates of MPCP predictions for the P41212 and P43212 subsets, focusing on accurately predicted crystallographic dimeric models, divided by DockQ accuracy classes: acceptable (0.23 ≤ DockQ < 0.49), medium (0.49 ≤ DockQ < 0.80), and high (DockQ ≥ 0.80). (C) Comparison of MPCP prediction accuracies (PackQ scores) across different crystallographic dimeric assembly prediction classes (DockQ scores): acceptable (Acpt), medium (Medi), and high (High) for the P41212 and P43212 subsets. (D, E) Representative MPCP models predicted using AF3 + C2‐DIPER for the P41212 (D) (PDB: 5CG5, 2ILR, 3MNG) and P43212 (E) (PDB: 4FLB, 4IZE, 4MUM) subsets. PDB reference models are shown in green, with AF3 + C2‐DIPER models in marine. RMSD95s and PackQ scores are provided to indicate model accuracy.

Furthermore, in line with earlier findings (Figure 3E,F), when focusing on successfully predicted crystallographic dimeric models (DockQ ≥ 0.23), the average success rate for MPCP construction significantly increased to 57.9% for P41212 and 39.8% for P43212 in the top 25 models (Figure 4B). Figure 4C further supports this trend, showing that crystallographic dimeric models with medium or higher prediction accuracy constantly led to better MPCP predictions, achieving higher PackQ scores across both subsets. These results demonstrate that the accuracy of crystallographic dimeric models is also crucial for successful crystal packing predictions, indicating that advancements in crystallographic dimeric assembly prediction techniques hold the promise of further enhancing MASCL's ability to accurately model molecular interactions within crystal lattices.

Six exemplary MPCPs are shown in Figure 4D,E. These models accurately captured the molecular interactions between the central molecule and its surrounding neighbors at the packing interface, with PackQ scores exceeding 0.6. In particular, the models for 5CG5 from the P41212 subset and 4IZE from the P43212 subset, both with PackQ scores above 0.7, were near‐perfect predictions, validating the capability of using the MASCL protocol to accurately predict protein crystal packings. However, it is important to note that while some lattice packings deviated from their corresponding PDB reference structures and were classified as inaccurate, they still represented plausible crystal packing. We hypothesize that multiple lattice arrangements may exist for the same protein, some of which may remain undiscovered due to suboptimal crystallization conditions [43]. A deeper understanding of the forces driving these packings, coupled with optimized reagent combinations, could unveil additional accurate crystal packing configurations.

Altogether, we showcased the robust performance of the integrated method, particularly when crystallographic dimeric models are highly accurate in crystal packing prediction, and thoroughly ascertained the feasibility of the first issue: whether protein packing in the crystal lattice can be simulated.

2.5. Development of AAI‐PatchBag for Characterizing Protein Crystal Packing Interfaces

Since understanding the molecular packing interface within crystal lattice is crucial for identifying the essential protein–protein interactions driving crystal formation, we next sought to address another key question: can analyzing crystal packing interfaces provide a more effective approach for exploring crystallization conditions? While existing methods focus primarily on geometric or structural features, these alone may not fully capture the complexity of molecular packing interactions. To overcome this limitation, we developed AAI‐PatchBag, a novel extension of the PatchBag methodology, which integrates physicochemical descriptors to provide a more detailed and comprehensive representation of protein crystal packing interfaces.

A schematic overview of the AAI‐PatchBag method is presented in Figure 5. To generate protein surfaces for analysis, we first constructed the minimal protein crystal packings (MPCPs) from similarity‐reduced PDB structures by applying appropriate symmetry operators (Figure 5A). The molecular surface of each central monomer within the MPCP was computed using the EDTSurf program [44]. We then represented the surface through the alpha carbon (C α) coordinates of exposed residues, defined as those with any atom within 3 Å of the surface. These exposed surfaces were further categorized into two interaction types: (1) the protein–solvent exposure interface, consisting of regions exposed to solvent and not in direct contact with other proteins, and (2) the protein–protein interaction interface, comprising residues whose heavy atoms are within 5 Å of neighboring molecules (Figure 5B).

FIGURE 5.

FIGURE 5

Framework of AAI‐PatchBag for protein crystal packing interface comparison. (A–D) Schematic diagram of AAI‐PatchBag: (A) protein crystal packing construction and representation of the packing interface using alpha carbon (C α) atoms, (B) separation of the crystal packing interface into the protein–solvent exposure interface and the protein–protein interaction interface, (C) interface characterization using pre‐established geometric shape libraries to encode the molecular shapes as PatchBag vectors, and (D) extension of the 1D geometric shape vector into two‐dimensional matrices incorporating both geometric shapes and multiple physicochemical properties. (E) Comparison of two protein packing interfaces using conventional global descriptors and the local patch descriptor (AAI‐PatchBag), with differences quantified by weighted Euclidean distances.

Following the PatchBag framework [33], we decomposed these interfaces into overlapping C α‐based surface patches. For the protein–solvent exposure interface, each patch comprised a central exposed C α atom and its five nearest C α neighbors, while for the protein–protein interaction interface, each patch also included the nearest C α atom from the interacting neighboring molecule. The orientation of each patch was determined by averaging the normal vectors of all residues within the patch. We then randomly selected 8000 patches from all the C α‐represented surface patches generated from the non‐redundant similarity‐reduced PDB dataset, and performed k‐medoids clustering (k = 300) of these patches based on their minimal RMSD for all possible combinations of C α to create a representative library of geometric shapes (Figure 5C). Note that the angle between the orientations of patches in the same cluster was constrained to be smaller than 90°.

To more comprehensively capture the molecular interaction among crystal packing interfaces, we developed AAI‐PatchBag to enhance the descriptive power of each patch beyond geometric features (Figure 5D). AAI‐PatchBag incorporates 69 physicochemical properties from the AAIndex (AAI) database [36], which provides numerical indices for various physicochemical attributes of amino acids (Table S5). These features, including attributes like hydrophobicity, charge, and polarity, were integrated with the shape descriptors, extending the original PatchBag one‐dimensional shape‐based vector into 69 two‐dimensional descriptors: one dimension encodes the occurrences of geometric shape (as in PatchBag), while the other captures the graded physicochemical properties for each patch. This enhanced representation allows for more precise characterization of the interactions occurring at protein packing surfaces, advancing beyond the limitations of purely geometric methods.

Consequently, for each newly provided protein crystal packing, we extracted surface patches and approximated them to their closest counterparts in the representative library of geometric shapes based on minimal RMSD. Through AAI‐PatchBag, we eventually represented the protein crystal packing interface as multiple distinct two‐dimensional descriptors, encapsulating both geometric and physicochemical features (Figure 5D). Combined with conventional global descriptors (sequence, C α coordinates, and 3DZD [45]), the overall similarities between two protein crystal packing interfaces can be quantified using weighted pairwise distances of conventional descriptors and the corresponding features in AAI‐PatchBag, offering a more robust strategy for comparing crystal packing interfaces (Figure 5E).

2.6. Performance Evaluation of AAI‐PatchBag in Identifying Crystallization Conditions

To evaluate the utility of AAI‐PatchBag in identifying potential crystallization conditions, we compared its ability to distinguish conditions similar to the ground truth PDB conditions from dissimilar ones against three conventional protein similarity descriptors: sequence identities, RMSDs, and Euclidean distances of 3DZDs. Pairwise dissimilarities of crystallization conditions among 2393 non‐redundant PDB samples were computed, setting a dissimilarity threshold of 1.0 based on the distribution (Figure S7B,D, Supplementary File S2). Conditions with dissimilarities below this threshold were classified as similar and likely conducive to successful crystallization. Logistic regression was employed to optimize the weights of AAI‐PatchBag features in differentiating between these conditions within a training set comprising 80% of 2393 non‐redundant PDB samples (Figure 6A). The classification performance was also evaluated on the remaining 20% of this dataset to validate the model's generalization (Figure S8B).

FIGURE 6.

FIGURE 6

Performance of AAI‐PatchBag in identify potential crystallization conditions. (A) Receiver operating characteristic curves comparing AAI‐PatchBag's ability to distinguish crystallization conditions similar to the ground truth PDB conditions, against sequence identities, RMSDs, 3DZD pairwise distances, and random guessing. (B) Precision‐recall curves evaluating the performance of AAI‐PatchBag in retrieving crystallization conditions close to the actual PDB‐deposited conditions, compared to sequence identities, RMSDs, 3DZD pairwise distances, and random guessing. (C) Comparison of rankings for crystallization conditions, both similar and dissimilar to the reference PDB conditions, using AAI‐PatchBag, sequence identities, RMSDs, and 3DZD pairwise distances. (D, E) Cumulative plots showing the efficacy of AAI‐PatchBag, sequence identities, RMSDs, and 3DZD pairwise distances in identifying potential conditions similar to the PDB references from the PDB crystallization condition pool, consisting of the test set including (D) or excluding (E) homologous samples (sequence identity > 0.3). (F) Quantification of the reduction in the number of screened conditions upon finding potential hits for successful crystallizations, focusing on the test set that excludes homologous samples (E).

As shown in Figure 6A, AAI‐PatchBag outperformed conventional descriptors, achieving an area under the curve (AUC) of 0.66 in receiver operating characteristic (ROC) analysis, significantly higher than the average AUC of 0.52 for traditional methods, including sequence identities, RMSDs, and Euclidean distances of 3DZDs. Similarly, the precision‐recall (PR) curve (Figures 7B and S8B) demonstrated the improved discrimination ability of AAI‐PatchBag, with AUC more than double those of random guessing or conventional approaches. These superior performances highlight AAI‐PatchBag's capability in identifying crystallization conditions that closely resemble successful experimental setups. Furthermore, unlike sequence identities [35], RMSDs, or 3DZD‐based methods—which failed to distinguish between similar and dissimilar conditions—AAI‐PatchBag consistently ranked similar crystallization conditions higher (Figures 6C and S8C), suggesting AAI‐PatchBag as a more reliable approach for guiding crystallization experiments by focusing on molecular packing interface similarities.

FIGURE 7.

FIGURE 7

Successful crystallization of hen egg‐white lysozyme under conditions identified using AAI‐PatchBag. (A) 3D structures of hen egg‐white lysozyme (PDB: 4QEQ), human lysozyme C (PDB: 1JKC), human protein adenylyltransferase (PDB: 6I7L), human bile acid receptor (PDB: 3FLI), human dihydroorotate dehydrogenase (PDB: 4OQV), and human chitotriosidase‐1(PDB: 4WKA). (B) Representative images of hen egg‐white lysozyme crystals grown under different crystallization conditions (as deposited in the PDB for 1JKC, 6I7L, 3FLI, and 4WKA) along with their respective cell parameters.

To better quantify AAI‐PatchBag's efficiency in identifying potential crystallization conditions, we applied the method to a newly curated dataset and tracked the cumulative identification of potential crystallization conditions during the screening process. As depicted in Figure 6D,E, AAI‐PatchBag identified a greater number of similar conditions earlier in the screening compared to conventional methods, both in the general test set and in a more stringent leave‐homologous‐out test set, where proteins sharing over 30% sequence identity were excluded. Quantitatively, AAI‐PatchBag efficiently reduced the number of attempts required to identify 1, 3, and 5 potential successful crystallization hits by 39.0%, 25.4%, and 13.9%, respectively (Figure 6F). This reduction underscores AAI‐PatchBag's ability to streamline the crystallization condition screening process, reducing experimental effort while improving the likelihood of success. In summary, our results demonstrate that AAI‐PatchBag outperforms conventional methods, providing a more robust and efficient tool for ranking crystallization conditions, with the potential to enhance the success rate of crystallization experiments.

2.7. Systematic Exploration of Crystallization Conditions for Hen Lysozyme Using AAI‐PatchBag

To further validate the effectiveness of AAI‐PatchBag in identifying crystallization conditions from a given PDB reference database, we conducted a case study on hen egg‐white lysozyme (HEWL), a classical crystallization model protein. First, we applied AAI‐PatchBag to characterize the packing interfaces of both tetragonal HEWL (PDB: 4QEQ; Figure 7A) and a qualified non‐redundant PDB reference dataset (see Supplementary File S1). We then computed interface dissimilarities between HEWL and each reference PDB entry, ranking them based on similarity. The crystallization conditions associated with the highest‐ranking proteins were considered potential leads for HEWL.

Crystallization experiments were subsequently conducted using the conditions corresponding to the top 20 ranked PDB entries, leading to successful crystallization in five cases (Figure 7A, Table 2). Among these, four conditions (PDB: 1JKC, 6I7L, 3FLI, and 4WKA) produced high‐quality tetragonal crystals (P43212) with unit cell parameters closely matching those of the input structure (PDB: 4QEQ), as confirmed by X‐ray diffraction analysis (Figure 7B). The fifth condition (PDB: 4OQV) resulted in urchin‐like microcrystals, which could not be definitively assigned a space group due to suboptimal crystal quality. Interestingly, only one of these five conditions involved a moderately homologous protein (PDB: 1JKC, a human‐derived lysozyme, ~59.2% sequence identity to HEWL), while the other four conditions (PDB: 6I7L, 3FLI, 4OQV, and 4WKA) originated from proteins with low homology to HEWL, as assessed by sequence alignment and RMSD95 (Figure 7A; see Supplementary File S3). These results suggest that AAI‐PatchBag effectively identifies crystallization conditions independently of conventional sequence or structural homology measures.

TABLE 2.

Top 20 crystallization conditions for hen egg‐white lysozyme (PDB: 4QEQ) predicted using AAI‐PatchBag.

PDB ID UniProt ID Crystallization condition
Input
4QEQ P00698 9% w/v sodium chloride, 0.1 M sodium acetate, pH 4.9, VAPOR DIFFUSION, HANGING DROP, temperature 293 K
Output
1JKC* P61626 5 M NH4NO3, 20 mm sodium acetate, pH 4.5
6I7L* Q9BVA6 1.5 M NaCl, 10% Ethanol
1LYY P61626 0.2 M (NH4)2SO4, 30% PEG 8000, pH 4.0
5ZM5 Q9BXW6 1.2 M Sodium Citrate, 100 mM Sodium Cacodylate, pH 6.2
5GL7 P56192 200 mM magnesium chloride, 5 mM DTT, 21% PEG 8000, 100 mM Tris pH 7.5
3DK9 P00390 0.23 M (NH4)2SO4, 0.1 M potassium phosphate, pH 7.0
2AA2 P08235 0.9 M Li2SO4, 2% PEG2KMME, 0.1 M hepes, pH 7.5
7REI Q96SZ5 0.2 M Li2SO4, 20% w/v PEG 3350, 0.1 M Bis‐Tris pH 5.5
4OQV* Q02127 1.76 M (NH4)2SO4, 1.9 M NaCl, 0.1 M Sodium Acetate, pH 5.4
3FLI* Q96RI1 22% PEG4K, 150 mM sodium acetate, 75 mM Tris, pH 8.5
7E9V P30085 2.5 M (NH4)2SO4, 0.1 M Tris pH 8.5
4JKQ Q92736 200 mM ammonium formate, 21% PEG 3350, 20 mM TRIS–HCl, 150 mM NaCl, 10% glycerol, 100 mM HEPES, pH 8.0
4WKA* Q13231 23% (w/v) polyethylene glycol (PEG) 3350, 0.2 M potassium sodium tartrate (PST) at pH 7.2
2UXW P49748 15% PEG3350, 0.1 M SODIUM SUCCINATE PH7.0, pH 7.00
5HKX P22681 10% (w/v) PEG 5000 MME, 0.1 M MES, 12% (v/v) 1‐propanol
3FAY P46940 500 mM MgCl2, 20% PEG 2000 methyl ether, 100 mM Tris HCL, pH 8.5
4PWN Q9H4A3 19% PEG 3350, 0.35 M potassium phosphate, pH 7.4
5UMS Q08945 28% polyethylene glycol monomethyl ether 2000, 100 mM Bis‐Tris, pH 6.5
2DH2 P08195 25% PEG 4000, 0.2 M SODIUM ACETATE, 0.1 M TRIS BUFFER, pH 8.5
3B0I P00709 2.0 M (NH4)2SO4, 0.01 M CaCl2, 0.1 M MES, pH 6.0

Note: The conditions highlighted in * indicate those that led to successful crystallization.

Given that HEWL is well known for crystallizing under diverse conditions [46], we analyzed 971 PDB entries containing the classical HEWL sequence [47] deposited before January 1, 2025 (Supplementary File S4). Among these, 857 structures belonged to the tetragonal space group, with 801 entries providing complete crystallization condition details. Despite the large number of depositions, reported tetragonal HEWL crystallization conditions clustered into 12 major groups based on their principal precipitants, with ~80% of cases utilizing a “sodium chloride + acetate buffer” system. The remaining conditions exhibited greater variability but predominantly employed reagents found in the 12 major groups. These findings suggest that, while HEWL can crystallize under a broad range of conditions, the reagents used for tetragonal HEWL crystallization is constrained to a relatively narrower set.

Next, we specifically compared the AAI‐PatchBag‐derived conditions with the previously reported tetragonal HEWL conditions (Table 2, Supplementary File S4). Among the three successful conditions from non‐homologous proteins (PDB: 6I7L, 3FLI, and 4WKA), one (PDB: 6I7L; 1.5 M NaCl +10% ethanol) matched a known tetragonal crystallization setup (PDB: 6AGN; see Supplementary File S4), confirming that AAI‐PatchBag can recover well‐established crystallization conditions. More notably, the other two conditions (PDB: 3FLI and 4WKA) represented novel crystallization conditions that, to our knowledge, had not been previously associated with tetragonal HEWL crystallization (Table 2, Supplementary File S4). These results underscore the advantage of using crystal packing interface similarity—rather than conventional sequence or structural homology—to identify both established and novel crystallization conditions.

In conclusion, our findings demonstrate that analyzing crystal packing arrangement for a given protein can improve the success rate of crystallization by utilizing molecular packing interface similarities through AAI‐PatchBag, holding promise in accelerating the identification of potential conditions for future crystallization studies.

3. Discussion

In this study, we developed MASCL, the first computational protocol specifically designed to predict protein crystal packing by integrating AlphaFold with symmetrical docking. With a focus on molecular symmetry handling, we addressed the challenge of excessive docking decoys [48] by filtering out incompatible molecular packings, which significantly improved both accuracy and the efficiency of batch processing across multiple docking rounds. Benchmarking on non‐redundant P41212 and P43212 space groups, MASCL achieved success rates of 26.8% and 30.1%, respectively, for predicting packing interfaces within the top 100 minimal protein crystal packing (MPCP) models. When focusing on successfully predicted crystallographic dimeric models (DockQ ≥ 0.23), the average success rates for MPCP predictions increased to 57.9% and 39.8%, highlighting the critical role of crystallographic dimeric model accuracy in crystal packing predictions and suggesting that future advancements in crystallographic dimeric assembly prediction techniques could further improve crystal packing predictions.

Prior studies have emphasized the prevalence of 2‐fold symmetry in crystal packing and examined how crystallographic dimers compare to biological dimers [49, 50, 51, 52, 53]. These earlier findings show that biological (specific) dimer interfaces tend to be larger, more compact, and enriched in hydrogen bonds per polar atom compared with non‐specific (crystal‐packing) interfaces. Such insights into interface size and buried surface area may help refine MASCL by improving its discrimination between crystallographic and biologically relevant dimers [54, 55]. Furthermore, the inherently small, fragmented nature of many crystal packing interfaces may explain why focusing on localized interface regions using small surface patches is more effective than analyzing entire protein surfaces.

Recent advances in Small Angle X‐ray Scattering (SAXS) offer another promising avenue for enhancing packing prediction accuracy [56, 57, 58]. By comparing experimental SAXS profiles with theoretical scattering profiles derived from docked structures, SAXS could be integrated into the MASCL protocol to refine the crystallographic dimeric assemblies, improving accuracy in subsequent MPCP predictions. Additionally, some predicted lattice configurations, though different from PDB references, may represent valid but unobserved crystal packings resulting from different crystallization conditions. A deeper understanding of the molecular interactions governing these packing configurations, along with optimized reagent combinations, could help uncover alternative lattice arrangements. Moreover, while neural network‐based scoring functions have shown promise in docking complex models [59, 60, 61], there is currently no available neural network‐based program specifically designed for scoring the crystal packing model. We believe that integrating neural network‐based scoring functions could further increase the overall success rate of protein crystal packing predictions. Furthermore, although our study focused on P41212 and P43212 space groups, MASCL has the potential to be applied to other space groups, as 66% of all space groups exhibit 2‐fold symmetry (Figure S1E), indicating room for expansion.

In parallel, we introduced AAI‐PatchBag to evaluate protein crystal packing similarities. AAI‐PatchBag surpassed conventional metrics, including sequence identities, RMSDs, and 3DZD pairwise distances in distinguishing crystallization conditions similar to the ground truth PDB condition, making it a promising tool for efficiently identifying potential crystallization setups. Beyond crystal packing comparison, AAI‐PatchBag, as an advanced version of PatchBag, could be extended to broader applications such as evaluating protein–protein or protein‐ligand interactions, offering valuable utility in structural biology research [33, 34, 62, 63]. Notably, over 40% of crystallization conditions for single‐chain human proteins deposited in the PDB are incomplete or contain rarely used reagents (Figure 7B). AAI‐PatchBag can help us retrieve or even identify new crystallization conditions based on known crystal packing interfaces. However, there are limitations to AAI‐PatchBag that leave room for improvement. First, typing errors or missing information in the crystallization condition deposited on the PDB website are not uncommon. These mistakes can be corrected sometimes (if we noticed that), but the wrong or some hard‐to‐repeat crystallization conditions that serve as outliers are often difficult to find The inherent variability in crystallization conditions, such as tolerances for deviations in reagent concentration and even similar reagents, introduces complexity into condition similarity calculations. Therefore, using the non‐linear model, such as a neural network, may be a potential approach to further enhance the correlation between the crystal packing interface and PDB crystallization condition.

In summary, MASCL and AAI‐PatchBag open new avenues for understanding molecular interactions within protein crystal lattices and optimizing protein crystallizations, laying the foundation for future advancements in structural biology.

4. Conclusions

X‐Ray crystallography is a critical technique in structural biology for elucidating protein structures, yet protein crystallization continues to be a significant bottleneck. Understanding the molecular forces driving crystal packing could revolutionize protein crystallization; however, existing methods for predicting crystal packing interactions are still limited. In this study, we introduced two innovative computational tools, MASCL and AAI‐PatchBag, to address this challenge. MASCL integrates AlphaFold's advanced protein structure prediction with crystallographic symmetry constraints to simulate protein arrangements within crystal lattices, enabling the prediction of protein–protein interactions at the molecular packing interface prior to practical experiments. In addition, we developed AAI‐PatchBag, an algorithm that characterizes protein surfaces as geometric patches with distinct physicochemical descriptors, allowing for precise quantification of interface similarities. By comparing packing interfaces with known structures in the PDB, AAI‐PatchBag significantly reduced the number of trials required for successful crystallization compared to conventional methods. Furthermore, we validated the utility of AAI‐PatchBag through hen lysozyme crystallization, efficiently obtaining crystals with the desired packing among the top‐ranking conditions. Collectively, our study demonstrates MASCL's capability in predicting protein crystal packing and presents AAI‐PatchBag as a more efficient approach for exploring crystallization conditions, which enhances our understanding of protein crystallization and advances the field by minimizing the trial‐and‐error nature of condition identification.

5. Materials and Methods

5.1. Non‐Redundant Single‐Chain Human Protein Dataset

For the space group analysis, we downloaded 16 190 PDB files of single‐chain human proteins determined by X‐ray crystallography and deposited in the PDB (https://www.rcsb.org/) prior to May 27, 2022. After excluding human‐associated proteins, such as viral proteins, 16 123 single‐chain human proteins remained (referred to as the “single‐chain human protein dataset”; Figure S1C). To reduce homology redundancy, the dataset was clustered and filtered based on sequence identity, resulting in a non‐redundant set of 3781 proteins, each sharing less than 70% sequence identity to others (Figure S1D). For each cluster, the structure with the highest resolution was selected as the representative sample. A complete list of PDB entries for each dataset is provided in Supplementary File S1.

5.2. MASCL Pipeline

The MASCL pipeline predicts protein crystal packing by first generating crystallographic dimeric models and then assembling them into higher‐order complexes with symmetry examination, simulating the stepwise formation of the protein crystal lattice (Figure 1). This pipeline consists of three main steps: crystallographic dimeric model prediction, C4‐DIPER for tetrameric assembly simulation, and SurfOrca for symmetry correction and crystal packing construction.

5.2.1. Crystallographic Dimeric Model Prediction

5.2.1.1. AlphaFold2

All AlphaFold2 (AF2) predicted structures were retrieved from the AlphaFold monomer V2.0 predictions in the AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/) based on the UniProt ID listed in each PDB file, as of January 25, 2023. Structures corresponding to the sequences in the PDB files were extracted, and low‐confidence regions, defined as five or more consecutive residues with an average pLDDT score below 70, were removed to generate AF2‐truncated monomeric models for improved docking performance in following C2‐DIPER docking.

5.2.1.2. AlphaFold3

AlphaFold3 (AF3) crystallographic dimeric models for the P41212 and P43212 benchmark datasets were generated using the latest online version, AlphaFold Server Beta (https://alphafold server.com/). For each target, sequences from the PDB files were input, with the auto random seed and molecular copy number set to 2. Five crystallographic dimeric models were generated per target and ranked by the AF3 ranking scores. Relatively confident regions, defined as consecutive residues with an average pLDDT score above 50, were extracted for subsequent docking experiments.

5.2.1.3. C2‐DIPER

The C2‐DIPER docking pipeline is mainly divided into the following steps. First, docked conformations of the preprocessed AF2‐truncated model were generated using PIPER, a Fast Fourier Transform‐based protein docking program that explores possible complex configurations on a grid with a 1.2 Å step size. Note that only 3281 2‐fold‐like (180° ± 4°) conformations were used in our C2‐DIPER docking procedure.

Next, the pairwise potentials (E total) of each docked model were computed using an energy‐based scoring function:

Etotal=w1*Eattr+w2*Erep+w3*Eelect+w4*EGB+w5*EDARS

where E attr and E rep are attractive and repulsive van der Waals interactions, E elect is the electrostatic term, E GB represents Generalized Born (GB) electrostatic effects, E DARS accounts for desolvation contributions.

The initial weights were set using a default balanced set (w 1 = −0.4, w 2 = 0.4, w 3 = 600, w 4 = 0, and w 5 = 1) and iteratively refined utilizing logistic regression on a training set (80% of similarity‐reduced 2‐fold proteins, excluding P41212 and P43212 samples; Figure S5A). Decoys with C α RMSD < 5 Å from the native structures were considered acceptable models, and training continued until accuracy convergence (Figure S5B). The optimized coefficients from the fourth round of refinement were selected for future docking experiments:

Etotal=0.43*Eattr+1.34*Erep+341*Eelect+9.28*EGB+0.6*EDARS

After scoring, the docked models were ranked by their pairwise potentials. The top 600 models were subjected to agglomerative hierarchical clustering, using pairwise C α RMSD as the distance metric. The clustering radius was set to 5 Å C α RMSD, and the models with the lowest potential and average inter‐cluster C α RMSD within each cluster were selected as the representatives. Finally, the representative clustered models were re‐ranked based on both cluster size and average potential of the cluster, and the top 25 models for each preprocessed AF2‐truncated sample were outputted as final crystallographic dimeric models.

5.2.2. C4‐DIPER

A similar pipeline was applied for C4‐DIPER docking. The top 5, 25, and 25 crystallographic dimeric models from AF3, C2‐DIPER, and AF3 + C2‐DIPER, respectively, were used as inputs, and PIPER was employed to generate 3395 4‐fold‐like (90° ± 4°) tetrameric complexes for each crystallographic dimeric model. Since the symmetry operators in space groups P41212 and P43212 consist of 4 crystallographic dimeric pairs (paired serial numbers: 1–7, 2–8, 3–5, and 4–6; Figure S3), one pairwise subunit (docking target) inside the tetrameric complex was assigned to the 1–7 pair, while another pairwise subunit (docking ligand) was fitted into a neighboring pair (3–5 or 4–6) to calculate the potential cell parameter pool. Complexes with unreasonable cell parameters, which led to severe (> 5%) atomic clashes upon crystal packing construction, were discarded. Afterward, the remaining qualified tetrameric complexes (i.e., those with appropriate cell parameters) were ranked by their pairwise potentials, and hierarchical clustering was performed using the same method as in C2‐DIPER to identify the top 25 tetrameric complexes for each input. In total, 625 models (25*25) were generated for each input crystallographic dimeric model and re‐ranked based on a comprehensive scoring of crystallographic dimeric assembly prediction rankings and C4‐DIPER docking results. These resulting C4‐DIPER tetrameric complexes from the three methods (AF3, C2‐DIPER, and AF3 + C2‐DIPER) were then selected for further refinement.

5.2.3. SurfOrca

To address potential atomic clashes following crystal packing construction, docked decoys were generated for each C4‐DIPER tetrameric complex by translating the docking ligand (i.e., one pairwise subunit within the tetrameric complex) through a series of independent steps with a 1.2 Å increment, within an 8 Å radius from the center of docking ligand. The fine‐tuned tetrameric complexes from this local grid docking were adjusted to strictly conform to the theoretical 2‐ and 4‐fold symmetry constraints (as opposed to “2‐ and 4‐fold‐like”) imposed by the symmetry operators of space groups P41212 and P43212, and the precise cell parameters were determined accordingly (Figure S3). Next, symmetry‐refined tetrameric models with no atomic clashes (defined as no atoms within 2.8 Å of each other) were identified, and the model with the minimal RMSD relative to the input C4‐DIPER tetrameric complex was selected as the representative SurfOrca‐refined tetrameric model. This symmetry‐refined and clash‐corrected model was then used to construct the minimal protein crystal packing (MPCP) by repeating the model through a set of finite translations based on corresponding cell parameters. The top 100 SurfOrca‐refined tetrameric models and the top 200 MPCPs were outputted for subsequent analyzes in Figures 3 and 4.

5.3. Evaluation Metrics

5.3.1. Model Quality Metrics

To assess monomeric model accuracy, the root‐mean‐square deviation of 95% of C α atoms with the lowest alignment error, termed RMSD95, was calculated to quantify the backbone differences between the predicted model and the native structure [2]. Prediction accuracy was categorized into four classes: high (RMSD95 ≤ 1 Å), medium (1 Å < RMSD95 ≤ 3 Å), acceptable (3 Å < RMSD95 ≤ 5 Å), and incorrect (RMSD95 > 5 Å).

For crystallographic dimeric models, the DockQ score, a continuous interface docking quality metric combining fraction of native contacts (f nat), LRMS, and IRMS, was utilized to evaluate docking quality. Models were classified into four quality levels: high (DockQ ≥ 0.80), medium (0.49 ≤ DockQ < 0.80), acceptable (0.23 ≤ DockQ < 0.49), and incorrect (DockQ < 0.23). A docked model achieving acceptable or higher accuracy (DockQ ≥ 0.23) was considered successful for the given target.

5.3.2. Packing Quality Metrics

Certain tetrameric models with acceptable DockQ scores displayed significant deviations in global molecular arrangement compared to the reference structure (Figure S6A). These models performed well in predicting crystallographic dimeric interfaces (Figure S6B) but failed to accurately predict interactions between different crystallographic dimeric pairs within crystal packing (Figure 6C). When DockQ was used to evaluate overall packing quality, these inaccuracies were averaged out, leading to overestimated DockQ scores that did not adequately reflect poor packing accuracy.

To address this, elevating the thresholds for f nat and incorporating global molecular assembly considerations became necessary when assessing packing quality. Therefore, we modified the CAPRI assessment by raising the cutoffs for f nat and replacing LRMS with RMSD95, creating the CAPRI* criterion (Table 1). Packing models were then classified into four tiers: incorrect [f nat < 0.3 or (RMSD95 > 8.0 & IRMS > 4.0)], acceptable [(f nat ≥ 0.3 & f nat < 0.5) and (RMSD95 ≤ 8.0|IRMS ≤ 4.0) or (f nat ≥ 0.5 & RMSD95 > 5.0 & IRMS > 2.0)], medium [(f nat ≥ 0.5 & f nat < 0.7) and (RMSD95 ≤ 5.0|IRMS ≤ 2.0) or (f nat ≥ 0.7 & RMSD95 > 3.0 & IRMS > 1.0)], and high [(f nat ≥ 0.7) & (RMSD95 ≤ 3.0|IRMS ≤ 1.0)].

TABLE 1.

Comparison of docking and packing accuracy criteria in terms of CAPRI, DockQ, CAPRI*, and PackQ score.

Class CAPRI DockQ
f nat LRMS IRMS Score
High x ≥ 0.5 x ≤ 1 or x ≤ 1 x ≥ 0.80
Median x ≥ 0.3 1 < x ≤ 5 or 1 < x ≤ 2 0.49 ≤ x < 0.80
Acceptable x ≥ 0.1 5 < x ≤ 10 or 2 < x ≤ 4 0.23 ≤ x < 0.49
Incorrect x < 0.1 x < 0.23
Class CAPRI* PackQ
f nat Overall RMSD95 IRMS Score
High x ≥ 0.7 x ≤ 3 or x ≤ 1 x ≥ 0.67
Median x ≥ 0.5 3 < x ≤ 5 or 1 < x ≤ 2 0.54 ≤ x < 0.67
Acceptable x ≥ 0.3 5 < x ≤ 8 or 2 < x ≤ 4 0.36 ≤ x < 0.54
Incorrect x < 0.3 x < 0.36

Note: The top table shows the classification of docking accuracy for crystallographic dimeric models, evaluated according to CAPRI definition and DockQ score (Materials and Methods 5.3.1). The bottom table presents the classification of packing accuracy for C4‐DIPER, SurfOrca‐refined, and MPCP models, evaluated using CAPRI* definition and PackQ score (Materials and Methods 5.3.2). x denotes the value of f nat, LRMS, IRMS, DockQ score, or PackQ score.

We also imitated the training rationale of the DockQ score to design a novel continuous quality measure of packing interface, termed PackQ, described as the following equation.

PackQfnatRMSD95IRMSd1d2=fnat+1/1+RMSD95d12+1/1+IRMSd22/3

The scaling parameters d 1 and d 2 were optimized for the training set (randomly divided, 70% of all crystallographic dimeric, tetrameric, and MPCP models generated in this study) to achieve the best classification performance according to the CAPRI* metric. For optimization, we first defined C 1, C 2, and C 3 as three cutoffs to distinguish the quality classes between incorrect and acceptable (C 1), acceptable and medium (C 2), and medium and high (C 3), respectively. Next, we calculated the F1‐score of all pairs of d 1 (0.5–10 Å with step size of 0.5), d 2 (0.5–5 Å with step size of 0.5), C 1 (<C 2, 0.1–0.5 with step size of 0.01), C 2 (<C 3, 0.4–0.7 with step size of 0.01) and C 3 (0.6–0.9 with step size of 0.01) to quantify the overall classification performance for four different classes. The set of parameters resulting in the maximum average F1‐score (0.742) was then determined as the optimized parameters (d 1 = 10.0, d 2 = 1.5, C 1 = 0.36, C 2 = 0.54, and C 3 = 0.67).

Finally, the packing quality is classified into four categories: incorrect (PackQ < 0.36), acceptable (0.36 ≤ PackQ < 0.54), medium (0.54 ≤ PackQ < 0.67), or high (PackQ ≥ 0.67) (Table 1, Figure S6D). As shown in Figure S6E, while PackQ is positively correlated with DockQ, it provides a more stringent evaluation, excluding models with acceptable DockQ scores (DockQ ≥ 0.23) but poor overall packing accuracy (PackQ < 0.36) (Figure S6A–C).

5.4. Protein Similarity Comparison

Three conventional descriptors—sequence identity, C α RMSD, and the 3D Zernike Descriptor (3DZD)—were used to assess protein similarity. First, sequences extracted from PDB files were globally aligned using the Needleman‐Wunsch algorithm implemented in MATLAB, and protein similarity was consequently quantified by the resulting sequence identity. Following sequence alignment, the corresponding C α coordinates of aligned residues were used to calculate RMSD, providing a measure of protein similarity based on backbone geometry. Next, the 3D shape of each protein molecule was characterized. Protein surfaces were generated using the EDTSurf program (https://zhanggroup.org/EDTSurf/). The 3DZD, a rotation‐invariant, moment‐based mathematical descriptor, was then applied to encode the 3D shape of each protein as a vector. The 3DZD computations in this work were performed using a program generously shared by Dr. Houdayer (https://github.com/jerhoud/zernike3d). The dissimilarity (or distance) between two protein molecules was thus determined by calculating the Euclidean distance between their respective 3DZD vectors.

5.5. Extraction and Preprocessing of PDB Crystallization Conditions

Crystallization details recorded in PDB REMARK280 typically include reagents' names and concentrations, pH, temperature, and crystallization method (e.g., sitting or hanging drop) (Figure S7A). These pieces of information have no fixed order and are often separated by spaces, commas, semicolons, or various units. Additionally, different research groups may use inconsistent naming, abbreviations, or even misspellings for the same reagent, posing challenges for standardized parsing. Figure S7B illustrates our parsing workflow:

  1. REMARK280 records are first divided into fragments using common delimiters, such as “spaces,” “commas,” “and,” and “units.”

  2. We compiled a list of frequently used reagents and their aliases in a reference table (Supplementary File S5). Each alias serves as a probe to identify the reagent in the REMARK280 record and map it to a standardized name.

  3. Once a reagent is recognized, the parser searches the preceding text for concentration units (e.g., %, M, mM, mg/mL) and retrieves the numeric value to determine the reagent's concentration. This yields a complete “reagent factor” that combines a concentration, unit, and standardized chemical name.

  4. Similar strategies are applied to extract pH values and temperature information by detecting strings such as “pH” and temperature units (e.g., °C, °F, K).

  5. The parser also captures the crystallization methods (e.g., sitting drop, hanging drop) by scanning for keywords such as “vapor,” “diffusion,” “lipidic cubic phase,” “sit,” “hang,” and “batch.”

  6. After each recognized string is extracted, it is removed from the original REMARK280 entry. If all relevant fields are eventually identified (leaving no unrecognized text), the parsing is deemed successful. Otherwise, any leftover text suggests unrecognized reagents, misspellings, or non‐standard formatting, and those entries are classified as failed.

  7. Parsed crystallization conditions are then classified based on the level of detail (Figure S7C): Qualified (including complete reagent factors and other crystallization details), w/o Conc. (lacking concentration information), pH & Temp only (containing only pH and temperature information) and Null (no relevant data).

After parsing, each condition is converted into a 245‐element vector (Supplementary File S6) that encompasses crystallization methods, pH, temperature (in Kelvin), and the concentrations of 241 common reagents (standardized to micromolar or percentage units). Our parsing script, Xtal_Cond_Parser.m, is publicly available in our GitHub repository: https://github.com/KJ‐Liao/MASCL/tree/main/Crystallization_Condition_Parsing.

Crystallization conditions for the single‐chain human protein dataset were extracted and parsed from the downloaded PDB files (corresponding entries can be found in Supplementary File S1). After scaling, the dissimilarities between PDB crystallization conditions were calculated using weighted Euclidean distances (Figure S7B). Detailed extraction results are shown in Figure S7C. Approximately 56.6% (9120) of the 16 123 available single‐chain human proteins having detailed crystallization information (i.e., no blank, unknown, or missing fields) were considered qualified. To optimize the training of weights for AAI‐PatchBag, these 9120 qualified PDB samples were further reduced to 2393 non‐redundant samples (hereafter, “qualified non‐redundant PDB samples”; Supplementary File S2) based on sequence identity. The dataset was split into training and validation sets in an 8:2 ratio. Additionally, a newly curated test set of 500 samples was randomly selected from the remaining qualified PDB samples, excluding those already included in the training and validation sets.

5.6. Crystallization of Hen Egg White Lysozyme

Hen egg white lysozyme [47] (Sigma‐Aldrich, CAS No. 12650‐88‐3) was used at a 20 mg/mL concentration. Crystallizations were conducted at 20°C using the sitting‐drop vapor‐diffusion method. Each reservoir was prepared according to the components listed in Table 2, and protein and reservoir solutions were mixed at a 1:1 ratio (0.5 μL protein + 0.5 μL reservoir). Crystals typically appeared within 2–5 days and were harvested in liquid nitrogen without cryoprotectant. Diffraction data were collected at the Taiwan Photon Source (TPS) 05A beamline of the National Synchrotron Radiation Research Center (NSRRC) in Hsinchu, Taiwan, under cryogenic conditions. All X‐ray diffraction data were processed using HKL‐2000.

5.7. Computational Resources

All experiments were conducted on a system running Ubuntu 22.04 LTS with an Intel Core i7‐12700K processor (12 cores, 20 threads, 3.60 GHz), 32 GB RAM (with 64 GB recommended for high molecular weight proteins and large minimal protein crystal packing models), and a GeForce RTX 2060 SUPER (8 GB) GPU, primarily used to accelerate symmetry examinations in MATLAB. PIPER docking typically completes within 2–5 min per sample, with symmetry checks in MATLAB requiring a similar runtime (depending on the molecular size). The training phase of AAI‐PatchBag, which involves surface patch extraction and clustering based on inter‐patch distances, is computationally intensive and varies depending on the number of selected patches and clusters. For the parameters used in our study (8000 randomly selected patches, 300 medoids), inter‐patch distance calculations and patch approximation each required approximately 6–10 h. However, once the AAI‐PatchBag model is trained, a new query—from inputting a single sample to ranking potential crystallization conditions—completes within a few minutes.

Author Contributions

Kuan‐Ju Liao: conceptualization, formal analysis, investigation, methodology, visualization, writing – original draft, writing – review and editing. Yuh‐Ju Sun: conceptualization, funding acquisition, writing – review and editing, project administration, supervision.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1. Supporting Information.

Acknowledgments

We are grateful for the access to the synchrotron radiation beamlines TPS05A and TLS15A at the National Synchrotron Radiation Research Center (NSRRC) in Taiwan and the in‐house X‐ray facilities in the Macromolecular X‐ray Crystallographic Center of National Tsing Hua University.

Liao K.‐J. and Sun Y.‐J., “Using AlphaFold and Symmetrical Docking to Predict Protein–Protein Interactions for Exploring Potential Crystallization Conditions,” Proteins: Structure, Function, and Bioinformatics 93, no. 10 (2025): 1747–1766, 10.1002/prot.26844.

Funding: This work was supported by the National Science and Technology Council, Taiwan: [113‐2311‐B‐007‐002, 113‐2311‐B‐007‐007, and 112‐2740‐B‐007‐001 to Y.J.S.], and National Tsing Hua University, Taiwan: [113QF009E1 to Y.J.S.].

Contributor Information

Kuan‐Ju Liao, Email: vamos0527@gmail.com.

Yuh‐Ju Sun, Email: yjsun@life.nthu.edu.tw.

Data Availability Statement

16123 PDB copies of single‐chain human proteins used in this study are downloaded from the PDB database (https://www.rcsb.org/). All unprocessed AF2‐predicted structures were obtained from the AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/). Supplementary Files S1–S6 are all available in our GitHub repository: https://github.com/KJ‐Liao/MASCL/tree/main/Others. Source data are provided in this paper. The PIPER docking program is available as a 64‐bit executable for academic users via the ClusPro website after registration (https://cluspro.org/downloads.php). Exemplary scripts and related settings for symmetrical docking, including docking conformations and refined coefficients, are provided as text files in our GitHub repository: https://github.com/KJ‐Liao/MASCL/tree/main. Additionally, code and brief instructions for docked model selection, SurfOrca refinement, and minimal protein crystal packing construction are also available in the same repository.

References

  • 1. Pereira J., Simpkin A. J., Hartmann M. D., Rigden D. J., Keegan R. M., and Lupas A. N., “High‐Accuracy Protein Structure Prediction in CASP14,” Proteins 89, no. 12 (2021): 1687–1699. [DOI] [PubMed] [Google Scholar]
  • 2. Jumper J., Evans R., Pritzel A., et al., “Highly Accurate Protein Structure Prediction With AlphaFold,” Nature 596, no. 7873 (2021): 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ghani U., Desta I., Jindal A., et al., “Improved Docking of Protein Models by a Combination of Alphafold2 and ClusPro,” bioRxiv (2021): 459290, 10.1101/2021.09.07.459290. [DOI] [Google Scholar]
  • 4. Luo Q., Wang S., Li H. Y., Zheng L., Mu Y., and Guo J., “Benchmarking Reverse Docking Through AlphaFold2 Human Proteome,” Protein Science 33, no. 10 (2024): e5167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Evans R., O'Neill M., Pritzel A., et al., “Protein Complex Prediction With AlphaFold‐Multimer,” bioRxiv (2021): 463034, 10.1101/2021.10.04.463034. [DOI] [Google Scholar]
  • 6. Abramson J., Adler J., Dunger J., et al., “Accurate Structure Prediction of Biomolecular Interactions With AlphaFold 3,” Nature 630, no. 8016 (2024): 493–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Masrati G., Landau M., Ben‐Tal N., Lupas A., Kosloff M., and Kosinski J., “Integrative Structural Biology in the Era of Accurate Structure Prediction,” Journal of Molecular Biology 433, no. 20 (2021): 167127. [DOI] [PubMed] [Google Scholar]
  • 8. Bertoline L. M. F., Lima A. N., Krieger J. E., and Teixeira S. K., “Before and After AlphaFold2: An Overview of Protein Structure Prediction,” Frontiers in Bioinformatics 3 (2023): 1120370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Terwilliger T. C., Liebschner D., Croll T. I., et al., “AlphaFold Predictions Are Valuable Hypotheses and Accelerate but Do Not Replace Experimental Structure Determination,” Nature Methods 21, no. 1 (2024): 110–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Shi Y., “A Glimpse of Structural Biology Through X‐Ray Crystallography,” Cell 159, no. 5 (2014): 995–1014. [DOI] [PubMed] [Google Scholar]
  • 11. Dobson C. M., “Biophysical Techniques in Structural Biology,” Annual Review of Biochemistry 88 (2019): 25–33. [DOI] [PubMed] [Google Scholar]
  • 12. Maveyraud L. and Mourey L., “Protein X‐Ray Crystallography and Drug Discovery,” Molecules 25, no. 5 (2020): 1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wienen‐Schmidt B., Oebbeke M., Ngo K., Heine A., and Klebe G., “Two Methods, One Goal: Structural Differences Between Cocrystallization and Crystal Soaking to Discover Ligand Binding Poses,” ChemMedChem 16, no. 1 (2021): 292–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. McPherson A. and Gavira J. A., “Introduction to Protein Crystallization,” Acta Crystallographica Section F, Structural Biology Communications 70, no. 1 (2014): 2–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Holcomb J., Spellmon N., Zhang Y., Doughan M., Li C., and Yang Z., “Protein Crystallization: Eluding the Bottleneck of X‐Ray Crystallography,” AIMS Biophysics 4, no. 4 (2017): 557–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Carugo O. and Argos P., “Protein‐Protein Crystal‐Packing Contacts,” Protein Science 6, no. 10 (1997): 2261–2263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Hermann J., Bischoff D., Grob P., et al., “Controlling Protein Crystallization by Free Energy Guided Design of Interactions at Crystal Contacts,” Crystals 11, no. 6 (2021): 588. [Google Scholar]
  • 18. Derewenda Z. S., “Rational Protein Crystallization by Mutational Surface Engineering,” Structure 12, no. 4 (2004): 529–535. [DOI] [PubMed] [Google Scholar]
  • 19. Fusco D., Headd J. J., De Simone A., Wang J., and Charbonneau P., “Characterizing Protein Crystal Contacts and Their Role in Crystallization: Rubredoxin as a Case Study,” Soft Matter 10, no. 2 (2014): 290–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Fusco D., Barnum T. J., Bruno A. E., et al., “Statistical Analysis of Crystallization Database Links Protein Physico‐Chemical Features With Crystallization Mechanisms,” PLoS One 9, no. 7 (2014): e101123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Derewenda Z. S. and Godzik A., “The “Sticky Patch” Model of Crystallization and Modification of Proteins for Enhanced Crystallizability,” Methods in Molecular Biology 1607 (2017): 77–115, 10.1007/978-1-4939-7000-1_4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Oldfield C. J., Xue B., Van Y. Y., et al., “Utilization of Protein Intrinsic Disorder Knowledge in Structural Proteomics,” Biochimica et Biophysica Acta 1834, no. 2 (2013): 487–498, 10.1016/j.bbapap.2012.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Johnson D. E., Xue B., Sickmeier M. D., et al., “High‐Throughput Characterization of Intrinsic Disorder in Proteins From the Protein Structure Initiative,” Journal of Structural Biology 180, no. 1 (2012): 201–215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Longenecker K. L., Garrard S. M., Sheffield P. J., and Derewenda Z. S., “Protein Crystallization by Rational Mutagenesis of Surface Residues: Lys to ala Mutations Promote Crystallization of RhoGDI,” Acta Crystallographica, Section D: Biological Crystallography 57, no. Pt 5 (2001): 679–688. [DOI] [PubMed] [Google Scholar]
  • 25. Mateja A., Devedjiev Y., Krowarsch D., et al., “The Impact of Glu→ala and Glu→Asp Mutations on the Crystallization Properties of RhoGDI: The Structure of RhoGDI at 1.3 Å Resolution,” Acta Crystallographica. Section D, Biological Crystallography 58, no. Pt 12 (2002): 1983–1991. [DOI] [PubMed] [Google Scholar]
  • 26. Czepas J., Devedjiev Y., Krowarsch D., Derewenda U., Otlewski J., and Derewenda Z. S., “The Impact of Lys→Arg Surface Mutations on the Crystallization of the Globular Domain of RhoGDI,” Acta Crystallographica. Section D, Biological Crystallography 60, no. 2 (2004): 275–280. [DOI] [PubMed] [Google Scholar]
  • 27. Banayan N. E., Loughlin B. J., Singh S., et al., “Systematic Enhancement of Protein Crystallization Efficiency by Bulk Lysine‐To‐Arginine (KR) Substitution,” Protein Science 33, no. 3 (2024): e4898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Ramberg K. O., Engilberge S., Skorek T., and Crowley P. B., “Facile Fabrication of Protein‐Macrocycle Frameworks,” Journal of the American Chemical Society 143, no. 4 (2021): 1896–1907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Li Z., Wang S., Nattermann U., et al., “Accurate Computational Design of Three‐Dimensional Protein Crystals,” Nature Materials 22, no. 12 (2023): 1556–1563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Dale G. E., Oefner C., and D'Arcy A., “The Protein as a Variable in Protein Crystallization,” Journal of Structural Biology 142, no. 1 (2003): 88–97, 10.1016/s1047-8477(03)00041-8. [DOI] [PubMed] [Google Scholar]
  • 31. Delucas L. J., Hamrick D., Cosenza L., et al., “Protein Crystallization: Virtual Screening and Optimization,” Progress in Biophysics and Molecular Biology 88, no. 3 (2005): 285–309. [DOI] [PubMed] [Google Scholar]
  • 32. Lu H. M., Yin D. C., Liu Y. M., Guo W. H., and Zhou R. B., “Correlation Between Protein Sequence Similarity and Crystallization Reagents in the Biological Macromolecule Crystallization Database,” International Journal of Molecular Sciences 13, no. 8 (2012): 9514–9526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Budowski‐Tal I., Kolodny R., and Mandel‐Gutfreund Y., “A Novel Geometry‐Based Approach to Infer Protein Interface Similarity,” Scientific Reports 8, no. 1 (2018): 8192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Shin W. H., Kumazawa K., Imai K., Hirokawa T., and Kihara D., “Quantitative Comparison of Protein‐Protein Interaction Interface Using Physicochemical Feature‐Based Descriptors of Surface Patches,” Frontiers in Molecular Biosciences 10 (2023): 1110567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Abrahams G. J. and Newman J., “BLASTing Away Preconceptions in Crystallization Trials,” Acta Crystallographica. Section F, Structural Biology Communications 75, no. Pt 3 (2019): 184–192, 10.1107/S2053230X19000141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., and Kanehisa M., “AAindex: Amino Acid Index Database, Progress Report 2008,” Nucleic Acids Research 36, no. Database issue (2008): D202–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Evans P. R., “An Introduction to Data Reduction: Space‐Group Determination, Scaling and Intensity Statistics,” Acta Crystallographica, Section D: Biological Crystallography 67, no. Pt 4 (2011): 282–292, 10.1107/S090744491003982X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Wukovitz S. W. and Yeates T. O., “Why Protein Crystals Favour Some Space‐Groups Over Others,” Nature Structural Biology 2, no. 12 (1995): 1062–1067, 10.1038/nsb1295-1062. [DOI] [PubMed] [Google Scholar]
  • 39. Kozakov D., Brenke R., Comeau S. R., and Vajda S., “PIPER: An FFT‐Based Protein Docking Program With Pairwise Potentials,” Proteins 65, no. 2 (2006): 392–406. [DOI] [PubMed] [Google Scholar]
  • 40. Basu S. and Wallner B., “DockQ: A Quality Measure for Protein‐Protein Docking Models,” PLoS One 11, no. 8 (2016): e0161879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Xia Y., Zhao K., Liu D., Zhou X., and Zhang G., “Multi‐Domain and Complex Protein Structure Prediction Using Inter‐Domain Interactions From Deep Learning,” Communications Biology 6, no. 1 (2023): 1221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Lensink M. F., Mendez R., and Wodak S. J., “Docking and Scoring Protein Complexes: CAPRI 3rd Edition,” Proteins 69, no. 4 (2007): 704–718, 10.1002/prot.21804. [DOI] [PubMed] [Google Scholar]
  • 43. Eyal E., Gerzon S., Potapov V., Edelman M., and Sobolev V., “The Limit of Accuracy of Protein Modeling: Influence of Crystal Packing on Protein Structure,” Journal of Molecular Biology 351, no. 2 (2005): 431–442. [DOI] [PubMed] [Google Scholar]
  • 44. Xu D. and Zhang Y., “Generating Triangulated Macromolecular Surfaces by Euclidean Distance Transform,” PLoS One 4, no. 12 (2009): e8140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Houdayer J. and Koehl P., “Stable Evaluation of 3D Zernike Moments for Surface Meshes,” Algorithms 15, no. 11 (2022): 406. [Google Scholar]
  • 46. Newman J., Xu J., and Willis M. C., “Initial Evaluations of the Reproducibility of Vapor‐Diffusion Crystallization,” Acta Crystallographica, Section D: Biological Crystallography 63, no. Pt 7 (2007): 826–832. [DOI] [PubMed] [Google Scholar]
  • 47. Canfield R. E., “The Amino Acid Sequence of Egg White Lysozyme,” Journal of Biological Chemistry 238 (1963): 2698–2707. [PubMed] [Google Scholar]
  • 48. Moal I. H., Torchala M., Bates P. A., and Fernandez‐Recio J., “The Scoring of Poses in Protein‐Protein Docking: Current Capabilities and Future Directions,” BMC Bioinformatics 14 (2013): 286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Janin J. and Rodier F., “Protein‐Protein Interaction at Crystal Contacts,” Proteins 23, no. 4 (1995): 580–587. [DOI] [PubMed] [Google Scholar]
  • 50. Janin J., “Specific Versus Non‐Specific Contacts in Protein Crystals,” Nature Structural Biology 4, no. 12 (1997): 973–974. [DOI] [PubMed] [Google Scholar]
  • 51. Dasgupta S., Iyer G. H., Bryant S. H., Lawrence C. E., and Bell J. A., “Extent and Nature of Contacts Between Protein Molecules in Crystal Lattices and Between Subunits of Protein Oligomers,” Proteins 28, no. 4 (1997): 494–514. [DOI] [PubMed] [Google Scholar]
  • 52. Ponstingl H., Henrick K., and Thornton J. M., “Discriminating Between Homodimeric and Monomeric Proteins in the Crystalline State,” Proteins 41, no. 1 (2000): 47–57. [DOI] [PubMed] [Google Scholar]
  • 53. Bahadur R. P., Chakrabarti P., Rodier F., and Janin J., “A Dissection of Specific and Non‐Specific Protein‐Protein Interfaces,” Journal of Molecular Biology 336, no. 4 (2004): 943–955. [DOI] [PubMed] [Google Scholar]
  • 54. Liu S., Li Q., and Lai L., “A Combinatorial Score to Distinguish Biological and Nonbiological Protein‐Protein Interfaces,” Proteins 64, no. 1 (2006): 68–78. [DOI] [PubMed] [Google Scholar]
  • 55. Bahadur R. P. and Zacharias M., “The Interface of Protein‐Protein Complexes: Analysis of Contacts and Prediction of Interactions,” Cellular and Molecular Life Sciences 65, no. 7–8 (2008): 1059–1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Yang S., “Methods for SAXS‐Based Structure Determination of Biomolecular Complexes,” Advanced Materials 26, no. 46 (2014): 7902–7910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Jimenez‐Garcia B., Pons C., Svergun D. I., Bernado P., and Fernandez‐Recio J., “pyDockSAXS: Protein‐Protein Complex Structure by SAXS and Computational Docking,” Nucleic Acids Research 43, no. W1 (2015): W356–W361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Xia B., Mamonov A., Leysen S., et al., “Accounting for Observed Small Angle X‐Ray Scattering Profile in the Protein‐Protein Docking Server ClusPro,” Journal of Computational Chemistry 36, no. 20 (2015): 1568–1572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Wang X., Terashi G., Christoffer C. W., Zhu M., and Kihara D., “Protein Docking Model Evaluation by 3D Deep Convolutional Neural Networks,” Bioinformatics 36, no. 7 (2020): 2113–2118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Wang X., Flannery S. T., and Kihara D., “Protein Docking Model Evaluation by Graph Neural Networks,” Frontiers in Molecular Biosciences 8 (2021): 647915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Han Y., He F., Chen Y., Qin W., Yu H., and Xu D., “Quality Assessment of Protein Docking Models Based on Graph Neural Network,” Frontiers in Bioinformatics 1 (2021): 693211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Hu B., Zhu X., Monroe L., Bures M. G., and Kihara D., “PL‐PatchSurfer: A Novel Molecular Local Surface‐Based Method for Exploring Protein‐Ligand Interactions,” International Journal of Molecular Sciences 15, no. 9 (2014): 15122–15145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Shin W. H., Christoffer C. W., Wang J., and Kihara D., “PL‐PatchSurfer2: Improved Local Surface Matching‐Based Virtual Screening Method That Is Tolerant to Target and Ligand Structure Variation,” Journal of Chemical Information and Modeling 56, no. 9 (2016): 1676–1691. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1. Supporting Information.

Data Availability Statement

16123 PDB copies of single‐chain human proteins used in this study are downloaded from the PDB database (https://www.rcsb.org/). All unprocessed AF2‐predicted structures were obtained from the AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/). Supplementary Files S1–S6 are all available in our GitHub repository: https://github.com/KJ‐Liao/MASCL/tree/main/Others. Source data are provided in this paper. The PIPER docking program is available as a 64‐bit executable for academic users via the ClusPro website after registration (https://cluspro.org/downloads.php). Exemplary scripts and related settings for symmetrical docking, including docking conformations and refined coefficients, are provided as text files in our GitHub repository: https://github.com/KJ‐Liao/MASCL/tree/main. Additionally, code and brief instructions for docked model selection, SurfOrca refinement, and minimal protein crystal packing construction are also available in the same repository.


Articles from Proteins are provided here courtesy of Wiley

RESOURCES