Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Oct 30;26(5):bbaf573. doi: 10.1093/bib/bbaf573

Artificial intelligence-driven framework for discovering synthetic binding protein-like scaffolds from the entire protein universe

Zixin Duan 1, Yafeng Liang 2, Jing Zhang 3, Zijian Chen 4, Liangcai Gu 5, Weiwei Xue 6,
PMCID: PMC12573432  PMID: 41165486

Abstract

Compared to traditional sequence-based methods, artificial intelligence (AI) approaches offer distinct advantages, such as significantly improved structural recognition efficiency and the ability to overcome inherent limitations of sequence alignment. Here, we introduce an AI-driven framework designed to discover synthetic binding proteins (SBPs)-like scaffolds from the entire known proteome. The framework integrates a deep learning-based FoldSeek with our in-house developed holistic protein attributes assessment (HP2A) algorithm, and enables subsequent protein function annotation and evolutionary analysis. As a proof-of-concept, four representative SBPs, including Affibody, Anticalin, DARPin, and Fynome, were used as query to discover SBP-like scaffolds. The results demonstrate that some of the identified SBP-like proteins, despite their low sequence similarity (identity ≤0.3), exhibit significant structural resemblance to the templates (template modeling score (TM-score) ≥ 0.5), highlighting the large sequence space available within specific protein scaffold. Statistical analysis identifies key biophysical properties that contribute to privileged scaffold functionality. Additionally, evolutionary insights derived from potential SBP-like scaffolds provide valuable guidance for protein binder design, as validated through targeted sequence analysis and in silico site-directed mutagenesis. This work highlights the potential of our framework to facilitate the discovery of high-quality engineered protein scaffolds, paving the way for the development of novel SBPs.

Keywords: synthetic binding protein, engineered protein scaffolds, protein evolution, protein design, artificial intelligence

Introduction

Engineered protein scaffolds possess several key characteristics, such as small size, low immunogenicity, high stability, and precise targeting capabilities, that make them attractive for biomedical applications [1]. Leveraging the conserved sequences and structures of these scaffolds, researchers have developed a vast array of synthetic binding proteins (SBPs) using for the precise regulation of biological functions, thereby advancing protein-based diagnostics and therapeutics [1]. To date, ~68 distinct scaffold types have been identified, with representative examples including Affibody, Anticalin, DARPin, and Fynomer (Fig. 1) [2]. However, the design of SBPs has several limitations. First, the 3D structures of protein scaffolds lack diversity, restricting SBPs to variants derived from fixed protein scaffolds and resulting in limited structural variability [3]. Second, designing a novel SBP with stability, adaptability, and functionality remains challenging due to the inherent randomness and limitations of current approaches, which primarily rely on directed mutagenesis or computational design without a clear, systematic strategy for mutation and optimization [4]. Therefore, it is necessary to develop a strategy to mine for structurally diverse SBP-like scaffolds, as well as to establish a rational SBP design strategy to overcome these limitations.

Figure 1.

Schematic and 3D structures of four SBP scaffolds (Affibody, Anticalin, DARPin, and Fynomer), highlighting the mutated and conserved regions with pink, red, and green colors.

Structures of Affibody, Anticalin, DARPin, and Fynomer. In the schematic representations, pink indicates the mutated regions introduced during scaffold design. In the 3D structures, red denotes conserved amino acids and green denotes variable ones, as revealed by Consurf analysis combining SBPs and PHPSs.

Artificial intelligence (AI) has been increasingly applied across bioinformatics to address diverse challenges. For instance, transformer-based natural language processing models have been explored for proteome bioinformatics to capture complex sequence–function relationship [5], and machine learning pipelines have been developed to predict protein crystallization propensity with improved accuracy over traditional approaches [6]. These studies exemplify the broad potential of AI methods in protein science and motivate the development of AI-driven strategies for scaffold discovery. The functionality of a protein, such as protein–protein interactions (PPIs), is intimately linked to its 3D structure [7]. Historically, techniques like X-ray crystallography have been used to determine protein structures, but these methods are often time-consuming and resource-intensive [8]. To address these limitations, physics-based homology modeling methods (e.g. MODELLER [9]) were developed, enabling structure prediction based on templates available in the PDB database. However, these methods face challenges when no suitable template exists. More recently, deep learning-based AI models have revolutionized the field by enabling rapid and accurate structure predictions. Notably, AlphaFold has achieved near-experimental accuracy in predicting protein structures [10]. The publicly available AlphaFold protein structure database (AlphaFold DB) now contains over 214 million entries, covering the human proteome and those of 47 other organisms that are critical to research and global health [11], which has opened the door to exploring the “sequence–structure–function” landscape of the entire protein universe [12]. Nevertheless, it is important to note that AlphaFold structures are still computational predictions and may not fully capture functionally relevant conformations, particularly for intrinsically disordered regions or multiprotein complexes.

Homologous proteins, which share a common ancestry, often exhibit similarities in sequence, structure, and function due to gene duplication or speciation events [13]. These shared characteristics pave the way for mining engineered protein scaffolds from large protein databases using bioinformatics methods. Traditional search tools like BLAST [14] operate primarily at the sequence level [15, 16], identifying homologous proteins or regions with sequence similarity. However, evolutionary point mutations can result in proteins with similar 3D structures despite significant sequence divergence, which means sequence-based methods may overlook these relationships [17]. To overcome this limitation, structural similarity search tools were developed to compare the 3D conformations of proteins, using similarity scores to detect distant homologs. For instance, TM-align employs dynamic programming to align protein backbones with high accuracy, though at a considerable computational cost [18]. In contrast, FoldSeek breaks down structures into an alphabet-based representation and utilizes deep learning techniques, thereby facilitating high-throughput and accurate identification of structurally similar proteins [19]. The SCOP2 [20] classification provides an expert-curated, knowledge-based organization of PDB proteins using a directed acyclic graph rather than a rigid hierarchy. It separates structural and evolutionary relationships, introduces new categories such as Protein types and Evolutionary events, and allows multiple relationship paths, enabling more consistent classification of complex protein similarities. Beyond these tools, ProBiS-2012 [21] detects structurally similar proteins by representing their surfaces as graphs, where vertices are functional groups of surface residues and edges are distances between them. It then applies a maximum clique algorithm to find the largest matching subgraphs between the query and database proteins, capturing both geometric and physicochemical similarities, and maps structural conservation onto the query to highlight potential binding or functional sites. Additionally, specialized web servers like PROSCA have been developed for structure-based searches, facilitating the identification of human protein scaffolds for binder engineering and design [22]. All of the above approaches are based on sequence, structural, or evolutionary similarity, and do not systematically take into account the biophysical properties of proteins when mining for similar proteins.

The stability and adaptability of a protein scaffold in engineering and design are collectively determined by its structural conformation, physicochemical properties, and dynamic behavior [23]. These features also provide critical insights into a protein’s functionality and evolvability [24]. In other words, proteins sharing similar biophysical characteristics across multiple dimensions are likely to exhibit analogous functional capabilities and evolutionary potential. For instance, one study employed independent component analysis combined with elastic net regularization to develop a model predicting scaffold-binding capabilities based on specific biophysical properties [24]. Furthermore, research has revealed that the dynamic flexibility and functional specificity of proteins correlate with species complexity, suggesting that phylogenetic proximity is associated with the statistical biophysical proximity of proteomes [25]. Consequently, beyond the structural similarity, a thorough evaluation of these properties is essential for assessing the quality of potential protein scaffolds identified from protein databases. However, such biophysical property analyses have not yet been systematically integrated into protein mining tools.

In this study, we propose a unified AI-driven framework to rapidly identify structurally similar proteins (or fragments), taking into account both structural similarity and biophysical property similarity, as engineered protein scaffolds from the entire known proteome (Fig. 2). A key component of the framework is the incorporation of FoldSeek [19], a high-throughput structural alignment tool selected over alternative methods due to its superior computational efficiency and scalability, which make it particularly suitable for large-scale protein comparisons without compromising alignment accuracy. Structural similarity alone, however, is insufficient to ensure scaffold utility. Therefore, the framework further integrates a systematic evaluation of biophysical and functional properties via our in-house developed holistic protein attributes assessment (HP2A) algorithm. HP2A performs multiparametric profiling of proteins by computing a curated set of biophysical descriptors and identifies potential high-quality protein scaffolds (PHPSs) by quantitatively assessing their global similarity to known SBPs across these attributes. For proof-of-concept validation, we selected four representative SBPs (i.e. Affibody, Anticalin, DARPin, and Fynomer) based on two considerations: (i) they are well-characterized and widely studied SBPs, and (ii) they exhibit distinct secondary structures (Alpha-helix and Beta-sheet), enabling assessment of our framework across different structural classes. This selection provides a meaningful demonstration of our approach while illustrating its potential applicability to the broader set of SBPs in the SYNBIP database. Furthermore, by analyzing the sequence space and biophysical properties of identified potential scaffolds, we gain valuable insights from natural knowledge for the rational design and optimization of protein binders.

Figure 2.

A flowchart illustrating the five-step framework of this study: from constructing a benchmark dataset of proteins to identifying PHPSs and deriving design principles.

The framework presented in this study enables the rapid identification of structurally similar proteins (or fragments) as potential engineered protein scaffolds from the entire known proteome. (1) Based on SBPs from the SYNBIP database, a benchmark dataset of SBP-like proteins was constructed using FoldSeek, UniProt, and AlphaFold; (2) the biophysical properties of SBPs and SBP-like proteins were calculated; (3) statistical analyses were performed on these biophysical properties; (4) PHPSs were identified using the HP2A strategy; and (5) by integrating SBPs and PHPSs, the design principles of SBPs were derived from natural proteins, focusing on sequence features and conservation.

While our framework does not directly generate novel folds or experimentally validate binding innovations, it provides a systematic and scalable strategy to uncover structurally diverse scaffold candidates from the entire proteome. Such candidates expand the repertoire beyond well-characterized scaffolds and establish a foundation for subsequent mechanistic investigations and functional engineering. From a protein science perspective, this approach enables a deeper exploration of the sequence–structure–function landscape, offering natural design principles that can guide the rational development of next-generation binding proteins.

Materials and methods

Construction of dataset

The structural data for scaffolds Affibody, Anticali, DARPi, and Fynomer were retrieved from the SYNBIP database [26]. For each scaffold, we conducted structural similarity searches using the FoldSeek [19] against the AlphaFold-UniProt100 database [11], targeting all SBP structures derived from its monomers. To ensure both meaningful structural alignment and preservation of scaffold integrity, we retained only those hits with a template modeling score (TM-score) >0.5 [18]. Previous studies indicate that TM-score values ~0.5 generally correspond to root mean square deviation (RMSD) values of ~6.5 Å when modeling full-length structures using tools such as MODELER7 [9]. Additionally, we required the retained query fragment proportion to exceed 0.75 to avoid excessively truncated matches. These criteria reflect the confidence limits of fold-recognition algorithms and balance alignment quality with coverage. In cases where multiple identical results were found within the same protein, only the result with the highest TM-score was retained.

To facilitate practical application of the identified SBP-like scaffolds, we further processed and standardized the FoldSeek output. Specifically, we obtained protein names and corresponding species information from the UniProt [27] database, extracted the associated structures and sequences from AlphaFold, and integrated these data with the similarity fragment regions and TM-scores provided by FoldSeek. The curated dataset is now available for access and download through our SYNBIP database [26].

Calculation of biophysical properties of proteins

The biophysical properties of proteins are critical for understanding their structure, stability, and functional potential. To achieve a comprehensive assessment, we selected a panel of descriptors including Radius of Gyration (Rg), Solvent Accessible Surface Area (SASA), Assortativity (ρ), Modularity (Q), Coefficient of Variation of the Hydrophobicity Profile (CVHP), Fractal Dimension (d), λ₁ (Principal Eigenvalue), Zipf Coefficient (z, 25%), Contact Order (CO), Long-range Contact Degree, Buried Non-Polar Surface Area (Buried NPSA), and Asphericity. This is because they collectively capture the key determinants of protein folding, stability, and binding interface characteristics (Table S1).

Structural descriptors were calculated using several well-established Python scripts mainly from [24, 25], with parameters for the biophysical property calculations adopted from these scripts. The Radius of Gyration (Rg) was computed with MDTraj’s [28] “compute_rg” function, the SASA was determined using MDTraj’s “shrake_rupley” function, and the asphericity was obtained via MDTraj’s “asphericity” function. Topological properties, including the degree assortativity (ρ) and modularity (Q), were derived by constructing a contact network from Prody’s [29] Kirchhoff matrix and analyzing it with NetworkX [30] and the community package. The fractal dimension (d) is obtained from the slope of a log–log fit between the cutoff radius and the average neighbor count. The Kirchhoff matrix was then eigen-decomposed to yield its eigenvalues; the first nonzero eigenvalue (λ₁) was used as a descriptor. Furthermore, Zipf coefficients were fitted using the powerlaw [31] package on the inverse of the normalized eigenvalue data, with coefficients obtained for the top 25% of the distribution. The CVHP was calculated by generating a hydropathy profile using Biopython’s [32] ProteinAnalysis. In addition, PyMOL [33] was employed to compute several contact and surface area descriptors, including contact degree, CO, long-range contact degree, as well as the charged, polar, and hydrophobic components of SASA, and the Buried NPSA.

Holistic protein attributes assessment algorithm

We developed an HP2A algorithm to identify PHPSs from SBP-like candidates. HP2A integrates multiple biophysical descriptors (Section “Calculation of biophysical properties of proteins”) into a unified similarity score. To ensure comparability across parameters, each descriptor was standardized using the mean and standard deviation derived from known SBPs. The standardized Euclidean distance was then used to compute similarity between each candidate scaffold and known SBPs:

graphic file with name DmEquation1.gif (1)
graphic file with name bbaf573f1001.jpg (2)

Here, x and y are the standardized parameter vectors of the candidate scaffold and reference SBP, respectively, and n is the total number of parameters. The final HP2A score S(x) is defined as the minimum distance to any reference SBP:

graphic file with name DmEquation3.gif (3)

Candidates were ranked by HP2A scores, and the top 200 were selected as PHPSs. All HP2A scripts and example datasets are publicly available at https://github.com/Tuan-Space/HP2A, enabling full reproducibility.

While other multi-attribute scoring or machine learning approaches (e.g. clustering, anomaly detection, and manifold learning) could also be applied, we focused on the Euclidean similarity approach because it is straightforward, interpretable, and directly comparable to reference SBPs in the multidimensional biophysical space.

Statistical analysis and visualization of protein properties

We performed a comprehensive statistical analysis of SBPs, SBP-like fragments, and PHPSs across each scaffold type. Specifically, we generated correlation heatmaps to examine pairwise parameter relationships, conducted principal component analysis (PCA) for visualization, and produced scatter plots comparing sequence identity and TM-score. The identity scores and TM-scores were calculated between SBP-like fragments and their corresponding SBPs. Additionally, we created boxplots to compare the physicochemical parameters of SBPs and PHPSs. To improve clarity in visualizing parameter distributions, each parameter in the boxplots was linearly scaled using the maximum and minimum values observed in SBP-like fragments. All visualizations were generated using Python 3.10.

Protein–protein interactions analysis and visualization

To investigate potential interactions between PHPSs and target proteins, we first utilized the STRING database [34] to identify PHPSs that might interact with these proteins. Our focus was on the PHPSs, which showed the greatest similarity to SBP properties, to uncover potential PPIs. From this pool, we selected four PHPSs that had interactions recorded in the STRING database, choosing one representative PHPS for each SBP scaffold type. Many PHPSs in the top 200 lacked recorded interactions in STRING; therefore, this selection allowed us to explore the potential interaction patterns and characteristics of PHPSs in a manageable and representative manner (Table S2).

Then, the sequence data of both the 4 selected PHPSs and their target proteins (10 PHPS-target protein pairs) were submitted to AlphaFold 3 [35] to predict the 3D structure of the interaction complexes. To refine the predicted complex structures, we employed RosettaDock 4.0 [36] for a detailed docking analysis. The number of docking iterations was set to 1000 and with a scoring file was generated for each complex. The file encompasses various metrics, including docking scores, RMSD, etc., to evaluate the quality and stability of each conformation.

Finally, we extracted two significant indicators: I_sc and I_rms, to analyze the scoring results [37]. The I_sc metric is employed to evaluate the strength of interactions between proteins. More negative values of I_sc signify more favorable interactions, implying a tighter binding affinity. Meanwhile, the I_rms metric assesses the fluctuation in the structure of the interaction region, reflecting the deviation of the interface area between the docked complex and the reference structure postdocking. Lower I_rms values indicate a structure that more closely resembles the reference, which corresponds to higher precision in the docking process. To visually illustrate the interaction strength and structural fluctuations across various conformations, we used Python to generate scatter plots, with the horizontal axis denoting I_rms (Å) and the vertical axis representing I_ sc (kcal/mol).

Evolutionary analysis

To explore the evolutionary relationships between SBPs and PHPSs for guiding protein binder design, we combined their sequences, removed duplicates, and analyzed their similarities and differences at the sequence level. Since protein design involves manipulating a protein’s sequence to achieve desired functional or structural properties [38], sequence-level analysis provides a more appropriate framework for this investigation. We first performed an amino acid frequency analysis on these raw sequences. Multiple sequence alignment (MSA) was then conducted using MUSCLE [39], and the resulting alignment was employed to construct a phylogenetic tree using FastTree [40], which was subsequently visualized with iTOL [41].

To refine the alignment, we then removed insertion mutations by excluding alignment positions where >90% of the sequences contained gaps. Using the WebLogo tool [42], we generated sequence logos to highlight conserved regions. Additionally, we selected three physicochemical properties from the AAindex database [43], including the hydrophobicity index [44] (to assess hydrophilic/hydrophobic characteristics), the average accessible surface area [45] (reflecting spatial properties), and the net charge [46] (indicating electrostatic properties). To intuitively present the hydrophobicity index, we apply linear scaling by dividing it by the range between the maximum and minimum values. The average accessible surface area is linearly scaled to a range of 0–1 using min–max normalization. These indices were then computed for each amino acid position, averaged across sequences, and visualized as line plots. Detailed numerical values for these indices are provided in Table S3.

graphic file with name DmEquation4.gif (4)
graphic file with name DmEquation5.gif (5)

Then, we conducted a conservation analysis using the ConSurf tool [47]. The refined MSA served as input, and a randomly selected SBP sequence was chosen for visualization. The corresponding 3D protein structure was then uploaded to map conserved regions, providing insights into structurally and functionally important sites.

Synthetic binding protein inherent interaction redesign

We aim to derive an evolutionary rule from nature to guide the optimization of SBPs by comparing and analyzing the sequence characteristics of naturally occurring PHPSs and artificially designed SBPs. By aligning the sequences of SBPs and PHPSs, we can identify conserved residue types or conserved physicochemical properties at specific positions, which represent knowledge from nature, enabling the selection of amino acids with appropriate physicochemical properties for site-directed mutagenesis. To validate this design principle, we selected and downloaded five complexes from SYNBIP, specifically complexes mt000110, mt000124, mt000128, mt000142, and mt000163, under the Affibody category. These complexes were chosen because they share identical conserved site amino acids in their SBPs, while exhibiting distinct binding site geometries on their respective targets. Based on the average values of three amino acid properties at the conserved sites provided from ConSurf, we selected amino acids with similar characteristics from AAindex and incorporated the conserved amino acids from WebLogo analysis [42] to conduct single-point mutation analysis. For comparison, we also select amino acids against the natural evolution knowledge to mutate. The mCSM-PPI2 [48] and DDMut-PPI [49] tool were then employed to investigate changes in the binding affinity between the SBP and its target upon mutation.

Results

Species distribution and diversity of synthetic binding protein-like fragments

The fold types and structures of the four representative SBPs (Affibody, Anticalin, DARPin, and Fynomer) studied in this work were shown in Fig. 1. Based on the dataset constructed by FoldSeek [19] search, we removed redundant entries and cleaned the data. The number and species distribution of SBPs and SBP-like fragments for each scaffold are summarized in Table 1. Although our search was not conducted at the single-genome level, which may lead to some unavoidable redundancy, the results clearly highlight the natural diversity and widespread occurrence of these scaffolds. For instance, SBP-like fragments are found across a broad range of species, and the Fynomer scaffold alone yielded 523,567 proteins from 29,685 species. This underscores the enormous potential for discovering novel SBPs from natural resources.

Table 1.

Quantitative and species-level statistics for Affibody, Anticalin, DARPin, and Fynomer

Scaffold Number of SBP Number of SBP-like scaffold Number of species Top five species percentage Human percentage
Affibody 49 1975 660 Staphylococcus aureus (14.48%), Mycoplasmopsis synoviae (4.91%), Streptococcus agalactiae (1.57%), Dolosigranulum pigrum (1.32%), Ligilactobacillus salivarius (0.91%) 0
Anticalin 39 108,792 18,378 Acidobacteriota bacterium (1.27%), Ixodes ricinus (0.93%), Mesorhizobium sp (0.59%), Bacteroidota bacterium (0.56%), Gemmatimonadota bacterium (0.43%) 0.05%
DARPin 118 374,869 18,364 Trichomonas vaginalis (0.65%), Giardia intestinalis (0.53%), Aphanomyces astaci (0.46%), Macrostomum lignano (0.44%), Rotaria sp. Silwood1 (0.40%) 0.16%
Fynomer 9 523,567 29,685 Araneus ventricosus (0.66%), Tanacetum cinerariifolium (0.58%), Helicobacter pylori (0.38%), Salmo trutta (0.38%), Chloroflexota bacterium (0.36%), 0.20%

Sequence space and biophysical properties of synthetic binding protein-like fragments

Firstly, we found that most SBP-like fragments share low sequence similarity with known SBPs, with identity values predominantly ranging between 0.1 and 0.4 (Fig. 3a–d). This finding demonstrates that structure-based similarity searches can overcome the limitations of traditional sequence-based methods, thereby uncovering a larger pool of potential homologous proteins. Furthermore, while high sequence similarity reliably predicts high structural similarity, proteins with high structural similarity do not necessarily exhibit high sequence similarity. This observation aligns with the evolutionary concept that the accumulation of point mutations can widely alter sequences while preserving overall structure and function.

Figure 3.

A series of twelve statistical plots (scatter plots, PCA visualizations, and heatmaps) analyzing the structural similarity and biophysical metrics of four SBPs (Affibody, Anticalin, DARPin, and Fynomer) and their naturally occurring, similar fragments.

Statistics of SBPs and SBP-like scaffold fragments. Scatter plots of identity and TM-score for SBP-like scaffolds corresponding to Affibody (a), Anticalin (b), DARPin (c), and Fynomer (d); PCA visualization of the metrics corresponding to Affibody (e), Anticalin (f), DARPin (g), and Fynomer (h); heatmap of correlations between metrics corresponding to Affibody (i), Anticalin (j), DARPin (k), and Fynomer (l).

SBPs are widely utilized due to their excellent properties. On the PCA plot of the metrics, SBPs tend to concentrate within a relatively confined region (Fig. 3e–h). This concentration is partly attributable to the fact that SBPs are generated through mutations at fixed scaffold positions—resulting in high sequence and structural similarity—and partly because these regions may reflect the intrinsic high quality of SBPs. In contrast, the metric distribution of SBP-like fragments is much more dispersed, with the SBP data representing only a small segment of that overall space. Thus, by focusing on this specific subset within the SBP-like fragments, we may uncover novel candidates for high-quality protein scaffolds.

Due to the substantial sequence and structural differences among the four scaffolds, the interrelationships among the metrics also vary (Fig. 3i–l). In particular, the positive correlations observed among buried NPSA, SASA, modularity (Q), radius of gyration (Rg), asphericity, Zipf coefficient (z), and assortativity (ρ) suggest that scaffolds with more extensive hydrophobic core burial and surface exposure tend to adopt more modular, expanded, and anisotropic folds—features that can enhance overall stability and provide versatile binding interfaces. Conversely, the negative correlations of the principal eigenvalue (λ₁) and CO with these same parameters imply that proteins with lower λ₁ values and fewer contacts often form more compact, tightly packed conformations, potentially trading dynamic flexibility for structural rigidity. Identity is weakly or not correlated with most other metrics, whereas the TM score exhibits stronger correlations with them. Moreover, the correlation between TM score and identity is not particularly strong. These suggest that traditional sequence- and structure-based analyses are insufficient for a comprehensive characterization of proteins and fail to provide biophysical insights. Beyond structural similarity, incorporating biophysical similarity in protein selection may enable the identification of truly high-quality proteins from a more holistic perspective.

Using standardized Euclidean distance, we selected the 200 proteins whose indices are closest to SBPs as PHPSs (Fig. 4). The boxplot (Fig. 4a–d) and PCA (Fig. 4i–l) visualization demonstrate that the parameters of these PHPSs lie around the distribution of the SBPs. Notably, many of the PHPSs exhibit low sequence similarity yet high structural similarity (Fig. 4e–h). This observation suggests that a protein’s biophysical properties are more strongly determined by its structure, while its sequence mainly serves to form specific structural and functional characteristics. Thus, analyses that focus on structure and parameters provide a more accurate depiction of protein function.

Figure 4.

A comprehensive statistical comparison of SBPs, their similar natural fragments, and PHPSs, presented through distribution plots, TM-score/identity scatter plots, and PCA visualizations for the four protein scaffolds.

Statistics of SBPs, SBP-like scaffold fragments, and PHPSs. Distribution of the biophysical metrics for SBPs and PHPSs, including Affibody (a), Anticalin (b), DARPin (c), and Fynomer (d); distribution of TM score and identity between SBP like scaffold fragments and SBPs and PHPSs and SBPs for Affibody (e), Anticalin (f), DARPin (g), and Fynomer (h); PCA visualization of biophysical metrics for SBPs, SBP-like scaffold fragments, and PHPSs, including Affibody (i), Anticalin (j), DARPin (k), and Fynomer (l).

Protein function analysis of the representative synthetic binding protein-like scaffolds

To gain deeper insight into whether proteins containing SBP-like domains possess biological activity and whether this activity is mediated specifically by the SBP-like fold, we selected several such proteins from the PHPSs dataset with documented interactions in the STRING database and performed docking analyses against their respective partners. Here, the near-native conformation of each representative SBP-like protein bound to its interactors was successfully identified, as shown in Fig. 5. The models’ high confidence, particularly at the binding interface, is also indicated by the pLDDT values. In the Anticalin- and DARPin-derived examples, the SBP-like domain essentially constitutes the entire protein. And in Affibody-, Anticalin-, and DARPin-derived scaffolds, the SBP-like motif engages directly with the target. These observations indicate that, within a full-length protein, the SBP-like fold may act as the core determinant of binding affinity, suggesting that de novo-designed synthetic proteins could likewise be engineered to target these partners by incorporating or modifying SBP-like elements. However, exceptions can occur: for instance, in proteins containing Fynomer-like domains, the functional binding core is formed by an α-helical bundle, and the Fynomer-like portion does not contribute directly to target affinity (Fig. S1).

Figure 5.

Molecular models showing successful protein–protein docking between representative potential scaffolds (shown as ribbons) and their binding partners (shown as surfaces), with high-confidence binding interfaces highlighted in red.

Protein–protein docking analysis verifies the function of the representative SBP-like scaffolds. An inset panel shows the near-native conformation for each docking. In RosettaDock, a successful docking was defined as having an interface RMSD (I_rmsd) of ≤4.0 Å and at least three near-native conformations in the top five scoring structures (N5 ≥ 3). The proteins shown as ribbons represent PHPSs, while the proteins displayed in surface mode are their potential partners. A color gradient from blue (low confidence) to red (high confidence) represents pLDDT values ranging from 0 to 100. The predicted complexes show high confidence, especially in their binding interface.

Verification of the synthetic binding protein design principle from natural evolution knowledge

Although protein structure has been shown to better represent biophysical properties and functions, sequence information remains the most intuitive basis for protein design. To explore the similarities and differences between SBPs and PHPSs, we performed a combined analysis of sequence evolution and conservation for both SBPs and PHPSs. For brevity, only Affibody is presented here as a detailed example (Fig. 6), identical analytical procedures were applied to Anticalin, DARPin, and Fynomer, with the results provided in the Appendix (Figs. S2S13).

Figure 6.

A multipanel sequence-based analysis for the Affibody scaffold, including a phylogenetic tree, conservation map, hydrophobicity/accessibility/charge plots, sequence logo, and amino acid frequency distribution.

Sequence-based analysis of SBPs and PHPSs in Affibody. Phylogenetic tree of SBPs and PHPSs (a); conservation analysis of Affibody using ConSurf (b); average values of hydrophobicity index, accessible surface area, and net charge calculated for each position in the refined MSA of SBPs and PHPSs (c); WebLogo analysis of the refined MSA for SBPs and PHPSs (d); frequency distribution of each amino acid in SBPs and PHPSs (e).

Following the method section described, we generated MSA and constructed phylogenetic trees for both SBPs and PHPSs (Fig. 6a). Interestingly, the SBPs are nested within the evolutionary framework of the PHPSs, indicating that proteins designed under controlled conditions still exhibit inherent evolutionary relationships with natural proteins. Moreover, analysis using ConSurf [47] to display conserved regions in Affibody revealed a clear trend: the protein core is highly conserved, while the peripheral regions are more variable (Fig. 6b and Fig. S3). This pattern reflects a design principle whereby internal conservation ensures the maintenance of specific intramolecular interactions and structure formation, whereas external variability allows for diverse biological processes and interactions with other biomolecules. This observation aligns with findings in SYNBIP [26], where multiple SBPs developed under a common structural backbone exhibit varied biological functions. In subsequent analyses, we will focus on designing conserved sites of SBP based on evolutionary insights from nature to enhance affinity across diverse binding interfaces.

Subsequently, we conducted a residue-level property analysis for the aligned SBPs and PHPSs in Affibody, integrating conservation information from ConSurf [47] and WebLogo [42]. For instance, in SBP000460, residues at positions 22, 23, 26, 37, 48, and 52 are conserved (Fig. 6b), corresponding to positions 22, 23, 26, 37, 50, and 54 in Fig. 6c and d, which contribute to the formation of a stable structure. Specifically, from the knowledge of natural evolution process, residue 22 is characterized by hydrophobicity, a low solvent-accessible surface area due to shielding by neighboring residues, lack of charge (Fig. 6c), and an enrichment in leucine (L) (Fig. 6d). Therefore, when designing position 22, it is advisable to consider replacing it with amino acids that possess three similar properties. Additionally, within the Affibody scaffold, residues C is notably disfavored (Fig. 6e), and thus their inclusion should be avoided when designing SBPs based on Affibody. Consequently, A, F, I, L, and V were all considered as potential substitutions at position 22. A similar analysis was applied to other conserved sites, resulting in a total of 5 available conserved sites and 14 mutations. Additionally, one position was discarded due to the lack of amino acids with fully similar properties or the absence of conserved amino acids other than those found in SBP at these positions, making it impossible to identify suitable mutations.

To assess the generalizability of optimizing SBPs based on evolutionary knowledge, we selected five high-quality complexes with entirely distinct binding interfaces from SYNBIP database. For 14 mutations, we used mCSM-PPI2 and DDMut-PPI to evaluate changes in affinity before and after mutation (Table 2). Notably, across all five complexes, certain mutations designed to optimize SBP internal interactions based on evolutionary principles led to increased affinity.

Table 2.

In silico site-directed mutagenesis following the principles of natural evolution via DDMut-PPI and mCSM-PPI2 analysis.

Position Origin Mutant MT000110 (ΔΔGpred) MT000124 (ΔΔGpred) MT000128 (ΔΔGpred) MT000142 (ΔΔGpred) MT000163 (ΔΔGpred)
DDMut mCSM DDMut mCSM DDMut mCSM DDMut mCSM DDMut mCSM
22 LEU ALA −0.629 −0.367 −0.653 −0.429 −1.16 −0.608 −1.144 −1.116 −0.487 −0.52
22 LEU ILE −0.342 0.09 −0.3 −0.093 −0.609 0.355 −0.684 −1.225 −0.291 0.183
22 LEU PHE −0.38 0.661 −0.406 0.798 −0.925 0.488 −0.893 0.541 −0.301 0.68
22 LEU VAL −0.419 −0.323 −0.398 −0.377 −0.738 −0.463 −0.88 −1.171 −0.336 −0.357
23 ASN HIS −0.253 −0.026 −0.121 −0.019 −0.538 −0.046 0.129 −0.17 −0.203 −0.06
23 ASN SER −0.154 −0.217 −0.097 −0.242 −0.257 −0.451 −0.014 −0.5 −0.134 −0.221
23 ASN THR −0.211 −0.056 −0.128 −0.058 −0.252 −0.098 0.246 −0.589 −0.173 0.038
26 GLN ASN −0.591 −0.901 −0.519 −0.125 −0.784 −0.267 −0.446 −0.561 −0.495 −0.367
26 GLN ASP −0.9 −0.764 −0.76 −0.089 −0.959 −0.154 −0.372 −0.645 −0.878 −0.314
26 GLN GLU −0.343 −0.079 −0.612 0.107 −0.92 −0.029 −0.122 −0.172 −0.731 0.001
26 GLN HIS −0.641 −0.153 −0.545 0.137 −0.81 0.115 −0.49 0.124 −0.482 0.121
37 ASP ALA −0.531 −0.269 −0.347 −0.157 −0.32 −0.245 −0.293 −0.133 −0.445 −0.162
52 ASN ASP −0.516 −0.091 −0.186 −0.206 −0.292 −0.171 −0.369 −0.311 −0.523 −0.224
52 ASN HIS −0.44 0.106 −0.56 0.169 −0.746 0.156 −0.862 0.097 −0.565 0.195

Bold values indicate ΔΔG > 0, corresponding to an increased binding affinity between the two proteins upon mutation

For comparison, we introduced mutations that contradicted natural evolutionary principles to examine their effects on target affinity. Taking position 22 as an example, natural evolution favors hydrophobicity, low solvent-accessible surface area, the absence of charge (Fig. 6c), and notably disfavors residue C across all sites (Fig. 6e). Thus, we selected amino acids with markedly different properties (i.e. C, D, E, K, and R) as replacements. The same approach was applied to 4 other positions, resulting in a total of 26 mutations. Using mCSM-PPI2 and DDMut-PPI to evaluate affinity changes, we observed, as expected, that all 26 mutations led to varying degrees of affinity reduction (Table 3). Furthermore, by comparing the “evolution-consistent” and “evolution-opposing” mutations, we found that the predicted ΔΔG values of the evolution-consistent mutations were statistically significantly higher than those of the evolution-opposing mutations at some of these sites (Fig. 7). This provides strong evidence supporting both the accuracy of our mutation site selection and the effectiveness of natural evolution-guided strategies for protein optimization.

Table 3.

In silico site-directed mutagenesis not following the principles of natural evolution via DDMut-PPI and mCSM-PPI2 analysis

Position Origin Mutant MT000110 (ΔΔGpred) MT000124 (ΔΔGpred) MT000128 (ΔΔGpred) MT000142 (ΔΔGpred) MT000163 (ΔΔGpred)
DDMut mCSM DDMut mCSM DDMut DDMut mCSM DDMut mCSM DDMut
LEU 22 ARG −1.178 −0.408 −1.347 −0.464 −1.253 −0.449 −2.545 −1.268 −1.065 −0.456
LEU 22 ASP −1.367 −0.444 −1.264 −0.634 −1.838 −0.635 −1.702 −1.503 −1.262 −0.548
LEU 22 CYS −0.344 −0.35 −0.367 −0.481 −0.422 −0.523 −0.578 −1.025 −0.272 −0.487
LEU 22 GLU −1.099 −0.491 −1.118 −0.62 −1.712 −0.619 −1.343 −1.52 −1.075 −0.577
LEU 22 LYS −0.866 −0.308 −0.918 −0.358 −1.261 −0.431 −1.262 −1.117 −0.804 −0.448
ASN 23 CYS −0.233 −0.426 −0.145 −0.415 −0.611 −0.599 −0.171 −0.806 −0.247 −0.34
GLN 26 ALA −0.632 −0.952 −0.525 −0.121 −0.744 −0.171 −0.383 −0.684 −0.614 −0.53
GLN 26 CYS −0.592 −1.157 −0.629 −0.22 −0.746 −0.433 −0.431 −0.757 −0.59 −0.598
GLN 26 ILE −0.488 −0.811 −0.423 −0.188 −0.603 −0.325 −0.409 −0.46 −0.503 −0.39
GLN 26 LEU −0.458 −0.61 −0.553 −0.13 −0.781 −0.241 −0.219 −0.466 −0.543 −0.263
GLN 26 MET −0.485 −0.669 −0.497 −0.195 −0.715 −0.215 −0.328 −0.552 −0.501 −0.395
GLN 26 PHE −0.579 −0.676 −0.551 −0.18 −0.634 −0.231 −0.601 −0.33 −0.719 −0.171
GLN 26 VAL −0.613 −0.799 −0.648 −0.167 −0.769 −0.284 −0.671 −0.502 −0.751 −0.379
ASP 37 CYS −0.314 −0.344 −0.211 −0.494 −0.262 −0.543 −0.176 −0.387 −0.244 −0.468
ASP 37 ILE −0.417 −0.283 −0.269 −0.209 −0.36 −0.281 −0.216 −0.154 −0.303 −0.191
ASP 37 LEU −0.295 −0.228 −0.295 −0.263 −0.283 −0.283 −0.192 −0.208 −0.342 −0.192
ASP 37 MET −0.464 −0.354 −0.296 −0.445 −0.319 −0.467 −0.243 −0.314 −0.367 −0.342
ASP 37 PHE −0.148 −0.101 −0.247 −0.169 −0.296 −0.134 −0.156 −0.075 −0.295 −0.095
ASP 37 VAL −0.368 −0.191 −0.297 −0.233 −0.327 −0.325 −0.242 −0.17 −0.423 −0.227
ASN 52 ALA −0.644 −0.393 −0.589 −0.221 −0.836 −0.274 −0.965 −0.298 −0.915 −0.357
ASN 52 CYS −0.471 −0.49 −0.401 −0.415 −0.643 −0.385 −0.607 −0.453 −0.639 −0.421
ASN 52 ILE −1.22 −0.169 −1.234 −0.105 −1.511 −0.196 −1.711 −0.169 −1.679 −0.14
ASN 52 LEU −0.422 −0.386 −0.638 −0.217 −0.589 −0.314 −0.717 −0.25 −0.687 −0.277
ASN 52 MET −0.413 −0.238 −0.424 −0.206 −0.574 −0.23 −0.585 −0.246 −0.591 −0.233
ASN 52 PHE −0.733 −0.369 −0.548 −0.164 −0.487 −0.223 −1.009 −0.279 −0.969 −0.127
ASN 52 VAL −0.9 −0.341 −0.649 −0.195 −0.87 −0.263 NA −0.228 −0.916 −0.213

Figure 7.

Bar graphs comparing the predicted ΔΔG for two different types of mutations using two computational tools (mCSM-PPI2 and DDMut-PPI), with asterisks denoting statistical significance.

Cross-site ΔΔG predictions for evolution-consistent and evolution-opposing mutations: (a) mCSM PPI2 and (b) DDMut-PPI. Significance by t-test: *P < .01, **P < .05, ***P < .005.

Discussion

The identification of SBP-like scaffolds across diverse species underscores the untapped potential of natural protein space for synthetic binder design. Our structure-based approach highlights a critical advantage over traditional sequence-driven methods: it enables the discovery of scaffolds that preserve functional folds despite low sequence similarity. This finding emphasizes that structural conservation, rather than sequence identity, is often the key determinant of binding capability, a principle that can guide future protein engineering efforts.

By integrating structural alignment with a multidimensional evaluation of biophysical properties, we demonstrate that scaffold quality is determined by a complex interplay of metrics such as solvent accessibility, hydrophobic core burial, and modularity. These parameters provide insights into scaffold stability and versatility, offering predictive power beyond what sequence- or structure-based similarity alone can deliver. The identification of PHPSs closely aligned with SBPs in biophysical space illustrates the potential to pinpoint high-quality scaffolds that are structurally competent but sequence-divergent, expanding the design repertoire for synthetic binders.

Functional analyses further reinforce the modularity of SBP-like domains. In most cases, these domains serve as the primary determinant of binding affinity, suggesting that rationally designed SBPs could exploit these conserved structural motifs to target diverse partners. Notably, exceptions such as Fynomer-derived proteins indicate that scaffold context and the contribution of other structural elements can modulate function, highlighting the need for empirical validation in protein design.

Our evolutionary and mutagenesis analyses provide a complementary framework for rational design. Conserved residues within the scaffold core maintain structural integrity, while peripheral variability allows functional diversification. By aligning protein engineering strategies with these evolutionary insights, we can improve binding affinity and stability, demonstrating a robust design principle informed by nature. Overall, our findings provide a holistic strategy for identifying and optimizing SBP scaffolds, bridging structural bioinformatics, evolutionary knowledge, and biophysical characterization.

While our pipeline provides an efficient in silico strategy for scaffold discovery, it currently relies exclusively on computational predictions (e.g. AlphaFold3 for structure modeling, RosettaDock for interaction prediction, mCSM-PPI2 and DDMut-PPI for stability assessment). Although these approaches greatly improve scaffold identification efficiency, the functional relevance and stability of the predicted SBP-like scaffolds remain to be experimentally confirmed. In particular, AlphaFold structures, while highly accurate, are still predictions and may not fully capture functionally relevant conformations, especially in disordered regions or protein complexes. Therefore, future work should incorporate systematic experimental validation, including protein expression and purification, structural verification (e.g. X-ray crystallography, cryo-EM, or NMR), and biophysical binding assays (e.g. SPR or ITC). Extending this framework to a broader range of protein families, and potentially integrating alternative deep learning models for biophysical property prediction, will further enhance scaffold discovery. By coupling large-scale computational exploration with targeted laboratory validation, the translational potential of SBP-like scaffolds can be robustly assessed and optimized.

Conclusion

In this study, we introduced a deep learning-based structural alignment approach for the rapid identification of SBP-like scaffolds. By integrating structural similarity searches with the HP2A framework, we systematically evaluated scaffold potential across multiple dimensions and uncovered novel SBP-like candidates with desirable biophysical properties. Our findings highlight the advantages of structure-based search methodologies over traditional sequence-based approaches, enabling a more comprehensive exploration of the proteome for potential scaffolds. Moreover, statistical and sequence analyses suggest that scaffold function is predominantly determined by structural conservation rather than sequence identity. While the current results are based on in silico predictions, this study establishes a workflow for scaffold discovery that can serve as a foundation for future experimental validation. With such validation and further refinement of computational models, this framework has the potential to support rational scaffold design and contribute to applications in biotechnology, synthetic biology, and ultimately therapeutic development.

Key Points

  • The framework was designed to integrate structural alignment approach Foldseek and biophysical properties evaluation algorithm holistic protein attributes assessment to uncover synthetic binding protein (SBP)-like scaffolds across the proteome, overcoming sequence-based limitations.

  • The method can identify scaffolds with <30% sequence similarity to SBPs but with a TM-score >0.5, revealing a diverse protein sequence space of SBP-like scaffolds.

  • Analysis of the identified SBP-like scaffolds reveals key biophysical properties and evolutionary patterns to guide optimized protein binder design.

Supplementary Material

Supplementary_data_bbaf573

Contributor Information

Zixin Duan, School of Pharmaceutical Sciences, Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, Chongqing University, Chongqing 401331, China.

Yafeng Liang, School of Pharmaceutical Sciences, Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, Chongqing University, Chongqing 401331, China.

Jing Zhang, School of Pharmaceutical Sciences, Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, Chongqing University, Chongqing 401331, China.

Zijian Chen, School of Pharmaceutical Sciences, Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, Chongqing University, Chongqing 401331, China.

Liangcai Gu, Department of Biochemistry, University of Washington, Seattle, WA 98195, United States.

Weiwei Xue, School of Pharmaceutical Sciences, Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, Chongqing University, Chongqing 401331, China.

Author contributions

W.X. conceived the project. Z.D. designed the framework, performed experiments, and analyzed the results. Y.L., J.Z., Z.C., and L.G. contributed to the data analysis. W.X. and L.G. supervised the project. Z.D. and Y.L. developed the first draft of the paper. All authors contributed to writing and improving the paper and approved the submission.

Conflict of interest

None declared.

Funding

This work was supported by the Natural Science Foundation of China (grant no. U23A20502), and the Open Project of Central Nervous System Drug Key Laboratory of Sichuan Province (grant no. 240023-01SZ).

Data availability

The original SBP data used in this study were obtained from https://synbip.idrblab.net. All SBP-like protein benchmark datasets are also publicly available at https://synbip.idrblab.net. The source code of HP2A algorithm for multiparametric profiling of proteins is freely available and can be found on the GitHub at https://github.com/Tuan-Space/HP2A.

References

  • 1. Gebauer  M, Skerra  A. Engineered protein scaffolds as next-generation therapeutics. Annu Rev Pharmacol Toxicol  2020;60:391–415. 10.1146/annurev-pharmtox-010818-021118 [DOI] [PubMed] [Google Scholar]
  • 2. Wang  X, Li  F, Qiu  W. et al.  SYNBIP: synthetic binding proteins for research, diagnosis and therapy. Nucleic Acids Res  2022;50:D560–70. 10.1093/nar/gkab926 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Jiang  H, Jude  KM, Wu  K. et al.  De novo design of buttressed loops for sculpting protein functions. Nat Chem Biol  2024;20:974–80. 10.1038/s41589-024-01632-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Li  Y, Duan  Z, Li  Z. et al.  Data and AI-driven synthetic binding protein discovery. Trends Pharmacol Sci  2025;46:132–44. 10.1016/j.tips.2024.12.002 [DOI] [PubMed] [Google Scholar]
  • 5. Le  NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics  2023;23:2300011. 10.1002/pmic.202300011 [DOI] [PubMed] [Google Scholar]
  • 6. Le  NQK, Li  W, Cao  Y. Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection. Brief Bioinform  2023;24:bbad319. 10.1093/bib/bbad319 [DOI] [PubMed] [Google Scholar]
  • 7. Redfern  OC, Dessailly  B, Orengo  CA. Exploring the structure and function paradigm. Curr Opin Struct Biol  2008;18:394–402. 10.1016/j.sbi.2008.05.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Narasimhan  S. Determining protein structures using X-ray crystallography. Methods Mol Biol (Clifton NJ)  2024;2787:333–53. 10.1007/978-1-0716-3778-4_23 [DOI] [PubMed] [Google Scholar]
  • 9. Sali  A, Blundell  TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol  1993;234:779–815. 10.1006/jmbi.1993.1626 [DOI] [PubMed] [Google Scholar]
  • 10. Jumper  J, Evans  R, Pritzel  A. et al.  Highly accurate protein structure prediction with AlphaFold. Nature  2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Varadi  M, Bertoni  D, Magana  P. et al.  AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res  2024;52:D368–75. 10.1093/nar/gkad1011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Barrio-Hernandez  I, Yeo  J, Jänes  J. et al.  Clustering predicted structures at the scale of the known protein universe. Nature  2023;622:637–45. 10.1038/s41586-023-06510-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Boger  RS, Chithrananda  S, Angelopoulos  AN. et al.  Functional protein mining with conformal guarantees. Nat Commun  2025;16:85. 10.1038/s41467-024-55676-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Altschul  SF, Gish  W, Miller  W. et al.  Basic local alignment search tool. J Mol Biol  1990;215:403–10. 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
  • 15. Steinegger  M, Meier  M, Mirdita  M. et al.  HH-Suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics  2019;20:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Steinegger  M, Söding  J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol  2017;35:1026–8. 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]
  • 17. Liu  W, Wang  Z, You  R. et al.  PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat Commun  2024;15:2775. 10.1038/s41467-024-46808-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Zhang  Y, Skolnick  J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res  2005;33:2302–9. 10.1093/nar/gki524 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. van  Kempen  M, Kim  SS, Tumescheit  C. et al.  Fast and accurate protein structure search with Foldseek. Nat Biotechnol  2024;42:243–6. 10.1038/s41587-023-01773-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Andreeva  A, Howorth  D, Chothia  C. et al.  SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res  2014;42:D310–4. 10.1093/nar/gkt1242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Konc  J, Janežič  D. ProBiS-2012: web server and web services for detection of structurally similar binding sites in proteins. Nucleic Acids Res  2012;40:W214–21. 10.1093/nar/gks435 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Wang  X, Zhang  Y, Li  Z. et al.  PROSCA: an online platform for humanized scaffold mining facilitating rational protein engineering. Nucleic Acids Res  2024;52:W272–9. 10.1093/nar/gkae384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. McConnell  A, Batten  SL, Hackel  BJ. Determinants of developability and evolvability of synthetic miniproteins as ligand scaffolds. J Mol Biol  2023;435:168339. 10.1016/j.jmb.2023.168339 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Golinski  AW, Holec  PV, Mischler  KM. et al.  Biophysical characterization platform informs protein scaffold evolvability. ACS Comb Sci  2019;21:323–35. 10.1021/acscombsci.8b00182 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Tang  Q-Y, Ren  W, Wang  J. et al.  The statistical trends of protein evolution: a lesson from AlphaFold database. Mol Biol Evol  2022;39:msac197. 10.1093/molbev/msac197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Li  Y, Li  F, Duan  Z. et al.  SYNBIP 2.0: epitopes mapping, sequence expansion and scaffolds discovery for synthetic binding protein innovation. Nucleic Acids Res  2024;53:D595–603. 10.1093/nar/gkae893 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. UniProt: the universal protein knowledgebase in 2025. Nucleic Acids Res  2025;53:D609–17. 10.1093/nar/gkae1010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. McGibbon  RT, Beauchamp  KA, Harrigan  MP. et al.  MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys J  2015;109:1528–32. 10.1016/j.bpj.2015.08.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Zhang  S, Krieger  JM, Zhang  Y. et al.  ProDy 2.0: increased scale and scope after 10 years of protein dynamics modelling with python. Bioinformatics (Oxf Engl)  2021;37:3657–9. 10.1093/bioinformatics/btab187 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Hagberg  A, Swart  PJ, Schult  DA. Exploring network structure, dynamics, and function using NetworkX. Proceedings of the 7th Python in Science conference (SciPy 2008); Varoquaux G, Vaught T, Millman J (Eds.), 11–15.
  • 31. Alstott, Bullmore  E, Plenz  D. Powerlaw: a python package for analysis of heavy-tailed distributions. PLoS One  2014;9:e85777. 10.1371/journal.pone.0085777 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Cock  PJ, Antao  T, Chang  JT. et al.  Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxf Engl)  2009;25:1422. 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. DeLano  WL. PyMOL: An Open-Source Molecular Graphics Tool, 2002. http://www.ccp4.ac.uk/newsletters/newsletter40/11_pymol.pdf
  • 34. Szklarczyk  D, Kirsch  R, Koutrouli  M. et al.  The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res  2023;51:D638–46. 10.1093/nar/gkac1000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Abramson  J, Adler  J, Dunger  J. et al.  Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature  2024;630:493–500. 10.1038/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Marze  NA, Roy Burman  SS, Sheffler  W. et al.  Efficient flexible backbone protein-protein docking for challenging targets. Bioinformatics (Oxf Engl)  2018;34:3461–9. 10.1093/bioinformatics/bty355 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Chaudhury  S, Berrondo  M, Weitzner  BD. et al.  Benchmarking and analysis of protein docking performance in Rosetta v3.2. PLoS One  2011;6:e22477. 10.1371/journal.pone.0022477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Huang  P-S, Boyken  SE, Baker  D. The coming of age of de novo protein design. Nature  2016;537:320–7. 10.1038/nature19946 [DOI] [PubMed] [Google Scholar]
  • 39. Edgar  RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res  2004;32:1792–7. 10.1093/nar/gkh340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Price  MN, Dehal  PS, Arkin  AP. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol  2009;26:1641–50. 10.1093/molbev/msp077 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Letunic  I, Bork  P. Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res  2021;49:W293–6. 10.1093/nar/gkab301 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Crooks  GE, Hon  G, Chandonia  J-M. et al.  WebLogo: a sequence logo generator. Genome Res  2004;14:1188–90. 10.1101/gr.849004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Kawashima  S, Pokarowski  P, Pokarowska  M. et al.  AAindex: amino acid index database, progress report 2008. Nucleic Acids Res  2007;36:D202–5. 10.1093/nar/gkm998 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Kyte  J, Doolittle  RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol  1982;157:105–32. 10.1016/0022-2836(82)90515-0 [DOI] [PubMed] [Google Scholar]
  • 45. Janin  J, Wodak  S, Levitt  M. et al.  Conformation of amino acid side-chains in proteins. J Mol Biol  1978;125:357–86. 10.1016/0022-2836(78)90408-4 [DOI] [PubMed] [Google Scholar]
  • 46. Klein  P, Kanehisa  M, DeLisi  C. Prediction of protein function from sequence properties. Discriminant analysis of a data base. Biochim Biophys Acta  1984;787:221–6. 10.1016/0167-4838(84)90312-1 [DOI] [PubMed] [Google Scholar]
  • 47. Ashkenazy  H, Abadi  S, Martz  E. et al.  ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucleic Acids Res  2016;44:W344–50. 10.1093/nar/gkw408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Rodrigues  CHM, Myung  Y, Pires  DEV. et al.  mCSM-PPI2: predicting the effects of mutations on protein-protein interactions. Nucleic Acids Res  2019;47:W338–44. 10.1093/nar/gkz383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Zhou  Y, Myung  Y, Rodrigues  CHM. et al.  DDMut-PPI: predicting effects of mutations on protein–protein interactions using graph-based deep learning. Nucleic Acids Res  2024;52:W207–14. 10.1093/nar/gkae412 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_data_bbaf573

Data Availability Statement

The original SBP data used in this study were obtained from https://synbip.idrblab.net. All SBP-like protein benchmark datasets are also publicly available at https://synbip.idrblab.net. The source code of HP2A algorithm for multiparametric profiling of proteins is freely available and can be found on the GitHub at https://github.com/Tuan-Space/HP2A.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES