Significance
Computational de novo protein design strategies to generate synthetic proteins outside of the natural repertoire stand as a rigorous test of our understanding of protein folding and have had considerable impact on biotechnological applications. However, current de novo design methods are severely hindered by creating designable protein backbone templates that can be realized with natural amino acids. We present an algorithm that automatically generates novel proteins from a minimal string definition with native-like geometries and show that current methods are insufficient for the generation of designable backbones. The variety of de novo designed proteins and the computational method should enlighten basic protein topological principles and facilitate the generation of de novo proteins to enable the exploration of the protein universe.
Keywords: protein design, computational biology, de novo design, protein topological descriptors
Abstract
De novo protein design enables the exploration of novel sequences and structures absent from the natural protein universe. De novo design also stands as a stringent test for our understanding of the underlying physical principles of protein folding and may lead to the development of proteins with unmatched functional characteristics. The first fundamental challenge of de novo design is to devise “designable” structural templates leading to sequences that will adopt the predicted fold. Here, we built on the TopoBuilder (TB) de novo design method, to automatically assemble structural templates with native-like features starting from string descriptors that capture the overall topology of proteins. Our framework eliminates the dependency of hand-crafted and fold-specific rules through an iterative, data-driven approach that extracts geometrical parameters from structural tertiary motifs. We evaluated the TopoBuilder framework by designing sequences for a set of five protein folds and experimental characterization revealed that several sequences were folded and stable in solution. The TopoBuilder de novo design framework will be broadly useful to guide the generation of artificial proteins with customized geometries, enabling the exploration of the protein universe.
Evolution has only explored a small subset of all possible amino acid (AA) sequences and structures (1). The space of viable protein sequences, e.g., sequences that have a global free energy minimum representing a well-folded native state, is small. Such a notion has been supported by several experimental studies showing that most random AA sequences have a rough energy landscape with many local minima representing aggregated or misfolded states (2–5).
De novo design strategies stand as an essential tool to aid the exploration of the sequence space and thereby enabling the creation of new protein structures and functions. Classical de novo protein design generally entails two iterative steps: First, target folds are modeled (backbone generation); and second, an AA sequence that stabilizes the lowest free energy state of the target backbone conformation is searched (sequence design). Despite multiple successes (6–9), de novo design remains a challenging problem for protein designers given that it stands as a stringent test for our understanding of the principles that govern protein structures.
Successful structure-based de novo design largely relies on the crafting of “designable” protein backbones, meaning physically realistic and strainless backbones that are compatible with sequences that will yield a protein fold with a well-defined energy minimum (10–15). The designability of a protein backbone is generally proxied by the number of sequences that it can support (10, 16, 17). For example, some natural protein structures can accommodate more sequences than the average and are thought to be more robust against random mutations, and therefore thermodynamically more stable, which favors evolutionary stability (16). Generating designable backbones is important, as one would like to a priori limit the sampling space to engineer only reasonable shapes with inherent structure-to-sequence compatibility and discard presumptive nonviable structures. Many de novo design approaches are likely to fail due to the lack of designability of the starting structural templates, requiring multiple iterative rounds of human-guided and experimental optimizations (4, 5).
Quantifying the designability of protein backbones is difficult (18, 19). Even recent energy functions fail to reliably capture global designability aspects of protein backbones but excel in assessing high-resolution details such as van der Waals forces, steric repulsion, electrostatic interactions, and hydrogen bonds (20). To facilitate the de novo design process at early stages, it would be necessary to have low-resolution energy functions that could accurately capture the physicochemical determinants of realistic structures at the backbone level (21).
There has been considerable progress in developing parametric functions and general principles for describing ideal and less symmetric protein structures (17, 22, 23). Often secondary structure elements (SSEs) are connected to create tertiary structural topologies by packing α-helices on paired β-strands through the control of the loop length and ABEGO residue torsion structure. This approach made it possible to design a set of ideal protein structures, including (triose-phosphate isomerase) [TIM] barrels (24), β-barrels (25, 26), jellyrolls (27), and immunoglobulin-like domains (28). Nonetheless, parametric definitions are often specifically framed for distinct protein classes or architecture types and cannot be generalized to other architectural configurations, i.e., the Crick coiled-coil generating equations (29) or descriptive parametric models of β-barrels (30, 31).
In this work, we enhanced the capabilities of the TopoBuilder (TB) framework by introducing a data-driven correction module to generate native-like backbones from a simplified description of a protein topology that we term “sketch” (Fig. 1A and SI Appendix, Fig. S1). This module is applicable to any protein topology that can be described by arranging ideal SSEs in layers (32, 33). The correction module generates parametric refinements that geometrically optimize the SSEs of protein backbones toward native-like configurations, rendering them more designable. The set of corrections includes translational and rotational parameters jointly capturing key geometric features such as distances and angles of native tertiary motifs. To further aid the topology assembly step, we use structural fragments from naturally occurring loops to connect two subsequent SSEs. We evaluated the general framework by de novo designing five different folds and found that even a minimal set of corrections to the protein backbones is sufficient to improve sequence sampling and achieve a better sequence-to-structure compatibility and sequence quality overall according to a variety of computational metrics. Finally, we experimentally characterized 54 designs and obtained multiple sequences that were folded and stable in solution, including topologies that have been particularly difficult for computational design such as all-β structures.
Results
Evolution proceeds incrementally through a random and sparse sampling of the possible sequence space, which in turn populates the protein structure space. However, nature seems to show a tendency to reuse the same protein structures repeatedly based on the observation that the discovery of new protein folds has become rare (34–36). In some regards, the mapping of structural space poses a number of challenges, since it depends on the structural definition and coarseness of the structures. For a more systematic exploration of the structural space, Taylor (37) and Taylor et al. (38) defined an idealized SSE lattice representation that can easily be captured through a simple string descriptor called “forms” (Fig. 1A). The form parametrization describes proteins as layered topologies, with each layer being composed of a defined number of either α-helices or hydrogen bonded β-strands (Fig. 1A). Although constrained by its grid-like tabular system that only allows particular folds to be parametrized and excludes folds in barrel shapes (e.g., β-barrels) or folds with noncanonical distorted SSEs (e.g., kinked helices), a wide range of structural configurations can systematically be defined, potentially allowing a full exploration of a protein topological space at orders of magnitude larger than the natural space currently characterized (Fig. 1 B and C). While forms are well suited for protein topology comparison and classification (38), using them for de novo design is challenging due to the loss of crucial structural and sequence features, including native tertiary configurations of SSEs and side-chain representations.
Sketching Native-Like Protein Backbones from Forms.
Given the form description, the TopoBuilder starts by placing ideal SSEs at their respective relative positions as specified by the form description, creating a three-dimensional (3D) backbone object containing only SSEs, which we refer to as sketch. We define the layer stacking along the z axis (Fig. 2A), with interlayer separations of 8 Å for β-sheets, 10 to 11 Å for α-helices, and mixed structures (39). The y axis aligns with the directionality of the SSEs (Fig. 2A), and the intralayer spacing between adjacent SSEs is defined along the x axis and typically of 10 Å for α-helices and 4.85 Å for β-strands (40).
The sketch representation does not contain SSE connecting loops and has no sequence information. Furthermore, the naive sketch presents a rather nonnative configuration of the SSEs composing the protein structure strongly hinting toward the possibility that such structures will likely present nondesignable configurations. Fine-grained structural details at the secondary structure and tertiary levels such as β-sheet pleatings and curvatures, and α-helical packing are absent. These features are difficult to sample automatically and correctly with low-resolution scoring functions.
We hypothesized that grosso modo parametric corrections per SSE could incorporate global native structural features and improve the sketches’ designability (Fig. 2A and SI Appendix, Fig. S2 A–E). To do so, we implemented a module that computes on-the-fly geometrical statistical corrections for each SSE from native structures (Materials and Methods). Briefly, the sketch is first divided into two-layer components based on adjacent layers. These substructures are then iteratively queried against a database of natural protein structures using the software MASTER (41, 42) and structural geometry parameters calculated from the retrieved matches and are used to correct the relative positioning of the SSEs in the sketch. The iterative nature of the process along the layers results in a hierarchical refining procedure where the previously corrected substructures help to contextualize the correction in the next layer for coherent native-like SSE placements (SI Appendix, Fig. S3). The main innovation relative to the first TopoBuilder method (32), is that now we incorporate a new module to perform structural corrections on the models according to features observed in native structures. Our previous TopoBuilder generated structural ensembles sampled with parametric moves of the structural elements for which there were no guarantees of being in the designable space. To implement the new module, we mine layered tertiary structural motifs from native proteins to derive parameter combinations that are difficult to either find randomly or calculate using current scoring functions. Hence, our new module removes the manual and random nature of the search for native-like structural parameters for the generation of designable backbone templates.
A small set of parametric corrections includes two rotational parameters and one translational parameter per SSE (Fig. 2A and SI Appendix, Fig. S2 A–E). We compute the twist angle (ζ) between the vector pointing along the length of the SSE and the plane spanned by the layer. Between two adjacent layers, we express the angle (ɛ) as the shear between the layer planes, and the interlayer distance (dz) is the distance from one layer to the next one. The individual parameters shift the SSEs from a naive configuration toward a native arrangement. For example, a β-sheet in the naive configuration is “flat” because of the initial placement of the SSEs that are fully ideal and aligned next to each other. The natural occurring twist within β-sheets can be approximated by twist angles ζ (43). Hence, the geometric corrections attempt to optimize the global arrangement of the SSEs and generate topologies with native-like features.
Backbone Assembly and Sequence Design of Native-Like Sketches.
To obtain sequence-designed structures, the native-like sketches are subjected to several steps: 1) loop building to connect the SSEs; 2) structural diversification starting from the initial native-like sketch; and 3) sequence sampling and selection of best scoring designs. All these steps were performed using tools provided by the Rosetta software suite (44), and more details are given below and in Materials and Methods (Fig. 2B and SI Appendix, Fig. S1 A–F, 3 and 4).
To build fully connected structural templates, we queried native loop segments that can bridge the gaps between the SSEs that compose the sketch (Materials and Methods). We avoided time-consuming computational loop closure algorithms by leveraging structural information of the retrieved loops and generating structural fragments (3-mers and 9-mers) (ABEGO torsions) (SI Appendix, Fig. S4). We used the previously developed Rosetta FunFolDes (FFD) (45, 46) protocol to generate an initial set of polyvaline backbone conformations using Cα-based distance restraints calculated between all different SSEs and performed sequence design. The structural fragments impose native backbone signatures at the local level, while the overall topology is tightly controlled by the distance restraints. We modified the Rosetta energy function at every stage of the folding simulation to include hydrogen bonding and SSE pairing terms favoring the correct pairing between β-strands (47). Each assembled backbone is fitted with a set of optimal sequences via the Rosetta FastDesign protocol (44). During this stage, amino acid sampling restrictions per position were added, such as layer definitions (core, surface, or boundary, profiles from structural fragments) (48, 49), and secondary structure type assignments (α-helix, loop, or β-strand). A bonus term enhancing secondary structure formation at defined positions was included in the energy function at the desired protein segments (50).
De Novo Design of Five Protein Folds.
To showcase the TopoBuilder de novo design framework and assess its performance, we attempted to de novo design five folds of distinct structural complexities (Fig. 2C). We selected four native folds: a two-layered α/β ubiquitin-like fold, a two-layered β-sandwich Ig-like fold, a two-layered β-sandwich jellyroll, and a three-layered α/β/α Rossmann-like fold. In order to investigate the generalization capability of the framework to the space of novel folds, we included a two-layered α/β Top7-like fold. Of note, the Top7 structure (Protein Data Bank [PDB] 1QYS) was excluded from all databases used during the correction searches in order to avoid any biases coming from the solved structure. The ubiquitin-like fold is composed of a helix packed onto a four-stranded β-sheet. Both terminal β-strands pair in a parallel direction and are located in the center of the sheet, making nonlocal hydrogen bond contacts, while the edge strands form a β-α-β motif. The jellyroll and the Ig-like have both nonlocal β-β motifs. Our drafted jellyroll has three β-arcade motifs and the Ig-like fold is made from a β-arcade on one and a long β-arch on the other side. The architecture of the intended Rossmann fold contains a four-stranded central β-sheet that is flanked by two helices on the top and two helices at the bottom and can be decomposed into three interlocked β-α-β motifs. Lastly, the Top7 fold is defined by two interlocked β-α-β motifs with two additional terminal core strands. Each of the folds has its unique complexity with specific tertiary motifs that need to be arranged realistically with the correct geometries. Especially nonlocal interactions and connections have been difficult to build and design, and multiple detailed analyses were needed to define effective design rules for single folds and domains (26–28, 51, 52).
For the generation of the backbones, we manually specified default SSE lengths of 8 residues for long β-strands and 5 residues for short β-strands or edge strands, 17 residues for long α-helices, and 13 residues for short α-helices. Thus, we did not employ SSE lengths of specific native examples but rather used these as a rough guide for the overall topology. To probe the impact and contribution of the parametric corrections, we first performed baseline design simulations guided only by an uncorrected sketch that we refer to as “naive sketch” (Fig. 3A and SI Appendix, Fig. S11). We then corrected the sketch using several combinations of parameters (termed “native-like sketch”) in order to assess their importance and find a minimal and optimal parameter combination (Fig. 3A and SI Appendix, Fig. S11). In the first scenario, we solely used the twist angle ζ-correction, the second scenario consisted of the corrections ζ + dz, and the third scenario simulated with ζ + dz + ɛ (SI Appendix, Figs. S5–S7). For each of the different scenarios, a total of 1,000 decoys were generated.
Corrections Induce Native Features in Idealized Folds.
For the native folds, the corrections were derived from a set of ∼15 to 25 structurally distinct proteins (SI Appendix, Fig. S8 A–C). The Top7 fold–derived sketch had a lower number of matches that were extracted from larger protein domains with similar SSE dispositions and connectivities (SI Appendix, Fig. S8 A–C). We compared the native-like and the naive decoys for each of the five selected folds (Fig. 3B and SI Appendix, Fig. S9 A–E). The modes of the ζ-angle distributions (Fig. 3C) show that native-like decoys have a twist in β-strands and native side-to-side configurations for α-helices, while the modes from the distributions of the naive decoys retain a ζ-angle around 0°, indicating that no twist was induced in the strands during the fragment assembly folding simulations. Interestingly, for native-like decoys ζ-angle deviations of 10 to 20° were observed, indicating that these offsets arranged in a suitable manner can be sufficient to induce twisted and native-like β-sheets.
Similarly, the ɛ-angle (Fig. 3D) and the dz distance (Fig. 3E) that improve the layer packing geometry, follow the native distribution peaks for the native-like but not for the naive sketch–derived decoys, showing that fragment insertion protocols are insufficient to correct global topological features. Of note, the dz distance seems to be of major importance to improve the packing between fully β-layers and less crucial for α/β-layer packing. Importantly, a relaxation without restraints after the folding and design simulations also did not yield native geometries in naive decoys, suggesting that geometric corrections were necessary to guide the folding trajectories of the native-like sketches.
Assessing Designability through Sequence-To-Structure Compatibility.
To assess the difference between sequences from the naive and native-like designs, we first used BLASTp (53) to search for similar sequences in the natural repertoire. However, the few hits (E value <0.01) found did not match the target fold, showing that there were no evident fold signatures in either of the design sets (SI Appendix, Fig. S10 A and B). Therefore, we used two orthogonal deep learning protein structure prediction engines trRosetta (trR) (54) and AlphaFold (AF) (55) to predict structural models for all designed sequences (without multiple-sequence alignment [MSA] generation, i.e., in single-sequence input mode). We computed the template modeling (TM) scores and root mean square deviation [RMSDs] between the TopoBuilder designs and the predicted structures by trR and AF (Fig. 4 A and B and SI Appendix, Fig. S11 A and B). We hypothesized that, if our native-like backbones have improved designability, they could lead to sequences with stronger propensities for the respective fold and consequently to more accurate structure predictions in contrast to the naive sketch–derived designs.
The comparisons for the naive and native-like design sets in terms of accuracies for the structure prediction simulations, do not show clear differences. For the naive designs, we observed low RMSDs at around ∼2 Å trR and ∼1.8 Å AF while the native designs achieved RMSDs of ∼1.5 Å and ∼1.2 Å for trR and AF, respectively (SI Appendix, Fig. S12A). Similarly, for TM scores the naive design sets peaked at around 0.7 for trR and 0.8 for AF, and the native-like reached TM scores of ∼0.8 for trR and ∼0.9 for AF (SI Appendix, Fig. S12B).
Despite the lack of clear preference at the level of individual design metrics, we noticed that comparing the naive- and the native-like derived sequences based on two metrics shows population differences (Fig. 4). We compare the AF predicted local distance difference test (plDDT) versus the TM score between the TopoBuilder (TB) and the trR models [TM-score(TB,trR)] or AF models [TM-score(TB,AF)].
To quantify the upper right quadrant (double positive) population, we set the plDDT threshold to a minimum of 70 and adjusted the TM-score thresholds to 0.6. Analyzing the population difference of the double positives, e.g., sequences with plDDT >70 and TM-score(TB,trR) >0.6 shows an enrichment of the native-like designs ranging from 1.9× in the Rossman-fold designs to 92× in the ubiquitin-like designs (Fig. 4A). Similarly, we then compared the plDDT against the TM score between the TB and the AF models [TM-score(TB,AF)] and observed similar enrichments ranging from 1.2× for the Top7 designs to 9.7× for the jellyroll designs (Fig. 4B). Lastly, we compared the TM-score(TB,AF) versus the TM-score(TB,trR) to assess the agreement between the two prediction methods fixing the TM-score(TB,AF) and TM-score(TB,trR) thresholds to 0.6. We saw a clear enhancement of the native-derived double positives ranging from 1.3× for the Rossmann designs to 12× for the Ig-like designs (Fig. 4C). Importantly, we observed that the ζ seems to be important to improve designability while the dz and ɛ-parameter are more fold dependent (SI Appendix, Figs. S5–S7). For example, β-sandwich architectures such as the Ig-like fold and the jellyroll tend to have higher enrichment ratios of TM-score(TB,trR) versus TM-score(TB,AF) when all three parameters (ζ + dz + ɛ) are used (SI Appendix, Fig. S7).
The prediction metrics obtained from AF tend to generally be stricter with fewer designs passing the required threshold compared to those of trRosetta. We speculate that this could either be due to AF’s better performance and sensitivity to small changes that would render a sequence nonviable, or because AF was run in single-sequence mode with only a single model prediction instead of all the generated solutions.
To further assess the increased performance induced by the corrections, we projected the three score pairs: 1) plDDT and TM-score(TB,trR); 2) plDDT and TM-score(TB,AF); and 3) TM-score(TB,trR) and TM-score(TB,AF) onto their respective diagonal and computed the receiver operating characteristic (ROC) curve and the area under the curve (AUC) (SI Appendix, Fig. S13 A–C). The ROC–AUC indicates the degree of separation between the naive- and native-derived projected distributions independent of an arbitrary set threshold. Most ROC–AUC values are in the range of 0.63 and 0.7 across the three different score pairs additionally showing that our corrections improved the de novo design of proteins (SI Appendix, Fig. S13 A–C).
The superior structural metrics produced by trR and AF for the sequences generated on native-like TB backbones suggest that these sequence improvements arise from more designable backbones with native-like features.
Experimental Validation of Novel Sequences.
We next sought to experimentally test whether the TB designs were folded and stable in solution. We investigated the top TB models by TM-score(TB,trR), obtained the synthetic genes for 54 designs, and expressed and purified them from Escherichia coli (Materials and Methods). From the tested designs, three ubiquitin-like, four Rossmann-like, three Ig-like, one jellyroll type, and two Top7-like fold proteins expressed soluble. A total of two Rossmann designs, one Ig-like fold designs, and two Top7 designs (Fig. 5 A and B) had size exclusion chromatography-multiangle light scattering (SEC-MALS) peaks (Fig. 5C) with the apparent molecular weights of monomers or small oligomeric species. The species corresponding to the monomeric or small oligomeric species (dimer or trimer) were examined by circular dichroism (CD) spectroscopy (Fig. 5D). In all cases, the CD spectra were consistent with the respective target structures, with the characteristic profiles of α/β- and mainly β-proteins. The designs were thermostable with melting temperatures above 90 °C (SI Appendix, Fig. S14).
As a retrospective analysis of the experimentally characterized sequences, we performed a deeper analysis of the predictions performed by trR (Fig. 5A) and AF (Fig. 5B). The five designs (Rossman, Ig-like, and Top7) recapitulated the target structures accurately, strongly indicating that our designs folded in the desired conformation. To evaluate the structural similarity relative to the natural repertoire, we queried the PDB (56) and the AF (57) (Fig. 5E) databases for similar protein structures. While for both natural folds (Rossmann-like and Ig-like) we found first hits ∼4 Å RMSD the Top7 designs are far away from any natural protein folds with the first hit ∼5.5 Å RMSD.
To reveal potential underlying sequence clusters, we used the found structural matches from the PDB and AF databases and performed pairwise alignments. We then calculated the Blosum62 distances for each alignment (i.e., the sum of each individual Blosum62 score), and clustered them hierarchically (SI Appendix, Figs. S15A and S16A). Similarly, to categorize the conformations we calculated pairwise RMSDs followed by a hierarchical clustering (SI Appendix, Figs. S15B and S16B). Our designs are generally well integrated within the hierarchical cluster trees showing native compatibility. To search for close members in sequence and structure space jointly, we gathered the sequence and structure features and projected the data into two dimensions through a principal component analysis (PCA) (SI Appendix, Figs. S15C and S16C). We observed that our designs are close to native clusters, further indicating their sequence and structure nativeness. For the native folds, the matches are of the same fold family. Interestingly, des_rssmnn_113 has a de novo designed Rossmann fold (PDB 2LV8) and natural Rossmann domains (PDB 1MZP and 4IZ6) as cluster members; hence, the design likely incorporated general native sequence and structure features. The des_tr7_30 based on the Top7 de novo designed fold has native cluster members that structurally fit well, but have different connectivities (e.g., PDB 6NR1, 4QTP, or 4QDJ).
We further investigated whether the five successful native-like designs are structurally similar to any of the naive-derived designs. To do so, we used MASTER to search the naive pool with the native-like designs and found that all native-like–derived designs differ substantially from the models of the naive counterparts with RMSDs ranging between 4 and 6 Å depending on the fold (SI Appendix, Fig. S17). The only exception was the Top7 fold where we did not observe significant differences, which is likely related to being a de novo fold composed of highly idealized structural elements.
Taken together, the data indicate that five designs across three different folds designed using native-like templates adopted stable monomeric or dimeric states with the expected secondary structure content and accurate structural predictions by AF.
Discussion
The TopoBuilder de novo design method enables the generation of artificial proteins from a minimal string description (form). The form description drafts the overall target topology and enables a fast and systematic fold-space exploration (38, 58). Combined with the TopoBuilder de novo design framework, virtually any protein form description can be constructed and designed.
Our computational and experimental assessments show that geometrical corrections inferred from native structural submotifs that compose the folds provide enough information to improve the designability of protein backbones (33). When analyzing the designed sequences with state-of-the-art structure prediction tools, we identified a larger fraction of successfully recovered structures from sequences derived from corrected (native-like) backbones than those from naive backbones. The experimental characterization of multiple designs shows that the TopoBuilder de novo design framework generates realistic designs that adopt the target fold and are thermodynamically stable.
Ultimately, our analysis shows that the current scoring functions and fragment assembly methods are insufficient for the generation of designable backbones without the guidance of natively arranged SSEs. Many of the de novo protein design rules rely on the generation of structured loops to guide the SSEs’ placements. Here, we present an alternative and complementary solution that is fully automated upon the definition of the length of the SSE. Instead of focusing on structured loops, we optimize and correct the global placements of SSEs and thereby implicitly guide the loop geometries.
Our strategy should further enable de novo design to nonexperts and improve and streamline future protein design efforts. The insights we gained from the parameters for a variety of complex fold examples can be harnessed and support the future discovery and understanding of protein architectural principles. Our work also opens possibilities for computational protein designers that may want to design for function, such as the scaffolding of functional proteins via incorporating known or predicted functional sites or protein assemblies with de novo designed domains.
Materials and Methods
Computation of Number of Architectures and Topologies from Forms.
To approximate the number of architectures and topologies from a form description (Fig. 1C), we treated all SSE elements as of the same type. Thus, if a topology consists of five SSE elements (n = 5), we did not differ between architectural and topological variations on the number of helical (H) and strand (E). We related the layered architectures to free polyominos without holes (to discard underpacked architectures from the count), which can be found under A000104 (https://oeis.org/A000104). This enabled a simple and fast lookup of the number of architectures for a form with n SSEs. The total number of topologies for one architecture can be computed through n!.
Topological Refiner.
Structural subunits within a particular protein sketch are queried using MASTER against a database of native proteins. The database consisted of structures of 70 to 250 residues. To speed up the matching procedure, we only searched over fold-relevant structures by filtering the database-based SSE content and structural features, e.g., for Ig-like and jellyroll folds we only quired β-sandwich architectures, and for ubiquitin-like, Rossmann-like, and Top7 folds, we quired two or three layer α/β architectures. The RMSD thresholds for the selection of matches ranged from 2 to 3 Å. We processed each MASTER match by fitting a vector along each SSE (SI Appendix, Fig. S2 A–C) via performing a PCA (for more details see ref. 37) over all Cα atoms within the SSE. Naturally, the first (major) eigenvector returned by the PCA points along the SSE length.
For each full layer (including all SSEs) we compute the first three eigenvectors (SI Appendix, Fig. S2D). This will result in a local coordinate system for the layer where the first eigenvector is along the y axis (along the lengths of the SSE), the second eigenvector the x axis (toward the side), and the third eigenvector the z axis. The first eigenplane defines the 1–2 (major-side) plane that is formed by the first two eigenvectors. The 1–3 (major-perpendicular) plane generated by the first and third eigenvector splices the layer along the length of the SSE in half. Lastly, the 2–3 (perpendicular-side) plane is computed using the second and third eigenvector and halfs the layer roughly along the center of each SSE. Having abstracted from atoms to simple geometric objects such as vectors and planes, multiple parameters can be efficiently computed (Fig. 2A). Considering two adjacent layers, one can compute several geometric features. Here, we use the ζ-angles, which are the angles between the first SSE eigenvectors and the layer (1–2) plane. The ɛ-angles, which are the shear angles between the two layers, can be calculated as the mean across all first SSE eigenvectors with the corresponding 1–3 eigenplane. The interlayer distance dz can be computed as the mean across all SSE center distances to adjacent layers. Lastly, the sheer distances are calculated as the distances between the 2–3 planes to the SSE centers.
Loop Assembler.
To connect the SSEs within the sketches, we developed a workflow to harness structural features from secondary structure connecting loops in native proteins (SI Appendix, Fig. S4). For each gap, we performed independent MASTER searches with their corresponding two SSEs. The algorithm iteratively matched two consecutive SSEs against a database of protein structures. For each gap, matches with an RMSD smaller than 3.2 Å were kept and clustered based on their loop lengths (the maximum length allowed was seven residues), and the most populated cluster lengths were selected. Subsequently, the selected loops were filtered with respect to their ABEGO torsion profiles (59). Loops displaying the same ABEGO dihedral angle for each residue were removed from the set, leaving a single loop per ABEGO profile. Structural fragments of sizes 3 and 9 (3mers and 9mers) were generated for the loops and SSE alignment regions in agreement with the ABEGO distributions observed in the loop matches. These fragments were then used in subsequent fragment assembly steps to generate fully connected folded structures. With this strategy we avoided performing explicit loop building and closure sampling to add the loops on the sketch, as this requires computationally expensive structural sampling without the guarantee of closing the gap effectively.
Folding of Polyvaline Backbones from Sketches.
Structural backbones were generated using the Rosetta FFD (NubInitioMover) (47) fragment assembly folding simulations. The 3mer and 9mer structural fragments derived from the loop assembler were used for the folding simulations introducing local native-like interactions and structural patterns in the loop regions. To control the global conformation, Cα–Cα distance restraints between all SSEs of the sketch were used during the folding trajectories. We modified the energy function during the four stages of the folding simulations by including short- and long-range hydrogen bonding terms and SSE formation–enhancing terms to yield compact backbones with favorable nonlocal interactions and global tertiary geometries. We also imposed secondary structure assignments to each residue to bias the simulations that were derived from the secondary structure definition of the sketch.
Sequence Design and Structural Relaxation of Backbones.
We used the Rosetta software to design sequences for each of the generated polyvaline backbones followed by structural relaxations (44). The sequence design operations at each residue position were restricted through 1) SSE propensity calculated from the sketch; 2) “layer” assignment calculated from the backbone conformations where the “core,” “boundary,” and “surface” layers are defined through the number of Cβ neighbors; and 3) an AA sequence profile derived from the structural fragments to upweight frequent AAs. The structural relaxations were performed under long-range Cα–Cα distance restraints derived from the backbone conformations. Additionally, we added a topological bonus term to the energy function to favor conformational changes with corrected strand and helix pairings and helix-sheet packing. The topological restraints were extracted from the sketch. A total of 200 polyvaline backbones were generated and five sequences designed for each backbone. This led to a total of 1,000 different sequences per sketch.
Structure Predictions Using trRosetta and AlphaFold.
We used the trR (54) and AF (55) deep neural networks to predict structural models from the designed sequences. We predicted structural models for all sequences (1,000 per sketch) using trR and AF in single sequence mode, e.g., omitting the time-consuming step of generating MSAs. Additionally, we only generated a single AF model (instead of five), making the computation five times faster. All inference calculations were parallelized onto 1,000 central processing unit [CPU] cores.
We aligned each trR and AF model onto the respective TB model using the TM alignment algorithm (60). The algorithm iteratively optimizes the superposition of segments with similar local structures. The superposition between the two structural models is evaluated through the TM score, a measure of the distance between Cα atoms of aligned residues in target and template, normalized by protein length. TM align returns the TM score and a best-fit RMSD.
Protein Expression and Purification.
The 54 best designs by TM-score(TB,trR) were selected for experimental validation. DNA sequences of the designs were purchased from Twist Bioscience. For bacterial expression, the DNA fragments were cloned via Gibson cloning into a pET11b followed by a terminal His-tag and transformed into E. coli BL21(DE3). Expression was conducted in Terrific Broth supplemented with ampicillin (100 μg/mL). Cultures were inoculated at an optical density (OD)600 of 0.1 from an overnight culture and incubated in a shaker at 37 °C and 220 rpm. After reaching an OD600 of 0.6, expression was induced by the addition of 0.4 mM isopropyl ß-D-1-thiogalactopyranoside [IPTG] and cells were further incubated overnight at 20 °C. Cells were harvested by centrifugation and pellets were resuspended in lysis buffer (50 mM Tris, pH 7.5, 500 mM NaCl, 5% glycerol, 1 mg/mL lysozyme, 1 mM phenylmethylsulfonyl fluoride [PMSF], 4 μg/mL DNase). Resuspended cells were sonicated and clarified by centrifugation. Ni-NTA purification of sterile-filtered (0.22 μm) supernatant was performed using a 5-mL His-Trap FF column on an ÄKTA pure system (GE Healthcare). Bound proteins were eluted using an imidazole concentration of 500 mM. Concentrated proteins were further purified by size exclusion chromatography on a Hiload 16/600 Superdex 75 pg column (GE Healthcare) using phosphate-buffered saline (PBS) (pH 7.4) as mobile phase.
Circular Dichroism Spectroscopy.
Far-UV circular dichroism spectra were collected between wavelengths of 190 and 250 nm on a Jasco J-815 circular dichroism spectrometer in a 1-mm path-length quartz cuvette. Proteins were diluted in 10 mM PBS at concentrations between 20 and 40 μM. Wavelength spectra were averaged from two scans with a scanning speed of 20 nm/min and a response time of 0.125 s. The thermal denaturation curves were collected by measuring the change in ellipticity at 220 nm from 20 to 90 °C with 2 or 5 °C increments.
SEC-MALS.
Multiangle light scattering was used to assess the monodispersity and molecular weight of the proteins. Samples containing 80 to 100 μg of protein in PBS buffer (pH 7.4) were injected into a Superdex 75 10/300 GL column (GE Healthcare) using an high-performance liquid chromatography [HPLC] system (Ultimate 3000, Thermo Scientific) at a flow rate of 0.5 mL/min coupled in-line to a multiangle light-scattering device (miniDAWN TREOS, Wyatt). Static light-scattering signal was recorded from three different scattering angles. The scatter data were analyzed by ASTRA software (version 6.1, Wyatt).
Supplementary Material
Acknowledgments
We thank the members of the Protein Design and Immunoengineering group ( Laboratory of Protein Design & Immunoengineering [LPDI], École polytechnique fédérale de Lausanne [EPFL], Lausanne) for helpful discussions. We thank Arne Schneuing for critical reading of the manuscript. We thank EPFLs Scientific IT and Application Support Center for their support on the computational infrastructure. We thank the Protein Production and Structure Core facility at EPFL for their support on the protein biophysical characterization experiments. B.E.C. is a grantee from the European Research Council (starting grant 716058), the Swiss National Science Foundation, and the Biltema Foundation. Parts of the computational simulations were performed at the Swiss National Supercomputing Centre (through a grant obtained by B.E.C.). Z.H., and S.R. are supported by a grant from the National Center of Competence in Research in Chemical Biology.
Footnotes
The authors declare no competing interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2206111119/-/DCSupplemental.
Data, Materials, and Software Availability
The core code for the TopoBuilder method can be found at https://github.com/LPDI-EPFL/topobuilder (61). The code for the design pipeline and datasets can be found at https://github.com/LPDI-EPFL/tbpipeline (62). Detailed documentation can be found at https://topobuilder.readthedocs.io/en/master/ (63).
References
- 1.Huang P.-S., Boyken S. E., Baker D., The coming of age of de novo protein design. Nature 537, 320–327 (2016). [DOI] [PubMed] [Google Scholar]
- 2.LaBean T. H., Kauffman S. A., Butt T. R., Libraries of random-sequence polypeptides produced with high yield as carboxy-terminal fusions with ubiquitin. Mol. Divers. 1, 29–38 (1995). [DOI] [PubMed] [Google Scholar]
- 3.Davidson A. R., Sauer R. T., Folded proteins occur frequently in libraries of random amino acid sequences. Proc. Natl. Acad. Sci. U.S.A. 91, 2146–2150 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rocklin G. J., et al. , Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Chevalier A., et al. , Massively parallel de novo protein design for targeted therapeutics. Nature 550, 74–79 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Joh N. H., et al. , De novo design of a transmembrane Zn2+-transporting four-helix bundle. Science 346, 1520–1524 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Thomson A. R., et al. , Computational design of water-soluble α-helical barrels. Science 346, 485–488 (2014). [DOI] [PubMed] [Google Scholar]
- 8.Bale J. B., et al. , Accurate design of megadalton-scale two-component icosahedral protein complexes. Science 353, 389–394 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jacobs T. M., et al. , Design of structurally distinct proteins using strategies inspired by evolution. Science 352, 687–690 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Govindarajan S., Goldstein R. A., Why are some proteins structures so common? Proc. Natl. Acad. Sci. U.S.A. 93, 3341–3345 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li H., Helling R., Tang C., Wingreen N., Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996). [DOI] [PubMed] [Google Scholar]
- 12.Koehl P., Levitt M., De novo protein design. I. In search of stability and specificity. J. Mol. Biol. 293, 1161–1181 (1999). [DOI] [PubMed] [Google Scholar]
- 13.Kuhlman B., Baker D., Native protein sequences are close to optimal for their structures. Proc. Natl. Acad. Sci. U.S.A. 97, 10383–10388 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Miller J., Zeng C., Wingreen N. S., Tang C., Emergence of highly designable protein-backbone conformations in an off-lattice model. Proteins 47, 506–512 (2002). [DOI] [PubMed] [Google Scholar]
- 15.Zhang J., Zheng F., Grigoryan G., Design and designability of protein-based assemblies. Curr. Opin. Struct. Biol. 27, 79–86 (2014). [DOI] [PubMed] [Google Scholar]
- 16.Helling R., et al. , The designability of protein structures. J. Mol. Graph. Model. 19, 157–167 (2001). [DOI] [PubMed] [Google Scholar]
- 17.Grigoryan G., Degrado W. F., Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079–1100 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.England J. L., Shakhnovich E. I., Structural determinant of protein designability. Phys. Rev. Lett. 90, 218101 (2003). [DOI] [PubMed] [Google Scholar]
- 19.Pan F., Zhang Y., Liu X., Zhang J., Estimating the designability of protein structures. bioRxiv [Preprint] (2021), 2021.11.03.467111. 10.1101/2021.11.03.467111 (Accessed November 4, 2021). [DOI]
- 20.Simons K. T., Bonneau R., Ruczinski I., Baker D., Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 37 (suppl. 3), 171–176 (1999). [DOI] [PubMed] [Google Scholar]
- 21.Leelananda S. P., Jernigan R. L., Kloczkowski A., Predicting designability of small proteins from graph features of contact maps. J. Comput. Biol. 23, 400–411 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Koga N., et al. , Principles for designing ideal protein structures. Nature 491, 222–227 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Minami S., et al. , Exploration of novel αβ-protein folds through de novo design. bioRxiv [Preprint] (2021), p. 2021.08.06.455475, August 2021. 10.1101/2021.08.06.455475. [DOI]
- 24.Huang P.-S., et al. , De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy. Nat. Chem. Biol. 12, 29–34 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Vorobieva A. A., et al. , De novo design of transmembrane β barrels. Science 371, eabc8182 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Dou J., et al. , De novo design of a fluorescence-activating betaDe novo design of a non-local β-sheet protein with high stability and accuracy-barrel. Nature 561, 7724 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Marcos E., et al. , De novo design of a non-local β-sheet protein with high stability and accuracy. Nat. Struct. Mol. Biol. 25, 1028–1034 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chidyausiku T. M., et al. , De novo design of immunoglobulin-like domains. bioRxiv [Preprint] (2021), 2021.12.20.472081, December 2021. 10.1101/2021.12.20.472081. [DOI]
- 29.Huang P.-S., et al. , High thermodynamic stability of parametrically designed helical bundles. Science 346, 481–485 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Offer G., Hicks M. R., Woolfson D. N., Generalized Crick equations for modeling noncanonical coiled coils. J. Struct. Biol. 137, 41–53 (2002). [DOI] [PubMed] [Google Scholar]
- 31.Novotný J., Bruccoleri R. E., Newell J., Twisted hyperboloid (Strophoid) as a model of β-barrels in proteins. J. Mol. Biol. 177, 567–573 (1984). [DOI] [PubMed] [Google Scholar]
- 32.Sesterhenn F., et al. , De novo protein design enables the precise induction of RSV-neutralizing antibodies. Science 368, eaay5051 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yang C., et al. , Bottom-up de novo design of functional proteins with complex structural features. Nat. Chem. Biol. 17, 492–500 (2021). [DOI] [PubMed] [Google Scholar]
- 34.Levitt M., Growth of novel protein structural data. Proc. Natl. Acad. Sci. U.S.A. 104, 3183–3188 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Chothia C., Gough J., Genomic and structural aspects of protein evolution. Biochem. J. 419, 15–28 (2009). [DOI] [PubMed] [Google Scholar]
- 36.Chothia C., Gough J., Vogel C., Teichmann S. A., Evolution of the protein repertoire. Science 300, 1701–1703 (2003). [DOI] [PubMed] [Google Scholar]
- 37.Taylor W. R., A ‘periodic table’ for protein structures. Nature 416, 657–660 (2002). [DOI] [PubMed] [Google Scholar]
- 38.Taylor W. R., Chelliah V., Hollup S. M., MacDonald J. T., Jonassen I., Probing the ‘dark matter’ of protein fold space. Struct. Lond. Engl. 17, 1244–1252 (1993). [DOI] [PubMed] [Google Scholar]
- 39.Chothia C., Finkelstein A. V., The classification and origins of protein folding patterns. Annu. Rev. Biochem. 59, 1007–1039 (1990). [DOI] [PubMed] [Google Scholar]
- 40.Cohen F. E., Sternberg M. J. E., Taylor W. R., Analysis and prediction of protein β-sheet structures by a combinatorial approach. Nature 285, 378–382 (1980). [DOI] [PubMed] [Google Scholar]
- 41.Zhou J., Grigoryan G., Rapid search for tertiary fragments reveals protein sequence-structure relationships. Protein Sci. 24, 508–524 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Zhou J., Grigoryan G., A C++ library for protein sub-structure search. bioRxiv [Preprint] (2020), p. 2020.04.26.062612, April 2020. 10.1101/2020.04.26.062612. [DOI]
- 43.Taylor W. R., et al. , Prediction of protein structure from ideal forms. Proteins 70, 1610–1619 (2008). [DOI] [PubMed] [Google Scholar]
- 44.Leman J. K., et al. , Macromolecular modeling and design in Rosetta: Recent methods and frameworks. Nat. Methods 17, 665–680 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bonet J., et al. , Rosetta FunFolDes - A general framework for the computational design of functional proteins. bioRxiv [Preprint] (2018), p. 378976, July 2018. 10.1101/378976. [DOI] [PMC free article] [PubMed]
- 46.Correia B. E., et al. , Proof of principle for epitope-focused vaccine design. Nature 507, 201–206 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Bonet J., et al. , Rosetta FunFolDes - A general framework for the computational design of functional proteins. PLOS Comput. Biol. 14, e1006623 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Huang P.-S., et al. , RosettaRemodel: A generalized framework for flexible backbone protein design. PLoS One 6, e24109 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Bonet J., Harteveld Z., Sesterhenn F., Scheck A., Correia B. E., rstoolbox - A Python library for large-scale analysis of computational protein design data and structural bioinformatics. BMC Bioinformatics 20, 240 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Bhardwaj G., et al. , Accurate de novo design of hyperstable constrained peptides. Nature 538, 329–335 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Koga R., et al. , Robust folding of a de novo designed ideal protein even with most of the core mutated to valine. Proc. Natl. Acad. Sci. U.S.A. 117, 31149–31156 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Koga N., et al. , Role of backbone strain in de novo design of complex α/β protein structures. Nat. Commun. 12, 3921 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J., Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990). [DOI] [PubMed] [Google Scholar]
- 54.Yang J., et al. , Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. U.S.A. 117, 1496–1503 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Berman H. M., et al. , The protein data bank. Nucleic Acids Res. 28, 235–242 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Varadi M., et al. , AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50(D1), D439–D444 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Taylor W. R., Exploring protein fold space. Biomolecules 10, 2 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Lin Y.-R., et al. , Control over overall shape and size in de novo designed proteins. Proc. Natl. Acad. Sci. U.S.A. 112, E5478–E5485 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Zhang Y., Skolnick J., TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Harteveld Z., et al. , A generic framework for hierarchical de novo protein design. GITHUB. https://github.com/LPDI-EPFL/topobuilder. Deposited 23 September 2022. [DOI] [PMC free article] [PubMed]
- 62.Harteveld Z., et al. , A generic framework for hierarchical de novo protein design. GITHUB. https://github.com/LPDI-EPFL/tbpipeline. Deposited 23 September 2022. [DOI] [PMC free article] [PubMed]
- 63.Harteveld Z., et al. , A generic framework for hierarchical de novo protein design. TopoBuilder. https://topobuilder.readthedocs.io/en/master/. Deposited 23 September 2022. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The core code for the TopoBuilder method can be found at https://github.com/LPDI-EPFL/topobuilder (61). The code for the design pipeline and datasets can be found at https://github.com/LPDI-EPFL/tbpipeline (62). Detailed documentation can be found at https://topobuilder.readthedocs.io/en/master/ (63).