SUMMARY:
We report two generative deep learning models that predict amino acid sequences and 3D protein structures based on secondary structure design objectives via either overall content or per-residue structure. Both models are robust regarding imperfect inputs and offer de novo design capacity as they can discover new protein sequences not yet discovered from natural mechanisms or systems. The residue-level secondary structure design model generally yields higher accuracy and more diverse sequences. These findings suggest unexplored opportunities for protein designs and functional outcomes within the vast amino acid sequences beyond known proteins. Our models, based on an attention-based diffusion model and trained on a dataset extracted from experimentally known 3D protein structures, offer numerous downstream applications in conditional generative design of various biological or engineering systems. Future work may include additional conditioning, and an exploration of other functional properties of the generated proteins for various properties beyond structural objectives.
Keywords: Proteins design, deep learning, diffusion model, attention models, transformer
eTOC blurb
Designing proteins beyond naturally existing ones is of great potential for numerous scientific and engineering applications; however, to date, it remains prohibitively expensive. Here, we leverage attention-based diffusion models to efficiently generate novel protein sequences with prescribed secondary structures. Our models handle variegated design objectives robustly and predict de novo protein sequences that are not yet discovered in nature, which open many avenues for discovering superior protein materials and systems and can be extended beyond structural objectives.
Graphical Abstract

1. Introduction
Proteins are critical biological building blocks that constitute fundamental functions of all life, as well as significant biomaterials emerging from natural evolution, including silks, collagens, complex assemblies such as cells, and tissue assemblies such as skin.1–4 These various structures and associated outstanding properties depend on the underlying sequences of amino acids (AAs) and the subsequent folded three-dimensional (3D) structures. For millennia, it has been a popular and fruitful approach to take inspiration from nature about how to design materials,2,5 including recent lessons learned from proteins (e.g., designing synthetic materials inspired by nacre6,7 and silks8,9). Furthermore, considering that there exist 20100 possible AA sequences for a 100-residue protein and that natural evolution has only sampled a small fraction of these, there remains a broad unexplored design space and hence significant potential for discovering de novo proteins with potentially unprecedented properties and functions.10 However, it is also due to this enormous design space and costs associated with experimental testing, that great challenges remain in finding appropriate tools to design de novo protein sequences that yield a set of targeted structural features or properties.8,9,11–13
In the present work, we are particularly focused on mechanical properties of proteins, for which secondary structure content is key (previous studies have demonstrated that, elementary units of secondary structures and their interactions within one chain can still play key roles in determining the mechanical properties of protein materials). For instance, alpha-helical proteins14 tend to yield stretchy materials, beta-sheet rich ones rather rigid materials,15–17 and combinations as in silks provide simultaneously rigidity, strength, and toughness via a nanocomposite design strategy.15,17,18 Further, using protein molecular building blocks as means to construct higher-level hierarchical materials offers high degrees of design and flexibility to generate targeted mechanical properties, such as flaw-tolerance,19–21 or tunable properties like stiffness or toughness22–24 among many others. For example, for tissue repairs and regenerative medicine, the durability of protein-based implants can be tuned by designing the content and types of the secondary structures.25–31 For silk proteins, by combining sequences with different ratios of beta-sheet, beta-turn and random coil content, the mechanical properties of the resulting materials can vary in an expanded range.26,32,33 Therefore, the significant role played by the secondary structure content on influencing the mechanical properties of protein materials and complex tissue systems suggest that this feature can be a valuable target or condition for de novo protein design of mechanical properties.
The application of machine learning (ML) approaches to protein studies in recent years has provided effective avenues to predict structure and properties/functions based on protein sequences. Taking the folding problem of predicting 3D structures for given sequences as an example, the deep learning-based tool AlphaFold 2 represented a breakthrough in achieving competitive accuracy while bypassing expensive and time-consuming measurements in conventional experimental methods.34 Using the latest AlphaFold model, the structures of ~200 million proteins in the human proteome, as well as that of several other organisms, have been predicted, going far beyond the pace of experimental methods.35 Furthermore, end-to-end models based on Deep Learning (DL) that predicts the secondary structures and properties for given sequences have also been developed.36,37 For instance, from their primary sequences, the secondary structure type and contents can be predicted with good accuracy using various ML models, including feed forward neural networks,38 recurrent neural networks,39 deep convolutional networks,40 and transformer-based language-models.41,42 The development of these deep learning models greatly reduces the cost of screening large numbers of protein sequences.
In contrast, the inverse design of de novo proteins that satisfy targeted features presents unique challenges and remains largely open for exploration, even with deep learning models. On the one hand, stochastic search algorithms constructed using handcrafted optimization functions and sampling approaches are often adopted for such inverse designs.43,44 For example, a combination of genetic algorithms and deep learning-based predictors is required to search for protein sequences that yield the desired ratio of secondary structure content.45 However, even with the efficient deep learning-based predictor, the iterative process of searching can potentially still be time consuming, while the convergence of the iterations and the quality and varieties of the discovered sequences are not necessarily correlated.
On the other hand, compared to applications of image generation, generative models have not yet been broadly generalized for protein structures and their potentials in solving such problems remain largely unexplored. Various deep learning methods have been used to generate images and image-like field data, including autoencoders,46,47 generative adversarial nets (GANs)48–50 and transformer-based diffusion models.51,52 For example, recently diffusion model architectures,51 such as DALL-E2,53 Imagen54 and Latent/Stable Diffusion,55 have produced state-of-the-art performance in text-to-image generation tasks for an unprecedented degree of photorealism and language understanding. For engineering applications, conditional GAN models have been demonstrated to be capable of generating stress/strain fields from modulus fields56,57 and vice versa;58,59 and progressive transformer diffusion models can learn and predict the various behaviors during dynamic fracture processes.60 In comparison, protein sequences and structures show different formats and features from those of images, graphs, or image-like fields. Attempts to bridge such generative models for protein studies remain few but grow rapidly. For example, variational autoencoders have been adjusted to generate diverse new protein structures without designed conditions;61 equivariant denoising diffusion probabilistic models have been developed to generate proteins following a given topology and constraints;62 another diffusion model has been applied to construct scaffold structures that support a desired motif in proteins.63 With these recent successes, it is promising to explore how to leverage these generative models in image domains and adjust them to handle de novo protein design effectively and efficiently.
In this paper, we propose an attention-based diffusion model for proteins and report a generative deep learning model that predicts AA sequences and 3D protein structures based on secondary structure design objectives (Figure 1). By proposing a U-Net architecture that handles one-dimensional (1D) protein data, we construct a model that takes a conditioning description of the desired secondary structures as input and produces various sequences of AAs from random noise vector sources (akin to the schematic shown in Figure 2a). This is achieved via an attention-based diffusion model that is trained by minimizing the L2 difference between the actual and predicted noise level (eq. (4)), realizing a step-wise denoising strategy, using a U-net architecture with intersecting convolutional and attention layers as depicted in Figure 2b-c.
Figure 1:

Overview of the model developed here. The model takes a conditioning description as input and produces, via a conditional attention-based diffusion model (Figure 2), an amino acid sequence. We use OmegaFold64 and AlphaFold34,65 to then predict 3D structures of the resulting proteins. Two models are trained; first, a model that takes fractional input of secondary structure and then predicts sequences (model A). Second, a model that takes per-residue secondary structure information as input and predicts sequences (model B). The generative model can be used to produce a number of samples that can be analyzed for further down selection (e.g., we can select samples that meet the target conditioning input the best, or that are the most novel, etc.).
Figure 2:

Overview of the attention-based diffusion model. a, Visualization of how noise and conditioning data is transformed into a solution. During training pairs of conditioning data and a resulting output are used. b, Illustration of the Markov chain of noising (top) and denoising (bottom). c, Depiction of the 1D U-net architecture that translates an input I𝑖 into an output Oi under a condition set Ci. The model features 1D conv layers, as well as self-/cross-attention layers as shown on the right.
We then integrate the model with folding prediction methods to determine 3D structures of the resulting sequences, classify their secondary structures and compare these outputs with the input conditions. Finally, the designed sequences are checked with known proteins to analyze their novelty. After training the model on a set of Protein Data Bank (PDB) proteins, we demonstrated that the models are able to generate various de novo proteins sequences of stable structures that closely follow the given secondary structure conditions, thus bypassing the iterative search process in previous optimization methods.43–45 Built upon the capability of our model and the known significance of secondary structures on mechanical properties of proteins, we expect our model can be useful in numerous applications in conditional generative design of various scientific and engineering protein systems.
2. Results and Discussion
Figure 1 depicts an overview of the model developed here, generating novel protein structures based on conditioning parameters. In this process, the model takes a conditioning description as input and produces, via a conditional attention-based diffusion model (Figure 2), an amino acid sequence that is then used to construct a 3D protein model. We use OmegaFold64 and AlphaFold34,65 to predict 3D structures of the resulting proteins. Two models are developed, trained and applied. First, a model that takes fractional input of secondary structure and then predicts sequences (model A). Second, a model that takes per-residue secondary structure information as input and predicts sequences (model B). In both models, the predictions/sequences are constructed from random signals, under conditioning, by reversing the diffusion process in a step-by-step fashion, as summarized in Figure 2a-b. The deep neural network is tasked with identifying the added noise at each step so that it can be removed successfully. Details of the models, training procedure and related details are included in the Experimental Procedures section, and Table 1 shows a summary of the eight secondary structure parameters used here, following the Define Secondary Structure of Proteins (DSSP) convention.66 Additional aspects including an analysis of the distribution of the data in the training set are included in the Supplementary Material, specifically Figures S1, S2 and S3.
Table 1:
Secondary structure code, following the DSSP convention.66
| DSSP code | Description |
|---|---|
| H | Alpha-helix (AH) |
| E | Extended parallel and/or anti-parallel beta-sheet (BS) conformation |
| T | Hydrogen bonded turns (3, 4 or 5 turn) |
| ~ | Unstructured |
| B | Beta-bridge (single pair beta-sheet hydrogen bond formation) |
| G | 3/310 helix |
| I | Pi-helix |
| S | Bend |
Figure 3 shows results for protein generation based on fractional secondary structure content (model A). Figure 3a-f show a variety of representative cases, including high beta-sheet content (a), a mix of alpha-helix and beta-sheet (b, f), pure alpha-helical content (c), alpha-helical content with significant disorder (d), and a completely disordered protein (e). The left column shows the conditioning vector (eight length reflecting the eight different types of secondary structure content as shown in Table 1) and the resulting amino acid sequences. The columns in the middle depict the resulting protein structures as predicted by OmegaFold in two renderings. The right column shows a comparison of secondary structure fractions between the input conditions and those reconstructed from the generated sequences.
Figure 3:

Results for protein generation based on fractional secondary structure content (model A). Panels a-f show a variety of representative cases, including high beta-sheet content (a), a mix of alpha-helix and beta-sheet (b, f), pure alpha-helical content (c), alpha-helical content with significant disorder (d), and a completely disordered protein (e). The fractional patterns used here are similar to those that have been observed in known proteins and relate to mechanical properties. The left column shows the conditioning vector (eight length reflecting the eight different types of secondary structure content as shown in Table 1) and the resulting amino acid sequences. The two columns in the middle depict the resulting protein structures as predicted by OmegaFold in two renderings (middle left: pLDDT score, middle right: secondary structure analysis via PyMol; yellow=beta-sheet, red=helix, green=disordered). The right column shows the comparison of secondary structure fractions between the input conditions and those reconstructed from the generated sequences. Additional designs are depicted in Figure S7.
We now focus on one of the predictions, the beta-sheet structure in Figure 3a, and explore the assembly of this protein into higher-order arrangements. These assemblies are not predicted by our model; rather, we explore whether the predicted beta-strands would assemble into higher-order filamentous structures. Figure 4 shows the analysis of the results from Figure 3a in greater detail. Figure 4a-b shows a comparison of the predictions between OmegaFold and AlphaFold. The results are comparable and indicate that both folding methods yield similar results. Figure 4c shows the structure prediction of an assembly of three of these beta-sheet building blocks, predicted using AlphaFold-Multimer.67 Figure 4d shows, similarly, the assembly geometry of eight beta-sheets. It is notable that the model arranges three sub-parts consisting of two mini beta-barrels and one bivalent assembly in the middle while for the individual chains, the designed secondary structures are preserved.
Figure 4:

Analysis of the results from Figure 3a in greater detail, and exploring how beta-strand – which are known to assemble into higher-scale structural assemblies such as amyloid filaments or other fibrous structures – yield such emerging structures. a-b, Comparison of the predictions between OmegaFold and AlphaFold. The results are comparable. Panel c shows the structure prediction of an assembly of three of these beta-sheet building blocks, predicted using AlphaFold. Panel d shows, similarly, the assembly geometry of eight beta-sheets. It is notable that the model arranges in three sub-parts consisting of two mini beta-barrels and one bivalent assembly in the middle. The designed protein fragments are preserved well when multiple chains aggregate, which indicates, for some cases, the current design models may be applied even for multimer cases in such cases, especially for beta-sheet rich protein assemblies. We use the AlphaFold-Multimer model for folding prediction of protein complexes (in c-d).
Next, we analyze the predicted amino acid sequences to assess whether, and to what extent, they represent novel sequences or closely related forms of existing and/or known proteins. This is done using basic local alignment search tool (BLAST) analysis.68 Table 2 shows the results of the BLAST analysis for the various cases, for the results from model A. We find that generally, the model predicts structures that are similar to existing protein sequences, as can be seen from the BLAST results. However, some generated sequences (as for the 2nd and 3rd case) do not exist in the PDB-based training set. While there is some novelty in the sequences, and there are measures one could take to drive further variation, we focused on ways to ensure a greater diversity of sequence predictions (strategies to enhance this could be to increase the conditioning probability dropout or to add noise to the conditioning vector during training).
Table 2:
Results of the BLAST68 analysis for the various cases, for the results from model A. Generally, the model predicts structures that are similar to existing protein sequences as can be seen from the BLAST results. However, some generated sequences (as for the 2nd and 3rd case) do not exist in the PDB-based training set.
| Conditioning | Sequence | BLAST result: the sequence producing the most significant alignment | |
|---|---|---|---|
| among PDB proteins | beyond PDB proteins | ||
| [0, 0.7, 0.07, 0.1, 0.01, 0.02, 0.01, 0.11] | GFCRCLCRRGICRCICTR | 100% query cover, 94.44% identical with PDB ID 1HVZ | N/A |
| [0.2, 0.2, 0.07, 0.3, 0.01, 0.02, 0.01, 0.11] | IPCACFRNYVPVCGSDGKVYGNPCMLNLAAYVKVTGLKLRFSGRLPVSNREYQ | -- | 94% query cover, 76.00% identical with GenBank: AHW57452.1 |
| [0.8, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] | RDQSLEARLVELETRLSFFEQALTELSEALADARLTGARNAELIRWLLEDL | -- | 100% query cover, 94.12% with NCBI Reference Sequence: WP_011036690.1 |
| [0.5, 0.0, 0.0, 0.2, 0.0, 0.0, 0.1, 0.1] | GSSGSSGVIRQMLQELVNAGCDQEMAGRALKQTGSRSIEAALEYIAKMSGPSSG | 100% query cover, 96.30% identical with PDB ID 2COS | N/A |
| [0.01, 0.1, 0.0, 0.5, 0.0, 0.01, 0.1, 0.5 ] | RQARWELAFDLD | 91% query cover, 90.91% identical with PDB ID 1ID6 | N/A |
| [0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11] | GSSGSSGTGEKPYICIECGKVFVVNSFLARHGKTHTGEKPSGPSSG | 80% query cover, 80.43% identical with PDB ID 2ENE | N/A |
This is accomplished via the second model, Model B, where we use residue-level conditioning. In this scenario, each residue is conditioned based on one of the eight DSSP secondary structure codes. Figure 5 depicts results for protein generation based on residue-level secondary structure content implemented in model B. We consider five cases with different distributions of secondary structure, including predominant beta-sheet (a), a long alpha-helix with a breaker in the center (b), a small alpha-helix (c), a sandwich alpha-helix/beta-sheet structure (a beta-sheet centered between two helical domains) (d), and a partially disordered-helical protein (e). The folded results (left column in Figure 5) reveal generally good agreement with the design objectives specified in the right (blue font), and confirms that the model enables us to design specific geometric details and localizations of structure. Even though these proteins are de novo sequences (see BLAST analysis in Table 3), OmegaFold (and AlphaFold) reach relatively high pLDDT scores (denoting a per-residue estimate of prediction confidence in a range from 0 to 100); and it is typically the largest for the alpha-helical structures.
Figure 5:

Results for protein generation based on residue-level secondary structure content (model B). We consider five cases with different distributions of secondary structure, including predominant beta-sheet (a), a long alpha-helix with a breaker in the center (b), a small alpha-helix (c), a sandwich alpha-helix/beta-sheet structure (a beta-sheet centered between two helical domains) (d), and a partially disordered-helical protein (e). The right column shows the input secondary structure sequences (in blue), generated amino acid sequences (in black) and secondary structure contents reconstructed from the generated proteins (in red). The left column depicts the resulting protein structures in the rendering of pLDDT score. The secondary structures of the folded results show good agreement with the design objectives and the per-residue accuracy (defined in eq. (6)) of those cases are 44% (a), 79% (b), 81% (c), 60% (d) and 73% (d), confirming that the model enables us to design specific geometric details and localizations of structure. Even though these proteins are de novo sequences (see analysis in Table 3), OmegaFold (and AlphaFold) reach relatively high pLDDT scores; typically, the largest for the alpha-helical structures. Several additional designs are depicted in Figures S4–S6.
Table 3:
Results of the BLAST analysis for the various cases for model B. In the interest of space, the secondary structure design objective is not shown but rather, the case ID (from a to e) is referred to corresponding to the cases shown in Figure 5. The analysis shows that the model generates protein structures that reflect the design objectives well (see analysis in Figure 5), and which are also de novo sequences that have little similarity with existing amino acid sequences. The BLAST results indicate similarities in around 50–60% for most cases, with one case reaching 85.71% case for 66% of the query cover. Most other cases have much smaller similarities and small query covers.
| Case | Sequence | BLAST result: the sequence producing the most significant alignment | |
|---|---|---|---|
| among PDB proteins | beyond PDB proteins | ||
| a | ACICANYDYLGRKCVCIRCGTPGLFSVKCYVPRL | -- | 82% query cover, 43.24% identical with GenBank: MBR4744867.1 |
| b | ASTFEEAIEIKRVLQIIGSKLARSGETSPEEIEEILIIFEVIKAVLKA | -- | 60% query cover, 57.58% identical with GenBank: MBR3311276.1 |
| c | GPDAVLFAGFKTGLAKMLKNR | -- | 66% query cover, 85.71% identical with GenBank: MCI9553838.1 |
| d | ASNETKLEEQKQDAIKECESEVTCIKIIKANGEDGIGVILCDSGGDVVADKVKIAQDGDNEGFAKSSQLHARIAELN | -- | 45% query cover, 44.68% identical with NCBI Reference Sequence: WP_257531070.1 |
| e | ASNETKLEEQKQDAIKECESEVTCIKIIKANGEDGIGVILCDSGGDVVADKVKIAQDGDNEGFAKSSQLHARIAELN | -- | 41% query cover, 61.54% identity with GenBank: TBR12876.1 |
Built on these predictions we now conduct a more detailed analysis of one of the designs. Figure 6 reveals a detailed analysis of two designs (Figure 5a and e), including a comparison between OmegaFold and AlphaFold predictions. The indication of the residue number (C- and N-terminus, respectively) and the localization of specific secondary structure shows excellent agreement with the design objective. The top row in each case depicts the structure color coded by residue number (rainbow plot), and the bottom row a secondary structure color coding.
Figure 6:

Detailed analysis of two designs (Figure 5a and e), including a comparison between OmegaFold and AlphaFold predictions (the design objective is noted at the top of each subplot). The indication of the residue number (C- and N-terminus, respectively) and the localization of specific secondary structure shows excellent agreement with the design objective. The top row in each case depicts the structure color coded by residue number (rainbow plot), and the bottom row a secondary structure color coding.
Table 3 summarizes the results of the BLAST analysis for the various cases for model B. The analysis shows that the model generates protein structures that reflect the design objectives well (see analysis in Figure 5), and which are also de novo sequences that have little similarity with existing amino acid sequences. We find that the BLAST results indicate similarities of around 50–60% for most cases, with one case reaching 85.71% and one for 66% of the query cover. It is further noted that most of the other cases have much smaller similarities and small query covers. This result is noteworthy as it provides strong evidence that Model B is capable of discovering new protein designs. The difference in this explorative potential is likely due to the way by which novel sequences are generated, using condition dropouts. Since the conditioning sequences are short, there is limited variability as the model explores new designs that conform less strongly to the conditioning task.
Several deep learning-based protein design systems have emerged recently.69–71 Complementing these alternative approaches, our model serves a unique design perspective focused on secondary structure content (model A) or residue-specific secondary structure content (model B). We briefly discuss the differences with regard to other generative models. For example, to generate desired sequences, proteinMPNN69 requires atomic details of the protein backbone design as input, including distances between Cα-Cα atoms, relative Cα-Cα-Cα frame orientations and rotations and backbone dihedral angles. In contrast, our model only conditions the secondary structure types of the generated sequences in the global (model A) or residual (model B) level, leaving the detailed atomic coordinates information unspecified. Indeed, for one pick of secondary structure contents, in terms of either global fractions or secondary structure type sequence, there could exist many different backbone atomic structures. So, these two are protein design tools working on different levels with different respective advantages.
Another model is Chroma,70 which adopts a graph neural network representation of proteins, and is based on a diffusion process between protein backbone and the noisy structure of a collapsed polymer system, instead of Gaussian noise. In the RFdiffusion model,71 the folding tool, RoseTTAFold,72 is fined-tuned as the denoising network. Both models pay particular attention to the backbone structures of the designed protein. Thus, various conditions regarding the atomic structure of the backbone can be implemented for both monomer and protein complex designs, making them general-purpose structure-focused protein design tools. In contrast, our model specifically focuses on the mapping between secondary structure and primary sequences, bypassing the construction of the atomic details of the backbone. The rules of what sequences can be generated are learned implicitly by our model. Correspondingly, users can directly work on the residual level (model B) as well as monomer level (model A) and skip the detailed choice of conditions on the atomic level. One can also combine Model A as a way to get a first protein sequence estimate, and then refine the design using Model B by specifying residue-level secondary structure detail.
To test and apply our models, we note that there is a lack of clear/explicit rules on which secondary structure distributions are physically possible and which are not, given the finite amino acids as the building blocks. Therefore, it remains unclear and challenging to directly construct an exhausted set of physically possible input conditions to test our models. Instead, in the current work, we focus on testing our models with secondary structure patterns that have been observed in known proteins and relate to mechanical properties. In Figures 3 and 5, we manually constructed the input conditions so that they resemble some typical secondary structure patterns in known PDB proteins, especially those that are known for affecting the deformation process and mechanical properties, such as beta-sheet and alpha-helix. Here, we intentionally included noises or even errors in the input conditions to test and demonstrate the behaviors and robustness of our model. For example, for Model A, the sum of the secondary structure fractions in some of the tested conditions in Figure 3 slightly deviates from 1. However, even with those imperfect inputs, the model is still able to generate sequences which respect the relative concentrations of different secondary structure types as shown in Figure 3 (the right column). Moreover, the generated proteins have a correct fractional content that does add up to one. To expand on this, Figure S8 shows an analysis of the rigor of input conditioning in Model A. We examine here the difference of predictions that do not necessarily add up to one (, Figure S8a) and conditioning that add up to one (, Figure S8b). Similar predictions are obtained in most cases, except for the second to last from the bottom. This result shows that the model has a certain degree of robustness and can deal with unphysical input (e.g. fractional secondary structure content that does not add up to one) and still produce reliable results.
While the rules of protein construction are complex, especially for secondary structures that yield long-range interactions (such as beta-sheets, where close or distant parts of the protein combine to form the 3D structure), primarily local-focused structures such as alpha-helices do not have such constraints. Hence, we conducted a case study to explore whether the model can solve systematic variations of the input. Applying Model B, we generate a series of alpha-helix (AH) sequences of increasing length. As shown in Figure S4, as we systematically control the length of helical residue in the inputs, the model predicts a length-dependent conformation transition, from an unconstructed chain, to a straight segment of AH, to multiple AH segments with kinks, and to a slightly curved continuous AH segment. This transition indicates that the trained model not only seeks to respect the input conditions but also yield to the underline constraints of physically possible secondary structures learned implicitly during training.
In another test, Figure S5 depicts designed sequences based on secondary structure sequences identified from recently deposited PDBs, taken from the CASP15 (a, PDB ID 7ROA) and CASP14 (b, PDB ID 7JTL) target set. The generated proteins, shown on the right panel, are de novo sequences with no significant overlap with any existing known sequences. This shows that the model can generate new protein sequences – different from those generated via evolutionary mechanisms – and offer alternative, or candidate sequence options for structural templates. In addition to this result, Figure S6 shows results for designed sequences based on several design targets with different complexity (hard to easier, from top to bottom). The left panels show the error value over the iterations (we repeat generation until a maximum number of iterations is reached, or until the design error is below a critical value). The critical error is set to 0.1 in all cases, only the alpha-helix structures reach that goal. Generally, error values fluctuate around a mean value, reflecting the general level of challenge to solve the design problem. Furthermore, since we trained our model using a classifier-free strategy, we can explore how removing conditioning information affects the results. In Figure S7 we show the results of these experiments, which demonstrate that increasing the parameter from 1 to 30 offers additional tuning of the predictions (for the model is fully conditioned, and becomes less conditioned as the parameter increases).
3. Conclusions
We report a method to generate novel protein designs, based on an attention-based diffusion model. Two variants of the model are presented, one that conditions predictions based on overall fractional content of secondary structure (Model A) and a second model based on per-residue level secondary structure conditioning. The results show that Model B is more effective in generating de novo sequences that are not found in nature, or that have not been discovered yet (see, analysis in Table 3). The model is capable of predicting a variety of new sequences; while they reveal some similarities with existing sequences (on the order of 50–60% similarity) they introduce significant amounts of new design cues. This level of variation can be regarded as an interesting measure to add to the diversity of natural protein designs.
Future work could include further refinement of sequences to meet additional criteria, for instance biological activity. Other critical steps include experimental validation of the designs, especially for cases that have little similarities with known proteins. Also, it may be interesting to add more explorative capacity to Model A, to achieve greater sequence diverse much more distinct from existing proteins at levels similar to what we achieved in Model B. One way to achieve this could be to use Model B to generate a greater variety of protein sequences, then fold the proteins and use these new data points to expand the existing dataset. There are already millions of new protein structures that were predicted by computational methods available, which could also be added to the training data set. So far, we excluded these predicted proteins as we wanted to focus the training procedure on experimental data. One may also use integrated optimization algorithms where sequence predictions are accepted, rejected or altered based on performance criteria, such as invoking structure prediction tools or methods to directly assess secondary structure content in an end-to-end fashion. These strategies may offer interesting research avenues and the possibility of achieving multiple objective functions.
There are other future research avenues in which this work can be taken, including via the use of inpainting strategies where part of the sequence is provided as a boundary condition, and the model is asked to fill in a missing part. If conditioning strategies are used as proposed in this paper, such models would likely be able to solve protein design tasks of the kind where a domain is sought to be altered, for instance for mechanical or other (e.g. biological, and other) property optimization. A useful application of this strategy can be, for instance, the design of silk materials where key domains could be strengthened by adding greater beta-sheet forming domains, whereas others can be rendered stretchier via the use of helical domains. The sandwich design shown in Figure 5d is an example that indicates the feasibility of such design strategies. An important next step is the experimental synthesis of such protein structures and the use of mechanical analysis tools like optical tweezers to assess outcomes. The tool reported here may also be combined with other generative protein models,69–71 including those focused on predicting sequences to meet a certain geometric/shape demand.
Our model can generate de novo proteins following desired secondary structures. As another future research topic, the novelty of these new sequences may lead to superior mechanical properties and related functions that go beyond what has been observed in natural proteins. For example, studies of natural proteins have demonstrated that the mechanical stiffness, strength and toughness of silk show size effect and length dependence on the beta-strands in silk proteins.73 Combining this mechanism knowledge with our generative model, one can systematically generate new sequences that yields the secondary structures of rationally optimized designs and verify the performances via simulations and experiments.
While we have explored the potential to study how beta-strand monomers assemble into larger assemblies (Figure 4), the models and scope of this work is strictly limited to single chain proteins. It has been demonstrated that the individual pieces of secondary structures and their interactions can play important roles in determining the mechanical proteins of protein and protein materials. For example, the unfolding of individual alpha-helices in vimentin in intermediate filament contributes mainly to its great extensibility;19 the interaction between parallel pieces of beta strands in beta-sheet-rich proteins governs their rupture strength.74 Our models have generated sequences with similar patterns (e.g., Figure 3a and 3c, or Figure 5c and 5a) effectively under different formats of conditioning.
We also identified ways to increase the de novo nature of predicted sequences, especially using Model A that tends to synthesize less diverse sequences compared to Model B. With strategies such as classifier-free guidance and early stopping during training, we can generate significantly more diverse sequences (another option, not yet explored here, can be to vary the number of sampling steps, which can be powerful to generate greater variety of generative results). A visual summary of the sequence alignments, as examples for some of the designs, are shown in Figures S9 and S10. Figure S11 shows sample alignments to explain the nature of the variations of the generated sequences to provide visual depictions of the results and the novel nature of the sequences. As evidenced from these analyses, the generative algorithm realized in Model A has a capacity to discover sequence designs from a deep reservoir of patterns; some of which have also been discovered via natural processes.
As another future direction, it may be interesting to generalize our models towards the design of multimers under similar secondary structure conditions. One straightforward strategy to do so is to introduce trivial linkers that represent the break between individual chains and further train the model with available data of protein complexes as similar scheme has been experimented to generalize AlphaFold2 for multimer tasks.75,76 A more systematic way to generalize our current models for multimer design would be to include more conditions, including those specifying the cross-chain geometry in protein complexes, like binding residues and residue ties for symmetric or repeat protein designs.69 Next, we could envision combining the current model and their possible generalizations for multimer tasks to generate large numbers of sequences and screen for those that undergo conformational changes in terms of secondary structure when forming multimer complexes. As those require significant additional research (including a development of a proper dataset), we leave them to future studies.
4. Experimental Procedures
Resource Availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Professor Markus Buehler (mbuehler@MIT.EDU).
Materials availability
This study did not generate new unique reagents.
Data availability
Data and codes, as well as trained weights, are either available on GitHub at https://github.com/lamm-mit/ProteinDiffusionGenerator or will be provided by the lead contact based on reasonable request.
4.1. Dataset
We use the same dataset utilized in the earlier work, see45. Figure S1 shows a statistical analysis of the dataset used to train model B. Figure S1a summarizes the sequence length distributions for lengths up to 128 amino acids. Figure S1b depicts the secondary structure coding statistics. Figure S1c shows the amino acid residue statistics.
The right panel in Figure S1 shows the tokenizer dictionary for both secondary structure and amino acid codes. The associations listed in these is used to translate string characters (for both amino acid codes and secondary structure code) into a numerical integer value. We then normalize the data to lay between 0 and 1, resulting in float numbers for input and output.
For Model A, the input vector is the fractional content of the 8 different types of DSSP secondary structures:
| (1) |
where is the number of residues with type . secondary structure, and the total number of amino acids in a proteins.
Unused tokens in the predicted sequence (both models) and in the conditioning sequence (model B) are identified with token 0 for padding.
Figure S2 shows the distribution of the fractional contents of the eight types of secondary structure in the training set used for Model A. Note that, for many types of secondary structures, the distribution is not uniform and quite skewed with respective to the fraction between 0 and 1. These distributions are likely a reflection of chemical and/or biological/evolutionary, principles that resulted in the protein designs). There exist relatively few sequences with higher fraction of beta-bridge (>0.4), beta-sheet (>0.9), 3/310 helix (>0.8), hydrogen bonded turn (>0.8), pi-helix (>0.5), or bend (>0.7). Figure S3 depicts plots of the fractional content distributions of different pairs of secondary structures in the training set. These pair correlation plots are useful to understand naturally occurring combinations of secondary structure contents.
4.2. Design of the neural network architectures
All code is developed in PyTorch,77 except for the tokenizer that is developed based on, and trained, using TensorFlow Keras.78
The model is built on a one-dimensional U-Net architecture composed of convolutional and transformer layers with skip connections (Figure 2c). U-Nets are a type of neural network that features the same input dimension as output dimension, commonly used in problems such as image segmentation. The U-Net implemented here feature a more complex architecture as described in 54,79, including the use of ResNet blocks, attention modules, and skip connections (see Table 4 for details). As depicted in Figure 2c, the U-net features self- and cross-attention blocks. These are used to contextualize the denoising process with both, the conditioning data and the diffusion time step. The use of Transformer-type architectures provides a meaningful way to also learn long-range relationships in the construction of amino acid sequences and how they interact to satisfy various physical/chemical constraints. During training, these are all learned.
Table 4:
Parameters used in the progressive transformer diffusion model (parameters for 1D U-net, the integrated architecture, and additional parameters are provided).
| Neural network component | Parameter | Value |
|---|---|---|
| U-net | U-net dimension | 128/512/768 (ratio-based model, Model A) 256 (residue-level model, Model B) |
| Dimension multipliers | 1, 2, 4, 8 | |
| Resnet blocks | 1 | |
| Attention heads | 8 | |
| Feed forward multiplier | 2.0 | |
| Overall architecture | Sequence length | 64/128 |
| Cond. drop probability | 0.2/0.1 | |
| Sample steps | 96/64 | |
| σ min | 0.002 | |
| σ max | 80, 160 | |
| σ data | 0.5 | |
| ρ | 7 | |
| P mean | −1.2 | |
| P std | 1.2 | |
| P churn | 80 | |
| S t,min | 0.05 | |
| S t,max | 50. | |
| S noise | 1.003 | |
| Optimizer and parameters | Adam Learning rate=1E-4, epsilon=1e-8, betas=(0.9,0.99) | |
| Additional parameters | Batch size | 256 |
| Additional parameters, Model A: | ||
| Condition embedding dimension | 512 | |
| Positional encoding depth | 128 | |
| Signal embedding depth | 368 | |
The U-Net is used to translate the secondary structure fraction vector (model A) or input sequence (model B) into the final field output via a learned denoising process. Figure 2b visualizes the denoising process, where the top defines a Markov chain operator q that adds Gaussian noise step-by-step (according to a defined noise schedule that defines how much noise 𝜀𝑖 is added at each step i), translating the original sequence X0 (left) into pure noise, XF (right) via:
| (2) |
The deep neural network is then trained to reverse this process, identifying an operator p that maximizes the likelihood of the training data, thereby offering a means to translate noise to solutions and thereby realizing the transition illustrated in Figure 2a (noise to solution), in a step-by-step fashion as indicated in the lower row of Figure 2b:
| (3) |
Specifically, the U-Net is tasked with predicting the added noise and correspondingly, the L2 distance of the actual added noise εi and the predicted added noise is adopted as the loss function for the training process.
| (4) |
That way, the trained model can predict the added noise. Then, knowing this quantity then allows us to realize a numerical solution to the problem stated in eq. (3), used to generate the next iteration of the denoised sequence:
| (5) |
In eq. (2), the sequence at step is transformed by removing the noise . This process is performed iteratively; whereas the neural network predicts, given the current state , the noise to be removed at a given time step in the denoising process (see Figure 2b).
We use an improved noise schedule, sampling and training processes proposed in proposed in 80 since it provides us with enhanced and computationally efficient predictions, specifically obtaining results within just 64 (model A) or 96 (mode B) denoising steps. Table 4 provides details about the model architecture parameters. The implementation is based on the code published at 79, but extended to feature a new U-Net architecture to feature one-dimensional sequence data with higher-order embeddings.
The conditional encoder scales the sequence data values to be between 0 and 1, and feeds each in the embedding dimension (and unscaled for reverse tokenization and analysis).
In Model A, the conditioning is fed via embeddings that are used for cross-attention with the input, after being expanded into a higher-order embedding dimensions through linear layers. Trainable positional encoding is used (ordinal ordering of each conditioning parameter, from 1 to 8 as defined in eq. (1) is designated and then encoded via a fully connected embedding layer). In Model B, the conditioning is provided as sequence conditioning, where the conditioning sequence of secondary structure encodings is concatenated to the input, i.e., the noise vector. We find the latter strategy to work better when the conditioning signal has similar dimensionality as the output (and this is the case in Model B since the conditioning and prediction are of the same dimension). In both cases, this yields a conditional algorithm where
| (6) |
So that the model can produce samples that meet a certain target features defined by .
Model A tends to generate less novel, diverse sequences. We explored ways to address this limitation. One possible way is to use early stopping to avoid overfitting, as had also been proposed in 81. The results shown in Figure S8 are obtained, for instance, using a model with a relatively large 768-dimensional U-net and early stopping. This strategy, combined with classifier-free guidance82 parameter variations as presented in Figure S7, can be means to increase the creative capacity of this model. Classifier-free guidance is achieved by training the model with conditional dropout; in our case, 20% of the time the conditioning information is removed so that we can make predictions both, with conditioning and without conditioning . During sampling, the predictions are then combined according to , where is a conditioning parameter that determines how the conditional and unconditional solutions are mixed ( yields the fully conditioned model, and for larger values, less conditioning is obtained). Other than in Figure S7, all results in this paper are obtained using .
The entire prediction pipeline involves all of the steps shown in Figure 1; first, taking a conditioning parameter, using the diffusion model to make conditional predictions, and then predicting the folded 3D protein structure using OmegaFold.64 For validation we also implement folding predictions using alternative methods, including AlphaFold 2 via ColabFold and some testing with trRosetta. For details regarding the primary folding strategies used, see Section 4.4.
As indicated in Figure 1, both models sample solutions, and are hence capable of generating a set of possible solutions to the same design problem. As a systematic way to obtain the best possible solution, we implement an iterative algorithm as outlined in Figure 1. We repeat generation until we reach a maximum number of iterations, or until we are below a critical error value (see Figures S6 for an analysis of errors over iterations). Alternative approaches may target those designs that are the most novel, or that show a compromise between novelty and meeting the design demand.
4.3. Training and validation
We use an Adam optimizer,83 with Table 4 including a variety of key model and hyperparameters. Figure 7 shows several validation cases, comparing ground truth and predictions, for two cases (a and b), for Model B (results are similar for Model A). These are protein sequences taken from the validation set (10% of the total data available); hence, they are not de novo sequences and do not merit further analysis. Both cases show that the model can accurately predict sequences according to specific secondary structure content. For cases where multiple sequences exist that correspond to the same or similar secondary structure input, multiple, varied predictions are made.
Figure 7:

Validation cases, comparing ground truth and predictions, for two cases (a and b). These are protein sequences taken from the validation set (10% of the total data available); hence, they are not de novo sequences. Both cases show that the model can accurately predict sequences according to specific secondary structure content. For cases where multiple sequences exist that correspond to the same or similar secondary structure input, multiple, varied predictions are made.
To measure the conditioning capability of model B, we define per-residue accuracy, 𝐴pr, as the fraction of residues with the designed secondary structure types for the generated protein sequence, i.e.,
| (7) |
where is the number of residues with the same secondary structure types as the input condition, and is the total number of amino acids in the protein. is between 0 and 1, and the error is defined as .
4.4. Protein folding
We implement OmegaFold64 directly in our model architecture for rapid prediction of protein structures from the sequence. OmegaFold offers a rapid alternative as it does not require Multiple Sequence Alignment (MSA), yet produces results of similar accuracy as AlphaFold 2 and trRosetta (and similar, related state of the art methods).
To check the results, we further use AlphaFold 234 for monomers and AlphaFold-Multimer67 for multimers via ColabFold65 to conduct these experiments, using the ColabFold implementation. We use the pdb70 template set, and MMseqs2 (UniRef and Environmental), using 6 cycles. The highest prediction is used for the analysis in this paper. Additional comparisons are conducted using trRosetta84 for validation.
The inclusion of folding tools and the comparison of several of such prediction strategies here are aimed at validating whether the generated sequences will likely fulfill the designed secondary structure conditioning parameters. While the ultimate validation of our model may require producing and examining those protein sequences experimentally, the applications of the state-of-the-art in silico folding tools, like AlphaFold2, provide a useful benchmarking pathway of higher efficiency and lower costs, which has been shown to be a viable strategy in recent works. As shown in Figures 3 and 5, the majority parts of the folded structures achieve a relatively high pLDDT score (~70), which indicates that the predicted structures are expected to be modelled well and remain stable. Furthermore, to fold de novo sequences with potentially limited multiple sequence alignments (MSAs) information, we use OmegaFold, which is designed to make prediction directly from the primary sequence, accurately and efficiently without MSAs. We find good agreement between high-confidence folded structures predicted by these two methods for the sequences we generated, including the de novo designs. This gives us confidence that the novel sequences generated by our model are likely to deliver the desired secondary structures.
4.5. BLAST analysis
The basic local alignment search tool (BLAST)68 analysis for the various cases is conducted using the blastp (protein-protein BLAST) algorithm, and the non-redundant protein sequences (nr) database. Summary results are shown in Tables 2 and 3, and detailed results are provided in Figures S9–S11.
4.6. Visualization
We use PyMol85 and Py3DMol86 for visualization of the protein structures.
4.7. Software versions and hardware
We use Python 3.8.12, PyTorch 1.10 77 with CUDA (CUDA version 11.6), and a NVIDIA RTX A6000 with 48 GB VRAM for training and inference.
Supplementary Material
Highlights.
Diffusion models can efficiently generate proteins with desired secondary structures.
De novo protein sequences not yet discovered in nature can be generated.
The models remain robust regarding imperfect or even unrealistic design goals.
The models can be extended to generate de novo proteins for other properties.
THE BIGGER PICTURE.
The design of de novo protein sequences has great potential in achieving superior combinations of novel functions and mechanical properties beyond known, natural proteins. However, the tremendous number of possible sequences and the cost of experimental testing make the effective search and validation of superior de novo protein candidates extremely challenging. Here, we leverage a diffusion model-based deep learning framework to efficiently generate novel protein sequences that meet a desired overall secondary structure fractional content or per-residue type of secondary structure. The generated sequences show novelty beyond existing, natural ones. By robustly generating various novel sequences with the desired structural features, our model provides rapid strategies for target-guided de novo protein design that leads to novel discoveries of superior protein materials for various biological and engineering applications and can be extended for other design objectives in future work.
Acknowledgments
We acknowledge support from the MIT-IBM Watson AI Lab, USDA (2021-69012-35978), DOE-SERDP (WP22-S1-3475), and ARO (79058LSCSB, W911NF-22-2-0213 and W911NF2120130). Additional support from NIH (U01EB014976 and R01AR077793) and ONR (N00014-19-1-2375 and N00014-20-1-2189) is acknowledged.
Footnotes
Declaration of interests
The author declares no competing interests.
Supplementary materials
Additional figures, PDB files, and other materials are provided as Supplementary Materials.
• Files with PDB proteins structures generated by Model A and Model B are included as Supplementary Material.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.López Barreiro D, Yeo J, Tarakanova A, Martin-Martinez FJ, Buehler MJ, López D, Yeo J, Tarakanova A, Martin-Martinez FJ, and Buehler MJ (2019). Multiscale Modeling of Silk and Silk-Based Biomaterials—A Review. Macromol Biosci 19, 1800253. 10.1002/MABI.201800253. [DOI] [PubMed] [Google Scholar]
- 2.Gronau G, Krishnaji ST, Kinahan ME, Giesa T, Wong JY, Kaplan DL, and Buehler MJ (2012). A review of combined experimental and computational procedures for assessing biopolymer structure–process–property relationships. Biomaterials 33, 8240–8255. 10.1016/J.BIOMATERIALS.2012.06.054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Vepari C, and Kaplan DL (2007). Silk as a biomaterial. Prog Polym Sci 32, 991–1007. 10.1016/J.PROGPOLYMSCI.2007.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ling S, Kaplan DL, and Buehler MJ (2018). Nanofibrils in nature and materials engineering. Nature Reviews Materials 2018 3:4 3, 1–15. 10.1038/natrevmats.2018.16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wegst UGK, Bai H, Saiz E, Tomsia AP, and Ritchie RO (2014). Bioinspired structural materials. Nature Materials 2014 14:1 14, 23–36. 10.1038/nmat4089. [DOI] [PubMed] [Google Scholar]
- 6.Gu GX, Takaffoli M, Buehler MJ, Gu GX, Takaffoli M, and Buehler MJ (2017). Hierarchically Enhanced Impact Resistance of Bioinspired Composites. Advanced Materials 29, 1700060. 10.1002/ADMA.201700060. [DOI] [PubMed] [Google Scholar]
- 7.Barthelat F, Yin Z, and Buehler MJ (2016). Structure and mechanics of interfaces in biological materials. Nature Reviews Materials 2016 1:4 1, 1–16. 10.1038/natrevmats.2016.7. [DOI] [Google Scholar]
- 8.Huang W, Tarakanova A, Dinjaski N, Wang Q, Xia X, Chen Y, Wong JY, Buehler MJ, and Kaplan DL (2016). Design of Multistimuli Responsive Hydrogels Using Integrated Modeling and Genetically Engineered Silk–Elastin-Like Proteins. Adv Funct Mater 26, 4113–4123. 10.1002/ADFM.201600236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Krishnaji ST, Bratzel G, Kinahan ME, Kluge JA, Staii C, Wong JY, Buehler MJ, and Kaplan DL (2013). Sequence–Structure–Property Relationships of Recombinant Spider Silk Proteins: Integration of Biopolymer Design, Processing, and Modeling. Adv Funct Mater 23, 241–253. 10.1002/ADFM.201200510. [DOI] [Google Scholar]
- 10.Huang PS, Boyken SE, and Baker D (2016). The coming of age of de novo protein design. Nature 2016 537:7620 537, 320–327. 10.1038/nature19946. [DOI] [PubMed] [Google Scholar]
- 11.Paladino A, Marchetti F, Rinaldi S, and Colombo G (2017). Protein design: from computer models to artificial intelligence. Wiley Interdiscip Rev Comput Mol Sci 7, e1318. 10.1002/WCMS.1318. [DOI] [Google Scholar]
- 12.Wang J, Cao H, Zhang JZH, and Qi Y (2018). Computational Protein Design with Deep Learning Neural Networks. Scientific Reports 2018 8:1 8, 1–9. 10.1038/s41598-018-24760-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Qin Z, Wu L, Sun H, Huo S, Ma T, Lim E, Chen PY, Marelli B, and Buehler MJ (2020). Artificial intelligence method to design and fold alpha-helical structural proteins from the primary amino acid sequence. Extreme Mech Lett 36, 100652. 10.1016/J.EML.2020.100652. [DOI] [Google Scholar]
- 14.Ackbarow T, Chen X, Keten S, and Buehler MJ (2007). Hierarchies, multiple energy barriers, and robustness govern the fracture mechanics of α-helical and β-sheet protein domains. Proc Natl Acad Sci U S A 104, 16410–16415. 10.1073/PNAS.0705759104/SUPPL_FILE/05759FIG6.JPG. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Qin Z, and Buehler MJ (2010). Cooperative deformation of hydrogen bonds in beta-strands and beta-sheet nanocrystals. Phys Rev E Stat Nonlin Soft Matter Phys 82, 061906. 10.1103/PHYSREVE.82.061906/FIGURES/5/MEDIUM. [DOI] [PubMed] [Google Scholar]
- 16.Xu Z, and Buehler MJ (2010). Mechanical energy transfer and dissipation in fibrous beta-sheet-rich proteins. Phys Rev E Stat Nonlin Soft Matter Phys 81, 061910. 10.1103/PHYSREVE.81.061910/FIGURES/4/MEDIUM. [DOI] [PubMed] [Google Scholar]
- 17.Knowles TPJ, and Buehler MJ (2011). Nanomechanics of functional and pathological amyloid materials. Nature Nanotechnology 2011 6:8 6, 469–479. 10.1038/nnano.2011.102. [DOI] [PubMed] [Google Scholar]
- 18.Hu X, Kaplan D, and Cebe P (2006). Determining beta-sheet crystallinity in fibrous proteins by thermal analysis and infrared spectroscopy. Macromolecules 39, 6161–6170. 10.1021/MA0610109/ASSET/IMAGES/LARGE/MA0610109F00006.JPEG. [DOI] [Google Scholar]
- 19.Qin Z, Kreplak L, and Buehler MJ (2009). Hierarchical Structure Controls Nanomechanical Properties of Vimentin Intermediate Filaments. PLoS One 4, e7294. 10.1371/JOURNAL.PONE.0007294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ackbarow T, Sen D, Thaulow C, and Buehler MJ (2009). Alpha-Helical Protein Networks Are Self-Protective and Flaw-Tolerant. PLoS One 4, e6015. 10.1371/JOURNAL.PONE.0006015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Spivak DI, Giesa T, Wood E, and Buehler MJ (2011). Category Theoretic Analysis of Hierarchical Protein Materials and Social Networks. PLoS One 6, e23911. 10.1371/JOURNAL.PONE.0023911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Studart AR (2013). Biological and Bioinspired Composites with Spatially Tunable Heterogeneous Architectures. Adv Funct Mater 23, 4423–4436. 10.1002/ADFM.201300340. [DOI] [Google Scholar]
- 23.Keten S, Chou CC, van Duin ACT, and Buehler MJ (2012). Tunable nanomechanics of protein disulfide bonds in redox microenvironments. J Mech Behav Biomed Mater 5, 32–40. 10.1016/J.JMBBM.2011.08.017. [DOI] [PubMed] [Google Scholar]
- 24.Wray LS, Rnjak-Kovacina J, Mandal BB, Schmidt DF, Gil ES, and Kaplan DL (2012). A silk-based scaffold platform with tunable architecture for engineering critically-sized tissue constructs. Biomaterials 33, 9214–9224. 10.1016/J.BIOMATERIALS.2012.09.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dinjaski N, Ebrahimi D, Qin Z, Giordano JEM, Ling S, Buehler MJ, and Kaplan DL (2018). Predicting rates of in vivo degradation of recombinant spider silk proteins. J Tissue Eng Regen Med 12, e97–e105. 10.1002/TERM.2380. [DOI] [PubMed] [Google Scholar]
- 26.Keten S, and Buehler MJ (2010). Nanostructure and molecular mechanics of spider dragline silk protein assemblies. J R Soc Interface 7, 1709–1721. 10.1098/RSIF.2010.0149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Xiao S, Xiao S, and Gräter F (2013). Dissecting the structural determinants for the difference in mechanical stability of silk and amyloid beta-sheet stacks. Physical Chemistry Chemical Physics 15, 8765–8771. 10.1039/C3CP00067B. [DOI] [PubMed] [Google Scholar]
- 28.Keten S, and Buehler MJ (2008). Geometric confinement governs the rupture strength of h-bond assemblies at a critical length scale. Nano Lett 8, 743–748. 10.1021/NL0731670/SUPPL_FILE/NL0731670SI20080128_053558.PDF. [DOI] [PubMed] [Google Scholar]
- 29.Ackbarow T, Keten S, and Buehler MJ (2008). A multi-timescale strength model of alpha-helical protein domains. Journal of Physics: Condensed Matter 21, 035111. 10.1088/0953-8984/21/3/035111. [DOI] [PubMed] [Google Scholar]
- 30.Keten S, Rodriguez Alvarado JF, Müftü S, and Buehler MJ (2009). Nanomechanical characterization of the triple β-helix domain in the cell puncture needle of bacteriophage T4 virus. Cell Mol Bioeng 2, 66–74. 10.1007/S12195-009-0047-9/TABLES/1. [DOI] [Google Scholar]
- 31.Buehler MJ, and Yung YC (2009). Deformation and failure of protein materials in physiologically extreme conditions and disease. Nature Materials 2009 8:3 8, 175–188. 10.1038/nmat2387. [DOI] [PubMed] [Google Scholar]
- 32.Jaleel Z, Zhou S, Martín-Moldes Z, Baugh LM, Yeh J, Dinjaski N, Brown LT, Garb JE, and Kaplan DL (2020). Expanding Canonical Spider Silk Properties through a DNA Combinatorial Approach. Materials 2020, Vol. 13, Page 3596 13, 3596. 10.3390/MA13163596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Hayashi CY, Shipley NH, and Lewis RV (1999). Hypotheses that correlate the sequence, structure, and mechanical properties of spider silk proteins. Int J Biol Macromol 24, 271–275. 10.1016/S01418130(98)00089-0. [DOI] [PubMed] [Google Scholar]
- 34.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 2021 596:7873 596, 583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, et al. (2022). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50, D439–D444. 10.1093/NAR/GKAB1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Liu FYC, Ni B, and Buehler MJ (2022). PRESTO: Rapid protein mechanical strength prediction with an end-to-end deep learning model. Extreme Mech Lett 55, 101803. 10.1016/J.EML.2022.101803. [DOI] [Google Scholar]
- 37.Khare E, Gonzalez-Obeso C, Kaplan DL, and Buehler MJ (2022). CollagenTransformer: End-to-End Transformer Model to Predict Thermal Stability of Collagen Triple Helices Using an NLP Approach. ACS Biomater Sci Eng 8, 4301–4310. 10.1021/ACSBIOMATERIALS.2C00737/ASSET/IMAGES/MEDIUM/AB2C00737_M001.GIF. [DOI] [PubMed] [Google Scholar]
- 38.Zhang B, Li J, and Lü Q (2018). Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinformatics 19, 1–13. 10.1186/S12859-018-2280-5/TABLES/13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Pollastri G, and McLysaght A (2005). Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics 21, 1719–1720. 10.1093/BIOINFORMATICS/BTI203. [DOI] [PubMed] [Google Scholar]
- 40.Mirabello C, and Pollastri G (2013). Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility. Bioinformatics 29, 2056–2058. 10.1093/BIOINFORMATICS/BTT344. [DOI] [PubMed] [Google Scholar]
- 41.Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, et al. (2022). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell 44, 7112–7127. 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
- 42.Høie MH, Kiehl EN, Petersen B, Nielsen M, Winther O, Nielsen H, Hallgren J, and Marcatili P (2022). NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res 50, W510–W515. 10.1093/NAR/GKAC439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lew AJ, and Buehler MJ (2021). A deep learning augmented genetic algorithm approach to polycrystalline 2D material fracture discovery and design. Appl Phys Rev 8, 041414. 10.1063/5.0057162. [DOI] [Google Scholar]
- 44.Khare E, Yu CH, Gonzalez Obeso C, Milazzo M, Kaplan DL, and Buehler MJ (2022). Discovering design principles of collagen molecular stability using a genetic algorithm, deep learning, and experimental validation. Proc Natl Acad Sci U S A 119, e2209524119. 10.1073/PNAS.2209524119/SUPPL_FILE/PNAS.2209524119.SD03.CSV. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Yu CH, Chen W, Chiang YH, Guo K, Martin Moldes Z, Kaplan DL, and Buehler MJ (2022). End-to-End Deep Learning Model to Predict and Design Secondary Structure Content of Structural Proteins. ACS Biomater Sci Eng 8, 1156–1165. 10.1021/ACSBIOMATERIALS.1C01343/SUPPL_FILE/AB1C01343_SI_001.ZIP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hinton GE, and Zemel RS (1993). Autoencoders, Minimum Description Length and Helmholtz Free Energy. Adv Neural Inf Process Syst 6. [Google Scholar]
- 47.Dong G, Liao G, Liu H, and Kuang G (2018). A Review of the Autoencoder and Its Variants: A Comparative Perspective from Target Recognition in Synthetic-Aperture Radar Images. IEEE Geosci Remote Sens Mag 6, 44–68. 10.1109/MGRS.2018.2853555. [DOI] [Google Scholar]
- 48.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y (2020). Generative adversarial networks. Commun ACM 63, 139–144. 10.1145/3422622. [DOI] [Google Scholar]
- 49.Makoś MZ, Verma N, Larson EC, Freindorf M, and Kraka E (2021). Generative adversarial networks for transition state geometry prediction. J Chem Phys 155, 024116. 10.1063/5.0055094. [DOI] [PubMed] [Google Scholar]
- 50.Lebese T, Mellado B, and Ruan X (2021). The use of Generative Adversarial Networks to characterise new physics in multi-lepton final states at the LHC. International Journal of Modern Physics A. 10.48550/arxiv.2105.14933. [DOI] [Google Scholar]
- 51.Ho J, Jain A, and Abbeel P (2020). Denoising Diffusion Probabilistic Models. Adv Neural Inf Process Syst 33, 6840–6851. [Google Scholar]
- 52.Yang L, Zhang Z, Song Y, Hong S, Xu R, Zhao Y, Shao Y, Zhang W, Cui B, and Yang M-H (2022). Diffusion Models: A Comprehensive Survey of Methods and Applications. 10.48550/arxiv.2209.00796. [DOI] [Google Scholar]
- 53.Marcus G, Davis E, and Aaronson S (2022). A very preliminary analysis of DALL-E 2. 10.48550/arxiv.2204.13807. [DOI] [Google Scholar]
- 54.Saharia C, Chan W, Saxena S, Li L, Whang J, Denton E, Ghasemipour SKS, Ayan BK, Mahdavi SS, Lopes RG, et al. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 10.48550/arxiv.2205.11487. [DOI] [Google Scholar]
- 55.Rombach R, Blattmann A, Lorenz D, Esser P, and Ommer B (2021). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2022-June, 10674–10685. 10.48550/arxiv.2112.10752. [DOI] [Google Scholar]
- 56.Yang Z, Yu CH, Guo K, and Buehler MJ (2021). End-to-end deep learning method to predict complete strain and stress tensors for complex hierarchical composite microstructures. J Mech Phys Solids 154, 104506. 10.1016/J.JMPS.2021.104506. [DOI] [Google Scholar]
- 57.Yang Z, Yu CH, and Buehler MJ (2021). Deep learning model to predict complex stress and strain fields in hierarchical composites. Sci Adv 7. 10.1126/SCIADV.ABD7416/SUPPL_FILE/ABD7416_SM.PDF. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Buehler MJ (2022). FieldPerceiver: Domain agnostic transformer model to predict multiscale physical fields and nonlinear material properties through neural ologs. Materials Today 57, 9–25. 10.1016/J.MATTOD.2022.05.020. [DOI] [Google Scholar]
- 59.Ni B, and Gao H (2021). A deep learning approach to the inverse problem of modulus identification in elasticity. MRS Bull 46, 19–25. 10.1557/S43577-020-00006-Y/METRICS. [DOI] [Google Scholar]
- 60.Buehler MJ (2022). Modeling Atomistic Dynamic Fracture Mechanisms Using a Progressive Transformer Diffusion Model. Journal of Applied Mechanics, Transactions ASME 89. 10.1115/1.4055730/1146377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Lin Z, Fair TS, Lecun Y, and Rives A Deep generative models create new and diverse protein structures. [Google Scholar]
- 62.Anand N, and Achim T (2022). Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models. 10.48550/arxiv.2205.15019. [DOI] [Google Scholar]
- 63.Trippe BL, Yim J, Tischer D, Baker D, Broderick T, Barzilay R, and Jaakkola T (2022). Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem. 10.48550/arxiv.2206.04119. [DOI] [Google Scholar]
- 64.Wu R, Ding F, Wang R, Shen R, Zhang X, Luo S, Su C, Wu Z, Xie Q, Berger B, et al. (2022). High-resolution de novo structure prediction from primary sequence. bioRxiv, 2022.07.21.500999. 10.1101/2022.07.21.500999. [DOI] [Google Scholar]
- 65.Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, and Steinegger M (2022). ColabFold: making protein folding accessible to all. Nature Methods 2022 19:6 19, 679–682. 10.1038/s41592-022-01488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Kabsch W, and Sander C (1983). Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637. 10.1002/BIP.360221211. [DOI] [PubMed] [Google Scholar]
- 67.Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T, Žídek A, Bates R, Blackwell S, Yim J, et al. (2022). Protein complex prediction with AlphaFold-Multimer. bioRxiv, 2021.10.04.463034. 10.1101/2021.10.04.463034. [DOI] [Google Scholar]
- 68.Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ (1990). Basic local alignment search tool. J Mol Biol 215, 403–410. 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 69.Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte RJ, Milles LF, Wicky BIM, Courbet A, de Haas RJ, Bethel N, et al. (2022). Robust deep learning–based protein sequence design using ProteinMPNN. Science (1979) 378, 49–56. 10.1126/SCIENCE.ADD2187/SUPPL_FILE/SCIENCE.ADD2187_SM.PDF. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Ingraham J, Baranov M, Costello Z, Frappier V, Ismail A, Tie S, Wang W, Xue V, Obermeyer F, Beam A, et al. (2022). Illuminating protein space with a programmable generative model. bioRxiv, 2022.12.01.518682. 10.1101/2022.12.01.518682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Watson JL, Juergens D, Bennett NR, Trippe BL, Yim J, Eisenach HE, Ahern W, Borst AJ, Ragotte RJ, Milles LF, et al. (2022). Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bioRxiv, 2022.12.09.519842. 10.1101/2022.12.09.519842. [DOI] [Google Scholar]
- 72.Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Dustin Schaeffer R, et al. (2021). Accurate prediction of protein structures and interactions using a threetrack neural network. Science (1979) 373, 871–876. 10.1126/SCIENCE.ABJ8754/SUPPL_FILE/ABJ8754_MDAR_REPRODUCIBILITY_CHECKLIST.PDF. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Keten S, Xu Z, Ihle B, and Buehler MJ (2010). Nanoconfinement controls stiffness, strength and mechanical toughness of β-sheet crystals in silk. Nature Materials 2010 9:4 9, 359–367. 10.1038/nmat2704. [DOI] [PubMed] [Google Scholar]
- 74.Keten S, and Buehler MJ (2008). Asymptotic strength limit of hydrogen-bond assemblies in proteins at vanishing pulling rates. Phys Rev Lett 100, 198301. 10.1103/PHYSREVLETT.100.198301/FIGURES/4/MEDIUM. [DOI] [PubMed] [Google Scholar]
- 75.(6) Moriwaki Yoshitaka on Twitter: “AlphaFold2 can also predict heterocomplexes. All you have to do is input the two sequences you want to predict and connect them with a long linker. https://t.co/BhmWcnlQed” / Twitter https://twitter.com/Ag_smith/status/1417063635000598528. [Google Scholar]
- 76.(6) Baek Minkyung on Twitter: “Adding a big enough number for ‘residue_index’ feature is enough to model hetero-complex using AlphaFold (green&cyan: crystal structure / magenta: predicted model w/residue_index modification). #AlphaFold #alphafold2 https://t.co/TX1PnRk5Wd” / Twitter https://twitter.com/minkbaek/status/1417538291709071362. [Google Scholar]
- 77.Paszke A, Gross S, Massa F, Lerer A, Bradbury Google J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv Neural Inf Process Syst 32. [Google Scholar]
- 78.Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. Tensorflow: a system for large-scale machine learning. usenix.org. [Google Scholar]
- 79.lucidrains/imagen-pytorch: Implementation of Imagen, Google’s Text-to-Image Neural Network, in Pytorch https://github.com/lucidrains/imagen-pytorch.
- 80.Karras T, Aittala M, Aila T, and Laine S (2022). Elucidating the Design Space of Diffusion-Based Generative Models. 10.48550/arxiv.2206.00364. [DOI] [Google Scholar]
- 81.Nichol AQ, and Dhariwal P (2021). Improved Denoising Diffusion Probabilistic Models. 8162–8171. [Google Scholar]
- 82.Ho J, and Salimans T (2022). Classifier-Free Diffusion Guidance. 10.48550/arxiv.2207.12598. [DOI] [Google Scholar]
- 83.Kingma DP, and Ba JL (2014). Adam: A Method for Stochastic Optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. 10.48550/arxiv.1412.6980. [DOI] [Google Scholar]
- 84.Du Z, Su H, Wang W, Ye L, Wei H, Peng Z, Anishchenko I, Baker D, and Yang J (2021). The trRosetta server for fast and accurate protein structure prediction. Nature Protocols 2021 16:12 16, 5634–5651. 10.1038/s41596-021-00628-9. [DOI] [PubMed] [Google Scholar]
- 85.Schrodinger LLC (2015). The PyMOL Molecular Graphics System, Version 1.8 [Google Scholar]
- 86.Rego N, and Koes D (2015). 3Dmol.js: molecular visualization with WebGL. Bioinformatics 31, 1322–1324. 10.1093/BIOINFORMATICS/BTU829. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data and codes, as well as trained weights, are either available on GitHub at https://github.com/lamm-mit/ProteinDiffusionGenerator or will be provided by the lead contact based on reasonable request.
