Abstract
Structural proteins are the basis of many biomaterials and key construction and functional components of all life. Further, it is well-known that the diversity of proteins’ function relies on their local structures derived from their primary amino acid sequences. Here we report a deep learning model to predict the secondary structure content of proteins directly from primary sequences. Understanding the secondary structure content of proteins is crucial to designing proteins with targeted material functions, especially mechanical properties. Using convolutional and recurrent architectures and natural language models, our deep learning model predicts the content of two essential types of secondary structures, alpha helix and beta sheet. The training data is collected from the Protein Data Bank and contains many existing protein geometries. We find that our model can learn the hidden features as patterns of input sequences that can then be directly related to secondary structure content. The alpha helix and beta sheet content predictions show excellent agreement with training data and newly deposited protein structures that were recently identified and which were not included in the original training set. We further demonstrate the features of the model by a search for de novo protein sequences that optimize max/min alpha-helix/beta-sheet content and compare the predictions with folded models of these sequences based on AlphaFold2. Excellent agreement is found, underscoring that our model has predictive potential for designing proteins with specific secondary structures and could be widely applied to biomedical industries, including protein biomaterial designs and regenerative medicine applications.
Keywords: Deep Learning, protein structure, secondary structure, structural proteins, artificial intelligence, materiomics
1. Introduction
Proteins form the basis for one of the most abundant biomaterials occurring in nature, as well as essential building blocks that constitute fundamental functions of all living organisms [1]. Their function depends mainly on secondary structures [2], which can be challenging to identify using traditional experimental or computational methods [3,4]. Deep learning now provides a novel tool to elucidate the relationships between input sequences and predicted secondary structures, and can hence be used as a means to predict other properties for these proteins or to optimize sequences in a materials-by-design paradigm [5-8]. For example, the secondary structure has been directly linked to mechanical properties as well as rates of degradation of protein materials [9-21].
The use of deep learning for predicting secondary structures has become a vital research area in recent years. Along with powerful computational ability and the emergence of Graphical Processing Unit (GPU) acceleration (the use of GPU compute power to significantly accelerate computational tasks), deep learning models are now capable of finding hidden patterns in complex datasets. Several models have been applied to predict secondary structures from their primary sequences, such as support vector machines, feed forward neural networks [22,23], recurrent neural networks [24,25], as well as deep convolutional networks [26,27]. However, previous studies have focused on determining the secondary structure feature of each residue, that is, typically three general states (Q3) or the more general 8 states (Q8) predicted by conventional Dictionary of Secondary Structure of Proteins (DSSP) methods that operate based on full molecular geometric models of proteins. The average accuracy for the Q3 model, for instance, is about 80%, with the highest accuracy of about 85% [28]. In contrast to these earlier models that focus on residue-level predictions, our approach aims to predict an overall features of the folded protein, and estimates the amount of alpha-helix vs. beta-sheet content a protein has. Such overarching predictions can be powerful in protein design, where biochemists are interested in screening a very large number of protein designs against a target function.
For biomaterials, the design and control of secondary structures can be used to tune or predict the durability of protein-based implants for tissue repairs or regenerative medicine [11,16,29-34]. For example, increased content of beta sheet secondary structure in silk protein material systems is directly linked to stronger materials with slower rates of enzymatic digestion; thus, more durable materials with a longer lifetime in vivo ([9,10,14-16]). Similarly, collagen-based protein biomaterials are inherently longer lasting in vivo and more mechanically robust when a higher content of alpha helix is present, reflective of organized hierarchical structures, in contrast to the denatured, gelatin-like formats of collagen, where rapid degradation and low mechanics are inherent limitations ([17]). It is also well understood that beta-sheet structures such as those found in amyloid fibrils play an important role in generating very strong and tough biomaterials [35-39].
Furthermore, recent progress in the study of structure-function relationship of synthetic polypeptide polymers paves the way to obtain materials to further develop platforms for drug delivery with controllable, precise, and directed self-assembly/release properties. As an example, the secondary structure content of silk fibroin nanoparticles has been shown to govern the release of small model molecule drugs: more crystalline (beta sheet-rich) structures demonstrated a faster release rate [40]. Another example of how control of secondary structure can be applied to biomaterial design includes fine tuning of the materia´s mechanical properties. Both the mechanical strength and extensibility of silk proteins are determined by the total amount of both β-sheet nanocrystals and non-crystalline amorphous regions within the protein, respectively [41,42]. By combining sequences from different silk types with varying ratios of β-sheet, β-turn and random coil content, the mechanical properties of the resulting materials were expanded [43].
In this paper, we report a deep learning model to predict proteins' overall secondary structure content directly from the sequence of amino acids, focusing on the overall content in a given sequence. Based on training data collected from the Protein Data Bank [44] (125,955 structures, details see Materials and Methods section), our model learns the hidden features as patterns of secondary structures. The predictions of alpha helix and beta sheet ratio show good agreement with the training and testing data as well as with newly deposited protein structures. Hence, our model shows potential for the design of fibrous proteins with specific secondary structures, and hence could be widely applied to needs in medicine and more broadly where protein-based material applications are relevant (e.g., food technologies, environmental monitoring and devices, energy storage and utility, and others) [45,46]. Moreover, due to the known significance of secondary structures for the mechanical properties of proteins, the algorithm may be useful for the design of structural and mechanically relevant proteins [11-13,18-21]. A few examples are provided in this paper.
2. Results
The details of the model, training and validation process are summarized in the Materials and Methods section. The model uses data of existing protein structures (more than 120,000 proteins), and their secondary structure content, for training. A summary of the input data is shown in Figure 2. Figure 3 shows the overall process used here for training and testing the end-to-end deep learning model. Figure 3(a) shows the flowchart of the training process. The input data is the protein sequence, and the structure of the deep learning model includes convolution blocks, bidirectional Long short-term memory (LSTM) blocks, and the fully connected layers towards the end. The outputs of the deep learning model are two scalars, corresponding to the alpha helix and beta sheet concent, respectively. Figure 3(b) depicts steps taken for testing the prediction ability of the deep learning model. The protein sequences are fed through the deep learning model, and the final output are alpha helix and beta sheet content, corresponding to the each of protein sequences, which can then be compared to the ground truth (experimental data or computational predictions using other methods).
Figure 2.
Summary of the data distribution of alpha helix and beta sheet ratios as derived from the Protein Data Bank (PDB) database [44]. The histogram shows different numbers of alpha helix and beta sheet ratios. In panel A, we can see that most of the sequences are between 30 % and 50 % of alpha helix ratio, and there is a small part of the sequences even more than 80 %. In panel B, we show that most sequences are less than 30 % of beta sheet ratio. The distribution maps of alpha helix and beta sheet are almost identical, overall. Panels (C) and (D) depict the ratio distributions of each component.
Figure 3:
The overall processes to training and testing the end-to-end deep learning model. Panel A shows the flowchart of the training process. The input data is the sequence of the protein, and the structure of the deep learning model includes convolution blocks, bidirectional LSTM blocks, and fully connected layers. The outputs of the deep learning model are alpha helix and beta sheet content, respectively, characterizing a global protein property. Panel B depicts steps for testing the prediction ability of the deep learning model (by comparing the predictions against proteins not included in the training). The protein sequences flow through the deep learning model, and the final output are alpha helix and beta sheet ratios corresponding to the input protein sequences. In future work we anticipate coupling the predictive model with an optimization algorithm such as genetic evolution or simulated evolution, to identify new sequences that meet certain design criteria.
Training results
The history of the training process is depicted in (a), which shows values of the MSE loss for the training and validation dataset. Both training error and validation decrease significantly and converge with towards a mean square error (MSE) loss lower than 0.005 after 50 epochs. Furthermore, the R2 scores of alpha-helix and beta-sheet ratio in training data are 0.99 and 0.96, respectively, close to 1. Hence, the difference between the ground truth label and the prediction value is slight and the model is considered reliable.
Prediction capacity
To visualize the prediction ability of the deep learning model, scatter plots are chosen to demonstrate a direct comparison between prediction and ground truth. As can be seen in Figure 4(b-c), the testing and training points roughly overlap. This result supports that our labels of training data conform to our predictions from the testing. It is noted that the comparison for beta sheet content does not overlap as much as the testing points for alpha helix, but still has a high coincidence rate. Overall, the distribution of both training and testing points are concentrated in the diagonal zone of the diagrams, which means that the gaps between labels and ratios predicted by the model are narrow. Also, the R2 scores of alpha-helix and beta-sheet in the testing dataset are up to 0.99 and 0.98; thus, the model has a strong ability of predicting alpha helix and beta sheet ratios.
Figure 4.
Panel A: Mean square error (MSE) during the whole training process. The data shows that the MSE loss decreased rapidly at the beginning of the training process and then converged to a quite low loss after 60 epochs. The prediction results by the deep learning model. Panels B-C depict alpha helix, beta-sheet, respectively, and show a comparison of the predictions against ground truth. Blue dots denote the training data and red dots indicate the testing data. The distribution of regression results is close to the diagonal line, suggesting good agreement. The evolution of the error during the training process reveals that the MSE loss decreased extremely at the beginning of the training process and then converged to a rather low loss after 60 epochs, indicating convergence.
Predictions and validation
To test the end-to-end model against other protein sequences out of our dataset, we chose several large SARS and COVID-19 proteins that were recently deposited in the Protein Data Bank and fed these protein sequences to the deep learning model [47-51]. Table 1 shows several protein sequences from PDB sources, the proportion of alpha helix and beta sheet ratios predicted by our deep learning model. The results demonstrate that the proportion of various secondary structures calculated using other tools, such as PredictProtein, agrees reasonably well with our model and the PDB results.
Table 1:
The table reports an analysis of various different protein structures in the Protein Data Bank (PDB) with a range of sequence lengths and secondary structures. The charts in the right column report the alpha helix and beta sheet ratios (PDB ground truth vs. earlier model PredictProtein [54] vs. our new deep learning model, marked ‘ML’). The table includes newly deposited proteins that were not yet available during training, assessing the predictive power against new experimental data.
|
|
|
Systematic sequence variation for protein design
As a simple case study, we take the sequence of hen egg-white lysozyme (PDB ID: 194L):
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
This starting sample sequence is predicted to feature 29.99% alpha helical and 6.7% beta-sheet. The PDF entry features 38% alpha helix, and 3.1% beta-sheet, which is close. Using a replacement algorithm, we now systematically replace all amino acids of a certain type in the sequence with all other possible amino acids. For example, we start with A - and replace all residues with that amino acid type with other 20 natural amino acids (in the order AGILPVFWYDERHKSTCMNQ where A=0, Q=19 I the associated plots depicted in Figure 5). This results in 20 new sequences where in the first one all A residues were replaced with A (same sequence as original since we are replacing A with A), in the second one all A residues were replaced with G, in the third one where all A residues were replaced with I, and so on. This process is repeated in a nested loop for all amino acid types, resulting in 20×20=400 new sequence candidates (noting that the diagonal sequences are identical but included for completeness, so that there are actually only 380 new sequences).
Figure 5:
Protein design example, showing the effect of systematic variation of point mutations on alpha-helix and beta-sheet contents. Panel A shows the original sequence of lysozyme (PDB ID: 194L), and an image of the molecular structure of the wildtype protein as found in nature. Panel B shows the effect of systematically substituting amino acids of a certain type in the entire sequence (left: AH content, right: BS content). In the plot, the substation numbers range from 0-19 and reflect the sequence of substitutions from AGILPVFWYDERHKSTCMNQ (i.e., A=0…Q=19). As we go from top to bottom in each column, the plot indicates how the secondary structure content changes if all A are replaced with A, then G, then I, and so on. As we vary the columns the residue type that is replaced is varied. In the first column all A residue types are replaced, in the second column all G residue types are replaced, then I, and so on. As the plots show, while the protein remains largely alpha-helical for most changes, there are a few sequence mutations that lead to significant changes in the protein secondary structure content. These max/min results are extracted using a min/max algorithm and then folded using AlphaFold2, and depicted in panel C. The changes in secondary structure is clearly visible, confirming the predictions from our model and the optimization scheme used here.
Figure 5 plots the results of how these replacements affect the alpha-helix (AH) and beta-sheet (BS) content. We then analyze this data and identify mutations that lead to the greatest variations, e.g., the highest/lowest alpha-helix or beta-sheet content.
The lowest alpha-helix content is achieved by replacing A residues with I, which yields an alpha-helix content of 8.8% and a beta-sheet content of 33.97%. The sequence is:
KVFGRCELIIIMKRHGLDNYRGYSLGNWVCIIKFESNFNTQITNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSILLSSDITISVNCIKKIVSDGNGMNIWVIWRNRCKGTDVQIWIRGCRL
The highest alpha-helix content is achieved by replacing D residues with M, which yields an alpha-helix content of 56.73% and a beta-sheet content of 35%. The sequence is:
KVFGRCELAAAMKRHGLMNYRGYSLGNWVCAAKFESNFNTQATNRNTMGSTMYGILQINSRWWCNMGRTPGSRNLCNIPCSALLSSMITASVNCAKKIVSMGNGMNAWVAWRNRCKGTMVQAWIRGCRL
The lowest beta-sheet content is achieved by replacing D residues with L, which yields an alpha-helix content of 51.45% and a beta-sheet content of 3.05%. The sequence is:
KVFGRCELAAAMKRHGLLNYRGYSLGNWVCAAKFESNFNTQATNRNTLGSTLYGILQINSRWWCNLGRTPGSRNLCNIPCSALLSSLITASVNCAKKIVSLGNGMNAWVAWRNRCKGTLVQAWIRGCRL
The highest beta-sheet content is achieved by replacing A residues with V, which yields an alpha-helix content of 10.9%% and a beta-sheet content of 36/38%. The sequence is:
KVFGRCELVVVMKRHGLDNYRGYSLGNWVCVVKFESNFNTQVTNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSVLLSSDITVSVNCVKKIVSDGNGMNVWVVWRNRCKGTDVQVWIRGCRL
Figure 5C shows folded predictions of these new sequences based on AlphaFold2 [5,52], confirming the changes in secondary structures predicted by our model, and obtained from the optimization steps sought out here. As we seek to minimize the alpha-helix content, the resulting proteins feature more beta-sheet content and alpha-helices disappear. A similar behavior is observed for the case where a maximum beta-sheet content is sought, where alpha-helices disappear as well. When we search for variations with maximum alpha-helix content, the beta-sheets disappear, and the protein resembles a closely packed alpha-helix design. Similar behavior is seen when minimizing beta-sheet content. This example demonstrates the utility of our model to provide design suggestions for novel proteins and illustrates that the new designs are anticipated to fold as predicted.
The potential of the method for explorative design, other than looking just for min/max cases, is demonstrated in Figure 6, plotting the distribution of alpha-helix and beta-sheet content in the lysozyme-based protein variations described in the previous paragraphs. This plot shows that the variations explored in that case study result in a variety of potential target secondary structure content levels that can be explored in protein design. The mean values of the two peaks align well with the wildtype structure of lysozyme, as expected.
Figure 6:
Distribution of alpha-helix and beta-sheet content in the lysozyme-based protein variation results shown in Figure 5. This plot shows that the variations explored in that case study result in a variety of potential target secondary structure content levels that can be explored in protein design. The mean values of the two peaks align well with the wildtype structure of lysozyme.
Future work could explore experimental synthesis of such protein designs, and further analysis of the resulting properties – which can include mechanical, or enzymatic. We also anticipate that tool developed here could be used in conjunction with other protein property prediction methods to assess a range of potential target functions.
3. Discussion and Conclusion
Our results show that the end-to-end approach reported here is a powerful tool to relate primary sequences to secondary structure content, and ultimately, function. Our demonstrations of performance assessments of small and large proteins, as shown in Table 1, reveal excellent agreement with experimental data. The additional explorations shown in Figure 5 illustrate how the method can quickly explore vast search spaces and identify new sequence candidates to evolve proteins towards a desired property space. We hence envision that our tool, given its simplicity and effectiveness, can be adequately applied in an optimization scheme, for instance, using genetic algorithms to simulate evolutionary changes [53].
Generally, the problem solved in protein folding is that the primary structure is used to predict the folded protein structure. There are several existing software or structural analysis tools to predict protein structures. For comparison and validation, we used one of the prediction tools, PredictProtein [54] by Rostlab et al., as an alignment template for comparison. This approach incorporates over 30 tools to predict protein structures and derive information such as secondary structure content. In the procedure implemented in this prediction server, homologous sequences are searched in the sequence library Swiss-Prot [55]and TrEMBL [56] first, then multiple sequence pairings are performed, and finally, the paired results for structure prediction are attained through deep learning. Our model is different from this approach. We do not need to use a set of different tools and databases. Instead, we only feed sequences into our model and directly predict the content of the secondary protein structure, offering a novel and direct way to predict structural features and other properties, from a de novo end-to-end perspective. This enables a quick search for novel sequences and associated properties, as outlined in Figure 5. As the results depicted in that figure show, the model is capable of predicting the secondary structure content of completely new proteins (for which no templates exist) quite well, as evidenced by comparing with the de novo folding predictions, and can explore a wider array of secondary structure variations (Figure 6). A mode detailed analysis of such predictions and general validity, including experimental validation, is left to future work.
Predicting structural and mechanical features of proteins based on the sequence is an essential step in rational protein design. Utilizing an end-to-end approach to solve this challenge, our tool offers a rapid computational means for design iteration and target assessments. Furthermore, since secondary structures are particularly critical for mechanical functions [11,13,38], the approach reported here offers novel means to predict estimates of mechanically relevant proteins. Such predictions are beneficial in the designs of materials with improved mechanical properties, changes in degradability in vitro and in vivo, binding efficacy for molecular recognition or diagnostics, and many related applications.
4. Materials and methods
Data statistics
The dataset used for this work was developed from the Protein Data Bank (PDB) [57], following previous work from our group [6]. This database contained 125,955 protein sequences, recorded protein ID, length, primary structure, and secondary structure which was labeled afterward. Protein primary structures contain linear sequences of amino acids composed of different letters, which represent different natural amino acids and the other residues. The segments of secondary structure and folding structure can be predicted by the array of amino acids. The distribution of the input data can be observed in Figure 2. In the dataset, the average length of the protein sequence is 644 amino acids. The shortest protein is composed of only 11 amino acids, along with the longest one that features up to 19,350 amino acids. The standard deviation of the length is 855 amino acids.
The data distribution of alpha-helix and beta-sheet ratios is depicted in Figure 2(b) and (c). The beta sheet content is less than 30% in most sequences and even features about 20,000 sequences under 10%. On the other hand, the alpha helix ratio is typically higher than the beta sheet ratio. Most of the sequences feature an alpha helix content between 30% and 50%, but a small part of the sequences has even more than 80% alpha helix content. Also, of note is the fact the data shows a high proportion of sequences with a low ratio under 5% of both alpha helix and beta sheet. Because there are many sequences in this database, each has a different length, structural organization, and secondary structure content, so we can quickly analyze the relationships between primary structure and the secondary structures of the different sequences.
Data preprocessing
In order to analyze protein sequences effectively, tokenization was adopted in the present study by using the Keras [58] Tokenizer module [59] to facilitate segmentation for each amino acid. The primary structure of a protein is represented by 20 different single letter code for numerical convenience. Tokenization is a common way to pre-process sequences before feeding them into a ML/DL model. The tokenizer can separate the entire protein sequence into letters and transform each letter into a numeric digit for further computation.
Data labeling
The Protein Data Bank data was labeled using the widely used Define Secondary Structure of Protein (DSSP) [60,61] in order to determine the ratio of alpha helix and beta sheet, to determine ground truth data for training. DSSP is a quasi-standard algorithm for classifying secondary structure conformation of amino acid residues in protein structure, which predicts the secondary structure ration by directly obtaining residues of protein sequences. The DSSP algorithm uses the atomic-level resolution of protein in three-dimensional structure coordinates in the PDB format. It relies on hydrogen bond recognition based on the electrostatic definition and the calculation of the main chain and side-chain dihedral angles to obtain the information of each amino acid residue conformational parameters of certain secondary structure. The data was labeled before they were fed into our model for training or testing.
Deep Learning Model
As summarized in Table 2, the deep learning model is an end-to-end regression model to predict protein secondary structure ratio. Moreover, both language data and protein sequence have the same characterization, which is sequential order. As a result, we adopt an NLP-like structure [62] to construct our model, which includes an embedding layer, a batch normalization layer, 1D convolutional (Conv1D) layers, Bi-directional Long Short-Term Memory Units (BiLSTM)[63] units, a GaussianNoise layer, and several dropout layers and dense layers.
Table 2:
Overview of the deep neural network model proposed here, featuring an embedding layer to process the amino acid letters as a natural language model, a batch normalization layer, 1D convolutional (Conv1D) layers, Bi-directional Long Short-Term Memory Units (BiLSTM) units, a GaussianNoise layer, and several dropout layers and dense layers for the final prediction of the secondary structure content.
| Layer | Parameter |
|---|---|
| Embedding | 345 |
| Batch Normalization | 60 |
| Convolution 1D | 38528 |
| Convolution 1D | 81984 |
| Bidirectional LSTM | 66048 |
| Bidirectional LSTM | 41216 |
| Dense | 33280 |
| Dropout | 0 |
| Dense | 65664 |
| Dropout | 0 |
| Dense | 8256 |
| Dropout | 0 |
| Dense | 1040 |
| Gaussian Noise | 0 |
| Dense | 34 |
Total parameters: 336,455
Trainable parameters:336,425
Non-trainable parameters:30
First, we process the sequence data through a Tokenizer into an embedding layer with 15 embedding dimensions to change our data from characters to vectors, as common in natural language models. Then the data pass through a batch normalization layer for numerical convergence. Next, the data flows into two Conv1D layers to capture a series of protein sequential features and associated hierarchical patterning. After passing through Conv1D layers, data is feeds into two BiLSTM units to extract sequential relations of proteins bidirectionally. Next, we use a GaussianNoise layer for denoising. Finally, all data flows through several fully connected and dropout layers to prevent our model from overfitting. The last layer of the model is a dense layer with scalar values as output that represents the two content of the two secondary structures predicted here.
Training and Testing
The training and testing dataset is split into 80% of training and 20% of testing sets. All training and testing processes are conducted with one GeForce RTX 2080Ti GPU. We implement the deep learning model in TensorFlow [64] with Keras, a high-level application interface (API). We set a batch size of 512 and adopted an Adam optimizer to train the model and measure the performance of the model prediction by the mean square error (MSE) loss.
Evaluation and Validation
To compare the difference between labels and ratios that we use the model to predict, the coefficient of determination (R2 score) was picked to measure the model's performance. The R2 score, which represents the prediction capacity of a regression model, is determined by:
| (1) |
where yi is the label of each data as the ground truth, denotes the average of yi, and is the value of prediction.
Protein folding
Newly designed sequences are folded using AlphaFold2[5,52] , a state-of-the-art folding method. AMBER is used to relax the predicted structures, and the algorithm is executed using ColabFold [65]. No templates are used to assess de novo predictions. Instead, the single sequence MSA mode is used as it is most suitable for de novo proteins and allows for a relevant comparison of the wild type vs. mutants. Images of proteins are visualized using PyMol [66].
Supplementary Material
Figure 1.
Hierarchical architecture of proteins and relevant overarching features of a protein molecule, here focusng on alpha-helix and beta-sheet content. In the formation of proteins, the primary structure controls the folding of secondary structures or proteins (panel A). In this work we use the primary structure to predict secondary structure content, specifically the alpha helix and beta sheet content, as depicted in panel B, towards engineering functions of proteins and protein materials. One particular example examined in this paper is the optimization of secondary structure content – e.g., maximum/minimum alpha-helix or beta-sheet content.
Acknowledgements:
We acknowledge support from the MIT-IBM AI lab, ONR (N000141612333 and N000141912375), AFOSR (FATE MURI FA9550-15-1-0514), NIH (U01 EB014976), as well as ARO (W911NF1920098). Further support from the Ministry of Science and Technology in Taiwan (MOST 109-2222-E-006-005-MY2 and MOST 109-2224-E-007-003- ). We would like to thank the Google Cloud computing program (GCP) of the MIT-AI lab, NCREE and NTU joint AI center for providing computational resources. Some of the figures were created with BioRender.com.
Footnotes
Supplementary Data: Five PDB files are attached, featuring the wildtype lysozyme folded by AlphaFold2, and the four variations described in the paper. Link: https://www.dropbox.com/s/dj9pgwnti5qvlgh/SI%20MATERIAL%20-%20PDB%20files.zip?dl=0
References
- [1].Vepari C, Kaplan DL, Silk as a biomaterial, Prog. Polym. Sci 32 (2007) 991–1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Mirabello C, Wallner raw B, MSA: proper Deep Learning makes protein sequence profiles and feature extraction obsolete, (n.d.). 10.1101/394437. [DOI] [Google Scholar]
- [3].Dror RO, Dirks RM, Grossman JP, Xu H, Shaw DE, Biomolecular simulation: a computational microscope for molecular biology., Annu. Rev. Biophys (2012). 10.1146/annurev-biophys-042910-155245. [DOI] [PubMed] [Google Scholar]
- [4].Paci E, Karplus M, Unfolding proteins by external forces and temperature: The importance of topology and energetics, Proc. Natl. Acad. Sci. U. S. A 97 (2000) 6521–6526. 10.1073/pnas.100124597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Žídek A, Nelson AWR, Bridgland A, Penedones H, Petersen S, Simonyan K, Crossan S, Kohli P, Jones DT, Silver D, Kavukcuoglu K, Hassabis D, Improved protein structure prediction using potentials from deep learning., Nature. 577 (2020) 706–710. 10.1038/s41586-019-1923-7. [DOI] [PubMed] [Google Scholar]
- [6].Qin Z, Wu L, Sun H, Huo S, Ma T, Lim E, Chen PY, Marelli B, Buehler MJ, Artificial intelligence method to design and fold alpha-helical structural proteins from the primary amino acid sequence, Extrem. Mech. Lett 36 (2020) 100652. 10.1016/j.eml.2020.100652. [DOI] [Google Scholar]
- [7].Xu J, Distance-based Protein Folding Powered by Deep Learning, BioRxiv. (2018) 465955. 10.1101/465955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Franjou SL, Milazzo M, Yu C-H, Buehler MJ, Sounds interesting: can sonification help us design new proteins?, Expert Rev. Proteomics 16 (2019). 10.1080/14789450.2019.1697236. [DOI] [PubMed] [Google Scholar]
- [9].Kambe Y, Mizoguchi Y, Kuwahara K, Nakaoki T, Hirano Y, Yamaoka T, Beta-sheet content significantly correlates with the biodegradation time of silk fibroin hydrogels showing a wide range of compressive modulus, Polym. Degrad. Stab 179 (2020) 109240. 10.1016/j.polymdegradstab.2020.109240. [DOI] [Google Scholar]
- [10].Hu Y, Zhang Q, You R, Wang L, Li M, The relationship between secondary structure and biodegradation behavior of silk fibroin scaffolds, Adv. Mater. Sci. Eng 2012 (2012). 10.1155/2012/185905. [DOI] [Google Scholar]
- [11].Keten S, Buehler MJ, Nanostructure and molecular mechanics of spider dragline silk protein assemblies, J. R. Soc. Interface 7 (2010). 10.1098/rsif.2010.0149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Keten S, Xu Z, Ihle B, Buehler MJ, Nanoconfinement controls stiffness, strength and mechanical toughness of B-sheet crystals in silk, Nat. Mater 9 (2010) 359–367. 10.1038/nmat2704. [DOI] [PubMed] [Google Scholar]
- [13].Giesa T, Jagadeesan R, Spivak DI, Buehler MJ, Matriarch: A Python Library for Materials Architecture, ACS Biomater. Sci. Eng (2015) 150901085625000. 10.1021/acsbiomaterials.5b00251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Shang K, Rnjak-Kovacina J, Lin Y, Hayden RS, Tao H, Kaplan DL, Accelerated In Vitro Degradation of Optically Clear Low β -Sheet Silk Films by Enzyme-Mediated Pretreatment, Transl. Vis. Sci. Technol 2 (2013) 2. 10.1167/tvst.2.3.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Drnovšek N, Kocen R, Gantar A, Drobnič-Košorok M, Leonardi A, Križaj I, Rečnik A, Novak S, Size of silk fibroin β-sheet domains affected by Ca2+, J. Mater. Chem. B 4 (2016) 6597–6608. 10.1039/c6tb01101b. [DOI] [PubMed] [Google Scholar]
- [16].Dinjaski N, Ebrahimi D, Qin Z, Giordano JEM, Ling S, Buehler MJ, Kaplan DL, Predicting rates of in vivo degradation of recombinant spider silk proteins, J. Tissue Eng. Regen. Med 12 (2018) e97–e105. 10.1002/term.2380. [DOI] [PubMed] [Google Scholar]
- [17].Rabotyagova OS, Cebe P, Kaplan DL, Collagen structural hierarchy and susceptibility to degradation by ultraviolet radiation, Mater. Sci. Eng. C 28 (2008) 1420–1429. 10.1016/j.msec.2008.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Sikora M, Sułkowska JI, Cieplak M, Mechanical strength of 17 134 model proteins and cysteine slipknots, PLoS Comput. Biol 5 (2009) 1000547. 10.1371/journal.pcbi.1000547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Perticaroli S, Nickels JD, Ehlers G, O’Neill H, Zhang Q, Sokolov AP, Secondary structure and rigidity in model proteins, Soft Matter. 9 (2013) 9548–9556. 10.1039/c3sm50807b. [DOI] [PubMed] [Google Scholar]
- [20].Hsu CC, Buehler MJ, Tarakanova A, The Order-Disorder Continuum: Linking Predictions of Protein Structure and Disorder through Molecular Simulation, Sci. Rep 10 (2020). 10.1038/s41598-020-58868-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Qin Z, Buehler MJ, Cooperative deformation of hydrogen bonds in beta-strands and beta-sheet nanocrystals, Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys 82 (2010). 10.1103/PhysRevE.82.061906. [DOI] [PubMed] [Google Scholar]
- [22].Di lena P, Nagata K, Baldi P, Deep architectures for protein contact map prediction, Bioinformatics. 28 (2012) 2449–2457. 10.1093/bioinformatics/bts475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Zhang B, Li J, Lü Q, Prediction of 8-state protein secondary structures by a novel deep learning architecture, BMC Bioinformatics. 19 (2018) 293. 10.1186/s12859-018-2280-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Pollastri G, McLysaght A, Porter: A new, accurate server for protein secondary structure prediction, Bioinformatics. 21 (2005) 1719–1720. 10.1093/bioinformatics/bti203. [DOI] [PubMed] [Google Scholar]
- [25].Pollastri S, Perchiazzi N, Lezzerini M, Plaisier JR, Cavallo A, Dalconi MC, Gandolfi NB, Gualtieri AF, The crystal structure of mineral fibres. 1. Chrysotile, Period. Di Mineral 85 (2016) 249–259. 10.2451/2016PM655. [DOI] [Google Scholar]
- [26].Mirabello C, Pollastri G, Porter, PaleAle 4.0: High-accuracy prediction of protein secondary structure and relative solvent accessibility, Bioinformatics. 29 (2013) 2056–2058. 10.1093/bioinformatics/btt344. [DOI] [PubMed] [Google Scholar]
- [27].Kaleel M, Torrisi M, Mooney C, Pollastri G, PaleAle 5.0: prediction of protein relative solvent accessibility by deep learning, Amino Acids. 51 (2019) 1289–1296. 10.1007/s00726-019-02767-6. [DOI] [PubMed] [Google Scholar]
- [28].Torrisi M, Pollastri G, Le Q, Deep learning methods in protein structure prediction, Comput. Struct. Biotechnol. J 18 (2020) 1301–1310. 10.1016/j.csbj.2019.12.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Xiao S, Xiao S, Gräter F, Dissecting the structural determinants for the difference in mechanical stability of silk and amyloid beta-sheet stacks, Phys. Chem. Chem. Phys 15 (2013) 8765–8771. 10.1039/C3CP00067B. [DOI] [PubMed] [Google Scholar]
- [30].Keten S, Buehler MJ, Geometric confinement governs the rupture strength of H-bond assemblies at a critical length scale, in: Mater. Res. Soc. Symp. Proc, 2008. [DOI] [PubMed] [Google Scholar]
- [31].Ackbarow T, Keten S, Buehler MJ, A multi-timescale strength model of alpha-helical protein domains, J. Phys. Condens. Matter. 21 (2009). 10.1088/0953-8984/21/3/035111. [DOI] [PubMed] [Google Scholar]
- [32].Keten S, Rodriguez Alvarado JF, M?ft? S, Buehler MJ, Nanomechanical characterization of the triple ?-helix domain in the cell puncture needle of bacteriophage T4 virus, Cell. Mol. Bioeng 2 (2009). 10.1007/s12195-009-0047-9. [DOI] [Google Scholar]
- [33].Buehler MJ, Yung YC, Deformation and failure of protein materials in physiologically extreme conditions and disease, Nat. Mater (2009). 10.1038/nmat2387. [DOI] [PubMed] [Google Scholar]
- [34].Buehler MJ, Nanomechanical sonification of the 2019-nCoV coronavirus spike protein through a materiomusical approach, (2020). http://arxiv.org/abs/2003.14258 (accessed April 16, 2020). [Google Scholar]
- [35].Hartl FU, Protein misfolding diseases, Annu. Rev. Biochem 86 (2017) 21–26. 10.1146/ANNUREV-BIOCHEM-061516-044518. [DOI] [PubMed] [Google Scholar]
- [36].Knowles T, … V. M-N reviews M., undefined 2014, The amyloid state and its association with protein misfolding diseases, Nature.Com. (n.d.). [DOI] [PubMed] [Google Scholar]
- [37].Knowles TPJ, Buehler MJ, Nanomechanics of functional and pathological amyloid materials, Nat. Nanotechnol 6 (2011). 10.1038/nnano.2011.102. [DOI] [PubMed] [Google Scholar]
- [38].Solar M, Buehler MJ, Comparative analysis of nanomechanics of protein filaments under lateral loading, Nanoscale. 4 (2012). 10.1039/c1nr11260k. [DOI] [PubMed] [Google Scholar]
- [39].Hu X, Kaplan D, Cebe P, Dynamic protein-water relationships during β-sheet formation, Macromolecules. (2008). 10.1021/ma071551d. [DOI] [Google Scholar]
- [40].Lammel A, Hu X, Park S-H, Kaplan DL, Scheibel T, Controlling silk fibroin particle features for drug delivery, (n.d.). 10.1016/j.biomaterials.2010.02.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Hayashi CY, Shipley NH, Lewis RV, Hypotheses that correlate the sequence, structure, and mechanical properties of spider silk proteins, in: Int. J. Biol. Macromol, Elsevier, 1999: pp. 271–275. 10.1016/S0141-8130(98)00089-0. [DOI] [PubMed] [Google Scholar]
- [42].Keten S, Buehler MJ, Nanostructure and molecular mechanics of spider dragline silk protein assemblies, J. R. Soc. Interface 7 (2010) 1709–1721. 10.1098/rsif.2010.0149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Jaleel Z, Zhou S, Martín-Moldes Z, Baugh LM, Yeh J, Dinjaski N, Brown LT, Garb JE, Kaplan DL, Expanding canonical spider silk properties through a DNA combinatorial approach, Materials (Basel). 13 (2020) 3596. 10.3390/MA13163596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C, The protein data bank, Acta Crystallogr. Sect. D Biol. Crystallogr 58 (2002) 899–907. 10.1107/S0907444902003451. [DOI] [PubMed] [Google Scholar]
- [45].Buehler MJ, Yung YC, Deformation and failure of protein materials in physiologically extreme conditions and disease, Nat. Mater 8 (2009). 10.1038/nmat2387. [DOI] [PubMed] [Google Scholar]
- [46].Qin Z, Cranford S, Ackbarow T, Buehler MJ, Robustness-strength performance of hierarchical alpha-helical protein filaments, Int. J. Appl. Mech 1 (2009). 10.1142/S1758825109000058. [DOI] [Google Scholar]
- [47].Deng Y, Liu J, Zheng Q, Yong W, Lu M, Structures and polymorphic interactions of two heptad-repeat regions of the SARS virus S2 protein, Structure. 14 (2006) 889–899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Jin Z, Du X, Xu Y, Deng Y, Liu M, Zhao Y, Zhang B, Li X, Zhang L, Peng C, Structure of M pro from SARS-CoV-2 and discovery of its inhibitors, Nature. 582 (2020) 289–293. [DOI] [PubMed] [Google Scholar]
- [49].Kirchdoerfer RN, Ward AB, Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors. Nat Commun 10: 2342, (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Liu L, Wang P, Nair MS, Yu J, Rapp M, Wang Q, Luo Y, Chan JF-W, Sahi V, Figueroa A, Potent neutralizing antibodies against multiple epitopes on SARS-CoV-2 spike, Nature. 584 (2020) 450–456. [DOI] [PubMed] [Google Scholar]
- [51].Walls AC, Park Y-J, Tortorici MA, Wall A, McGuire AT, Veesler D, Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein, Cell. 181 (2020) 281–292. e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D, Highly accurate protein structure prediction with AlphaFold, Nature. (2021) 1–12. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Yu C-H, Qin Z, Buehler MJ, Artificial intelligence design algorithm for nanocomposites optimized for shear crack resistance, Nano Futur. 3 (2019). 10.1088/2399-1984/ab36f0. [DOI] [Google Scholar]
- [54].Yachdav G, Kloppmann E, Kajan L, Hecht M, Goldberg T, Hamp T, Hönigschmid P, Schafferhans A, Roos M, Bernhofer M, Richter L, Ashkenazy H, Punta M, Schlessinger A, Bromberg Y, Schneider R, Vriend G, Sander C, Ben-Tal N, Rost B, PredictProtein - An open resource for online prediction of protein structural and functional features, Nucleic Acids Res. 42 (2014). 10.1093/nar/gku366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, Pilbout S, Schneider M, The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003, Nucleic Acids Res. 31 (2003) 365–370. 10.1093/nar/gkg095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Magrane M, Consortium UP, UniProt Knowledgebase: A hub of integrated protein data, Database. 2011 (2011). 10.1093/database/bar009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [57].Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE, The Protein Data Bank, Nucleic Acids Res. 28 (2000) 235–242. 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Ketkar N, Ketkar N, Introduction to Keras, in: Deep Learn. with Python, Apress, 2017: pp. 97–111. 10.1007/978-1-4842-2766-4_7. [DOI] [Google Scholar]
- [59].Kudo T, Richardson J, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, in: Proc. 2018 Conf. Empir. Methods Nat. Lang. Process. Syst. Demonstr., Association for Computational Linguistics, Stroudsburg, PA, USA, 2018: pp. 66–71. 10.18653/v1/D18-2012. [DOI] [Google Scholar]
- [60].Joosten RP, Te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G, A series of PDB related databases for everyday needs, Nucleic Acids Res. 39 (2010) D411–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Kabsch W, Sander C, Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, Biopolymers. 22 (1983) 2577–2637. 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- [62].Oore S, Simon I, Dieleman S, Eck D, Simonyan K, This time with feeling: learning expressive musical performance, Neural Comput. Appl 32 (2020) 955–967. 10.1007/s00521-018-3758-9. [DOI] [Google Scholar]
- [63].Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B, Attention-based bidirectional long short-term memory networks for relation classification, in: Proc. 54th Annu. Meet. Assoc. Comput. Linguist. (Volume 2 Short Pap., 2016: pp. 207–212. [Google Scholar]
- [64].Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Tensorflow: A system for large-scale machine learning, in: 12th {USENIX} Symp. Oper. Syst. Des. Implement. ({OSDI} 16), 2016: pp. 265–283. [Google Scholar]
- [65].Mirdita M, Ovchinnikov S, Steinegger M, ColabFold - Making protein folding accessible to all, BioRxiv. (2021) 2021.08.15.456425. 10.1101/2021.08.15.456425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [66].Schrödinger LLC, The PyMOL molecular graphics system, version 1.8, (2015). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






