Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 1.
Published in final edited form as: Nat Chem. 2022 Oct 31;14(12):1427–1435. doi: 10.1038/s41557-022-01055-3

Machine learning overcomes human bias in the discovery of self-assembling peptides

Rohit Batra 1,2, Troy D Loeffler 1,3, Henry Chan 1,3, Srilok Srinivasan 1, Honggang Cui 4, Ivan V Korendovych 5, Vikas Nanda 6, Liam C Palmer 7, Lee A Solomon 8, H Christopher Fry 1,, Subramanian K R S Sankaranarayanan 1,3,
PMCID: PMC9844539  NIHMSID: NIHMS1859483  PMID: 36316409

Abstract

Peptide materials have a wide array of functions, from tissue engineering and surface coatings to catalysis and sensing. Tuning the sequence of amino acids that comprise the peptide modulates peptide functionality, but a small increase in sequence length leads to a dramatic increase in the number of peptide candidates. Traditionally, peptide design is guided by human expertise and intuition and typically yields fewer than ten peptides per study, but these approaches are not easily scalable and are susceptible to human bias. Here we introduce a machine learning workflow–AI-expert–that combines Monte Carlo tree search and random forest with molecular dynamics simulations to develop a fully autonomous computational search engine to discover peptide sequences with high potential for self-assembly. We demonstrate the efficacy of the AI-expert to efficiently search large spaces of tripeptides and pentapeptides. The predictability of AI-expert performs on par or better than our human experts and suggests several non-intuitive sequences with high self-assembly propensity, outlining its potential to overcome human bias and accelerate peptide discovery.


Nature generates innumerable functional materials in living systems in the form of proteins and their supramolecular assemblies. Examples include collagen (extended triple helices forming the fibrous base component of skin, hairs and nails)1,2, silk proteins3 and light-harvesting reaction centres4. Investigations into such naturally occurring supramolecular assemblies have inspired the design of novel biomolecular materials59. For example, the formation of a plaque-nucleating region in neurodegenerative diseases (such as Alzheimer’s) has been attributed to (but not limited to10) the presence of a diphenylalanine (FF) amino-acid sequence in the amyloid beta peptide11. As a result, FF-containing peptide sequences have been explored in sensing12 as biocompatible implants13, semiconductors14 and piezoelectrics15,16 and for drug release17. Similarly, other works have employed self-assembling small peptides (<10 amino acids) for various chemical and biological applications, such as catalysis, light harvesting, scaffold hydrogels and conductivity1823. Importantly, in all cases, the emergent functionality of a peptide is an outcome of its self-assembled architecture, with the unique property being lost when there is no assembly. The self-assembled architecture and thus its functionality depend strongly on the amino-acid sequence, and the derivation of this has traditionally relied on the examination of natural sequences, human expertise, experience and intuition. Thus, researchers rationally design novel peptide sequences to either replicate, tailor or improve natural properties or investigate emerging functionalities as a result of their assembled structure6,24,25.

Traditional approaches of peptide design utilize hydrophobicity scales determined from the partitioning of amino acids into hydrophilic or hydrophobic environments (for example, the Wimley–White scale26,27) and secondary structure propensity tables obtained from the occurrence of any given amino acid in an α-helix or β-sheet fold (for example, Chou–Fasman28). This often introduces a bias towards high β-sheet propensity amino acids with moderate to high hydrophobicity (for example, valine, isoleucine and phenylalanine) in the design of supramolecular peptides. Another source of bias comes from the commonly employed patterning strategies, such as pnnnp or npnpn (p = polar, n = non-polar), which reliably lead to β-sheet-rich nanostructured materials. The principal reason why designers knowingly resort to such (biased) approaches is because the design space of peptides can become exorbitantly large–the number of possible combinations of peptides equals 20n, where n is the number of amino acids in the peptide chain and the factor 20 arises from the library of commonly available amino acids29, as shown in Fig. 1. Although short sequences such as tripeptides (n = 3) with 8,000 combinations are somewhat tractable, the move to pentapetides (n = 5) opens up nearly 3.2 million possibilities. This not only precludes any rigorous experimental study of the complete peptide design space, but also suggests that a large fraction of the possible peptide sequences remain unexplored. Although a brute-force computational search based on coarse-grained (CG) molecular dynamics (MD) simulations provides a pathway to overcome this search bias and has been notably successful in identifying several self-assembling and hydrogelating tripeptides (from a total of 8,000 cases)29, it cannot be extended to larger sequence lengths (n > 3) owing to the high computational costs.

Fig. 1 |. Workflow adopted to discover self-assembling pentapeptides using inputs from human experts and the developed AI-expert.

Fig. 1 |

The search space of peptides grows drastically with its sequence length owing to the presence of 20 amino acids. Although 8,000 possible tripeptides can be explored (computationally) for assembly using a brute-force approach, the space of 3.2 million (M) pentapeptides is intractable. The human experts use rational design approaches, such as hydrophobicity scales, charge balance, patterning (npnpn: n, non-polar; p, polar) and their own individual experiences to design self-assembling pentapeptides. Six of the eleven synthesized pentapeptides suggested by the six different human experts were found to aggregate, although the proposed sequences suggested human bias toward V, F, K and E amino acids. By contrast, the developed AI-expert combines Monte Carlo tree search (MCTS) (A), MD simulations (B) and a peptide structure based scoring function (C) to efficiently search for self-assembling peptides. Six of nine synthesized peptides from the AI-expert were found to aggregate. Beyond being able to recover some intuitive sequences (FFEKF), the AI-expert suggested some novel/unusual sequences (SYCGY) involving diverse amino acids (RWLDY), reflecting its advantage of overcoming human bias. Molecular representations and AFM images for a few promising pentapeptides from both categories are shown.

A major challenge in peptide design lies in efficiently navigating through this elaborate search space of amino-acid sequences and propose a subset with the most promising possibilities. Artificial intelligence (AI) and machine learning-based strategies makes this a reality by balancing the exploration-versus-exploitation tradeoff3032. In this article, we introduce an ‘AI expert’ that combines recent advances in decision trees (Monte Carlo tree search (MCTS) algorithm33,34) with CG MD simulations to identify pentapeptides with high aggregation propensities (APs) in water (Fig. 1). Operating in an autonomous manner, AI-expert utilizes the MCTS algorithm to make an informed decision on which peptide sequence(s) to evaluate next using the MD simulations, with the score of the modelled peptide(s) provided as feedback to guide future searches. In contrast to brute-force or grid-based approaches, in which every possibility is investigated, the MCTS streamlines the search by focusing on the most promising areas of the search space, that is, with high-scoring (exploitation) and diverse (exploration) sequences. An additional performance boost to the MCTS algorithm is provided by introducing a novel concept of uniqueness function within the MCTS objective function, and by utilizing a random forest (RF)-based surrogate model to bypass some of the expensive MD simulation evaluations. Inspired by past work on the design of di- and tripeptides29,35, our scoring system consists of the solvent-accessible surface areas and the Wimley–White scale to respectively quantify the computational AP and hydrophobicity of a peptide. Although the former is based on the structure of a peptide obtained only after time-intensive MD simulations, the latter is computationally cost-effective and can be evaluated instantly given only the peptide sequence.

Of the 3.2 million possible pentapeptides, AI-expert sampled and evaluated roughly 6,600 cases using computations (MD simulations). The top 100 pentapeptides identified in this way were further modelled for longer timescales (200 ns) using more rigorous MD simulation parameters to improve the AP estimates. Nine top-scoring AI sequences were screened for actual synthesis and experimental investigation, and six of these were found to aggregate based on light scattering and atomic force microscopy (AFM) measurements. In comparison, six of eleven pentapeptide sequences suggested by the human experts (peptide synthesis experts with many years of experience) were found to aggregate. We discuss these findings in the context of prevalent bias (and thus similarity) in the sequences proposed by the human experts, the ability of the AI-expert to recover existing materials knowledge by reproducing sequences similar to the human experts and, most importantly, the power of AI-expert to overcome human bias by discovering previously unknown and completely non-intuitive self-assembling peptide sequences (for example, SYCGY) in an efficient manner. We provide our perspectives and propose a path forward where the performance of the AI-expert can be enhanced by fusing information from parallel computations and experiments, and by improving the MCTS scoring function to include other structural factors from existing protein databases.

Results and discussion

AI-expert for peptide discovery

We have developed a workflow, henceforth referred to as the AI-expert, to discover self-assembling peptides. The workflow consists of a search algorithm (that is, MCTS interfaced with GROMACS MD simulation software36,37) to model the structure–functionality of a peptide, given only its amino-acid sequence. The role of MCTS is to intelligently and efficiently generate peptide sequences, sampled from the overall search space, that have high self-assembling scores. The MD simulations via GROMACS provide a relatively inexpensive method to estimate the AP of the peptide sequence proposed by the MCTS algorithm, and thus provide feedback to improve the quality of the peptide search. It should be noted that AI-expert autonomously switches between the two stages of peptide generation (MCTS) and evaluation (MD simulations) without any human intervention. Furthermore, in contrast to supervised learning algorithms, AI-expert generates its training data on-the-fly, which helps it to avoid any form of bias arising from past databases.

MCTS is a powerful algorithm for planning, optimization and learning tasks because of its generality, low computational requirements and a theoretical bound on the exploration-versus-exploitation tradeoff 34,38,39. It has been particularly successful when applied to problems involving an extremely large search space4042, making it the model of choice for this work. Its details are covered in the Methods, but we briefly note that it searches in a tree-structured fashion where every node (or tree leaf) contains a unique peptide sequence (for example, VKVKV) and its associated score (Fig. 1). Moreover, these nodes contain connections in a special configuration such that a parent node is connected to several child nodes with slightly different peptide sequences. This gives a meaningful structure to the overall tree, with the high-scoring child node generally belonging to the tree branch that contains other relatively high-scoring parent nodes. To advance the search, MCTS utilizes a tree policy and a rollout policy. The former selects the most promising node, and the latter samples the nearby space (using Monte Carlo trials) of the selected node by introducing small perturbations, referred to as rollouts. The upper confidence bound (UCB) for parameters43 is a popular choice of tree policy given by

UCB(θj)=min(r1,r2,,rni)+cf(θj)lnNini (1)

where θj represents node j in the MCTS structure, r denotes the score (or reward) of a given rollout, c(>0) is the exploration constant, ni is the number of rollout samples taken by node θj and all of its child nodes, and Ni is the same value as ni except for the parent node of θj. In this work, the scoring (or reward) function is chosen to balance the AP and the hydrophobicity of the peptides using the form ri=APiα×logPiβ, where the prime symbol denotes normalized values, and α and β are the coefficients weights. f(θj) is the uniqueness function, specifically introduced in this work to drive the search towards diverse sequences. The policy in equation (1) tries to balance the search between those nodes that have either returned the maximum score (left term) or have not been explored enough (right term). In contrast, the rollout policy introduces random, but controlled, perturbations (from a node) to sample new sequences.

We introduce two modifications to improve the efficiency of MCTS: the uniqueness function f(θj) and a RF-model-guided rollout policy. The uniqueness function enhances the effect of the exploration term in equation (1), further motivating MCTS to select those nodes that represent diverse peptides. This pushes the search into new regions (or diverse sequences) that have not been explored before. For this work, we used Morgan circular fingerprints44 to numerically represent the peptides, followed by the Dice similarity measure to compute the uniqueness of a peptide in relation to others in the MCTS structure (Methods). The second important modification is the use of a surrogate RF model to quickly predict the AP of a peptide given its sequence. This eliminates the need to perform computationally expensive MD simulations during rollouts, especially for cases that are predicted to have very low AP values. However, care should be taken to only partly replace the MD simulations with the RF model, as the surrogate model is only approximate and could miss out on promising cases that are different from the data used for its training. Accordingly, we use the RF model here to only guide the rollout policy such that half of the rollouts correspond to cases that are predicted to have a high AP, and the remaining half of rollouts from random perturbation as in the traditional MCTS setting (Methods). For both cases, the AP value used in equation (1) is obtained only after actual MD simulations. It should be noted that the RF model is trained in an online fashion, with the RF model being regularly updated as more training data from the MD simulations become available during the MCTS run. Details of the input features and training parameters of the RF model are provided in the Methods.

Validation for tripeptides

We first consider the space of tripeptides as a demonstration of the ability of AI-expert to accelerate the search for self-assembling peptides. There are two reasons that dictate this choice. First, tripeptides have a computationally manageable space of 8,000 (= 203) sequences. Second, a rigorous past study already exists on the use of MD simulations to explore self-assembling tripeptides using a brute-force approach29. In fact, the previous works on di- and tripeptides confirm that MD simulations based on the MARTINI CG force field4547 are reliable enough to guess self-assembly when coupled with the metrics AP and hydrophobicity29,35.

To measure the performance gain of AI-expert over a brute-force approach, we first performed MD simulations for all 8,000 cases and sorted them based on their score, rtri=AP2×logP (α = 2, β = 1, as in past work). See Methods for details on the MD simulations and the computation methodology for AP and logP. Some of the top-scoring tripeptides are shown in Fig. 2a and Extended Data Fig. 1. Although the specific rank order of our results may differ slightly from the previous brute-force investigation29, the overall AP versus hydrophobicity trends in both studies match well. The differences can be traced to the slight variations in the AP computations introduced due to either the stochasticity of the MD simulations or the software choice for the AP computations (Supplementary Information). Nonetheless, the trends in the identified top-scoring tripeptides are similar. For example, YKD, EWK and KYE all follow the general trend of charge-balanced peptides (K/D or K/E) plus an aromatic residue, whereas KYY and SYY are amphiphilic peptides displaying a pair of aromatic residues. We also note that the specific rank order of the tripeptides does not influence the conclusions made in this study regarding the search efficiency of AI-expert, as discussed next.

Fig. 2 |. Performance comparison of the different search strategies for the space of tripeptides.

Fig. 2 |

a, Molecular representation of example top-scoring tripeptides along with their computed scores. The numbers in parentheses indicate (left) the aggregation propensity (AP) and (right) the hydrophobicity (logP) values. Amino-acid (AA) colour coding: acidic AAs, red; basic AAs, blue; polar AAs, yellow; aromatic AAs, orange. b, Comparison of the number of trials needed to search the highest-scoring tripeptide from the complete space of 8,000 cases. AI-expert utilizing MCTS or MCTS + RF search strategies on average takes a substantially lower number of trials in comparison to a random or an exhaustive search to find the highest-scoring tripeptide (SYY). c, Comparison of the score of peptide sequences generated using a random, MCTS or MCTS + RF search strategy. Solid lines denote the respective normalized density. The AI-expert with MCTS + RF scheme is most efficient in identifying high-scoring peptides, as a larger fraction of its generated peptide population have high scores.

Figure 2b compares the time taken by the different methods to identify the highest-scoring tripeptide (SYY). It can be seen that the AI-expert that uses RF-boosted MCTS (labelled MCTS + RF) on average takes a substantially lower number of trials to identify the highest-scoring SYY sequence compared to a purely random rollout policy-based MCTS. This suggests that the developed rollout policy utilizing the RF model indeed helps AI-expert to efficiently identify high-scoring peptides. Furthermore, AI-expert, with or without the RF model, performs substantially better than a random or a brute-force search, requiring ~4,000 and 8,000 trials, respectively. Similarly, Fig. 2c compares the quality of the peptide population generated using a random search or by the AI-expert utilizing MCTS or MCTS + RF scheme. It is evident that the MCTS + RF scheme samples high-scoring tripeptides most frequently, followed by the MCTS and then the random search. Overall, these results validate that AI-expert can efficiently identify high-scoring peptides without resorting to a time-intensive brute-force search.

Screening of pentapeptides

Having validated the efficiency of AI-expert for tripeptides, next we use it to discover self-assembling pentapeptides, which have 3.2 million (M) (205) permutations. Such a large search space renders a brute-force search impossible and motivates the need for an AI-guided search. The AI-expert with MCTS + RF scheme was deployed with slightly different settings of the reward function, that is, rpenta=AP2×logP0.5 (α = 2, β = 0.5), to bias the search towards pentapeptides that are neither too hydrophilic (easily soluble) nor too hydrophobic (difficult to form hydrogels). This adjustment in the reward function is necessary, because a majority of the amino acids are hydrophilic, and the naive use of rtri for pentapeptide design will incorrectly assign high scores to hydrophilic pentapeptides (discussed later). Results for the ~6,600 pentapeptides evaluated using MCTS + RF (with rpenta) are shown in Fig. 3a. It can be seen that AI-expert found a high occurrence of moderately hydrophobic pentapeptides with logP between 0 and −4, although with a broad peak. A list of the top 100 peptides from this, based on their reward score and an additional constraint of −0.6 < logP < 2, were screened for longer MD simulations (200 ns) and the AP estimates were improved (see Supplementary Information for the complete list). From a computational viewpoint, substantial aggregation was observed in all of the selected 100 cases (a few example pentapeptide structures are shown in the top row of Fig. 3b).

Fig. 3 |. Screening of pentapeptides from AI-expert and human experts.

Fig. 3 |

a, Left: results of the MCTS + RF-based computational search of AI-expert using the scoring function rpenta. A broad peak with logP between 0 and −4 indicates the generation of moderately hydrophobic peptides that display a good balance between aggregation propensity (AP) and hydrophobicity. The AP results are based on shorter MD simulations (50 ns). The probability density function of AI-expert-proposed pentapeptides is estimated using kernel density estimation. Right: top peptides that were screened by AI-expert (top 100 using rpenta), suggested by the human experts, and those that were selected for synthesis. The AP results are based on longer MD simulations (200 ns). b, MD simulation results (200 ns) for example top-scoring pentapeptides from the AI-expert (top row) and the human experts (bottom row), showing different levels of aggregation.

Similar to AI-expert, several human experts were asked to suggest their own sequence of pentapeptides that they expected to assemble. A set of simple guidelines were supplied (Methods). In response, a total of 29 pentapeptides were collected. Many literature examples of self-assembling pentapeptides include N- and C-termini modification (acetylated or carbamidated, respectively) to facilitate assembly4853. However, in this work, the human experts were directed to leave the pentapeptide termini unmodified, in alignment with the workflow adopted for AI-expert. Analogous to the AI-expert pentapeptides, AP and logP values for these sequences were evaluated using MD simulations and hydrophobicity scales, respectively.

The results for the top 100 pentapeptides from AI-expert (red markers) and the 29 sequences from human experts (green markers) are shown in Fig. 3a (right panel). Also captured are the results of the candidates that were synthesized (black markers) and those that were found to aggregate (filled markers) based on light scattering and microscopy measurements. A detailed comparison of the pentapeptides suggested by AI-expert and the human experts is provided later in Extended Data Fig. 2. Here, however, we make the following observations. First, the top sequences screened by AI-expert lie in a relatively smaller logP range than those proposed by the human experts. This is because AI-expert screens candidates only on the basis of the scoring function, whereas the human experts rely on a multitude of factors, such as patterning, hydrophobicity scales and individual past experiences. Second, the AI-expert-suggested sequences, in general, show higher AP values than those of the human experts. This implies that, at least from a computational modelling viewpoint, AI-expert has indeed found pentapeptide sequences with a higher degree of aggregation. This is also evident from the example pentapeptide structures obtained after longer MD simulations (200 ns) comparing the AI-expert (top row) and human expert (bottom row) sequences in Fig. 3b. Third, many sequences that were computationally found to have high AP values did not display any assembly upon experimental synthesis. These cases highlight the limitations of the MARTINI force field to capture accurate aggregation behaviour in peptides, or the inadequacy/simplicity of the reward function used in this work, which consists of just the AP and logP values. Finally, pentapeptides that were (experimentally) observed to aggregate belonged to a narrow range of hydrophobicity (−5 < −logP < 3) and AP (1.5–2.5), signalling the importance of these theoretically derived values in identifying novel peptide sequences for self-assembly. The observed narrow range of hydrophobicity (or −logP) agrees well with the convention of balancing the hydrophobic and hydrophilic content in peptide sequences; if amino acids are too hydrophobic (or −logP values are too positive), peptides begin to precipitate out of solution or are rendered entirely insoluble in water even at low concentrations, and if amino acids are too hydrophilic (or −logP values are too negative), then the peptides remain as water-soluble monomers. Similarly, the computed AP values are also a good indicator of peptide assembly, as no peptide sequence with a low AP value was observed to assemble.

Discovery of self-assembling pentapeptides

This section covers details of the 20 synthesized pentapeptides and the observed self-assembled structures. Eleven of the 29 sequences from the human experts and 9 of the 100 AI-expert-suggested sequences were prepared using a solid-phase peptide synthesizer (SPPS; Methods) with the termini of the peptides kept unmodified (that is, the amine and carboxyl groups of the final product were unprotected). Given that we are interested in the ability of peptides to aggregate and/or assemble, it is important here to distinguish between two seemingly analogous terms: aggregation and assembly. Aggregation implies the lack of noticeable structure, and assembly implies the presence of nano-, meso- and microscale features like micelles, vesicles, fibres and sheets. Thus, for a detailed analysis of the aggregated/assembled structure, and to find experimental quantities analogous to the computed logP and AP values, liquid chromatography (LC), MS, infrared spectroscopy, AFM and opacity measurements were taken for each of the synthesized pentapeptides.

First, all the synthesized peptides were analysed for purity and mass by HPLC and MS. The retention time (RT) was recorded and was found to correlate with hydrophobicity (logP); a linear relationship between RT and logP is visible in Fig. 4a, although some deviations are also noted. Importantly, most of the peptides that show aggregation (filled circles) displayed high RT.

Fig. 4 |. Experimental measurements of self-assembly in pentapeptides.

Fig. 4 |

a, RT and opacity (OD 800 nm) measurements for the 20 synthesized pentapeptides suggested by the AI and human experts. Peptides that were found to aggregate are shown as filled circles. Although RT was found to correlate linearly with logP (indicated by a dashed linear fit line), OD at 800 nm was analogous to the computed AP values. b, AFM images (representative of three trials yielding similar results) for example pentapeptides synthesized in this work, along with their molecular representation and pattern of mixed polar (p) and non-polar (n) amino acids. Aggregates in the form of fibres, sheets/tapes and other irregular shapes are visible. AA colour coding: acidic AAs, red; basic AAs, blue; polar AAs, yellow; aromatic AAs, orange. c, Infrared spectroscopy measurements for the 20 synthesized pentapeptides as suggested by the AI and human experts. The peak near 1,600 cm−1 highlights the formation of secondary structures (β-sheets) in many of the human-expert-selected systems, which was largely missing in the AI-recommended systems. d, Photographs of example gels, solutions and suspensions formed by the different peptides.

To determine assembly, all 20 peptides were dissolved in water at 2 wt% and the pH was adjusted to 7. After 24 h, the solutions either remained clear, grew cloudy or gelled upon adjusting the pH (Fig. 4d). The samples’ opacity (absorption at 800 nm) was monitored with a plate reader (optical density (OD) at 800 nm) and peptides with OD800nm > 0.1 (for water, OD800nm = 0.04) were considered to aggregate (Extended Data Fig. 2). Overall, six candidates (VVVVV, FKFEF, VKVEV, VKVFF, KFFFE and KFAFD) from the human experts and six (SYCGY, FKIDF, FFEKF, KWEFY, RWLDY and KWMDF) from AI-expert were found to aggregate. One peptide from the human experts, RVSVD, yielded high opacity values after one week and was not considered to be a ‘positive hit’. Only a modest match between the measured OD800nm and the computed AP was observed, as shown in Fig. 4a. Peptides that yielded OD800nm > 0.1 had an AP value > 1.8, although some peptides with low opacity (OD800nm < 0.05) also had AP > 1.8, indicating that the computed AP is not always a good predictor for aggregation. In this regard, we also caution that the opacity measurements do not always indicate assembly. For example, micelles at the nanoscale remain translucent and would not yield high values at OD800nm.

We thus further analysed the secondary structures of the aggregated peptides using Fourier-transform infrared (FTIR) spectroscopy. In the amide I region of the spectrum, peptide/protein in D2O shows signature FTIR vibrations of 1,675 cm−1 and 1,627 cm−1 that are representative of a β-sheet conformation54. Although we observed (Fig. 4c) the former peak in almost all the samples investigated (2 wt% in D2O, pH 7), the latter peak, which is attributed to a β-sheet composition, was observed in the following peptides: FKIDF, KFFFE, FKFEF, VVVVV and VKVFF. In addition to the samples in solution, dried films cast from diluted stock solutions (10 μl of a 0.2 wt% solution onto a CaF2 window) yielded the β-sheet signature of the amide I vibration at 1,627 cm−1 in more peptides (Supplementary Information). Nine of eleven peptides from the human experts indicated β-sheet formation as opposed to three of nine AI peptides. This reflects the bias of the users towards peptides with β-sheet formation, as will be discussed in the next section.

To investigate the morphology of the designed pentapeptides, AFM was used on the dried sample films (Methods). We note that the dried films could contain structural artefacts, but in most cases the microscopy data correlate well with our solution studies–only three cases are believed to form aggregates as a result of the drying (KVKVK, RVSVD and VKVKV). Among the cases recommended by AI-expert, SYCGY yielded microscale-length needles with widths on the order of hundreds of nanometres (Fig. 4b). Nanoscale structures were found in peptides rich in the aromatic amino acids, tryptophan and phenylalanine: KWEFY (fibres), FFEKF (spheres) and FKIDF (platelets). This is a design feature similar to that found in a past study on tripeptide series by the Ulijn and Tuttle groups29. Two sequences that introduce aliphatic amino acids (Leu and Met) in the middle of the sequences, RWLDY and KWMDF, yield spherical structures with varying degrees of aggregation. Interestingly, WKPYY indicated no assembly via our experimental protocol, but large aggregates were observed in solution as well as in AFM studies. Thus, among the selected nine cases from this list, only PPPHY and PTPCY did not indicate any discernible structure and resembled dried organic matter on a substrate.

Human-expert-designed phenylalanine-rich peptides, similar to those identified by AI-expert, demonstrated a high propensity for forming nanostructures. Nanoscale fibres were discovered for KFAFD, FKFEF, VKVFF (spherical bundles) and KFFFE (nanoplates). Many of the valine-rich peptides from human experts form fibrous structures upon drying (VKVKV, KVKVK and RVSVD), but no evidence for solution structures was observed. Interestingly, large platelets with 25-nm height were observed for the relatively hydrophobic (logP = −2.3) VVVVV. Even though it is highly hydrophilic (logP = 5.05), VKVEV gelled on increasing the pH to 7. Thus, 6 of the 11 pentapeptides suggested by human experts formed nanostructures, most of which formed a β-sheet conformation.

Performance comparison of AI-expert and human experts

In terms of the overall ability of AI-expert to predict the assembly of pentapeptides, it performs at par or slightly better than our human experts. As shown in Fig. 5a, the success rate of AI-expert (using rpenta) is 66.67%, as compared to 54.5% for human experts. We, however, argue that aggregation success rate alone is not a sufficient metric to evaluate performance. It is important to realize that, in contrast to human experts, AI-expert had no direct feedback from any actual experiments and merely relied on computationally derived quantities, such as AP and logP, to make predictions. So, if all the synthesized peptides are rank-ordered on the basis of their computational score (that is, rpenta), seven of the top eight peptides are from AI-expert and only one is from the human experts (Fig. 5b). This means that, in the ideal case where the computational scoring function is a perfect indicator of assembly, AI-expert would have performed much better than the human experts. Thus, efforts are needed to either modify the scoring function (for example, adding other structural factors or manipulating the weighting scheme) or improve the performance of the force fields to accurately relate the computational score with peptide assembly.

Fig. 5 |. Performance comparison of AI-expert and human experts.

Fig. 5 |

a,b, The performances of AI-expert and human experts are evaluated in terms of the aggregation success rate (a) and computational and experimental scores (b) of the proposed peptides. c, Co-relation between the computational and experimental scores, with (bottom) and without (top) the β-sheet factor. Although a high computational score does not necessarily indicate aggregation, the experimental score beyond a threshold of 0.01 (dotted lines) captures the peptide aggregation extremely well.

As another form of performance evaluation, we devised an experimental scoring (ExpScore) metric that incorporates information from the characterization measurements. ExpScore was defined as the product of the normalized RT (analogous to logP) and the normalized OD800nm (analogous to AP) and captured the aggregation in peptides extremely well (Methods). As seen in Fig. 5b, even on the basis of ExpScore, AI-expert performs on par with human experts, with both suggesting four of the eight top-scoring peptides. Nevertheless, the the highest-scoring case (SYCGY) was predicted by AI-expert, further corroborating the ability of AI-expert in peptide discovery.

The diversity of the proposed peptide sequence is another important performance metric. Many of the sequences proposed by human experts revolved around the use of only four amino acids: phenylalanine (F), valine (V), lysine (K) and glutamic acid (E) (Extended Data Fig. 3). This is not only reflective of human bias, but in some sense is also the reason for the good success rate of the human experts (for instance, many similar pairs, VKVFF, VKVEV and KFFFE, FKFEF, are counted as positive hits). AI-expert, on the other hand, suggested very diverse sequences covering more than ten distinct amino acids. Furthermore, it suggested sequences that are very unusual, such as SYCGY, or that include many distinct amino acids, for example, FKIDF, RWLDY and KWEFY. None of these sequences is likely to be recommended by a human expert and truly suggests the power of AI-expert to overcome human bias, identify novel sequences and plausibly unearth new protein chemistry.

Besides discovering novel chemistry, AI-expert automatically (re-) discovered a few known rational design approaches. For example, a notable percentage of the peptides (~50%) determined by AI-expert were either charge-neutral or charge-balanced such that a positively charged amino acid like lysine or arginine was frequently paired with a negatively charged amino acid like glutamic or aspartic acid. Such inclusion of salt-bridges/electrostatic pairing in peptide design is standard practice. Another unique area where both the human- experts and AI-expert agreed was the incorporation of phenylalanine-rich peptides balanced with the charged pair of lysine and glutamic/aspartic acid, for example, KFFFE, FKFEF (humans) and FKIDF, FFEKF (AI). Phenylalanine is well known to form β-sheet-rich structures that further stabilize supramolecular assemblies via ππ interactions. The generation of such sequences via AI-expert is rather encouraging.

However, there are many deviations between the human experts and AI-expert. Using rpenta, some unusual high-scoring sequences were observed. These sequences lacked electrostatic pairs and incorporated uncharged/polar amino acids (for example, SYCGY and PPPHY). Incorporation of cysteine can be challenging due to its ability to crosslink and form cysteine bridges. For self-assembly, this can be beneficial, but the simulations do not predict disulfide bond formation, rendering the experimental and simulation results difficult to compare. Another deviation between AI-expert and the rational design approach is with respect to the amino acids valine and proline. Very few peptides suggested by AI-expert contained valine, yet human experts show a proclivity for it (for example, KVKVK, RVSVD and VKVEK). Generally, valine is employed due to its high β-sheet propensity, which often leads to long-range ordering in self-assembly55. The trend was opposite in the case of proline; although several of the AI-expert sequences were dominated by this amino acid, none of the human experts believed it to help with any sort of assembly. The added observation that none of the proline-containing pentapeptides aggregated points to a limitation of AI-expert.

Plausible sources of bias that can influence the results presented in this study should be highlighted. Foremost is the limited number of human experts who were included in this study. Although the inclusion of a larger cohort of human experts would have definitely helped to obtain more rigorous comparison results between the human experts and AI-expert, we ensured that the human experts selected in this study represent a diverse group of researchers affiliated (both now and in the past) to a variety of US institutions with several years (>10) of experience in the field of peptide design. Thus, if years of research is considered as an indicator of knowledge in the domain of peptide design, then the selected human experts would represent a population with much better chance of proposing highly aggregating sequences than a population that consists of a mix of graduate students, post-docs and scientists. Second, an additional selection had to be made from the top candidates proposed by AI-expert (9/100) and human experts (11/29) owing to the excessive experimental cost of studying aggregation in peptides. We detail our rationale for selection in both cases in the Supplementary Information, based on the diversity of sequences in terms of various aspects (amino acids, logP, β-sheet propensity and charge, among others). Additionally, we note that, because the peptide synthesis and characterization study was conducted in a highly reproducible and objective manner, knowledge of the source (AI-expert or human experts) of a sequence is not expected to impact the results of this study. Thus, our efforts regarding inclusion of a diverse group of human experts and the selection of diverse sequences from the suggested top-candidate list of AI and human experts are expected to mitigate the biases in this study.

AI-expert improvement opportunities

A critical component behind the success of AI-expert is its scoring function, which needs to be designed very carefully. To show how dramatically it can influence the performance of AI-expert, we again used AI-expert to design pentapeptides, but this time with the reward as rtri. This mostly yielded highly hydrophilic candidates with a logP value between 2 and 6, as shown in Supplementary Fig. 1. Although scoring highly (based on rtri), these candidates are expected to be soluble in water and not show any assembly. An analogous screening procedure was followed, wherein ten candidates were synthesized from the list of top 100 cases based on longer MD simulations (200 ns). Only two of ten cases, that is, KFFFDY and FFEKF, yielded aggregates with a success rate of only 20% (Supplementary Information). Thus, the selection of an instructive scoring function is quintessential for AI-expert, and the weighting parameters α and β need to be adjusted carefully according to the sequence length, n, of the peptide.

We noted previously that AI-expert did not receive any feedback from actual experiments, and only a modest correlation between the computational (rpenta) and experimental scores was found, as shown in Fig. 5c. This observation, along with the general tendency of human experts to incorporate amino acids with high β-sheet propensity (for example, valine and phenylalanine), as well as the unfitting proclivity of AI-expert to proline, provide an opportunity to improve the performance of AI-expert.

Chou and Fasman have reported β-sheet propensity as a statistical distribution of amino-acid conformers in the Protein Data Bank28, and the approach is updated periodically55. Being a quantitative measure, this can be used to modify the scoring function, that is, rScore=APα×logPβ×logσγ, where σ is the reported β-sheet propensity factor of weight γ. Figure 5c compares the correlation between the computational and experimental scores with (γ = 1) and without (γ = 0) the β-sheet propensity factor. The vertical lines in the bottom panel of Fig. 5c are proportional to change in the peptide score upon addition of the β-sheet propensity factor and denote its effect on the peptide score. It can be seen that, with the β-sheet, the computational score for the proline-containing sequences, WKPYY, PTPCY and PPPHY, which did not assemble, decreased substantially. Similarly, the human-expert-suggested peptides that had low computational and experimental scores, and did not aggregate (VKVKV), continued to display low scores, even after inclusion of the β-sheet propensity factor. Furthermore, many of the peptides that were found to aggregate either did not show any change (VVVVV, VKVEV) or their scores decreased marginally (KFFFE, SYCGY, FFEKF). This overall results in an improved ranking of the peptides, and we believe that this improved scoring system (rScore) could be used for peptide discovery in the future. However, caution should be exercised in selection of the γ factor, as a high γ value will dominate the search towards selected amino acids (for example, V, I and F) or to peptides with only β-sheet conformations, and thus introduce unwanted bias in the search and minimize amino-acid diversity.

Our future vision is the development of a fully autonomous peptide design platform in which AI-expert interacts with a robotic platform capable of synthesizing and characterizing new sequences, whose feedback is directly digested by AI-expert to suggest new sequences, then the search is progressed in an iterative manner. To accelerate this process, inputs from simulations could also be utilized to avoid low-scoring peptides. Another limitation of the current scheme is that AI-expert cannot predict the morphology (fibre, β-sheet, tapes and so on) of self-assembled nanostructures. Improvements in the reward function to include additional information from the simulations, such as the number and aspect ratio of the peptide clusters, their morphology and their moments of inertia, among others, are needed to enhance the abilities of AI-expert. We also note that the scheme, and the accompanying codes, presented here are generalizable to peptide design problems of variable sequence length. In the future, the introduction of β-sheet factor or the incorporation of direct feedback from (high-throughput) experiments is expected to improve the ability of AI-expert to predict peptide assemblage into well-ordered structures. Although this study was exclusively focused on sequences that show self-assembly, the same approach could be utilized to understand trends and biases in the peptide sequences that are not prone to assembly.

Conclusions

AI methodologies are incredibly useful for guiding scientists towards identifying novel short self-assembling peptides and are being considered as the future of synthesis and molecular design. AI-facilitated peptide discovery is necessary because of the intractable search space (20n, where n is the peptide length). Here we have developed AI-expert to evaluate the aggregation propensity of 6,600 out of 3.2 million possible pentapeptides using MD simulations and hydrophobicity (logP) scales. In addition, we have queried expert peptide designers to provide promising sequences. The top nine sequences from AI-expert and eleven candidates from the human experts were synthesized and characterized. An experimental scoring system (sample opacity versus HPLC RT) that reflected the AI scoring system (aggregation propensity versus hydrophobicity) was critical in identifying failures and successes in the approaches of both AI-expert and the human experts. Overall, AI-expert performed on par or slightly better (with a 66.7% success rate) than our human experts (54.5%). Not only did AI-expert recover known design strategies, such as the identification of charge-balanced phenylalanine-rich peptides (AI, FKIDF and FFEKF; human, KFFFE and FKFEF), it also found novel sequences that deviate notably from the traditional approach (for example, SYCGY).

Human bias was demonstrated to favour pentapeptides with high β-sheet propensity scores and was used as an opportunity to improve the AI scoring metric. Including the β-sheet factor to the AI score shifted the rankings in the correct direction, but still did not fully resemble the experimental ranking. Future efforts will focus on the application of high-throughput peptide synthesis coupled to the developed experimental scoring system to provide an experimental feedback loop to AI-expert beyond the currently implemented theoretical metrics (AP and logP). A similar AI strategy could be extended to screen small libraries of peptides for more specific applications. Although this study demonstrates the success of AI-expert in discovering self-assembling peptides, it can be extended to discover functional peptide assemblies for applications involving light-harvesting, catalysis, mechanical stability and conductivity.

Online content

Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41557-022-01055-3.

Methods

MCTS

AI-expert generates promising peptide sequences using the MCTS, which utilizes a tree-structure-based search to balance the exploration-versus-exploitation tradeoff. MCTS builds a shallow tree of nodes, each containing a peptide sequence, that are inter-connected in a parent–child manner. A meaning to the structure of the tree is provided by ensuring that each child node contains a sequence that is only a minor perturbation of the sequence of the parent node. Thus, similar peptide sequences occur in a tree branch. The MCTS consists of four key stages: (1) selection–based on a tree policy, select the leaf node that has the highest current score; (2) expansion–add a child node (with slightly different sequences from the parent node) to the selected leaf after taking a possible action; (3) simulation–from the selected node, perform Monte Carlo trials of possible actions using a rollout policy to estimate the associated expected reward; (4) back-propagation–pass the rewards generated by the simulated episodes to update the scores of all the parent leaves encountered while moving up the tree. Here, we emphasize the distinction between score and reward. The former is computed using the full equation (1), but the latter represents only the left term of this equation. Starting from a random peptide sequence assigned to the root node, the MCTS iterates between these four stages as guided by the tree and the rollout policies. This results in continual growth of the search tree (expansion) in regions that have high scores either due to high rewards (exploitation) or due to their uniqueness (exploration). An advantage of the MCTS is that if the search becomes trapped in a suboptimal point, it can quickly jump to other regions in the search space by growing other branches of the tree that have a high exploration score.

The UCB tree policy (equation (1)) used in this work is discussed in the main text, along with its reward and the uniqueness function. For the rollout policy, random perturbations were introduced to the peptide sequence of the selected node depending on its depth in the tree; the higher the level of depth the smaller the perturbations. For example, in the case of pentapeptides, at depth 0 (or seed nodes), all five sequences were generated randomly. However, with each increase in depth level, the number of sequences that were allowed to change decreased by one, for example, at depth levels 1 and 2, four and three sequences were changed randomly with respect to the selected node. Furthermore, only one of the amino acids was randomly perturbed during rollout if the depth level was equal to or greater than the sequence length.

In this work, ten seed nodes at depth 0 with random peptide sequences initiated the search. During each rollout, ten Monte Carlo runs were conducted to obtain the expected reward of the selected node, the results of which were back-propagated to all parent nodes to update their scores. In the scenario when no RF model was used, all of the ten Monte Carlo runs were performed on randomly generated sequences. In contrast, within the MCTS + RF scheme five of the ten runs were performed on randomly generated sequences, and the remaining five were screened from a pool of 500 random sequences based on their reward as approximated by the RF model. Exploration constant c was set to 10. This value was chosen based on the optimization study on the dataset of tripeptides, as discussed in Extended Data Fig. 4. The uniqueness function f(θj) was computed using the Dice similarity measure of the Morgan circular fingerprint of a peptide, as implemented in the open-source rdkit library56 with radius parameter m = 3 and using the feature-based invariance57 (useFeatures=True).

MD simulation

Peptide (tri- or penta-) coordinate files were created using VMD scripting tools58 and converted to CG representation in the MARTINI force field (version 2.2) using the open-source script martinize.py59. Analogous to the previous study on tripeptides29, the secondary structure input flag -ss = EEE was used. As this choice was consistent for all peptide sequences studied, it is not expected to bias our search.

Using the GROMACS code (version 5.1.2)36,37, 180 (300) zwitterionic pentapeptides (tripeptides) were randomly placed in a periodic cubic box of dimensions 13 × 13 × 13 nm3, resulting in a peptide concentration of 0.14 (0.23) mol l−1 in standard CG water. Lennard-Jones interactions were shifted to zero in the range 0.9–1.2 nm, and electrostatic interactions in the range 0.0–1.2 nm for all simulations (no particle mesh Ewald method was used). A relative dielectric constant of ϵr = 15 was used in standard CG water simulations for screening of the electrostatic interactions, and 2.5 was used for simulations in polarizable water. To model the peptide structure, the simulations were conducted in a series. First, the box was energy-minimized for 10,000 steps or until forces on atoms converged to under 200 pN. Next, the minimized box was equilibrated at constant volume (NVT) for 15,000 steps of 6.125 fs, using v-rescale temperature coupling (τT = 0.1 ps) at ~303 K. Finally, the resulting structure was equilibrated for 2 × 106 steps of 6.125 fs using the Berendsen algorithms60 to keep temperature (τT = 1.25 ps) and pressure (τP = 3 ps) around 303 K and 1 bar, respectively. Bond lengths in aromatic side chains and the backbone–side chain bonds in I, V and Y were constrained using the LINCS algorithm61. The total simulation time for MCTS evaluation was 12.25 ns, which, owing to the speed factor of the CG potentials62,63, was tantamount to an ‘effective time’ of ~50 ns. For longer MD simulations on the screened top 100 pentapeptides, the water in the solvated energy-minimized box obtained after NVT simulation was converted to polarizable water64 to better account for charge screening. This system was then energy-minimized again, and run in the NPT ensemble for 8 × 106 steps, or an effective time of ~200 ns.

The GROMACS sasa tool was used to compute AP values as the ratio of solvent-accessible surface area of the structures obtained at the start and finish of the MD runs. The hydrophobicity of a peptide was computed using the Wimley–White whole-residue scale26,27, formulated as logP=aϵSΔGwater-oct,a, where the summation runs over all amino acids a in peptide sequence S. The AP and logP values were normalized using the expression x = (xxmin)/(xmaxxmin), where x, xmin and xmax respectively denote the original peptide AP/logP value, and the associated possible minimum and maximum values. For logP, the minimum and maximum values were computed by assuming all amino acids in the sequence to be either W (−2.09) or D (3.64). In the case of AP, the results on tripeptides provided minimum and maximum values of 0.97 and 2.7, respectively.

Surrogate RF model for peptide AP

The RF regression algorithm, as implemented in scikit-learn65, was used to learn the AP of peptides. Accuracy comparison studies using AP data on tripeptides revealed the superior performance of the RF algorithm over other regression algorithms, such as kernel ridge regression and gradient boosting. Thus, the accuracy of the RF model and the consideration that the RF training time does not prohibitively increase with increasing training data size (making it suitable for on-the-fly training) made RF the algorithm of choice. RF is an ensemble of decision trees that averages predictions from a large group of ‘weak models’ to result, overall, in a better prediction. The RF hyperparameter, that is, the number of weak estimators, was set to 100 based on preliminary results using the dataset of tripeptides. As an input to the RF model, a three-level hierarchical set of features, based on past experience66,67, that capture different geometric and chemical information about the peptides at multiple length scales (atomic, morphological) was considered. Further details on the model input features are provided in the Supplementary Information. The RF model was trained to minimize the mean absolute error. To estimate prediction errors on unseen data and showcase the improvement in the model performance with increasing training data, learning curves were generated by varying the sizes of the training and test sets. These results are included in Extended Data Fig. 5. Statistically meaningful results were obtained by averaging over ten different random test–train splits.

Screening guidelines for human experts

Pentapeptide sequence design was requested by e-mail from seven experts. Five experts responded. The experts were chosen based on their published track record in designing small peptides for self-assembly and/or expertise in computational protein/peptide design. The experts are included as co-authors in this publication. They were given minimal guidance in an effort to minimize design biasing. The guidelines were as follows. (1) Rationally design a self-assembling pentapeptide that is unmodified (that is, select only from the 20 genetically encoded amino acids, no modified terminii). (2) The peptide should assemble at neutral pH 3. What morphology will the assembly yield? The five experts submitted multiple sequences, 29 in total. From these, we chose 11 based on the diversity of the sequences, as detailed in Supplementary Table 2. To avoid the influence of the sequences proposed by either other human experts or AI-expert, the human experts were blinded to the candidate sequences proposed by all other sources in this study.

Peptide synthesis

Pentapeptide sequences were obtained from either AI-expert or the human experts as described in the main text. In an effort to minimize post-synthesis purification via lengthy HPLC methods, our SPPS methods were optimized to yield crude peptides with >95% purity. The SPPS of pentapeptides was carried out using Fmoc chemistry (CS Bio Co. automated peptide synthesizer, CS136XT). Preloaded Wang resin (0.1-mmol synthetic scale, Chem Impex) was used as the solid support. A solution of 20% piperidine (Sigma Aldrich) in dimethylformamide (DMF; Fisher Chemical, bioreagent grade) was used as the deprotecting reagent with subsequent 5- and 20-min deprotection times. Coupling was executed using tenfold equivalents of standard Fmoc-protected amino acids (1 mmol, Chem Impex) and stoichiometric equivalents of diisopropylethylamine (1 mmol, Sigma Aldrich) and O-benzotriazole-N,N,N′,N′-tetramethyl-uronium-hexafluoro-phosphate (1 mmol, Chem Impex) in DMF with a 90-min coupling time. Final Fmoc deprotection was made following the same deprotection protocol as stated above.

Upon completion of the synthesis, the resin was transferred to a 20-ml scintillation vial equipped with a stir bar. The peptide side chains were deprotected and the crude peptide removed from the peptidyl resin with a standard trifluoroacetic acid solution (10 ml, 95% TFA, 2.5% triisopropylsilane, 2.5% water) and stirred for 3 h. If a cysteine residue was present in the sequence, the deprotection solution was adjusted with ethane dithiol (EDT, Sigma Aldrich) (10 ml, 95% TFA, EDT, 2.5% triisopropylsilane, 2.5% water). The resulting solution was filtered (fritted peptide reaction vessel equipped with a side arm, Chem Glass) into a clean 20-ml glass vial. The crude peptide was precipitated out of solution via dropwise addition of the TFA solution into cold diethyl ether (90 ml). The suspension in diethyl ether was transferred into two centrifuge tubes (50-ml Falcon tubes). The precipitate was pelleted using a centrifuge. The off-white to white precipitate was washed three times with cold diethyl ether, yielding the crude material. Once dry, the material was reconstituted in water and then lyophilized to obtain a white powder.

Sample preparation

The lyophilized powder of each peptide was weighed and dissolved in MilliQ water (R = 18.2 MΩ) or deuterium oxide (D2O, Sigma Aldrich) for solution infrared experiments. The pH was adjusted to 7 with ammonium hydroxide (1 M NH4OH or 1 M ND4OD prepared by diluting ammonium hydroxide in water or deuterium oxide). The sample was noted to either remain in solution, precipitate or gel immediately after adjusting the pH and after 24 h. The samples were either used as prepared or diluted for further characterization.

Experimental measurements

LCMS was employed for not only peptide identification and purity analysis but also for quantifying the hydrophobicity as reported by RT in a standardized method. An Agilent Technolgies HPLC Workstation (Agilent 1260 Infinity equipped with an autosampling unit and multiwavelength detector) equipped with a C18 column (Jupiter Proteo 10 × 250 mm, Phenomenex) was utilized. A linear purification method was employed using a polar mobile phase water (0.1% TFA) with a 4% (vol/vol) per minute increase of the non-polar mobile phase acetonitrile (0.1% TFA). The sample was prepared at a concentration of 300 μg ml−1 in water (0.1% TFA), with injection volumes of 0.9 ml. The RTs were recorded using Agilent OpenChem software (Extended Data Fig. 2). Advion Expression CMS (ESI-MS) was used to determine the correct mass of the isolated peptides (Supplementary Table 2).

Sample opacity was used as an indicator of aggregation or assembly. We added 100 μl of each sample to a 96-well plate and analysed the absorption at 800 nm (OD800nm, Tecan Platereader, Magellan Software).

ExpScore is defined as the product of the normalized retention time (RT′) and the normalized sample opacity (OD800nm′). The RT was normalized to the lowest (15 min) and highest whole-number RT (22 min). OD800nm was normalized to the value collected for water (0.04) and the highest value observed for the peptides (1.82). A complete table is provided in Extended Data Fig. 2.

A Thermofisher Nicolet FTIR spectrometer was used for analysis of the mid-infrared region, that is, the amide region for peptides. Each spectrum is an average of 16 scans with a resolution of 4 cm−1 and background-corrected for D2O. Each sample (10 μl of a 2 wt% solution) was dropcast on a CaF2 plate equipped with a 0.025-mm Teflon spacer in a solution infrared cell (Sigma Aldrich). A second CaF2 plate was placed on top of the Teflon spacer, and the assembly was sealed. The same spectrometer and settings were used as for the solution FTIR, except background-corrected for CaF2 only. Each sample was diluted tenfold to 0.2 wt%, and 10 μl was dropcast onto a CaF2 plate and dried (~30 min).

AFM results were obtained on a Bruker MultiMode 8 microscope using the Scanasyst mode. A silicon tip on a nitride lever was used (Scanasyst-Air Probe, Bruker). Each sample was diluted tenfold to 0.2 wt%, and 100 μl was dropcast onto a freshly cleaved mica disk (top layer removed with scotch tape) affixed to a stainless-steel disk. After 2 min, the solution was removed by wicking away with filter paper. Images (10 μm × 10 μm and 2 μm × 2 μm) were collected at a scan rate of 1 Hz.

Extended Data

Extended Data Fig. 1 |. Top scoring tripeptides.

Extended Data Fig. 1 |

Top ranked tripeptides identified using the brute-force computational search on 8000 candidates. The score is based on the reward function rtri. Abbreviations: AP, aggregation propensity; logP; hydrophobicity.

Extended Data Fig. 2 |. Overall results for the synthesized pentapeptides.

Extended Data Fig. 2 |

Computational (AP, logP) and experimental (LC(RT), OD800nm) measurements, along with the associated reward scores (rpenta, rtri) and experimental score (ExpScore) are provided. β-sheet scale corrected rpenta and rtri scores, respectively titled rpentawB and rtriwB, are also included. Cases where aggregation (Agg.) was observed are marked 1 with a bold font.

Extended Data Fig. 3 |. Diversity of pentapeptides proposed by our human experts.

Extended Data Fig. 3 |

Frequency of occurrence of (left panel) amino acids in the 29 human expert proposed sequences and (right panel) the overall charge distribution of those sequences. It is evident that human experts preferred to include V, K and F amino acids and overall charge neutral pentapeptides sequences. The complete list of the pentapeptides proposed by the human experts and the rationale for choosing/rejecting a sequences for synthesis is provided in Supplementary Information Table S2.

Extended Data Fig. 4 |. MCTS hyperparameter study.

Extended Data Fig. 4 |

Effect of the exploration constant c in Eq. 1 on the search efficiency of AI-expert for the case of tripeptides with (a) just the MCTS scheme and (b) with the MCTS+RF scheme. The boxplots showcase the number of runs needed to find the topmost scoring tripeptide. The minima and maxima bounds of box represent the 25th and 75th percentile, the middle line the median, the upper whiskers extended to last datum less than 75th percentile + 1.5(IQR), lower whiskers extended to first datum greater than 25th percentile - 1.5(IQR), and data beyond the whiskers are plotted as individual points. Here, IQR signify interquartile range given by 75th - 25th percentile. The results are based on n=10 statistically independent runs. Number of trials needed using a brute-force or random search (on average) are also shown using dotted lines. The MCTS+RF scheme performs the best–not only is the MCTS+RF scheme less sensitive to the choice of c parameter, it also finds the topmost scoring tripeptide more efficiently. The MCTS+RF scheme with c = 10 was found to be most efficient and thus was selected for the pentapeptide search.

Extended Data Fig. 5 |. Machine learning aggregation propensity.

Extended Data Fig. 5 |

Performance of the random forest (RF) model to predict the computed aggregation propensity (AP) in a) tripeptides and b) pentapeptides. In both cases improvement in the RF model performance with increasing size of training data (left panel) is shown, along with an example parity plot of the test data when it constitutes 20 % of the total dataset. In a) n=10 statistically independent runs with a random split of test-train data (from 8000 total cases) were performed. Here, data are presented as mean values +1.5/−1.5 SD. In b) the test-train split (from ~ 6600 total cases using rpenta) was performed in a special manner to capture the progressive improvement of the RF model during the MCTS run. Since within the MCTS+RF scheme the training data was generated in an online fashion, the RF model training set consists of AP values evaluated in the early stages of the MCTS run while the test set contains AP values evaluated in the later stage of the run. Abbreviation: MAE, mean absolute error; SD, standard deviation.

Supplementary Material

Supp Info
Summary

Acknowledgements

Work performed at the Center for Nanoscale Materials, a US Department of Energy (DOE) Office of Science User Facility, was supported by the US DOE, Office of Basic Energy Sciences, under contract no. DE-AC02–06CH11357, and additionally supported by the University of Chicago and the DOE under DOE contract no. DE-AC02–06CH11357 awarded to UChicago Argonne, LLC, operator of the Argonne National Laboratory. This material is based on work supported by the DOE, Office of Science, BES Data, Artificial Intelligence and Machine Learning at DOE Scientific User Facilities programme (Digital Twins). We gratefully acknowledge the computing resources provided on Bebop, the high-performance computing clusters operated by the Laboratory Computing Resource Center (LCRC) at Argonne National Laboratory. S.K.R.S.S. acknowledges support from the UIC faculty start-up fund. We acknowledge T. Tuttle for sharing computational data on tripeptides.

Footnotes

Competing interests

The authors declare no competing interests.

Extended data is available for this paper at https://doi.org/10.1038/s41557-022-01055-3.

Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41557-022-01055-3.

Peer review information Nature Chemistry thanks Shuguang Zhang, Jin Kim Montclare, Fabien Plisson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Reprints and permissions information is available at www.nature.com/reprints.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this Article.

Code availability

The codes underlying the AI-expert framework are freely available for general use under a Creative Commons Attribution 4.0 International license and are deposited at https://doi.org/10.5281/zenodo.6564202.

Data availability

The data that support the findings of this study are available in the Extended Data figures (for synthesized pentapeptides), the Supplementary Information (for AI-expert-proposed pentapeptides) and the accompanying code repository at https://doi.org/10.5281/zenodo.6564202 (for tripeptides). Source data are provided with this paper.

References

  • 1.Zhu S et al. Self-assembly of collagen-based biomaterials: preparation, characterizations and biomedical applications. J. Mater. Chem. B 6, 2650–2676 (2018). [DOI] [PubMed] [Google Scholar]
  • 2.Sorushanova A et al. The collagen suprafamily: from biosynthesis to advanced biomaterial development. Adv. Mater. 31, 1801651 (2019). [DOI] [PubMed] [Google Scholar]
  • 3.Lewis RV Spider silk: ancient ideas for new biomaterials. Chem. Rev. 106, 3762–3774 (2006). [DOI] [PubMed] [Google Scholar]
  • 4.Scholes GD, Fleming GR, Olaya-Castro A & Van Grondelle R Lessons from nature about solar light harvesting. Nat. Chem. 3, 763–774 (2011). [DOI] [PubMed] [Google Scholar]
  • 5.Luo Q, Hou C, Bai Y, Wang R & Liu J Protein assembly: versatile approaches to construct highly ordered nanostructures. Chem. Rev. 116, 13571–13632 (2016). [DOI] [PubMed] [Google Scholar]
  • 6.Wei G et al. Self-assembling peptide and protein amyloids: from structure to tailored function in nanotechnology. Chem. Soc. Rev. 46, 4661–4708 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ulijn RV & Smith AM Designing peptide based nanomaterials. Chem. Soc. Rev. 37, 664–675 (2008). [DOI] [PubMed] [Google Scholar]
  • 8.Adler-Abramovich L & Gazit E The physical properties of supramolecular peptide assemblies: from building block association to technological applications. Chem. Soc. Rev. 43, 6881–6893 (2014). [DOI] [PubMed] [Google Scholar]
  • 9.Wang M et al. Nanoribbons self-assembled from short peptides demonstrate the formation of polar zippers between β-sheets. Nat. Commun. 9, 5118 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lakshmanan A et al. Aliphatic peptides show similar self-assembly to amyloid core sequences, challenging the importance of aromatic interactions in amyloidosis. Proc. Natl Acad. Sci. USA 110, 519–524 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Brahmachari S, Arnon ZA, Frydman-Marom A, Gazit E & Adler-Abramovich L Diphenylalanine as a reductionist model for the mechanistic characterization of β-amyloid modulators. ACS Nano 11, 5960–5969 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Yemini M, Reches M, Rishpon J & Gazit E Novel electrochemical biosensing platform using self-assembled peptide nanotubes. Nano Lett. 5, 183–186 (2005). [DOI] [PubMed] [Google Scholar]
  • 13.Zohrabi T, Habibi N, Zarrabi A, Fanaei M & Lee LY Diphenylalanine peptide nanotubes self-assembled on functionalized metal surfaces for potential application in drug-eluting stent. J. Bio. Mater. Res. A 104, 2280–2290 (2016). [DOI] [PubMed] [Google Scholar]
  • 14.Tao K, Makam P, Aizen R & Gazit E Self-assembling peptide semiconductors. Science 358, eaam9756 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yan X, Zhu P & Li J Self-assembly and application of diphenylalanine-based nanostructures. Chem. Soc. Rev. 39, 1877–1890 (2010). [DOI] [PubMed] [Google Scholar]
  • 16.Kholkin A, Amdursky N, Bdikin I, Gazit E & Rosenman G Strong piezoelectricity in bioinspired peptide nanotubes. ACS Nano 4, 610–614 (2010). [DOI] [PubMed] [Google Scholar]
  • 17.Yan X et al. Transition of cationic dipeptide nanotubes into vesicles and oligonucleotide delivery. Angew. Chem. Int. Ed. 119, 2483–2486 (2007). [DOI] [PubMed] [Google Scholar]
  • 18.Zhao X et al. Molecular self-assembly and applications of designer peptide amphiphiles. Chem. Soc. Rev. 39, 3480–3498 (2010). [DOI] [PubMed] [Google Scholar]
  • 19.Zelzer M & Ulijn RV Next-generation peptide nanomaterials: molecular networks, interfaces and supramolecular functionality. Chem. Soc. Rev. 39, 3351–3357 (2010). [DOI] [PubMed] [Google Scholar]
  • 20.Cui H, Webber MJ & Stupp SI Self-assembly of peptide amphiphiles: from molecules to nanostructures to biomaterials. Peptide Sci. Original Res. Biomol. 94, 1–18 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Rufo CM et al. Short peptides self-assemble to produce catalytic amyloids. Nat. Chem. 6, 303–309 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gelain F, Luo Z & Zhang S Self-assembling peptide EAK16 and RADA16 nanofiber scaffold hydrogel. Chem. Rev. 120, 13434–13460 (2020). [DOI] [PubMed] [Google Scholar]
  • 23.Solomon LA et al. Tailorable exciton transport in doped peptide-amphiphile assemblies. ACS Nano 11, 9112–9118 (2017). [DOI] [PubMed] [Google Scholar]
  • 24.Palmer LC & Stupp SI Molecular self-assembly into one-dimensional nanostructures. Acc. Chem. Res. 41, 1674–1684 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhang S Discovery and design of self-assembling peptides. Interface Focus 7, 20170028 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.White SH & Wimley WC Hydrophobic interactions of peptides with membrane interfaces. Biochim. Biophys. Acta Biomembr. 1376, 339–352 (1998). [DOI] [PubMed] [Google Scholar]
  • 27.Wimley WC, Creamer TP & White SH Solvation energies of amino acid side chains and backbone in a family of host-guest pentapeptides. Biochemistry 35, 5109–5124 (1996). [DOI] [PubMed] [Google Scholar]
  • 28.Chou PY & Fasman GD Prediction of protein conformation. Biochemistry 13, 222–245 (1974). [DOI] [PubMed] [Google Scholar]
  • 29.Frederix PW et al. Exploring the sequence space for (tri-) peptide self-assembly to design and discover new hydrogels. Nat. Chem 7, 30–37 (2015). [DOI] [PubMed] [Google Scholar]
  • 30.Batra R, Song L & Ramprasad R Emerging materials intelligence ecosystems propelled by machine learning. Nat. Rev. Mater 6, 655–678 (2021). [Google Scholar]
  • 31.Balachandran PV, Kowalski B, Sehirlioglu A & Lookman T Experimental search for high-temperature ferroelectric perovskites guided by two-step machine learning. Nat. Commun. 9, 1668 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lookman T, Balachandran PV, Xue D, Hogden J & Theiler J Statistical inference and adaptive design for materials discovery. Curr. Opin. Solid State Mater. Sci. 21, 121–128 (2017). [Google Scholar]
  • 33.Sutton RS & Barto AG Reinforcement Learning: An Introduction (MIT Press, 2011). [Google Scholar]
  • 34.Browne CB et al. A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4, 1–43 (2012). [Google Scholar]
  • 35.Frederix PW, Ulijn RV, Hunt NT & Tuttle T Virtual screening for dipeptide aggregation: toward predictive tools for peptide self-assembly. J. Phys. Chem. Lett 2, 2380–2384 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Bekker H et al. in Physics Computing Vol. 92, 252–256 RA DeGroot, J Nadrchal (World Scientific; Singapore, 1993). [Google Scholar]
  • 37.Abraham MJ et al. GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015). [Google Scholar]
  • 38.Coulom R Efficient selectivity and backup operators in Monte-Carlo tree search. In Proc. 5th International Conference on Computers and Games 72–83 (Springer, 2006). [Google Scholar]
  • 39.Kocsis L & Szepesvári C Bandit based Monte-Carlo planning. In Proc. 15th European Conference on Machine Learning 282–293 (Springer, 2006). [Google Scholar]
  • 40.Silver D et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). [DOI] [PubMed] [Google Scholar]
  • 41.Dieb TM, Ju S, Shiomi J & Tsuda K Monte Carlo tree search for materials design and discovery. MRS Commun. 9, 532–536 (2019). [Google Scholar]
  • 42.Srinivasan S et al. Artificial intelligence-guided De novo molecular design targeting COVID-19. ACS Omega. 6, 12557–12566 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Liu Y-C & Tsuruoka Y Modification of improved upper confidence bounds for regulating exploration in Monte-Carlo tree search. Theor. Comput. Sci. 644, 92–105 (2016). [Google Scholar]
  • 44.Rogers D & Hahn M Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010). [DOI] [PubMed] [Google Scholar]
  • 45.Monticelli L et al. The Martini coarse-grained force field: extension to proteins. J. Chem. Theory Comput. 4, 819–834 (2008). [DOI] [PubMed] [Google Scholar]
  • 46.Singh G & Tieleman DP Using the Wimley-White hydrophobicity scale as a direct quantitative test of force fields: the Martini coarse-grained model. J. Chem. Theory Comput. 7, 2316–2324 (2011). [DOI] [PubMed] [Google Scholar]
  • 47.de Jong DH, Periole X & Marrink SJ Dimerization of amino acid side chains: lessons from the comparison of different force fields. J. Chem. Theory Comput. 8, 1003–1014 (2012). [DOI] [PubMed] [Google Scholar]
  • 48.Tang JD, Mura C & Lampe KJ Stimuli-responsive, pentapeptide, nanofiber hydrogel for tissue engineering. J. Am. Chem. Soc. 141, 4886–4899 (2019). [DOI] [PubMed] [Google Scholar]
  • 49.Clarke DE, Parmenter CD & Scherman OA Tunable pentapeptide self-assembled β-sheet hydrogels. Angew. Chem. Int. Ed. 57, 7709–7713 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Reches M, Porat Y & Gazit E Amyloid fibril formation by pentapeptide and tetrapeptide fragments of human calcitonin. J. Bio. Chem. 277, 35475–35480 (2002). [DOI] [PubMed] [Google Scholar]
  • 51.Guterman T et al. Real-time in-situ monitoring of a tunable pentapeptide gel-crystal transition. Angew. Chem. 131, 16016–16022 (2019). [DOI] [PubMed] [Google Scholar]
  • 52.Tsiolaki PL, Hamodrakas SJ & Iconomidou VA The pentapeptide LQVVR plays a pivotal role in human cystatin C fibrillization. FEBS Lett. 589, 159–164 (2015). [DOI] [PubMed] [Google Scholar]
  • 53.Krysmann MJ et al. Self-assembly and hydrogelation of an amyloid peptide fragment. Biochemistry 47, 4597–4605 (2008). [DOI] [PubMed] [Google Scholar]
  • 54.Kong J & Yu S Fourier transform infrared spectroscopic analysis of protein secondary structures. Acta Biochim. Biophys. Sin. 39, 549–559 (2007). [DOI] [PubMed] [Google Scholar]
  • 55.Fujiwara K, Toda H & Ikeguchi M Dependence of α-helical and β-sheet amino acid propensities on the overall protein fold type. BMC Struct. Biol. 12, 18 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.RDKit open source toolkit for cheminformatics; http://www.rdkit.org/
  • 57.Gobbi A & Poppinger D Genetic optimization of combinatorial libraries. Biotechnol. Bioeng. 61, 47–54 (1998). [DOI] [PubMed] [Google Scholar]
  • 58.Humphrey W, Dalke A & Schulten K VMD: visual molecular dynamics. J. Mol. Graph. 14, 33–38 (1996). [DOI] [PubMed] [Google Scholar]
  • 59.martinize.py; http://cgmartini.nl/index.php/tools2/proteins-and-bilayers/204-martinize
  • 60.Berendsen HJ, Postma JV, van Gunsteren WF, DiNola A & Haak JR Molecular dynamics with coupling to an external bath. J. Chem. Phys. 81, 3684–3690 (1984). [Google Scholar]
  • 61.Hess B P-LINCS: a parallel linear constraint solver for molecular simulation. J. Chem. Theory Comput. 4, 116–122 (2008). [DOI] [PubMed] [Google Scholar]
  • 62.Marrink SJ, Risselada HJ, Yefimov S, Tieleman DP & De Vries AH The Martini force field: coarse grained model for biomolecular simulations. J. Phys. Chem. B 111, 7812–7824 (2007). [DOI] [PubMed] [Google Scholar]
  • 63.Marrink SJ, De Vries AH & Mark AE Coarse grained model for semiquantitative lipid simulations. J. Phys. Chem. B 108, 750–760 (2004). [Google Scholar]
  • 64.Yesylevskyy SO, Schäfer LV, Sengupta D & Marrink SJ Polarizable water model for the coarse-grained Martini force field. PLoS Comput. Biol. 6, e1000810 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Pedregosa F et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). [Google Scholar]
  • 66.Batra R et al. Screening of therapeutic agents for COVID-19 using machine learning and ensemble docking studies. J. Phys. Chem. Lett 11, 7058–7065 (2020). [DOI] [PubMed] [Google Scholar]
  • 67.Kim C, Chandrasekaran A, Huan TD, Das D & Ramprasad R Polymer genome: a data-powered polymer informatics platform for property predictions. J. Phys. Chem. C 122, 17575–17585 (2018). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Info
Summary

Data Availability Statement

The data that support the findings of this study are available in the Extended Data figures (for synthesized pentapeptides), the Supplementary Information (for AI-expert-proposed pentapeptides) and the accompanying code repository at https://doi.org/10.5281/zenodo.6564202 (for tripeptides). Source data are provided with this paper.

RESOURCES