Abstract
Modern proteins are remarkable polymers built from a 20-amino-acid alphabet, shaped by billions of years of evolution. Yet in Earth’s prebiotic era, several amino acids – particularly the canonical basic residues lysine, arginine, and histidine – were likely scarce, unlike the more readily available acidic amino acids. Moreover, protein-length polymers were inaccessible before ribosomal synthesis emerged, and peptides were probably short, statistical, and non-templated. How the earliest proteins and enzymes emerged under these constraints remains a central question in origins-of-life research.
Here, we synthesize random peptide libraries that span a broad electrostatic spectrum and systematically interrogate their properties. The data indicate that a prebiotically plausible acidic alphabet stands out in its propensity for secondary structure and higher-order soluble assembly via formation of β-sheets. These assemblies arise from highly heterogeneous sequences, plausibly reflecting the statistical diversity of early Earth peptides, and differ from amyloid structures in both solubility and morphology. Our results further show that the acidic random peptides have inherent capacity to bind certain metal ions, implying their potential to contribute to prebiotic catalysis. Using a large language model for structural prediction, we further show that peptides composed of this acidic alphabet exhibit a strong propensity for compact conformations.
Altogether, this study showcases that unevolved sequences of prebiotically-abundant amino acids can readily produce foldable self-assembling polymers, potentially providing a steppingstone toward the first proteins, prior to the onset of purifying selection.
Keywords: Peptide assembly, prebiotic chemistry, catalysis, protein structure, language models, peptide libraries, random sequences
Introduction
Biochemistry researchers and students alike have long asked, Why did life since the last universal common ancestor (LUCA) ‘select’ the twenty α-amino acids as the established language of building proteins and peptides(1-5)? The groundbreaking work of Miller and Urey (6, 7) in the 1950s provided early evidence that unsupervised gas phase reactions under reducing conditions could have provided abundant access to some α-amino acids early in Earth’s history. Even at this incipient stage in the field of prebiotic chemistry, an apparent asymmetry was noted: the modern acidic amino acids – L-aspartic acid and L-glutamic acid – were generated in high yields, whereas the modern basic amino acids – L-arginine, L-histidine, and L-lysine – were not detected. The absence of these essential building blocks was rightly perceived as a roadblock for an early emergence of functional proteins (2), given that extant proteins are comprised of a balanced mix of acidic and basic residues, and basic residues are almost always involved in interactions with nucleotides and nucleic acids (2, 8). Analyses of the amino acids that are carried to Earth by carbonaceous meteorites (9) as well as recent examination of the Bennu and Ryugu asteroids (which were not subjected to terrestrial exposure) (10, 11) have only reinforced the view that the modern acidic amino acids were plentiful whereas the modern basic amino acids were unavailable in prebiotic sources. The overlap between the amino acid compositions from gas-phase atmospheric chemistry and extraterrestrial cosmic material have led to propositions of a consensus ‘early’ amino acid alphabet (12) that comprised of ten α-amino acids: alanine (Ala), aspartic acid (Asp), glutamic acid (Glu), glycine (Gly), leucine (Leu), isoleucine (Ile), proline (Pro), serine (Ser), threonine (Thr), and valine (Val) (Figure 1, grey and red circles).
Figure 1. Design and evaluation of random peptide libraries using distinct amino acid alphabets.
(a) Chemical structures of 16 α-amino acids assembled through solid phase peptide synthesis to generate random 25-mer peptides utilising one of nine distinct alphabets (collections of amino acids). Dap, 2,3-diaminopropionic acid; Dab, 2,4-diaminobutyric acid; Aad, 2-aminoadipic acid; Api, 2-aminopimelic acid. (b) Methods used in this study to interrogate random peptide libraries.
Whether such a reduced chemical repertoire could have served as a practical inventory for building up the earliest biopolymers is not apparent, given the essential roles aromatic, basic, and sulfur-containing amino acids play in modern protein biochemistry. Recent efforts from us and others (13, 14) have at least substantiated this possibility by showing that these ‘early acidic’ proteins can adopt conformations with secondary structure, can form strong associations with RNA mediated by divalent cation coordination (15), and strikingly, are much less dependent on chaperones to adopt soluble stable conformations (14, 16, 17). A recent report has shown that distinct small protein folds can be reliably designed with the reduced alphabet using the modern de novo design methods (18).
On the other hand, firm claims are hard to make in prebiotic chemistry because one must always be cognizant of the large space of potential alternative scenarios that are not reflected in extant biology, but which nevertheless might have taken place or been significant in the distant past. For instance, even though there is spare evidence for L-lysine and L-arginine in prebiotic sources, modest levels of L-2,3-diaminopropionic acid (Dap), and L-2,4-diaminobutyric acid (Dab) (Figure 1, cyan circles) are detected in meteorite sources among many other noncanonical amino acids (19). It is possible that these noncanonical basic amino acids played a key role in an earlier phase of biochemical history, only later to be displaced by their peers with longer side chains. The existing biological record provides no way to disambiguate between these possibilities.
Recently, we proposed that foldability could have provided a selective force to govern which among the vast array of available amino acids became promoted to the biotic pool (20). The model is premised on the fact that now many prebiotically-plausible routes to condense α-amino acids into peptides have been documented, e.g., COS-catalysis (21), wet-dry cycling (22), and thiol catalysis (23, 24). Building blocks that provided peptides with a propensity to fold/compact or oligomerize/assemble would have been adaptive by generating sequences with greater stability and function. On the other hand, monomers that impair foldability would have been selected against. This model provides a satisfying explanation for why the canonical alphabet contains branched aliphatic amino acids (Val, Leu, Ile) even though their unbranched isomers (norvaline, norleucine) were possibly more prebiotically abundant: the former are more adept at producing secondary structures than the latter (especially β-sheets). However, the possible adaptive properties of different charge profiles remain elusive.
Here, we extensively characterise nine random peptide libraries (heterogenous mixtures of sequences), each composed of various combinations of acidic and basic amino acids (some canonical, some prebiotic, and some counterfactual, Figure 1), to interrogate effects on solubility, folding propensity, and oligomerisation potential. The data shows that acidic alphabets are remarkable in their capacity to promote folding as well as oligomer assembly mediated through β-sheets. We find unambiguously that any inclusion of basic amino acid residues – either in isolation or alongside acidic residues – diminishes these properties. The assemblies formed by the prebiotically plausible acidic peptides are quite thermostable but do not resemble amyloid structure (based on both solubility and morphology). Altogether, these findings point to acidic polypeptides as a potential intermediate in the early evolution of peptides and proteins. Though alien in some ways compared to today’s charge-balanced polypeptides, we observe that the early acidic alphabet could provide key advantages that facilitated the initial emergence of biopolymers.
Results
Synthesis of Charge-Variant Libraries.
Libraries of 25-mer peptides, constructed as random combinations of amino acids within the library, were synthesised by standard Fmoc-based solid phase peptide synthesis using the isokinetic mixtures as previously reported (20) (see Methods and Supplemental Data S1). Each library sampled one of nine different possible alphabets (Figure 1a), which could comprise either 8, 10, or 12 constituent α-amino acids. The common core of all libraries (library “8E”) only uses the eight uncharged canonical and prebiotic amino acids (grey), which we synthesised as a control. We considered two basic alphabets: 8E+Lys,Arg adds the two modern basic amino acids with long side chains (dark blue), whereas 8E+Dap,Dab adds the two most prebiotically available basic amino acids (cyan). We considered two acidic alphabets: 8E+Asp,Glu which adds the two canonical and prebiotically plausible acidic amino acids (red), and 8E+Aad,Api (L-α-2-aminoadipic acid, L-α-2-aminopimelic acid; orange) which is a ‘counterfactual’ alphabet. These amino acids were not prebiotically abundant but mimic the longer side chains of Lys and Arg and provide orthogonal means to address whether Asp and Glu provide unique advantages over hypothetical alternatives. Finally, we also considered four mixed alphabets in which acidic and basic amino acids coexist with varying side chain lengths.
We established the quality of these libraries with both MALDI mass spectrometry (Supplemental Figure S1) and HPLC amino acid analysis (Supplemental Figure S2, measurements compiled in Supplemental Data S2). MALDI showed distributions of precursor masses quite similar to theoretical predictions for each library (Supplemental Figure S1). For instance, the predicted average molecular weight for a 25-mer peptide in the 8E+Dap,Dab library is 2330 Da; we observe that it is 2288 Da (difference 42 Da or ~0.44 amino acids), which is similar for the other libraries. Amino acid analysis showed correct incorporation of the amino acids in each library and close-to-uniform distributions in the amino acids’ stoichiometry (Supplemental Figure S2). For instance, since the 8E+Lys,Arg,Aad,Api library has 12 amino acids, each amino acid should represent ~8.3% of the total amino acid content theoretically; we observe that each amino acid is present at 8 ± 3% of the content, and this is similar for the other eight libraries. We conclude that the isokinetic mix strategy can produce reasonably representative random peptide libraries, even spanning those with distinct isoelectric points and noncanonical amino acids.
An array of biophysical and computational methods was then used to characterize the libraries (Figure 1b).
Solubility Profiles of Charge-Variant Libraries.
Lyophilised peptide libraries were resuspended in aqueous Britton-Robinson (ABP) buffers of various pH’s and supplemented with NaCl to either a medium (50 mM) or high (500 mM) ionic strength. After mixing, the solutions were centrifuged at 21,000 x g for 30 min to remove insoluble material, and the solubility of the alphabets were assessed by measuring peptide concentration in the soluble fraction by UV-Vis absorption spectroscopy (Supplemental Figure S3, measurements compiled in Supplemental Data S2).
Similarly to what we previously observed in 8E+Asp,Glu (20), the acidic 8E+Aad,Api alphabet showed poor solubility at low pH, and high solubility at neutral and high pH (Supplemental Figure S3). These results can be rationalised since carboxylic acids are expected to be protonated and neutral at pH below 5, thus promoting aggregation. To our surprise, all other six libraries that contained basic amino acids were highly soluble at all pH’s (Supplemental Figure S3), when we might have expected 8E+Lys,Arg and 8E+Dap,Dab to mirror the acidic libraries and be insoluble at higher pH’s (although the pKA of Arg is further away from physiological pH).
In general, ionic strength has minimal effect on solubility, with the only exception being the acidic alphabets (8E+Asp,Glu and 8E+Aad,Api) at physiological pH, where high salt concentration is deleterious (Supplemental Figure S3). In other words, it is only the acidic alphabets that evidence the Hofmeister salting-out effect typically ascribed to globular proteins (25, 26). This result suggests that the acidic alphabets have a rather unique balance of hydrophobicity and hydrophilicity, which can promote aggregation under some conditions (i.e., low pH, high ionic strength) and potentially folding under other conditions (i.e., neutral pH, medium ionic strength).
Soluble Aggregate Formation in Charge-Variant Libraries.
Size exclusion chromatography (SEC) can provide an orthogonal means to assess aggregation, and unlike centrifugal pelleting assays, which measure precipitating aggregation, SEC is sensitive to the formation of soluble oligomers/aggregates. We examined all the peptide libraries in the pH range between 3–11, finding that in all cases, the majority of the peptide mass eluted as a single peak with an elution volume between 13–16 ml (Figure 2). These data indicate that most of the soluble peptides are monomeric and do not form oligomers or soluble aggregates.
Figure 2. The aggregation propensity of eight peptide libraries.
Aggregation propensity of the indicated libraries was measured in 20 mM ABP buffer at different pH’s and at medium ionic strength (50 mM NaCl) with size-exclusion chromatography (SEC). SEC chromatograms report absorbance at 215 nm as a function of elution volume. The data for 8E+Asp,Glu library were taken from previous work (20).
The two acidic libraries (8E+Asp,Glu and 8E+Aad,Api) showed the greatest pH-dependence on their elution profile in a predictable fashion. At pH 11, the peptides eluted at 13 ml, implying larger hydrodynamic radii when the molecules are the most polyanionic (and most likely unfolded). At pH 3, where these peptides are more neutralised and higher levels of precipitation are observed (cf. Supplemental Figure S3), the soluble peptides eluted at 16 ml, implying more compact structures with lower hydrodynamic radii. Curiously, the two basic libraries (8E+Lys,Arg and 8E+Dap,Dab) did not show the expected reciprocal behaviour of becoming more compact at high pH, where the peptides would become more charge neutralised. Instead, the position of the primary peak remained at ca. 16 ml, though the chromatographic feature does broaden and narrow in a pH-dependent manner (Figure 2). The SEC profiles of the four libraries with both acidic and basic residues responded to pH in a similar, if slightly attenuated, way as the acidic libraries.
We detected soluble aggregates forming as a separate feature at low elution volume (8 ml), but only in the two acidic libraries and only under pH’s where these libraries did not produce precipitating aggregates (Figure 2 and Supplemental Figure S3). This is a key observation as self-assembly into larger structures has been posited as an important property for prebiotic peptides (27, 28).
These experiments were repeated at high ionic strength (500 mM NaCl, Supplemental Figure S4), which produced SEC chromatograms that were largely similar but differed in two key respects. Firstly, the pH-associated changes to the peak position were reduced, which is reasonable because the salt screens repulsive ion-ion interactions that would promote peptide chain radius expansion. Secondly, and more importantly, the soluble aggregate feature we detected in 8E+Asp,Glu is greatly diminished at 500 mM NaCl, implying that these assemblies involve electrostatic interactions that are screened out at high ionic strength. Notably, the ability of 8E+Asp,Glu peptides to form assemblies is strongly dependent on their length, with assembly formation not observed for 8- and 15-mer versions of the library (Supplemental Figure S5). This indicates that a critical peptide length is necessary to sustain cooperative interactions and the formation of higher-order architectures.
Beta Sheets Mediate Soluble Assembly of Prebiotic Acidic Peptides.
To interrogate the nature of the peptide assemblies formed in the 8E+Asp,Glu library, we performed SEC preparatively and isolated the early-eluting peaks (Figure 3). The peptide mixture was analysed by CD spectroscopy, and remarkably, this subpopulation evinced clear signature for β-sheets, including a clear negative peak at 215 nm, and positive peak at 192 nm. When the subpopulation was hydrolysed and analysed with amino acid analysis, we saw an enrichment for the aliphatic branched amino acids, such as Val, Leu, and Ile, and depletion of Pro and the charged amino acids, Asp and Glu. This apparent sensitivity of the assembly formation to the library composition was further confirmed by synthesizing the 8E+Asp,Glu library with individually skewed amino acid ratios (Supplemental Figure S6 and Table S1).
Figure 3. The resolution of a β-sheet rich subpopulation of peptides from the soluble aggregate fraction of the 8E+Asp,Glu library.
The 8E+Asp,Glu library was dissolved in 10 mM ABP buffer (pH 7.4) and resolved by size-exclusion chromatography. The soluble aggregate fraction (lime) at 8.5 ml was isolated and analyzed by far-UV circular dichroism (CD) and HPLC amino acid analysis. The far-UV CD spectrum possesses a negative peak at 215 nm and a positive peak at 192 nm, signaling β-sheet secondary structure. The HPLC amino acid analysis showed enrichment in branched amino acids (Val (V), Leu (L), and Ile (I)) and depletion of Pro (P) and acidic residues (Asp (D), Glu (E)). The monomer fraction (red) at 13.5 ml possesses a far-UV CD spectrum and amino acid content similar to that of the input peptide library (blue).
This shift in the amino acid frequencies matches closely with the amino acids known to have higher (Val, Leu, Ile) and lower propensity (Glu) to occur in β-sheets (29). On the other hand, the late-eluting peak, which we can assign to monomeric peptides, possessed a CD spectrum and an amino acid composition relatively similar to the original unseparated peptide library (Figure 3), though with an even more negative peak at 200 nm (corresponding to loops and coils) and a narrower shoulder at 222 nm (corresponding to regular secondary structures, α-helices primarily).
The remarkable overall folding potential of the 8E+Asp,Glu library comes into starker focus when compared with the other libraries (Supplemental Figure S7, spectral data compiled in Supplemental Data S2). Even the chemically similar 8E+Aad,Api alphabet shows a diminished spectral feature at 215 and 222 nm suggesting peptides with attenuated secondary structure when the negatively charged carboxylic acids are placed further away from the backbone. Moreover, we find that all six alphabets containing basic residues, either alone or in combination with acidic residues, exhibit strikingly diminished capacity to form secondary structures (Supplemental Figure S7). We were surprised by these results particularly in relation to the four mixed-charge libraries, as they have isoelectric point distributions more representative of naturally evolved proteins. These findings are invariant to pH, and indeed it is only the acidic alphabets whose secondary structure capacity show any pronounced dependence on pH. As a control to ensure that the folding properties were not influenced by the ABP buffer (which enabled the interrogation of a broad pH range), we recorded CD spectra for 8E+Asp,Glu, 8E+Lys,Arg, and 8E+Lys,Arg,Asp,Glu in Tris-HCl (pH 7.4) and found this change resulted in no significant differences with respect to ABP buffer (Supplemental Figure S8). As a further control, we acquired CD spectra of these random peptide libraries with varying amounts of 2,2,2-trifluoroethanol (TFE) added (between 10-90% (v/v)) as a cosolvent in order to induce α-helical structure (Supplemental Figure S9). All the alphabets responded to TFE induction, with pronounced negative features at 205 and222 nm appearing in all cases with TFE concentrations of 50% and higher. These experiments serve as useful controls that show all libraries have, in principle, the capacity to fold, and none of the noncanonical amino acids impose steric constraints that explicitly prevent it.
The peptide assembly is thermally stable but distinct from amyloid structures.
We used mass photometry to estimate the aggregation number of the peptide assemblies in the 8E+Asp,Glu library. These experiments identified a non-Gaussian peak with a maximum at ca. 80 kDa (ca. 32 peptides, Supplemental Figure S10b) and a fat tail extending to 400 kDa, which is removed entirely in the late-eluting SEC fraction (Supplemental Figure S10c) and enriched substantially in the early-eluting SEC fraction (Figure 4a). Hence, we conclude that the 8E+Asp,Glu alphabet admits a rather intriguing characteristic: a sub-population of peptides that spontaneously assemble into larger soluble structures consisting primarily of β-sheets. Most intriguingly, this property emerges in random sequence space, though only with the appropriate alphabet.
Figure 4. Size, morphology and stability of the 8E+Asp,Glu assemblies.
(a) The mass histogram of the isolated assembly fraction of 8E+Asp,Glu library in 10 mM Tris-HCl (pH 7.4), 50 mM NaCl assessed by mass photometry. (b) Temperature dependent far-UV circular dichroism (CD) of assembly fraction in 10 mM ABP (pH 7.4) performed in temperature range shown on scale in panel c. (c) Size exclusion chromatography of isolated assembly fraction performed in 10mM ABP (pH 7.4) after 1 hour incubation at 20, 70, 80 and 90 °C. (d) Far-UV CD of assembly fraction in 10mM ABP (pH 7.4), with increasing concentration of guanidine hydrochloride (Gu*HCl) as indicated in the panel. (e) Transmission electron micrograph of the isolated assembly fraction of 8E+Asp,Glu library. The electron micrographs were obtained for 20 μg/ml fraction samples in 10 mM Tris-HCl (pH 7.4).
To evaluate the thermal stability of the 8E+Asp,Glu assemblies, we analyzed the isolated assembly fraction using temperature-dependent CD spectroscopy. The analysis revealed that the characteristic β-sheet structure of the purified assemblies was retained up to 90 °C, showing remarkable resistance to thermal denaturation (Figure 4b). This observation was further supported by size exclusion chromatography with the purified assemblies performed right after 1 hour incubation at 70 °C, 80 °C, and 90 °C, which showed only minor signs of disassembly (Figure 4c). Similarly, the assemblies are resistant to chemical degradation by GuHCl (Figure 4d).
Additionally, we collected transmission electron micrographs of the 8E+Asp,Glu assemblies and compared them with the monomer SEC fraction and total fractions of the 8E+Asp,Glu and 8E+Asp,Glu,Lys,Arg libraries (Figure 4e). Globular particles were observed in the assembled peptide species (detectable in the total 8E+Asp,Glu fraction and enriched in the early-eluting SEC fraction), but no fibrillar structures that would potentially resemble amyloids were observed in the micrographs.
Charge-variant libraries bind metal ions with different preferences.
To assess the inherent propensities of random peptide sequences of different compositions to bind metal ions (Fe2+, Ca2+, Zn2+, Ni2+, Co2+), we monitored the fluorescence of the metal-sensitive dye Fura-2 in the presence and absence of selected peptide libraries (Figure 5). Libraries 8E+Asp,Glu and 8E+Asp,Glu,Lys,Arg were used to evaluate the influence of differently charged residues on metal binding, whereas the 19F library - representing the complete set of amino acids (excepting Cys, excluded for synthetic reasons) - served as a control that included additional potential ligands such as histidine.
Figure 5. Metal ion binding propensity.
Chelation of biologically relevant metal ions (Ca2+, Zn2+, Ni2+, Co2+, and Fe2+) was measured for three canonical amino acid alphabets (where 19F stands for the full set of canonical amino acids excepting Cys) using a fluorescent reporter-based assay. EDTA was used as a positive control. Error bars indicate the standard deviation from triplicate measurements.
All random-sequence libraries exhibited measurable metal-binding propensities, which increased with the compositional complexity of the libraries (Figure 5). Among the tested ions, 8E+Asp,Glu displayed the strongest affinity for Ca2+, consistent with Hard and Soft Acids and Bases (HSAB) theory (30), and this binding was only marginally enhanced by the addition of further amino acid types. Notably, Asp and Glu are also the most frequent Ca2+-binding residues in modern proteins (31). In contrast, Zn2+- an ion intermediate between hard and soft acids—showed unexpectedly strong binding to the 8E+Asp,Glu library, with further enhancement observed for the 19F library, likely reflecting the contribution of residues such as His. The substantial Zn2+ affinity of 8E+Asp,Glu is intriguing given that stable coordination of transition metals in modern proteins typically requires softer donor groups (31). As anticipated from HSAB principles, binding of Fe2+, Ni2+, and Co2+ was limited for 8E+Asp,Glu but markedly increased in the 19F library.
Large Language Models Recapitulate Structure-Forming Potential of Acidic Residues.
Obtaining structural information on our random peptide libraries is challenging on two levels: (i) the vast compositional diversity of the libraries (essentially every sequence in the mixture is unique) and (ii) these short unevolved 25-mer peptides are expected to be dynamic and conformationally heterogenous. Recently, large language models have been developed for protein structure prediction (32, 33). These models do not rely on multiple sequence alignments, and hence it has been argued that they have learned a more generic relationship between sequence and structure that is not dependent on biological sequences or even biological amino acid compositions. We therefore carried out ESMFold structure predictions on 50,000 unique and randomly generated 25-mer sequences from four alphabets comprising only canonical amino acids (8E, 8E+Asp,Glu, 8E+Lys,Arg, and 8E+Lys,Arg,Asp,Glu), as ESMFold does not support noncanonical amino acids.
In consonance with our experimental studies, ESMFold predicts that peptide compositions drawing from the prebiotically plausible 8E+Asp,Glu alphabet are slightly more compact (median Rg = 1.63 nm) than peptides drawing from the basic alphabet (median Rg = 1.82 nm) or the charge-balanced alphabet (median Rg = 1.69 nm) (Supplemental Figure S11a). On the other hand, ESMFold predicts that the 8E library, which possesses no charged residues, generates the most compact structures (Supplemental Figure S11a, median Rg = 1.36 nm). At face value, this result appears to be in direct contrast with CD spectroscopy (Supplemental Figure S11b) of this peptide library, which evidences very little secondary structure. However, fluorometric solubility assays show that the 8E library has poor solubility at all pH’s (Supplemental Figure S11b), which can be ascribed to its very hydrophobic composition (results for all other assays on the 8E peptide library are given in Supplemental Figure S12). One can reconcile these findings by suggesting that 8E does possess many sequences that can fold in principle, but those peptides also aggregate and are therefore depleted from the soluble fraction on which CD spectra were acquired.
We next assessed the secondary structures in the generated structural models of these random sequences with DSSP4.4 (34) (Supplemental Figure S11c). In general, coil conformations are the most common followed by helices; β-sheets appear to be quite rare amongst random peptides. The 8E library showed the highest helical content (19.6%, consistent with its lower Rg values), but surprisingly, the 8E+Asp,Glu sequences had the lowest helical content (9.7%). On the other hand, 8E+Asp,Glu exhibited higher levels of bends and turns (10.3%, which can help a chain compact by turning on itself), consistent with the radius of gyration calculations (35).
To support these patterns in secondary structure, we also analyzed 100,000 random sequences from each of the four alphabets with three different sequence-based secondary structure prediction algorithms - Spider3 (36), Jnet (37), and Gor4 (38) (Supplemental Figure S13). While the Gor4 results are rather inconclusive, Jnet and Spider3 qualitatively agree with ESMFold structural models in that they also predict the highest secondary structure propensity in the hydrophobic 8E library, and the lowest propensity in the 8E+Asp,Glu library.
ESMFold (Supplemental Figure S11c) and the secondary structure prediction algorithms (Supplemental Figure S13) are most inconsistent with experiments in their prediction for higher levels of α-helices in the 8E+Lys,Arg peptide library, which is not evidenced in CD spectra (cf. Supplemental Figure S7). We wondered if perhaps, given its directive to “force” proteins fold, ESMFold predicts extended alpha helices for 8E+Lys,Arg sequences that in reality are unstable in aqueous solution given their lack of hydrophobic packing interactions. To test this, we applied a generative model for protein thermodynamic stability prediction (39). Correlating predicted ΔG against predicted Rg revealed that 8E+Asp,Glu peptides – whilst not more stable overall – were more likely to positively correlate stability with compaction (Supplemental Figure S14). In other words, 8E+Asp,Glu prebiotic alphabet admits a clearer structure-stability relationship whereas 8E+Lys,Arg in particular does not. In further support of this idea, when we reconsider CD spectra in which folding was induced with TFE (Supplemental Figure S9), it is clear that both basic libraries (8E+Lys,Arg and 8E+Dap,Dab) revealed greater inducibility, with lower secondary structure than 8E+Asp,Glu at TFE at 20% (v/v) and below, but then surpassing 8E+Asp,Glu at TFE at 30% (v/v) and higher. In summary, the experiments and generative modeling support the premise that 8E+Lys,Arg has a higher helical propensity in principle, that is not realised in practice in aqueous solution due to its lower capacity to form compact structures. The acidic library, on the other hand, can better stabilise secondary structures by adopting more compact conformations.
Discussion
Our research into the properties of prebiotically-plausible amino acid alphabets began with the assumption that a severe reduction in the chemical inventory would imply proteins with more limited folding and function compared to modern proteins with their access to aromatic, basic, amidic, and sulfur-containing building blocks. Although such a rudimentary alphabet has been shown to meet simple requirements – such as the ability to fold (18, 20), bind RNA (15), or bind organic cofactors (40) – its potential role before the advent of templated synthesis has remained unclear. In light of the results presented here, the early 8E+Asp,Glu alphabet appears to constitute a uniquely adaptive pool of building blocks, well suited to act as a transitional pool toward the emergence of structured, function-bearing nanostructures. These adaptive features stem from the ability of statistical peptides of this composition to self-assemble and coordinate metal ions.
Among the tested charge-variant libraries, 8E+Asp,Glu possesses substantial secondary structure propensity whilst also maintaining high solubility, a delicate “balancing act” that it pulls off, unlike the hydrophobic 8E alphabet and the modern alphabet, cf. Supplemental Figure S11-15 and previous work (14, 20), which are highly prone to aggregate. The fact that 8E+Aad,Api alphabet recapitulates most of the key biophysical features of 8E+Asp,Glu shows that there is some leeway in the chemical space that admits peptides that can fold and assemble. These results suggest that side-chain length is perhaps less critical of a modulator than we had initially hypothesised (20). Together, these findings suggest that 8E+Asp,Glu was a strong candidate for assisting the emergence of proteins prior to purifying selection. The modern 20-letter alphabet has great functional breadth, but this is realized through specific sequences that have been finessed by selection and frequently rely on chaperones to fold (14, 41). Random sequences utilising this more advanced amino acid repertoire, on the other hand, suffer from aggregation. An amino acid set that promotes high solubility and foldability across unevolved sequences would have played a key adaptive role during the early stages of life’s emergence, for which the early 8E+Asp,Glu alphabet would be more suitable.
Both experiments and AI modelling agree that β-sheets are rare amongst random peptide sequences as monomers. We report here, however, that a subset of 8E+Asp,Glu peptides self-assemble into higher-order oligomers with pronounced β-sheet content. This finding is particularly germane to the theory that amyloids could have played a key role during the origin of life (the amyloid world hypothesis), with their ability to form complex structures and propagate using short peptides as monomers (27). Nevertheless, the formation of amyloids typically depends on specific, repetitive sequence patterns—features that would have been difficult to achieve under prebiotic conditions.
Compositionally amphiphilic peptides (similar to those found in the 8E+Asp,Glu library) have been explored by design in the field of peptide nanotechnology, assembling into supramolecular structures rich in β-sheets (42). Our results here constitute, to the best of our knowledge, the first observation of soluble oligomeric β-sheet assemblies arising from random and highly heterogenous peptide sequences comprising early amino acids, i.e. sequences that would plausibly be formed preceding templated ribosomal synthesis and the onset of cellular life. Whilst we would refrain from referring to these species as fibrils or amyloids (their molecular weights span between 80–400 kDa, they are soluble, and do not show fibrillar morphology in electron micrographs), they provide evidence that beta structures were also available to early oligomeric peptides, extending previous observations on β-content in insoluble protein (43). Importantly, self-assembled peptides stemming from relatively short sequences have been implied as plausible sites of (prebiotic) catalysis and interaction scaffolds, in the absence of longer polypeptides and tertiary folding (42, 44-47). Intermolecular assemblies of peptides lead to net release of water molecules at nonpolar interfaces and can lead to increased reaction rates (48). Besides the involvement of the non-polar amino acids (Val, Leu, and Ile, all of which are enriched in the assemblies) in stabilization of the peptide nanostructures, the intermolecular interaction between the β-sheets are likely further supported by the metal ions that bind substantially to the 8E+Asp,Glu library (such as Ca2+) as implied by the assembly weakening in high ionic strength. The self-assembling capacity of random 8E+Asp,Glu peptides emerges with peptides longer than 20 amino acids (Supplemental figure S5). Such a length threshold could have represented a critical step in the transition from short oligomers towards structurally more defined, self-organizing peptide systems capable of exhibiting primitive functional properties. Moreover, the high thermal and chemical stability of the observed assemblies suggests that such β-sheet-based architectures could have endured the fluctuating conditions of the prebiotic environment, and provided them with a fitness advantage.
The lower affinity of Asp and Glu residues for transition metals, as predicted by the HSAB principle, aligns well with the reducing and alkaline nature of prebiotic environments, where ions such as Ni2+ or Fe2+ would have been scarce due to their tendency to precipitate under such conditions. Conversely, the binding of alkaline earth metals such as Ca2+ and Mg2+, which are both hard ions, may have promoted primitive catalytic activity and contributed to the emergence of the earliest metalloproteins, which were likely smaller and relied on simpler coordination environments compared to modern metalloproteins, which use Cys and His residues in evolutionarily optimized geometries.
A view of the protein universe based solely on extant proteins might suggest that polypeptide sequences which fold, assemble, and stay soluble are rare, and once discovered, are conserved, thereby explaining the relatively small number of unique fold topologies used in Nature (49). Nevertheless, this is a view based on the modern amino acid alphabet. An early acidic amino acid alphabet can produce random polypeptides with these properties with higher probability. With this view in mind, the emergence of polymers that can fold and self-assemble from a primordial soup rich in α-amino acids begins to appear less of a statistical anomaly, but rather, quite plausible.
Materials and Methods
Solid-phase synthesis.
Synthesis was conducted using the standard Fmoc peptide synthesis methodology using the SPENSER automated synthesizer (IOCB Prague, Czech Republic). The synthesis of the libraries was conducted on 384-well filter plates (Cytiva 5072-N) with a reaction scale of 1280 nmol per well. Incubations were carried out using an apparatus that alternated between N2 overpressure (0.05 atm) and suction (−0.75 atm) to mix the well contents, with a period of 5 s between 200 ms suction pulses. To start the synthesis, Fmoc-protected amino acids were weighed (see Supplemental Data S1), combined, and then dissolved to form a 0.3 M solution with 0.375 M OxymaPure in dimethylformamide (DMF) to form an isokinetic mixture. Deprotection was performed by sequential addition of 6 × 40 μL of 20% piperidine in DMF, with each addition incubated for 30 min. Wells were washed using 35, 80, and 6 × 35 μL of DMF. Supernatants were removed via suction applied for 1 min. Coupling was carried out using 2 × 45 μL of isokinetic mixture, and 375 mM N,N’-diisopropylcarbodiimide in DMF and incubated for 3 h each time, followed by washing, deprotection and washing. After the final deprotection cycle, the resin was thoroughly rinsed with DMF and all liquid contents were removed via suction applied for 1 min. After that, resin was taken out of the 384-well filter plate via centrifugation at 4000 × g for 5 minutes and combined. The dried, combined resin underwent treatment with 10 ml of Mixture K (trifluoroacetic acid–water–triisopropylsilane, 95:2.5:2.5, v/v) per gram of resin for 2 h. The resin was then filtered out, and the resulting peptide solution was precipitated with diethyl ether (20 ml/1 ml Mixture K) and centrifuged at 4000 × g for 5 minutes. The resulting pellet was suspended in 1 mM HCl and subjected to lyophilization. Before conducting subsequent experiments, all peptide libraries were further lyophilized thrice from 1 mM HCl overnight, establishing Cl− as the counterion.
Quality control.
The molecular weight distributions of combinatorial peptide libraries were confirmed by mass spectrometry using UltrafleXtreme MALDI-TOF/TOF mass spectrometer (Bruker Daltonics, Germany) according to the standard procedure. Prior to the experiment, peptide libraries were dissolved in acetonitrile:water (1:1) mixture to the final concentration of 10 mg/ml. 2,5-Dihydroxybenzoic acid (DHB) was used as a matrix.
Prior to the amino acid analysis, the library samples were hydrolyzed in 6 M hydrochloric acid at 110 °C for 20 h and the hydrolysate was evaporated and reconstituted with 0.1 M hydrochloric acid containing the internal standard. Amino acid analysis was performed on Agilent 1260 HPLC (Agilent Technologies, Germany) equipped with an UV-Vis detector using automated ninhydrin derivatization.
Solubility and aggregation analysis.
Prior to analysis, a series of universal Britton-Robinson (ABP) buffers was prepared by mixing 20 mM acetic acid, 20 mM phosphoric acid and 20 mM boric acid and adjusting pH with 5 M NaOH. The ionic strength was adjusted with NaCl to conductivity that corresponds to the conductivity of 50 mM or 500 mM NaCl. Peptide libraries were thoroughly resuspended in autoclaved MilliQ water to the final concentration of 5 mg/ml. The peptide suspensions were diluted into the corresponding ABP buffer to the final concentration of 0.5 mg/ml and incubated for 30 min with mild shaking (500 rpm) at room temperature. After incubation, the insoluble fraction was separated by centrifugation at 21,300 g for 30 min at 4 °C, and the supernatant (cleared sample) was subsequently used for solubility and aggregation analyses.
For all peptide libraries except 8E library, the amount of soluble peptides was estimated by measuring absorption of peptide bonds at 215 nm on a Nanodrop 2000 (Thermo Fisher Scientific, USA).
Due to its low solubility, the relative solubility of 8E peptide library was estimated with a fluorescamine assay(50) in a 96-well plate format. In brief, 90 μl of the peptide library solution was mixed with 30 μl of 3 mg/ml fluorescamine in DMSO, and the resulting mixture was incubated for 30 min in the dark at room temperature in a dark-bottom 96-well plate. The fluorescence was then recorded on Spark microplate reader (Tecan, Switzerland) using the following parameters: excitation at 365 nm, emission at 470 nm. 0.5 mg/ml solution of 8E peptide library in DMSO was used as a standard. All solubility measurements were conducted in technical duplicate for every library at every pH at two ionic strengths.
The relative amount of soluble aggregates was estimated by size-exclusion chromatography. A 100 μl aliquot of clarified sample (following centrifugation) was loaded onto a Superdex 75 Increase 10/300 GL column (Cytiva, USA) that was pre-equilibrated with two bed volumes of the corresponding 20 mM ABP buffer with either 50 or 500 mM NaCl. The peptides were eluted from the column isocratically on NGC Quest Plus System (Bio-Rad, USA), by one bed volume of the buffer at 0.5 ml/min flow rate at room temperature, and the eluted peptides were detected by absorption at 215 nm. The aggregation analysis was performed by quantifying the relative intensity underneath the peaks at ~8 or ~14 ml elution volume in the acquisition software (ChromLab Software v. 6.1). Size exclusion chromatography runs were conducted once for every library at every pH at two ionic strengths.
Secondary structure analysis.
The CD spectra were recorded using a Chirascan-plus spectrophotometer (Applied Photophysics, UK) over the wavelength range 190–260 nm in steps of 1 nm with an averaging time of 1 s per step. Cleared samples at 0.2 mg/ml nominal concentration in 1 mm path-length quartz cells were placed into a cell holder, and spectra were recorded at room temperature. The CD signal was obtained as ellipticity in units of millidegrees, and the resulting spectra were averaged from two scans and buffer-spectrum subtracted. All CD measurements were performed twice, and the resulting ratio of ellipticity at 222 nm to ellipticity at 200 nm was averaged and used to estimate the secondary structure content in peptide libraries.
To estimate the effect of pH on secondary structure, 5 mg/ml suspensions of peptide libraries in Milli-Q water were diluted into 10 mM ABP buffers at pH 3.0, 5.0, 7.4, 9.0, and 11.0 to the final nominal concentration of 0.2 mg/ml, gently mixed at room temperature for 30 min, and then centrifuged at 21,300 g for 15 min at 4 °C in order to remove insoluble fraction.
To estimate the effect of 2,2,2-trifluoroethanol on secondary structure, 5 mg/ml suspensions of peptide libraries in MilliQ water were diluted in a series of 10 mM ABP (pH 7.4) supplemented with 0–90% (v/v) 2,2,2-trifluoroethanol.
To estimate the effect of buffer on secondary structure, 5 mg/ml suspensions of peptide libraries in MilliQ water were diluted in 10 mM ABP (pH 7.4) or 10 mM Tris-HCl (pH 7.4).
Isolation and analysis of total, aggregate and monomer fractions from 8E+Asp,Glu library.
The 800 μl aliquot of cleared 0.5 mg/ml 8E+Asp,Glu library solution in 10 mM ABP (pH 7.4) was loaded onto a Superdex 75 Increase 10/300 GL column (Cytiva, USA) that was pre-equilibrated with two bed volumes of 10 mM ABP buffer (pH 7.4). The peptides were eluted from the column by one bed volume of the buffer at 0.5 ml/min flow rate at room temperature, and the eluted peptides were detected by absorption of peptide bonds at 215 nm. The aggregate (~8.5 ml) and monomer (~13.5 ml) fractions were isolated and concentrated to 0.2 mg/ml using Amicon Ultra centrifugal filters (3 kDa MWCO). The total fraction was prepared by direct dilution of 0.5 mg/ml 8E+Asp,Glu library solution with 10 mM ABP (pH 7.4) to 0.2 mg/ml. The secondary structure analysis and HPLC amino acid analysis of total, aggregate and monomer fractions was performed as described above. The collection of the total, aggregate, and monomer fractions from the 8E+Asp,Glu library and the HPLC amino acid analysis on those fractions was conducted in technical triplicate. CD measurements and secondary structure analysis were conducted in technical duplicate.
The mass photometry data were collected using a TwoMP system (Refeyn, UK)(51). Prior to analysis, total, aggregate and monomer fractions were diluted with 10 mM Tris-HCl (pH 7.4), 50 mM NaCl to the final concentration of 20 μg/ml (~8 μM). 17.5 μl of 10 mM Tris-HCl (pH 7.4), 50 mM NaCl was placed in a well of glass casket (Refeyn, UK) to focus the objective and subsequently mixed with 2.5 μl of the sample to achieve the final concentration of 2.5 μg/ml (~1 μM). Mass photometry data was acquired and analyzed using AcquireMP and DiscoverMP software (Refeyn, UK), respectively. Mass calibration was performed using 10 nM bovine serum albumin (BSA) in 10 mM Tris-HCl (pH 7.4), 50 mM NaCl as a standard. The mass photometry analysis was performed once.
The morphology of monomer, aggregate and total fractions of 8E+Asp,Glu peptide library was studied by transmission electron microscopy (TEM) using negative staining. Prior to analysis, the 800 μl aliquot of cleared 0.5 mg/ml 8E+Asp,Glu library solution in 10 mM Tris (pH 7.4) was resolved into monomer and aggregate fractions according to the aforementioned protocol. The total fraction was prepared by direct dilution of 0.5 mg/ml 8E+Asp,Glu library solution with 10 mM Tris (pH 7.4) to 0.2 mg/ml. Additionally, the total fraction of 8E+Lys,Arg,Asp,Glu peptide library was used as a reference library with no observable aggregation. For the analysis, 0.2 mg/ml samples in 10 mM Tris (pH 7.4) were diluted with buffer to 20 μg/ml final concentration. The parlodion-carbon coated, 200 mesh copper grids (SPI supplies, USA) were glow discharged with GLOQUBE plus instrument (Quorum Technologies, UK) and soaked with 20 μl of the sample solution for 1 min at room temperature. The excess sample was soaked away with filter paper (Whatman), and the grids were subsequently stained two times with 20 μl of 2% (w/v) freshly filtered uranyl acetate in water for 1 min at room temperature. The grids were dried overnight at room temperature and examined in JEM-2100PLUS transmission electron microscope (JEOL, Japan) equipped with TemCam-XF416 CMOS camera (TVIPS, Norway) that was operated at 200 kV.
The stability of 8E+Asp,Glu aggregate fraction was assessed by CD spectroscopy measured upon titration with guanidine hydrochloride and temperature-dependent CD spectroscopy. The isolated aggregate fraction (prepared in the same way as described above) in 10 mM ABP (pH 7.4) was concentrated to 0,1 mg/ml using Amicon Ultra centrifugal filters (3 kDa MWCO). The ECD spectra were measured using the Jasco 1500 spectropolarimeter equipped with the Peltier holder PTC-517. Temperature dependence of CD spectra was measured in spectral range 200-280 nm from 10 °C to 90 °C with the step of 10 °C in 0.5 mm path length quartz cell (scanning speed of 10 nm/min, response time of 8 seconds, scanning step 0.5 nm, 1 accumulation). For the guanidine hydrochloride titration experiment, a concentration range of 1–2 M was employed. After baseline correction, the final spectra were expressed as molar ellipticity θ (deg·cm2·dmol−1) per residue.
To support the CD results and prove the stability of 8E+Asp,Glu assemblies, isolated fraction (prepared in the same way as described above) in 10 mM ABP (pH 7.4) was split in 4 aliquots (900 ml) and incubated separately at 20, 70, 80 and 90 °C for 1 hour prior to analytical SEC. 800 μl aliquot of sample was loaded onto a Superdex 75 Increase 10/300 GL column (Cytiva, USA). The peptides were eluted from the column isocratically on NGC Quest Plus System (Bio-Rad, USA), by one bed volume of the buffer (10mM ABP, pH 7.4) at 0.5 ml/min flow rate at room temperature, and the eluted peptides were detected by absorption at 215 nm.
Metal chelation assay using Fura-2 fluorescent reporter
Metal chelation assays were performed by monitoring the fluorescence of the metal-sensitive dye Fura-2 in the presence of various divalent metal ions and competing peptide ligands. The assay was conducted in a 384-well microplate format, with all measurements performed in triplicate.
Each well contained a total volume of 21.08 μl. The final well composition consisted of 10 μl of metal ion solution in 100 mM HEPES buffer at pH 7.4, 10 μl of Na5.Fura-2 solution in the same buffer (both titrated to pH 7.4 using NaOH), and 1.08 μl of a peptide library stock solution in 10 mM DMF (final concentration of DMF was 5.3% v/v).
The metal ion concentration (Fe2+, Ca2+, Zn2+, Ni2+, and Co2+) in each well was 16.6 μM, while the Fura-2 concentration was 2.2 μM. Peptides were added at a final concentration of 510 μM to assess their ability to competitively bind metal ions and displace Fura-2 from the complex. Notably, the Fe2+ stock appeared partially oxidized, as indicated by an EDTA titration, which affected the assay only by lowering the effective Fe2+ concentration, since Fe3+ is not detectable by Fura-2.
Four different peptides libraries were tested for their metal chelation ability: 10E (assumed MW 2474.24), 10KR (MW 2574.56), 12DEKR (MW 2656.94), and 19F (MW 3008.93). All peptides were added to achieve the same final molar concentration of 510 μM, with corresponding stock concentrations prepared in the range of 1.3 to 1.5 mg/ml depending on peptide molecular weight.
Fluorescence changes of the Fura-2–metal complexes at Ex 363 nm, Em 512 nm for Fe2+ and Co2+, and Ex 335 nm, Em 505 nm for Ca2+, Zn2+ and Ni2+, were monitored following peptide addition to determine competitive chelation behavior. Fractional metal ion chelation (Chelation %) was calculated by linear normalization as:
Where is the fluorescence intensity of the mixture containing Fura-2, peptide library, and metal ions; is the intensity of Fura-2 with the peptide library alone (100 % chelation control); and is the intensity of Fura-2 with the metal ions alone (0 % chelation control). Plotted values show medians of triplicate measurements, with standard deviations as error bars.
Radius of gyration, secondary structure annotation, and unfolding free-energy analysis on predicted peptide structures.
Fifty thousand (50,000) unique and random 25-mer peptide sequences were generated in-silico by uniformly sampling the amino acid alphabets of each library (8E, 8E+Asp,Glu, 8E+Lys,Arg, 8E+Lys,Arg,Asp,Glu and the full canonical alphabet of 20 amino acids). The heavy atom coordinates of all two hundred fifty thousand (250,000) peptide sequences were predicted using ESMFold-v1(32). These structures were used to calculate the radius of gyration (GROMACS 2024.2), annotate their secondary structure (DSSP 4.4), and estimate the unfolding free energy(39). All histograms and kernel density estimations were calculated using Python v3.12.0 and SciPy v1.13.0.
Gor4, Jnet, and Spider3 secondary structure predictions.
For each of the four libraries (8E, 8E+Asp,Glu, 8E+Lys,Arg, 8E+Lys,Arg,Asp,Glu), one hundred thousand (100,000) unique random 25-mer peptide sequences were generated in silico by equimolar random sampling of the amino acids. For these sets of 100,000 sequences, the secondary structure was predicted by the Spider3(36), Jnet(37), Gor4(38) algorithms. In all three cases, only the list of sequences was the input information for the prediction. The secondary structure propensity of each library was then calculated as a mean of the fraction of the amino acids in each sequence predicted to be either α-helix or β-sheet for all 100,000 sequences.
Supplementary Material
Significance Statement.
Modern proteins rely on a 20-letter amino acid alphabet to build the intricate structures essential for life. Yet, on the early Earth, many of these amino acids - especially the basic ones - were likely absent, and primitive peptides probably formed as random sequences rather than from genetic templates. Could such simple unevolved peptides already provide biological organization? We find that random peptides made only from prebiotically plausible amino acids spontaneously fold and assemble into stable, soluble β-sheet–rich nanostructures. This surprising capacity for self-organization suggests that even simple, early peptides could have provided the first scaffolds for molecular interactions, laying groundwork for the emergence of biological complexity.
Acknowledgements
We thank Dr. Radko Souček and Dr. Kateřina Nováková for their technical support during the quality control of the peptide libraries. We thank John Cava for assistance with installation of ESMFold on compute clusters. In addition, we would like to thank Prof. Stephen Freeland for helpful discussions about this work. This work was supported by a Research Grant from HFSP https://doi.org/10.52044/HFSP.RGEC272023.pc.gr.168579. S.D.F. acknowledges support from the NIH Director’s New Innovator Award (DP2-GM140926), a Cottrell Scholar Award, and a Sloan Fellowship for support. E.M.S. thanks the Program in Molecular Biophysics training grant (NIH-T32GM135131). M.M. and K.H. acknowledge CMS CF Biophysical techniques of CIISB, Instruct-CZ Centre, supported by MEYS CR Infrastructure project LM2018127 and European Regional Development Fund-Project, “UP CIISB” (No. CZ.02.1.01/0.0/0.0/18_046/0015974). Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic.
Footnotes
Competing Interest Statement: The authors have no competing interests to declare
Data availability
To facilitate resynthesis, all product codes and loadings for Fmoc-protected amino acids used for solid phase peptide synthesis are reported as Supplemental Data S1. To facilitate reanalysis, raw CD spectral data, solubility data, and amino acid analysis are supplied as Supplemental Data S2. Computational results including predicted radius of gyration and secondary structure for 50,000 random sequences for 5 alphabets are supplied in Supplemental Data S3.
References
- 1.Weber A. L., Miller S. L., Reasons for the occurrence of the twenty coded protein amino acids. J Mol Evol 17, 273–284 (1981). [DOI] [PubMed] [Google Scholar]
- 2.Cleaves H. J. II, The origin of the biologically coded amino acids. Journal of Theoretical Biology 263, 490–498 (2010). [DOI] [PubMed] [Google Scholar]
- 3.Philip G. K., Freeland S. J., Did Evolution Select a Nonrandom “Alphabet” of Amino Acids? Astrobiology 11, 235–240 (2011). [DOI] [PubMed] [Google Scholar]
- 4.Mayer-Bacon C., Agboha N., Muscalli M., Freeland S., Evolution as a Guide to Designing xeno Amino Acid Alphabets. International Journal of Molecular Sciences 22, 2787 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Brown S. M., Mayer-Bacon C., Freeland S., Xeno Amino Acids: A Look into Biochemistry as We Do Not Know It. Life 13, 2281 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Miller S. L., Urey H. C., Organic Compound Synthesis on the Primitive Earth. Science 130, 245–251 (1959). [DOI] [PubMed] [Google Scholar]
- 7.Johnson A. P., et al. , The Miller Volcanic Spark Discharge Experiment. Science 322, 404–404 (2008). [DOI] [PubMed] [Google Scholar]
- 8.Blanco C., Bayas M., Yan F., Chen I. A., Analysis of Evolutionarily Independent Protein-RNA Complexes Yields a Criterion to Evaluate the Relevance of Prebiotic Scenarios. Current Biology 28, 526–537.e5 (2018). [DOI] [PubMed] [Google Scholar]
- 9.Elsila J. E., et al. , Meteoritic Amino Acids: Diversity in Compositions Reflects Parent Body Histories. ACS Cent. Sci. 2, 370–379 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Glavin D. P., et al. , Abundant ammonia and nitrogen-rich soluble organic matter in samples from asteroid (101955) Bennu. Nat Astron 9, 199–210 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Naraoka H., et al. , Soluble organic molecules in samples of the carbonaceous asteroid (162173) Ryugu. Science 379, eabn9033 (2023). [DOI] [PubMed] [Google Scholar]
- 12.Higgs P. G., Pudritz R. E., A Thermodynamic Basis for Prebiotic Amino Acid Synthesis and the Nature of the First Genetic Code. Astrobiology 9, 483–490 (2009). [DOI] [PubMed] [Google Scholar]
- 13.Despotović D., et al. , Polyamines Mediate Folding of Primordial Hyperacidic Helical Proteins. Biochemistry 59, 4456–4462 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Tretyachenko V., et al. , Modern and prebiotic amino acids support distinct structural profiles in proteins. Open Biology 12, 220040 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Giacobelli V. G., et al. , In Vitro Evolution Reveals Noncationic Protein–RNA Interaction Mediated by Metal Ions. Mol Biol Evol 39, msac032 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tanaka J., Doi N., Takashima H., Yanagawa H., Comparative characterization of random-sequence proteins consisting of 5, 12, and 20 kinds of amino acids. Protein Science 19, 786–795 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Newton M. S., Morrone D. J., Lee K.-H., Seelig B., Genetic Code Evolution Investigated through the Synthesis and Characterisation of Proteins from Reduced-Alphabet Libraries. ChemBioChem 20, 846–856 (2019). [DOI] [PubMed] [Google Scholar]
- 18.Giacobelli V. G., et al. , Ancient amino acid sets enable stable protein folds. [Preprint] (2025). Available at: https://www.biorxiv.org/content/10.1101/2025.10.29.685319v1 [Accessed 30 October 2025]. [Google Scholar]
- 19.Meierhenrich U. J., Muñoz Caro G. M., Bredehöft J. H., Jessberger E. K., Thiemann W. H.-P., Identification of diamino acids in the Murchison meteorite. Proceedings of the National Academy of Sciences 101, 9182–9186 (2004). [Google Scholar]
- 20.Makarov M., et al. , Early Selection of the Amino Acid Alphabet Was Adaptively Shaped by Biophysical Constraints of Foldability. J. Am. Chem. Soc. 145, 5320–5329 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Leman L., Orgel L., Ghadiri M. R., Carbonyl Sulfide-Mediated Prebiotic Formation of Peptides. Science 306, 283–286 (2004). [DOI] [PubMed] [Google Scholar]
- 22.Forsythe J. G., et al. , Ester-Mediated Amide Bond Formation Driven by Wet–Dry Cycles: A Possible Path to Polypeptides on the Prebiotic Earth. Angewandte Chemie International Edition 54, 9871–9875 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Canavelli P., Islam S., Powner M. W., Peptide ligation by chemoselective aminonitrile coupling in water. Nature 571, 546–549 (2019). [DOI] [PubMed] [Google Scholar]
- 24.Foden C. S., et al. , Prebiotic synthesis of cysteine peptides that catalyze peptide ligation in neutral water. Science 370, 865–869 (2020). [DOI] [PubMed] [Google Scholar]
- 25.Zhang Y., Cremer P. S., Interactions between macromolecules and ions: the Hofmeister series. Current Opinion in Chemical Biology 10, 658–663 (2006). [DOI] [PubMed] [Google Scholar]
- 26.Kherb J., Flores S. C., Cremer P. S., Role of Carboxylate Side Chains in the Cation Hofmeister Series. J. Phys. Chem. B 116, 7389–7397 (2012). [DOI] [PubMed] [Google Scholar]
- 27.Greenwald J., Kwiatkowski W., Riek R., Peptide Amyloids in the Origin of Life. Journal of Molecular Biology 430, 3735–3750 (2018). [DOI] [PubMed] [Google Scholar]
- 28.Frenkel-Pinter M., Samanta M., Ashkenasy G., Leman L. J., Prebiotic Peptides: Molecular Hubs in the Origin of Life. Chem. Rev. 120, 4707–4765 (2020). [DOI] [PubMed] [Google Scholar]
- 29.Malkov S. N., Živković M. V., Beljanski M. V., Hall M. B., Zarić S. D., A reexamination of the propensities of amino acids towards a particular secondary structure: classification of amino acids based on their chemical structure. J Mol Model 14, 769–775 (2008). [DOI] [PubMed] [Google Scholar]
- 30.Pearson R. G., Hard and Soft Acids and Bases. J. Am. Chem. Soc. 85, 3533–3539 (1963). [Google Scholar]
- 31.Dokmanić I., Šikić M., Tomić S., Metals in proteins: correlation between the metal-ion type, coordination number and the amino-acid residues involved in the coordination. Acta Cryst D 64, 257–263 (2008). [DOI] [PubMed] [Google Scholar]
- 32.Lin Z., et al. , Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). [DOI] [PubMed] [Google Scholar]
- 33.Jing X., Wu F., Luo X., Xu J., Single-sequence protein structure prediction by integrating protein language models. Proceedings of the National Academy of Sciences 121, e2308788121 (2024). [Google Scholar]
- 34.Joosten R. P., et al. , A series of PDB related databases for everyday needs. Nucleic Acids Research 39, D411–D419 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Abraham M. J., et al. , GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2, 19–25 (2015). [Google Scholar]
- 36.Heffernan R., et al. , Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning. Journal of Computational Chemistry 39, 2210–2216 (2018). [DOI] [PubMed] [Google Scholar]
- 37.Cuff J. A., Barton G. J., Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics 40, 502–511 (2000). [Google Scholar]
- 38.Garnier J., Osguthorpe D. J., Robson B., Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. Journal of Molecular Biology 120, 97–120 (1978). [DOI] [PubMed] [Google Scholar]
- 39.Cagiada M., Ovchinnikov S., Lindorff-Larsen K., Predicting absolute protein folding stability using generative models. Protein Science 34, e5233 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Sanchez-Rocha A. C., Makarov M., Pravda L., Novotný M., Hlouchová K., Coenzyme-Protein Interactions since Early Life. eLife 13 (2024). [Google Scholar]
- 41.Kim Y. E., Hipp M. S., Bracher A., Hayer-Hartl M., Hartl F. U., Molecular Chaperone Functions in Protein Folding and Proteostasis. Annual Review of Biochemistry 82, 323–355 (2013). [Google Scholar]
- 42.Hendricks M. P., Sato K., Palmer L. C., Stupp S. I., Supramolecular Assembly of Peptide Amphiphiles. Acc. Chem. Res. 50, 2440–2448 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Greenwald J., Friedmann M. P., Riek R., Amyloid Aggregates Arise from Amino Acid Condensations under Prebiotic Conditions. Angewandte Chemie International Edition 55, 11609–11613 (2016). [DOI] [PubMed] [Google Scholar]
- 44.Rout S. K., Friedmann M. P., Riek R., Greenwald J., A prebiotic template-directed peptide synthesis based on amyloids. Nat Commun 9, 234 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Maury C. P. J., Amyloid and the origin of life: self-replicating catalytic amyloids as prebiotic informational and protometabolic entities. Cell. Mol. Life Sci. 75, 1499–1507 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Wieczorek R., Adamala K., Gasperi T., Polticelli F., Stano P., Small and Random Peptides: An Unexplored Reservoir of Potentially Functional Primitive Organocatalysts. The Case of Seryl-Histidine. Life (Basel) 7, 19 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Rufo C. M., et al. , Short peptides self-assemble to produce catalytic amyloids. Nature Chem 6, 303–309 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Edri R., et al. , From Polymerization-Enabled Folding and Assembly to Chemical Evolution: Key Processes for Emergence of Functional Polymers in the Origin of Life. Astrobiology (2025). [Google Scholar]
- 49.Bordin N., Sillitoe I., Lees J. G., Orengo C., Tracing Evolution Through Protein Structures: Nature Captured in a Few Thousand Folds. Front. Mol. Biosci. 8 (2021). [Google Scholar]
- 50.Udenfriend S., et al. , Fluorescamine: A Reagent for Assay of Amino Acids, Peptides, Proteins, and Primary Amines in the Picomole Range. Science 178, 871–872 (1972). [DOI] [PubMed] [Google Scholar]
- 51.Young G., et al. , Quantitative mass imaging of single biological macromolecules. Science 360, 423–427 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
To facilitate resynthesis, all product codes and loadings for Fmoc-protected amino acids used for solid phase peptide synthesis are reported as Supplemental Data S1. To facilitate reanalysis, raw CD spectral data, solubility data, and amino acid analysis are supplied as Supplemental Data S2. Computational results including predicted radius of gyration and secondary structure for 50,000 random sequences for 5 alphabets are supplied in Supplemental Data S3.





