Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Nov 14.
Published in final edited form as: Chem. 2024 Sep 6;10(11):3444–3458. doi: 10.1016/j.chempr.2024.07.025

A High-Throughput Workflow to Analyze Sequence-Conformation Relationships and Explore Hydrophobic Patterning in Disordered Peptoids

Erin C Day 1, Supraja S Chittari 1, Keila C Cunha 3, Roy J Zhao 3, James N Dodds 1, Delaney C Davis 1, Erin S Baker 1, Rebecca B Berlow 2, Joan-Emma Shea 3, Rishikesh U Kulkarni 4, Abigail S Knight 1,5,*
PMCID: PMC11580747  NIHMSID: NIHMS2022358  PMID: 39582487

SUMMARY

Understanding how a macromolecule’s primary sequence governs its conformational landscape is crucial for elucidating its function, yet these design principles are still emerging for macromolecules with intrinsic disorder. Herein, we introduce a high-throughput workflow that implements a practical colorimetric conformational assay, introduces a semi-automated sequencing protocol using MALDI-MS/MS, and develops a generalizable sequence-structure algorithm. Using a model system of 20mer peptidomimetics containing polar glycine and hydrophobic N-butylglycine residues, we identified nine classifications of conformational disorder and isolated 122 unique sequences across varied compositions and conformations. Conformational distributions of three compositionally identical library sequences were corroborated through atomistic simulations and ion mobility spectrometry coupled with liquid chromatography. A data-driven strategy was developed using existing sequence variables and data-derived ‘motifs’ to inform a machine learning algorithm towards conformation prediction. This multifaceted approach enhances our understanding of sequence-conformation relationships and offers a powerful tool for accelerating the discovery of materials with conformational control.

Keywords: peptoids, intrinsic disorder, high-throughput characterization, oligomer sequencing, data-driven analysis

Graphical Abstract

graphic file with name nihms-2022358-f0001.jpg

eTOC Blurb

In this article, Day et al. develop a workflow for the high-throughput characterization of disordered oligomers for data-driven sequence analysis. Screening an immobilized library of peptoids varied in hydrophobic composition and patterning with a solvatochromic dye, they identify >100 sequences across several classifications of degree of disorder to fuel their data-driven model. The development of this versatile platform provides unique insights into sequence-structure relationships, especially on the challenging frontier of small, disordered materials.

INTRODUCTION

As originally proposed by Anfinsen in an initial hypothesis on protein folding, the conformation and function of a protein is governed by the primary sequence.1,2 While a comprehensive framework exists for describing interactions in crystalline protein domains, the guidelines for tailoring the conformation of macromolecules with intrinsic disorder, and thus dynamic interactions, are still emerging. Notably, up to 44% of the human proteome contains disordered segments exceeding 30 residues in length, and these proteins play pivotal roles in cellular signaling and disease progression.3 Additionally, extending beyond canonical proteins, peptidomimetics and synthetic copolymers are predominantly disordered macromolecules, but attractive as therapeutic and delivery agents due to protease-resistance and broad chemical diversity. Thus, a fundamental understanding of sequence-conformation relationships across the spectrum of disordered materials is imperative for the development of therapies targeting diseases involving disordered proteins and for the advancement of next-generation biotechnologies.

Recent efforts have elucidated the role of charged residues, hydrophobicity, and monomer patterning on macromolecule conformation and function through limited systematic studies.46 For example, for polyelectrolytes, the number of blocks7,8 (i.e., blockiness) has proven effective in elucidating structure-function relationships in charge-driven adhesion. A similar parameter was shown to correlate with simulated polymers with covalent crosslinks.9 Similarly, the quantitative sequence descriptors κ (kappa)10,11 and sequence charge distribution (SCD)1216 have been developed to describe charge-driven compaction in intrinsically disordered proteins (IDPs).1017 For hydrophobic collapse, which is critical to compaction and folding of proteins3,1821 and synthetic polymers,2224 sequence hydropathy decoration (SHD) was shown to be successful in model compaction of IDPs.17 However, studies have yet to examine how and when these parameters can be extrapolated beyond their original contexts. Beyond development of these intuitive descriptors, machine learning and data-driven approaches have also facilitated the elucidation of sequence-structure-function relationships for well-defined protein structures25 and simulations of disordered proteins and polyelectrolytes.2630 However, experimental characterization, often resource-intensive and reliant on specialized equipment and methods, allows for the study of only a limited number of materials at a time (typically <20; Figure 1a) and further constricts the ability to directly apply feature learning strategies that have been successful in adjacent fields. Therefore, strategies to predict sequence-structure relationship in disordered materials—either by adapting empirical descriptors31 or by developing data-driven strategies—is a current frontier in disordered materials.32,33 A high-throughput approach to conformational characterization would enable further elucidation of sequence-structure relationships in these systems.34

Figure 1. Overview of strategies for macromolecule sequence analysis.

Figure 1.

(a) Schematics of existing sequence-conformation analyses including systematic and low-throughput characterization (top) and iterative optimization of a single sequence (bottom). (b) Schematic of the high-throughput workflow for the analysis of a sequence space developed herein that incorporates a one-bead one-compound library, a colorimetric assay, and motif analysis.

Motivated by this gap, we have developed a workflow for characterization of disordered macromolecules consisting of a hydrophobic monomer (H) and a polar monomer (P), a model HP system.3538 This workflow leverages our accessible colorimetric conformational assay39 and is validated through atomistic modeling and ion mobility spectrometry (IMS). HP polymers have been studied computationally, and iterative optimization can yield a hydrophobic pattern that leads to a compact disordered structure via surface recoloring (i.e., KK patterning by Khalatur and Khokhlov, Figure 1a).36,40,41 Our rapid approach implementing the solvatochromic probe Reichardt’s dye to characterize immobilized sequences is unique in yielding both sequences with desired conformations, as well as statistical modeling approaches to describe sequence-structure relationships (Figure 1b). The development of the high-throughput workflow includes: 1) establishing a sequencing protocol with PeptoidSeq, a script that expedites sequence analysis, and matrix assisted laser desorption/ionization and time-of-flight mass spectrometry (MALDI-TOF MS) to evaluate over 100 sequences, 2) creating an image analysis protocol, ColorClassify,42 to analyze over 1000 library members, describing the conformational propensities of a disordered macromolecule library, and 3) developing a data-driven approach to elucidate sequence-conformation relationships, MotifFold.43 Ultimately, this high-throughput analysis of sequence-dependent conformations exemplifies a workflow to uncover design principles connecting sequence and conformation. These relationships will facilitate the identification of connections between conformation and function for disordered macromolecules, playing a critical role in the development of next-generation therapeutics and biomimetic materials.

RESULTS AND DISCUSSION

A workflow for sequence identification of 20mer peptoids

We investigated HP peptoids through synthesis and characterization of a 20mer library with a binary monomer scope, glycine (gly) and N-butylglycine (Nbu) (Figure 2a). This builds on previous work, which described characterization of four model sequences of the same composition.39 Peptoids are an ideal model system due to their modularity, sequence definition, and conformational flexibility, facilitating access to a broad range of disordered conformational ensembles.44,45 Further, they have been used in high-throughput one-bead one-compound (OBOC) libraries4648 and to study the impact of sequence on solution conformation.39,41

Figure 2. A workflow for sequence identification of 20mer peptoids.

Figure 2.

(a) Schematic of a peptoid with the charged hydrophilic linker immobilized on PEGA resin. (b) A representative intact MALDI-TOF MS spectrum (left) annotated with the expected mass and a dissociation MALDI-MS/MS spectrum with the 20 necessary y-ions observed (right). Blue and purple labels show glycine terminal and N-butylglycine terminal fragments, respectively. (c) Schematic of the workflow for the rapid identification of unknown 20mer peptoids.

Peptoids can be sequenced with tandem mass spectrometry, which leverages the dissociation of amide bonds to generate fragment ions;49 the uncharged nature of our sequences required the design of a charged linker to increase ionization. The presence of tertiary amides within the peptoid backbone results in a higher abundance of y-ions compared to other fragments.49 Therefore, we designed a C-terminal linker to further amplify the detection of y-ions generated during collision induced dissociation with MALDI-MS/MS analyses (Figure 2a). We attached positively-charged lysine residues to a poly(ethylene glycol)-crosslinked poly(acrylamide) (PEGA) resin,50 protected by hydrazine labile protecting groups (ivDde), as this orthogonal protecting group allows for side-chain deprotection independently from the cleavage of the rink amide resin (Figures S13).51 Additionally, we incorporated two hydrophilic oligo(ethylene glycol) spacer units52 to increase the mass of the ions of interest above 500 Da, isolating desired peaks from those generated by the MALDI matrix (Figure 2b). Traditional solid-phase synthesis coupling methods were implemented to add 20mer peptoids, including both peptoid submonomer synthesis53,54 (Nbu) and Fmoc-protected amino acid coupling55 (gly). Following acid cleavage from the PEGA resin, the linker enabled the observation of all twenty y-ions necessary for identifying the peptoid sequence.

Previous work had demonstrated the efficacy of solvatochromic probes in investigating the microenvironments formed by macromolecules,41,5658 with Reichardt’s dye being particularly notable for facilitating the rapid conformational analysis of immobilized peptoids.39 Reichardt’s dye undergoes a colorimetric shift in response to environmental changes, where darker hues correspond to a more hydrophobic environment.56,59 The color observed following the incubation of immobilized disordered peptoids with Reichardt’s has been found to correlate with macromolecule conformation and hydrophobic microenvironment.39 To ensure that the self-consistent qualitative trends of the colorimetric assay were not impacted by the inclusion of the linker, four previously characterized peptoids (i.e., a diblock, an alternating, and two randomized sequences) were synthesized on the linker and exposed to the conformational assay with Reichardt’s dye (Figure S4). After deprotection of the lysine residues with hydrazine (5% in DMF, 6 × 30 min), the resin was incubated with Reichardt’s dye (150 μM in 100 mM HEPES, pH 7.8). The peptoids with the additional C-terminal linker were less saturated with dye but displayed the same qualitative trends across sequences (i.e., alternating was the darkest and diblock was the palest; Figure S5).

To determine the sequence of an unknown 20mer, we first manually isolated single beads of PEGA resin, containing many copies of a single sequence, under an optical microscope. Subsequently, the peptoid was cleaved from the resin using trifluoracetic acid, and the resulting material was dried, dissolved in MALDI matrix, and applied to a MALDI plate (Figure 2c). MALDI-MS was used to identify the m/z of the 20mer, and therefore its composition; then MALDI-MS/MS was used to identify the sequence.

To increase the throughput of sequence identification after obtaining spectra, we developed a semi-automated tool: Peptoid-Seq. This tool searches for one of two monomers (e.g., Nbu or gly) at each position and assigns residues sequentially.60 To address any ambiguity where a specific position is challenging to assign due to low signal or the presence of peaks of both possible fragment masses, users can provide input, and the tool can generate multiple candidate sequences. This analysis pipeline efficiently identified over 100 unknown 20mers (Figure 2c), performing five times faster than manual identification.

Conformational analysis of a model peptoid library with Reichardt’s dye

To establish design principles connecting macromolecule sequence to the compactness of the disordered conformational ensemble, we employed split-and-pool synthesis to create a one-bead one-compound (OBOC) library of over a million (220) unique sequences containing either gly or Nbu at each of the 20 positions. This split-and-pool technique employs minimal synthetic steps to create a large, immobilized library where each bead contains many copies of a single sequence.6163 Upon incubating the library with Reichardt’s dye, a wide range of color intensities was observed, suggesting a variety of hydrophobic microenvironments and conformational ensembles were generated by the 20mer peptoids (Figures 3a and S6).39,56,64 While the library presented as a broad continuous spectrum of saturation, we sorted a subset of the library (177 beads of PEGA resin) into nine color groups based on their visual similarity using an optical microscope to facilitate identification of trends between sequence and conformation (Figures 3b and S7). Image analysis of the beads within each of the nine color groups revealed a decrease in the scalar magnitude of a 3D vector composed of R, G, and B values with an increase in color intensity (Figure S8 and Table S1). Each of the color categories had multiple library members, suggesting each conformational classification can be accessed by many distinct sequences (Figure 3c).

Figure 3. Conformational analysis of a model peptoid library with Reichardt’s dye.

Figure 3.

(a) A representative photograph of the library of 20mer peptoids after incubation with Reichardt’s dye. (b) Representative images of the nine manually sorted color groups. (c) The population distributions of library members quantified via manual sorting and an image analysis tool.

Although 177 beads capture a substantial subset of macromolecules, this constitutes less than 0.02 percent of the entire sequence design space. To encompass a broader sample, we developed a machine-learning image analysis tool (ColorClassify).42 Using images from the nine color groups for training, the tool employed a gradient boosting classifier, and a model accuracy of 79% was calculated through k-fold cross-validation. Due to a limited population proportion in the two darkest color groups, those data were combined into one group for the automated analysis. This model was then used to classify an image containing 1404 beads of unknown colors (Figure S6, S9). The distribution of color intensities exhibited similar trends to the manually sorted sample, especially for the most saturated color groups (Figure 3c). Discrepancies in the palest beads likely arise from the increased relative impact of image imperfections (e.g., particulates and shadows) on the low saturation samples, leading to an overpopulation of group 3 and underpopulation of groups 1 and 2. The similarity between these two image populations demonstrates that the smaller library was sampled unbiasedly and is representative of the larger conformational space.

Following manual colorimetric sorting with an optical microscope, we isolated, cleaved, and sequenced individual beads, resulting in 122 peptoid sequences across the compositional library space (Figures 4a, S1015, Tables S2S6). Peptoids composed of 5–15 Nbu residues constitute 98.8% of the 20mer library space, and we find those compositions are accordingly represented (Figure S16). As anticipated, there is a clear positive correlation between the number of hydrophobic residues and the dye color intensity, attributed to the inherent increase in microenvironment hydrophobicity with a greater fraction of hydrophobic residues. However, we also observed a range of colors within peptoids of the same composition, suggesting differences in conformation and hydrophobic microenvironments are available to different compositionally identical sequences.39 For example, at 50 mol% hydrophobicity, peptoids were identified in 5 of the 9 different color groups (Figure 4a left, column with 10 hydrophobic residues), underscoring the significance of sequence in understanding the conformations of disordered materials.

Figure 4. Quantitative conformational analysis of three library members.

Figure 4.

(a) Heatmap of the compositional space of the library that was sampled (n = 122). Composition is compared to the color of the bead with Reichardt’s dye, and the three library members selected for further analysis are indicated. Schematics of the three sequences, Lib-A-C, are shown on the right, with a light background to highlight the absence of conformations with those measurements. (b) Probability distributions comparing end-to-end distances and Rg(bkb) of Lib-A-C and representative snapshots from the most populated conformational clusters. (c) Probability distributions from atomistic modeling simulations include radius of gyration of the peptoid backbone (Rg (bkb)) and intramolecular hydrogen bonds for Lib-A-C. (d) A plot of reversed-phase liquid chromatography retention time and collisional cross section from ion mobility spectrometry. Error bars represent standard deviation of multiple runs (n = 5).

Quantitative conformational analysis of three library members

We selected three library members with 50 mol% hydrophobic composition to analyze with lower-throughput techniques to generate quantitative conformational descriptors from disparate colorimetric classifications. This composition aimed to target macromolecules with conformational variability that would not aggregate in solution, facilitating further characterization. The three sequences selected for this analysis were: Lib-A (from the darkest color bin), Lib-C (from the lightest color bin), and Lib-B (from an intermediate color bin) (Figures 4a and S17). We conducted atomistic modeling to identify quantitative conformational descriptors to correlate with our experimental trends. Replica exchange molecular dynamics (MD) was performed for 600 ns using a force field adapted from GROMOS 53A6, specifically tailored for peptoid conformations (Figure 4b).65 Simulations of each peptoid as a single chain in bulk water revealed distributions of radii of gyration, end-to-end distances, numbers of hydrogen bonds, solvent accessible surface area, and site-specific monomer distances across 50,000 conformations (the last 500 ns analyzed) for each sequence (Figures 4b, S1826, Tables S79). Corroborating our experimental observations, Lib-A exhibited the smallest average radius of gyration of the peptoid backbone (i.e., excluding side-chains, Rg (bkb)) of 0.62 nm and the smallest average end-to-end distance (Ree = 1.0 nm). Heatmaps comparing these parameters for Lib-A-C show that the full distribution of conformations for Lib-A is tightly clustered around more compact structures, while Lib-B displays intermediate characteristics, and Lib-C has the broadest distribution with the largest Rg (bkb) (0.65 nm and 0.64 for Lib-C and Lib-B, respectively) and the most extended Ree (1.4 nm and 1.2 nm for Lib-C and Lib-B, respectively). Further, when clustering conformations using a 1.4 Å cut-off,66 Lib-A exhibited the fewest clusters (5781), suggesting self-similar conformations. In contrast, Lib-B had 10599 clusters and Lib-C had 11030 clusters (Figure 4b). As these values are impacted by both the clustering method selected and the RMSD cut-off threshold, we implemented two additional clustering methods and an additional cut-off (1.0 Å). Consistent across these analyses, Lib-A had the fewest clusters (i.e., most similar conformations, and Lib-C had the most clusters (i.e., highest degree of disorder, Table S910).

Additional interactions measured included intramolecular hydrogen bonds, hydrogen bonds formed with water, and solvent accessible area. Lib-A had the highest average number of intramolecular hydrogen bonds (3.74) compared to Lib-B and Lib-C (3.04 and 2.89, respectively; Figure 4c and Table S8). Correspondingly, Lib-B and Lib-C had a greater average number of hydrogen bonds with water (23.0 and 23.9, respectively) than the more compact Lib-A (21.0). Lib-A also had the smallest average solvent-accessible surface area (SASA) of 16.5 nm2 (Figure S23). Lib-B was slightly larger (16.9 nm2), followed by Lib-C with an SASA of 17.0 nm2. Across all metrics, Lib-A displayed the smallest and least disordered conformational ensemble of the three selected library sequences, followed by Lib-B, and Lib-C exhibits the most extended and disordered conformational ensemble. These findings, along with previous reports of compositionally identical sequences, suggest that macromolecules composed of Nbu and gly with less conformational disorder have both smaller radii of gyration and end-to-end distances. Notably, despite having the fewest hydrophobic blocks (3) among the three library sequences, Lib-B displays an intermediate Rg (bkb). This underscores that sequence parameters, beyond intuitive descriptors, significantly influence conformational ensembles and emphasizes the need for further exploration of sequence-conformation relationships in disordered materials.

To further probe the conformational ensembles of these peptoids experimentally, we employed ion mobility spectrometry coupled with liquid chromatography and mass spectrometry (LC-IMS-MS) and diffusion ordered spectroscopy NMR (DOSY) of synthesized and purified Lib-A, Lib-B, and Lib-C (Figure S27). IMS, which can be coupled with liquid chromatography and mass spectrometry (LC-IMS-MS), has previously distinguished conformations of disordered polymer architectures67 and compositionally identical oligomeric peptoids. Further, the combination of experimental and computational techniques have demonstrated that the solution-phase conformation difference can be captured with this gas phase technique.69,70 Through analysis of the three library sequences via IMS (Figure S28), we observed that Lib-A and Lib-B had similar collisional cross section (CCS) values within experimental error (461.6 ± 0.4 and 462.0 ± 0.3 Å2, respectively, Figure 4d). However, Lib-C displayed a larger CCS of 463.1 ± 0.3 Å2, indicative of a less compact conformation. Reversed phase LC has previously been shown to differentiate linear and branched small molecules69 and peptoid 20mers;39 however, Lib-A and Lib-B also displayed similar retention times of 9.56 ± 0.03 min and 9.55 ± 0.02 min, respectively. Lib-C eluted later, at 9.68 ± 0.02 min, aligning with its larger CCS. This further confirms our conclusion from the library and simulations, that Lib-C is the least compact of the three library sequences. Lib-B had the largest hydrophobic block, which could have a larger structural impact on these analyses and inform the greater similarity between Lib-A and Lib-B than observed with the simulated descriptive radii.

Interestingly, Lib-A was more compact than an alternating 20mer;39 however, in contrast to the alternating sequence, Lib-A did not aggregate at high solution concentrations (10 mg/mL; Figure S29). Although DOSY NMR did not yield different coefficients for Lib-A-C (Figures S3031, Table S11), the coefficients do suggest negligible aggregation. A comprehensive suite of techniques is essential to unveil a complete description of a conformational ensemble including parameters such as relevant radii and aggregation propensity.39 Nonetheless, this high-throughput workflow rapidly highlights the accessible conformational diversity within a simple two-component library. Thus, we have demonstrated that the integration of an OBOC library with our convenient colorimetric assay can accelerate our understanding of conformational differences generated by variations in hydrophobic patterning and composition.

Data-driven sequence analysis

While macromolecules are often distinguished by differences in molecular weight and composition, limited quantitative methods exist to describe multiple sequences of the same monomer composition. Using the 122 peptoid sequences identified from the library and colorimetric output, we sought to correlate sequence with conformation with the two original darkest color groups combined due to limited relative abundance in the population. For crystalline proteins, extensive experimental databases have allowed machine learning algorithms such as AlphaFold25 to be successful in representation learning of sequence and chemical information. However, for non-canonical structures (e.g., peptidomimetics), data-driven representation learning strategies27,31 to embed critical sequence features are currently limited by a lack of substantial experimental datasets or reliance on coarse-grained simulation data. A generalizable, data-driven strategy capable of quantifying sequence-structure – and eventually sequence-function – relationships in soft materials beyond natural proteins remains elusive.

The limited chemical scope of our peptoid library and intuitive relationships between composition and conformation offers a unique opportunity to develop and validate a data-driven workflow for non-native backbones. Towards this, we here demonstrate a data-driven workflow (MotifFold), in which peptoid sequences are divided into motifs, similar to how a sentence is divided into words (Figure 5a).26,7173 Motifs were defined as segments of length n residues, and a complete set of 2n unique motifs were generated to describe the library members. Each of the 122 sequences was partitioned into 20 – (n-1) motifs, with each motif position assigned to one of the 2n motifs. To generate model inputs, we represented this motif information through a frequency embedding strategy, a commonly used encoding strategy where a feature vector is composed of the frequency of occurrences of each motif within each sequence.7277 A gradient boosting regression algorithm was selected to preserve relationships between the color classes and capture nonlinearities within the data,78,79 although we also evaluated the performance of gradient boosting classifier as an alternative strategy. Each color class was represented by a single value calculated from principal component analysis (PCA). The PCA inputs consisted of averaged RGB values for each color class, as images of individual beads corresponding to the sequences were not captured to improve characterization throughput. This PCA output explained >98% of the variance in the dataset. We screened input features generated from motif lengths n = 2–9 and found that n = 4 demonstrated the best model performance as quantified by RMSE for a regressor (Table S12) and log-loss for a classifier (Table S13). The motif embedding uniquely enables further investigation of the relative importance of feature inputs by calculating Shapley additive explanations (SHAP values) (Figure 5b). Interestingly, we find that the highest-scoring motifs, arranged closer to the top of the plot, are not solely grouped by specific local hydropathy values. For example, we find that “GGGG,” “BGBB,” “and “BBBB” all appear as motifs with high importance, suggesting that both composition and residue sequence inform the colorimetric output of the assay.

Figure 5. Data-driven sequence analysis.

Figure 5.

a) (left) Data-driven “motif” extraction strategy by parsing individual sequences. (right) Established methods for quantitative analysis of binary sequences by composition, blockiness (i.e., number of blocks, β9), and monomer patterning (i.e., sequence hydropathy decoration, SHD17). b) SHAP analysis of motif features in a feature embedding strategy. Color indicates feature importance. c) Goodness-of-fit plot for a model fit to frequency embedded motif features as well as existing descriptors hydrophobicity, blockiness, and SHD.

While the frequency embedding leveraged in this analysis facilitates ease of interpretability, this strategy does not preserve positioning information of each motif. To probe alternative embedding strategies, we performed James-Stein and one-hot encoding to transform motif information into feature inputs while embedding motif positioning information (Table S14).74,80 James-Stein encoding, a type of target-encoding, excels at capturing important patterns within the dataset and handling high-cardinality features, while one-hot encoding is a more interpretable strategy. However, either strategy can be non-robust and prone to overfitting in different ways. In either case, evaluation of the model performance reveals moderate predictiveness, which encourages the implementation across diverse datasets including canonical and noncanonical backbones. However, we hesitate to evaluate the effectiveness of these strategies in preserving motif positioning effects with the limited dataset.

To further improve the performance of the model, we sought to leverage existing descriptors of sequence composition and patterning as additional features. A blockiness parameter (β) has been defined across a compositional landscape and validated on coarse-grained crosslinking simulations of polymers9 (Figure 5a, right). We identified existing metrics of residue patterning developed for disordered proteins such as sequence hydropathy decoration (SHD),17 sequence charge decoration (SCD),1216 and kappa (κ).1216 We selected metrics that intuitively applied to our system: hydrophobicity as a global composition metric in addition to blockiness9 and SHD17 to quantitate hydrophobic patterning (Table S15), and we benchmarked model performance with comparisons against these known and previously validated descriptors. We observed an improvement in model performance as compared to motif features alone (RMSE = 0.623) (Figures 5c, S32). This analysis confirmed that, while hydrophobicity was one of the most important features according to chemical intuition, it was outranked by SHD, underscoring the importance of patterning-specific features in addition to the potential to adapt metrics originally developed for canonical backbones. Further, the improvement of the RMSE with the addition of motif features also emphasizes the importance of data-driven feature generation.

We here contribute a construction of a data-driven workflow towards elucidating sequence-structure relationships in disordered peptoids. We developed a motif featurization strategy through a reduced model system that interrogates the role of sequence in hydrophobic collapse. Further improvement was demonstrated by harnessing chemical intuition and existing descriptors developed for disordered proteins. We anticipate the motifs will be uniquely valuable descriptors in systems where it is challenging to identify which physical interactions are dictating the desired property, whether solution conformation or a target function (e.g., ligand binding). Further, the motif embedding uniquely enables the identification of key motifs that can guide rational design.

Conclusions

As macromolecule sequence, structure, and function are intrinsically connected, it is critical to identify design principles to that connect sequence to disordered conformation to develop biomimetic materials and macromolecule therapeutics that address modern societal challenges. Using a practical colorimetric assay, we revealed a spectrum of color intensities and corresponding conformations within a two-component one-bead one-compound library composed of glycine and N-butylglycine. The distribution of conformational ensembles was analyzed via manual sorting using an optical microscope and automated image classification, ColorClassify. Using a sequencing workflow expedited by a script, PeptoidSeq, we identified 122 sequences with varying degrees of disorder and compactness. Further insights were achieved by selecting three library sequences with identical compositions for analysis with atomistic simulations and LC-IMS-MS. These analyses corroborated the high-throughput results and revealed additional conformational descriptors, such as relevant radii and aggregation propensity. This high-throughput analysis yielded unique sequences with differentiable compactness, which can be leveraged in the design of novel receptors. Additionally, it establishes an efficient workflow that can rapidly describe the conformational variability within a library. Furthermore, we introduce a data-driven workflow for sequence analysis, MotifFold, which captures sequence-structure relationships that can be generalized to other native and non-native backbones such as peptides and peptoids, respectively. We also demonstrate how our models are improved through the addition of existing sequence and composition quantitation strategies as feature inputs. This backbone-agnostic sequence analysis can be translated easily to other classes of sequence-defined molecules (e.g., peptidomimetics, oligourethanes, oligothiophenes) and biomacromolecules (e.g., disordered peptides, oligonucleotides).

Through this study, we contribute a comprehensive high-throughput platform spanning synthesis, characterization, and sequence analysis. Having validated this platform on a model system, we look to expanding this strategy to additional backbones and monomer types with chemically diverse residues (e.g., aromatic, chiral, and charged) that remain poorly understood empirically. Further, we anticipate this workflow will enable innovations beyond sequence-structure relationships. For example, the design of therapeutics, such as ligand development for disease targets and disruption of protein-protein interactions, is at the forefront of medical biology, but generating principles that inform this design remains a substantial challenge. Designing and interrogating data-driven strategies from well-described problems such as folding will enable the modular development of materials with target functions across broad environmental and health applications, and the impact will continue to accelerate as characterization databases expand.

EXPERIMENTAL PROCEDURES

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Abigail Knight (aknight@unc.edu).

Materials availability

All materials generated in this study are available from the lead contact without restriction.

Data and code availability

The datasets generated during this study are available in the supplemental document and additional image and MALDI-MS files. The programs used for this analysis are available on the GitHub at https://github.com/UNC-Knight-Lab.

Supplementary Material

1
2
3

The Bigger Picture.

The sequence, structure, and function of biological and synthetic macromolecules are inherently intertwined. Machine learning algorithms such as AlphaFold are powerful for crystalline proteins but face limitations in characterizing intrinsically disordered proteins, which constitute nearly 50% of the human proteome. Additionally, these algorithms are challenged by unnatural backbones that play crucial roles in biomaterials and therapeutics. Disordered materials present obstacles for structure prediction due to the ensemble of conformations and limited access to uniform datasets. We propose a data-driven workflow to predict the compactness of peptidomimetics using existing descriptors and a machine learning analysis. As a platform to advance our understanding of sequence-structure relationships across key scientific frontiers, this is a critical step towards the de novo design of disordered materials with complex functions.

Highlights.

  • A colorimetric assay characterizes the conformational distribution of a peptoid library

  • A semi-automated protocol identifies the sequences of more than one hundred peptoids

  • A high-throughput workflow yielded unique sequences with differentiable compactness

  • A data-driven workflow for sequence analysis captures sequence-structure relationships

ACKNOWLEDGMENTS

This material is based upon experimental work (A.S.K.) supported by the Air Force Office of Scientific Research under Award Number FA9550-20-1-0172 and statistical analyses (A.S.K) supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Award Number DE-SC0021295. E.C.D. and acknowledges support from the National Defense Science & Engineering Graduate (NDSEG) Fellowship Program, and S.S.C. acknowledges the National Science Foundation Graduate Research Fellowship (DGE-2040435). J.N.D and E.S.B acknowledge support from the National Institute of Environmental Health Sciences (P42 ES027704), the National Institute of General Medical Sciences (RM1 GM145416) and a cooperative agreement with the U.S. Environmental Protection Agency (STAR RD 84003201). R.B.B. gratefully acknowledges institutional support from the Department of Biochemistry and Biophysics and the Dean’s Office at the UNC School of Medicine as well as the Lineberger Comprehensive Cancer Center. We acknowledge the NMR core laboratory supported by a National Science Foundation award number CHE-1828183. We additionally thank Stuart Parnham for expert advice on experimental design and the Biomolecular NMR Laboratory which receives funding from the National Cancer Institute of the National Institutes of Health under award number P30CA016086. J.E.S. acknowledges support from the Center for Scientific Computing at the California Nanosystems Institute (CNSI, NSF grant CNS-1725797). This work (J.E.S.) used the Extreme Science and Engineering Discovery Environment, which is supported by the National Science Foundation grant ACI-1548562 (MCA05S027). J.E.S. acknowledges support from the NSF (MCB-1716956). This work (J.E.S.) was partially supported by the National Science Foundation through the Materials Research Science and Engineering Center (MRSEC) at UC Santa Barbara: NSF DMR-2308708 (IRG-2). Microscopy was performed at the UNC Neuroscience Microscopy Core (RRID:SCR_019060), supported, in part, by funding from the NIH-NINDS Neuroscience Center Support Grant P30 NS045892 and the NIH-NICHD Intellectual and Developmental Disabilities Research Center Support Grant P50 HD103573. MALDI TOF analysis was performed in the laboratory of Dr. Timothy Haystead at Duke University School of Medicine, and we acknowledge David Loiselle for valuable training and conversations. We thank Michael Connolly and the Hicks research group for valuable conversations about mass spectrometry sequencing strategies.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

DECLARATION OF INTERESTS

The authors declare no competing interests.

SUPPLEMENTAL INFORMATION

Supporting information can be found online and includes a PDF containing the experimental details, synthetic procedures, and supplemental figures and tables including LC-MS, MALDI-TOF, NMR, and other characterization.

Data S1. High-resolution images of library members

Data S2. MALDI-TOF-TOF raw data files (.t2d)

REFERENCES

  • (1).Anfinsen CB; Haber E (1961) Studies on the Reduction and Re-Formation of Protein Disulfide Bonds. Journal of Biological Chemistry, 236, 5, 1361–1363. 10.1016/S0021-9258(18)64177-8. [DOI] [PubMed] [Google Scholar]
  • (2).Anfinsen CB; Redfield RR (1956) Protein Structure in Relation to Function and Biosynthesis. In Advances in Protein Chemistry; Elsevier, 11, pp 1–100. 10.1016/S0065-3233(08)60420-9. [DOI] [PubMed] [Google Scholar]
  • (3).Van Der Lee R; Buljan M; Lang B; Weatheritt RJ; Daughdrill GW; Dunker AK; Fuxreiter M; Gough J; Gsponer J; Jones DT; Kim PM; Kriwacki RW; Oldfield CJ; Pappu RV; Tompa P; Uversky VN; Wright PE; Babu MM (2014) Classification of Intrinsically Disordered Regions and Proteins. Chemical Reviews, 114, 13, 6589–6631. 10.1021/cr400525m. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (4).Austin MJ; Rosales AM (2019) Tunable Biomaterials from Synthetic, Sequence-Controlled Polymers. Biomaterials Science, 7, 2, 490–505. 10.1039/C8BM01215F. [DOI] [PubMed] [Google Scholar]
  • (5).Perry SL; Sing CE (2020) 100th Anniversary of Macromolecular Science Viewpoint: Opportunities in the Physics of Sequence-Defined Polymers. ACS Macro Letters, 216–225. 10.1021/acsmacrolett.0c00002. [DOI] [PubMed] [Google Scholar]
  • (6).Yang C; Wu KB; Deng Y; Yuan J; Niu J (2021) Geared Toward Applications: A Perspective on Functional Sequence-Controlled Polymers. ACS Macro Lett., 10, 243–257. 10.1021/acsmacrolett.0c00855. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (7).Lytle TK; Chang L-W; Markiewicz N; Perry SL; Sing CE (2019) Designing Electrostatic Interactions via Polyelectrolyte Monomer Sequence. ACS Central Science, 5, 4, 709–718. 10.1021/acscentsci.9b00087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Chang Q; Jiang J Adsorption of Block-Polyelectrolytes on an Oppositely Charged Surface. (2021) Macromolecules, 54, 9, 4145–4153. 10.1021/acs.macromol.1c00165. [DOI] [Google Scholar]
  • (9).Patel RA; Colmenares S; Webb MA (2023) Sequence Patterning, Morphology, and Dispersity in Single-Chain Nanoparticles: Insights from Simulation and Machine Learning. ACS Polym. Au, 3, 3, 284–294. 10.1021/acspolymersau.3c00007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Das RK; Pappu RV (2013) Conformations of Intrinsically Disordered Proteins Are Influenced by Linear Sequence Distributions of Oppositely Charged Residues. Proceedings of the National Academy of Sciences, 110, 33, 13392–13397. 10.1073/pnas.1304749110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (11).Das RK; Ruff KM; Pappu RV (2015) Relating Sequence Encoded Information to Form and Function of Intrinsically Disordered Proteins. Current Opinion in Structural Biology, 32, 102–112. 10.1016/j.sbi.2015.03.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Sherry KP; Das RK; Pappu RV; Barrick D (2017) Control of Transcriptional Activity by Design of Charge Patterning in the Intrinsically Disordered RAM Region of the Notch Receptor. Proceedings of the National Academy of Sciences of the United States of America, 114, 44, E9243–E9252. 10.1073/pnas.1706083114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Sawle L; Ghosh K (2015) A Theoretical Method to Compute Sequence Dependent Configurational Properties in Charged Polymers and Proteins. The Journal of Chemical Physics, 143 8, 085101. 10.1063/1.4929391. [DOI] [PubMed] [Google Scholar]
  • (14).Statt A; Casademunt H; Brangwynne CP; Panagiotopoulos AZ (2020) Model for Disordered Proteins with Strongly Sequence-Dependent Liquid Phase Behavior. The Journal of Chemical Physics, 152, 7, 075101. 10.1063/1.5141095. [DOI] [PubMed] [Google Scholar]
  • (15).Rana U; Brangwynne CP; Panagiotopoulos AZ (2021) Phase Separation vs Aggregation Behavior for Model Disordered Proteins. Journal of Chemical Physics, 155, 12. 10.1063/5.0060046. [DOI] [PubMed] [Google Scholar]
  • (16).Cohan MC; Shinn MK; Lalmansingh JM; Pappu RV (2022) Uncovering Non-Random Binary Patterns Within Sequences of Intrinsically Disordered Proteins. Journal of Molecular Biology, 434, 2, 167373. 10.1016/j.jmb.2021.167373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (17).Zheng W; Dignon G; Brown M; Kim YC; Mittal J (2020) Hydropathy Patterning Complements Charge Patterning to Describe Conformational Preferences of Disordered Proteins. Journal of Physical Chemistry Letters, 11, 9, 3408–3415. 10.1021/acs.jpclett.0c00288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Kyte J; Doolittle RF (1982) A Simple Method for Displaying the Hydropathic Character of a Protein. Journal of Molecular Biology, 157, 1, 105–132. 10.1016/00222836(82)90515-0. [DOI] [PubMed] [Google Scholar]
  • (19).Dill KA; MacCallum JL (2012) The Protein-Folding Problem, 50 Years On. Science, 338, 6110, 1042–1046. 10.1126/science.1219021. [DOI] [PubMed] [Google Scholar]
  • (20).Milles S; Salvi N; Blackledge M; Jensen MR (2018) Characterization of Intrinsically Disordered Proteins and Their Dynamic Complexes: From in Vitro to Cell-like Environments. Progress in Nuclear Magnetic Resonance Spectroscopy, 109, 79–100. 10.1016/j.pnmrs.2018.07.001. [DOI] [PubMed] [Google Scholar]
  • (21).Sormanni P; Piovesan D; Heller GT; Bonomi M; Kukic P; Camilloni C; Fuxreiter M; Dosztanyi Z; Pappu RV; Babu MM; Longhi S; Tompa P; Dunker AK; Uversky VN; Tosatto SCE; Vendruscolo M (2017) Simultaneous Quantification of Protein Order and Disorder. Nature Chemical Biology, 13, 4, 339–342. 10.1038/nchembio.2331. [DOI] [PubMed] [Google Scholar]
  • (22).Barbee MH; Wright ZM; Allen BP; Taylor HF; Patteson EF; Knight AS (2021) Protein-Mimetic Self-Assembly with Synthetic Macromolecules. Macromolecules, 54, 8, 3585–3612. 10.1021/acs.macromol.0c02826. [DOI] [Google Scholar]
  • (23).Ruan Z; Li S; Grigoropoulos A; Amiri H; Hilburg SL; Chen H; Jayapurna I; Jiang T; Gu Z; Alexander-Katz A; Bustamante C; Huang H; Xu T (2023) Population-Based Heteropolymer Design to Mimic Protein Mixtures. Nature, 615, 7951, 251–258. 10.1038/s41586-022-05675-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (24).Warren JL; Dykeman-Bermingham PA; Knight AS (2021) Controlling Amphiphilic Polymer Folding beyond the Primary Structure with Protein-Mimetic Di(Phenylalanine). Journal of the American Chemical Society, 143, 33, 13228–13234. 10.1021/JACS.1C05659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (25).Jumper J; Evans R; Pritzel A; Green T; Figurnov M; Ronneberger O; Tunyasuvunakool K; Bates R; Žídek A; Potapenko A; Bridgland A; Meyer C; Kohl SAA; Ballard AJ; Cowie A; Romera-Paredes B; Nikolov S; Jain R; Adler J; Back T; Petersen S; Reiman D; Clancy E; Zielinski M; Steinegger M; Pacholska M; Berghammer T; Bodenstein S; Silver D; Vinyals O; Senior AW; Kavukcuoglu K; Kohli P; Hassabis D (2021) Highly Accurate Protein Structure Prediction with AlphaFold. Nature, 596 (7873), 583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (26).Webb MA; Jackson NE; Gil PS; de Pablo JJ (2020) Targeted Sequence Design within the Coarse-Grained Polymer Genome. Science Advances, 6, 43, eabc6216. 10.1126/sciadv.abc6216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Bhattacharya D; Kleeblatt DC; Statt A; Reinhart WF (2022) Predicting Aggregate Morphology of Sequence-Defined Macromolecules with Recurrent Neural Networks. Soft Matter, 18, 27, 5037–5051. 10.1039/D2SM00452F. [DOI] [PubMed] [Google Scholar]
  • (28).Wheatle BK; Fuentes EF; Lynd NA; Ganesan V (2020) Design of Polymer Blend Electrolytes through a Machine Learning Approach. Macromolecules, 53, 21, 9449–9459. 10.1021/acs.macromol.0c01547. [DOI] [Google Scholar]
  • (29).Jablonka KM; Jothiappan GM; Wang S; Smit B; Yoo B (2021) Bias Free Multiobjective Active Learning for Materials Design and Discovery. Nat Commun, 12, 1, 2312. 10.1038/s41467-021-22437-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (30).Tesei G; Trolle AI; Jonsson N; Betz J; Knudsen FE; Pesce F; Johansson KE; Lindorff-Larsen K (2024) Conformational Ensembles of the Human Intrinsically Disordered Proteome. Nature, 626 (8000), 897–904. 10.1038/s41586-023-07004-5. [DOI] [PubMed] [Google Scholar]
  • (31).Patel RA; Borca CH; Webb MA (2022) Featurization Strategies for Polymer Sequence or Composition Design by Machine Learning. Mol. Syst. Des. Eng, 7, 6, 661–676. 10.1039/D1ME00160D. [DOI] [Google Scholar]
  • (32).An Y; Webb MA; Jacobs WM (2024) Active Learning of the Thermodynamics-Dynamics Trade-off in Protein Condensates. Sci. Adv 10, 1. 10.1126/sciadv.adj2448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (33).Lotthammer JM; Ginell GM; Griffith D; Emenecker RJ; Holehouse AS (2024) Direct Prediction of Intrinsically Disordered Protein Conformational Properties from Sequence. Nat Methods, 21, 3, 465–476. 10.1038/s41592-023-02159-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (34).DeStefano AJ; Segalman RA; Davidson EC (2021) Where Biology and Traditional Polymers Meet: The Potential of Associating Sequence-Defined Polymers for Materials Science. JACS Au, 1, 10, 1556–1571. 10.1021/jacsau.1c00297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (35).Chan HS; Dill KA (1991) “Sequence Space Soup” of Proteins and Copolymers. The Journal of Chemical Physics, 95, 5, 3775–3787. 10.1063/1.460828. [DOI] [Google Scholar]
  • (36).Ashbaugh HS (2009) Tuning the Globular Assembly of Hydrophobic/Hydrophilic Heteropolymer Sequences. Journal of Physical Chemistry B, 113, 43, 14043–14046. 10.1021/jp907398r. [DOI] [PubMed] [Google Scholar]
  • (37).Guseva E; Zuckermann RN; Dill KA (2017) Foldamer Hypothesis for the Growth and Sequence Differentiation of Prebiotic Polymers. Proceedings of the National Academy of Sciences, 114 (36), E7460–E7468. 10.1073/pnas.1620179114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (38).Faizullina K; Burovski E (2021) Globule-Coil Transition in the Dynamic HP Model. Journal of Physics: Conference Series, 1740, 1. 10.1088/1742-6596/1740/1/012014. [DOI] [Google Scholar]
  • (39).Day EC; Cunha KC; Zhao J; DeStefano AJ; Dodds JN; Yu MA; Bemis JR; Han S; Baker ES; Shea J-E; Berlow RB; Knight AS (2024) Insights into Conformational Ensembles of Compositionally Identical Disordered Peptidomimetics. Polymer Chemistry. 10.1039/d4py00341a. [DOI] [Google Scholar]
  • (40).Khokhlov AR; Khalatur PG (1999) Conformation-Dependent Sequence Design (Engineering) of AB Copolymers. Physical Review Letters, 82, 17, 3456–3459. 10.1103/PhysRevLett.82.3456. [DOI] [Google Scholar]
  • (41).Murnen HK; Khokhlov AR; Khalatur PG; Segalman RA; Zuckermann RN (2012) Impact of Hydrophobic Sequence Patterning on the Coil-to-Globule Transition of Protein-like Polymers. Macromolecules, 45, 12, 5229–5236. 10.1021/ma300707t. [DOI] [Google Scholar]
  • (42).ColorClassify. https://github.com/UNC-Knight-Lab/ColorClassify.
  • (43).MotifFold. https://github.com/UNC-Knight-Lab/MotifFold.
  • (44).Rosales AM; Segalman RA; Zuckermann RN (2013) Polypeptoids: A Model System to Study the Effect of Monomer Sequence on Polymer Properties and Self-Assembly. Soft Matter, 9, 35, 8400–8414. 10.1039/c3sm51421h. [DOI] [Google Scholar]
  • (45).Knight AS; Zhou EY; Francis MB; Zuckermann RN (2015) Sequence Programmable Peptoid Polymers for Diverse Materials Applications. Advanced Materials, 27, 38, 5665–5691. 10.1002/adma.201500275. [DOI] [PubMed] [Google Scholar]
  • (46).Banville SC; Zuckermann RN (1996) Synthesis of N-Substituted Glycine Peptoid Libraries. Methods in Enzymology, 267, 437–447. [DOI] [PubMed] [Google Scholar]
  • (47).Knight AS; Zhou EY; Pelton JG; Francis MB (2013) Selective Chromium(VI) Ligands Identified Using Combinatorial Peptoid Libraries. Journal of the American Chemical Society, 135, 46, 17488–17493. 10.1021/ja408788t. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (48).Green RM; Bicker KL (2021) Discovery and Characterization of a Rapidly Fungicidal and Minimally Toxic Peptoid against Cryptococcus Neoformans. ACS Med. Chem. Lett, 12, 9, 1470–1477. 10.1021/acsmedchemlett.1c00327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (49).Ren J; Tian Y; Hossain E; Ho JS; Mann YS; Zhang Y; Browne MD; Connolly MD; Zuckermann RN (2020) Mass Spectrometry Studies of the Fragmentation Patterns and Mechanisms of Protonated Peptoids. Biopolymers, 111, 7, e23358. 10.1002/bip.23358. [DOI] [PubMed] [Google Scholar]
  • (50).Meldal M (1992) Pega: A Flow Stable Polyethylene Glycol Dimethyl Acrylamide Copolymer for Solid Phase Synthesis. Tetrahedron Letters, 33 (21), 3077–3080. 10.1016/S0040-4039(00)79604-3. [DOI] [Google Scholar]
  • (51).Chhabra SR; Hothi B; Evans DJ; White PD; Bycroft BW; Chan WC (1998) An Appraisal of New Variants of Dde Amine Protecting Group for Solid Phase Peptide Synthesis. Tetrahedron Letters, 39, 12, 1603–1606. 10.1016/S0040-4039(97)10828-0. [DOI] [Google Scholar]
  • (52).Paulick MG; Hart KM; Brinner KM; Tjandra M; Charych DH; Zuckermann RN (2006) Cleavable Hydrophilic Linker for One-Bead-One-Compound Sequencing of Oligomer Libraries by Tandem Mass Spectrometry. J. Comb. Chem, 8, 3, 417–426. 10.1021/cc0501460. [DOI] [PubMed] [Google Scholar]
  • (53).Zuckermann RN; Kerr JM; Moos WH; Kent SBH (1992) Efficient Method for the Preparation of Peptoids [Oligo(N-Substituted Glycines)] by Submonomer Solid-Phase Synthesis. Journal of the American Chemical Society, 114, 26, 10646–10647. 10.1021/ja00052a076. [DOI] [Google Scholar]
  • (54).Culf AS; Ouellette RJ (2010) Solid-Phase Synthesis of N-Substituted Glycine Oligomers (α-Peptoids) and Derivatives. Molecules, 15, 8, 5282–5335. 10.3390/molecules15085282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (55).Stewart JM (1976) Solid Phase Peptide Synthesis. Journal of Macromolecular Science: Part A - Chemistry, 10 (1–2), 259–288. 10.1080/00222337608068099. [DOI] [Google Scholar]
  • (56).Terashima T; Sugita T; Fukae K; Sawamoto M (2014) Synthesis and Single-Chain Folding of Amphiphilic Random Copolymers in Water. Macromolecules, 47, 2, 589–600. 10.1021/ma402355v. [DOI] [Google Scholar]
  • (57).Pan Y; Ford WT (1999) Dendrimers with Both Hydrophilic and Hydrophobic Chains at Every End. Macromolecules, 32, 16, 5468–5470. 10.1021/ma990675q. [DOI] [Google Scholar]
  • (58).Macquarrie DJ; Tavener SJ; Gray GW; Heath PA; Rafelt JS; Saulzet SI; Hardy JJE; Clark JH; Sutra P; Brunel D; Di Renzo F; Fajula F (1999) The Use of Reichardt’s Dye as an Indicator of Surface Polarity. New Journal of Chemistry, 23, 7, 725–731. 10.1039/a901563i. [DOI] [Google Scholar]
  • (59).Reichardt C Solvatochromic Dyes as Solvent Polarity Indicators (1994). Chemical Reviews, 94, 8, 2319–2358. 10.1021/cr00032a005. [DOI] [Google Scholar]
  • (60).PeptoidSeq. https://github.com/UNC-Knight-Lab/PeptoidSeq.
  • (61).Sarkar M; Pascal BD; Steckler C; Aquino C; Micalizio GC; Kodadek T; Chalmers MJ (2013) Decoding Split and Pool Combinatorial Libraries with Electron-Transfer Dissociation Tandem Mass Spectrometry. Journal of the American Society for Mass Spectrometry, 24 (7), 1026–1036. 10.1007/s13361-013-0633-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (62).Murrell E; Luyt LG (2020) Incorporation of Fluorine into an OBOC Peptide Library by Copper-Free Click Chemistry toward the Discovery of PET Imaging Agents. ACS Comb. Sci, 22, 3, 109–113. 10.1021/acscombsci.9b00146. [DOI] [PubMed] [Google Scholar]
  • (63).Lam KS; Lebl M; Krchňák V (1997) The “One-Bead-One-Compound” Combinatorial Library Method. Chem. Rev, 97, 2, 411–448. 10.1021/cr9600114. [DOI] [PubMed] [Google Scholar]
  • (64).Fetsch C; Flecks S; Gieseler D; Marschelke C; Ulbricht J; van Pée K-H; Luxenhofer R (2015) Self-Assembly of Amphiphilic Block Copolypeptoids with C 2 -C 5 Side Chains in Aqueous Solution. Macromolecular Chemistry and Physics, 216, 5, 547–560. 10.1002/macp.201400534. [DOI] [Google Scholar]
  • (65).Wonderly WR; Cristiani TR; Cunha KC; Degen GD; Shea J; Waite JH (2020) Dueling Backbones: Comparing Peptoid and Peptide Analogues of a Mussel Adhesive Protein. Macromolecules, 53, 16, 6767–6779. 10.1021/acs.macromol.9b02715. [DOI] [Google Scholar]
  • (66).Daura X; Gademann K; Jaun B; Seebach D; Van Gunsteren WF; Mark AE (1999) Peptide Folding: When Simulation Meets Experiment. Angew. Chem. Int. Ed, 38 1–2, 236–240. . [DOI] [Google Scholar]
  • (67).Foley CD; Zhang B; Alb AM; Trimpin S; Grayson SM (2015) Use of Ion Mobility Spectrometry-Mass Spectrometry to Elucidate Architectural Dispersity within Star Polymers. ACS Macro Letters, 4, 7, 778–782. 10.1021/acsmacrolett.5b00299. [DOI] [PubMed] [Google Scholar]
  • (68).Weber P; Hoyas S; Halin É; Coulembier O; De Winter J; Cornil J; Gerbaux P (2022) On the Conformation of Anionic Peptoids in the Gas Phase. Biomacromolecules, 23 (3), 1138–1147. 10.1021/acs.biomac.1c01442. [DOI] [PubMed] [Google Scholar]
  • (69).Dodds JN; Hopkins ZR; Knappe DRU; Baker ES (2020) Rapid Characterization of Per- and Polyfluoroalkyl Substances (PFAS) by Ion Mobility Spectrometry–Mass Spectrometry (IMS-MS). Anal. Chem, 92, 6, 4427–4435. 10.1021/acs.analchem.9b05364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (70).Bleiholder C; Bowers MT (2017) The Solution Assembly of Biological Molecules Using Ion Mobility Methods: From Amino Acids to Amyloid β-Protein. Annual Rev. Anal. Chem, 10, 1, 365–386. 10.1146/annurev-anchem-071114-040304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (71).Jaeger S; Fulle S; Turk S (2018) Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J. Chem. Inf. Model, 58, 1, 27–35. 10.1021/acs.jcim.7b00616. [DOI] [PubMed] [Google Scholar]
  • (72).Alipanahi B; Delong A; Weirauch MT; Frey BJ (2015) Predicting the Sequence Specificities of DNA- and RNA-Binding Proteins by Deep Learning. Nat Biotechnol, 33 (8), 831–838. 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]
  • (73).Grabherr MG; Haas BJ; Yassour M; Levin JZ; Thompson DA; Amit I; Adiconis X; Fan L; Raychowdhury R; Zeng Q; Chen Z; Mauceli E; Hacohen N; Gnirke A; Rhind N; Di Palma F; Birren BW; Nusbaum C; Lindblad-Toh K; Friedman N; Regev A (2011) Full-Length Transcriptome Assembly from RNA-Seq Data without a Reference Genome. Nat Biotechnol, 29, 7, 644–652. 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (74).Bahai A; Asgari E; Mofrad MRK; Kloetgen A; McHardy AC (2021) EpitopeVec: Linear Epitope Prediction Using Deep Protein Sequence Embeddings. Bioinformatics, 37, 23, 4517–4525. 10.1093/bioinformatics/btab467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (75).Selva Birunda S; Kanniga Devi R (2021) A Review on Word Embedding Techniques for Text Classification. In Innovative Data Communication Technologies and Application; Raj JS, Iliyasu AM, Bestak R, Baig ZA, Eds.; Lecture Notes on Data Engineering and Communications Technologies; Springer Singapore: Singapore,; Vol. 59, pp 267–281. 10.1007/978-981-15-9651-3_23. [DOI] [Google Scholar]
  • (76).Asudani DS; Nagwani NK; Singh P (2023) Impact of Word Embedding Models on Text Analytics in Deep Learning Environment: A Review. Artif Intell Rev, 56, 9, 10345–10425. 10.1007/s10462-023-10419-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (77).Pargent F; Pfisterer F; Thomas J; Bischl B (2022) Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features. Comput Stat, 37, 5, 2671–2692. 10.1007/s00180-022-01207-6. [DOI] [Google Scholar]
  • (78).Svetnik V; Wang T; Tong C; Liaw A; Sheridan RP; Song Q (2005) Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Model, 45 (3), 786–799. 10.1021/ci0500379. [DOI] [PubMed] [Google Scholar]
  • (79).Boldini D; Grisoni F; Kuhn D; Friedrich L; Sieber SA (2023) Practical Guidelines for the Use of Gradient Boosting for Molecular Property Prediction. J Cheminform, 15, 1, 73. 10.1186/s13321-023-00743-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (80).Mikolov T; Chen K; Corrado G; Dean J (2013) Efficient Estimation of Word Representations in Vector Space. Preprint on arXiv. 10.48550/arXiv.1301.3781. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2
3

Data Availability Statement

The datasets generated during this study are available in the supplemental document and additional image and MALDI-MS files. The programs used for this analysis are available on the GitHub at https://github.com/UNC-Knight-Lab.

RESOURCES