Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2024 Mar 7;121(11):e2311726121. doi: 10.1073/pnas.2311726121

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space

Marshall Case a, Matthew Smith a,b, Jordan Vinh c, Greg Thurber a,c,1
PMCID: PMC10945751  PMID: 38451939

Significance

We demonstrate that, surprisingly, information obtained from simple sorting experiments coupled with linear machine learning models consistently predicts continuous protein properties across multiple protein engineering tasks. The ability to readily predict protein sequence for one or more fitness objectives (affinity, fluorescence, and specificity) can reduce the cost and increase the scale of experimental measurements of protein fitness while retaining the accuracy of more complex experimental methods. This manuscript further provides a powerful protein optimization method to harness information from commonly obtained cell sorting data to design high fitness agents that lie beyond experimentally measured sequence space.

Keywords: protein engineering, machine learning, directed evolution

Abstract

Proteins are a diverse class of biomolecules responsible for wide-ranging cellular functions, from catalyzing reactions to recognizing pathogens. The ability to evolve proteins rapidly and inexpensively toward improved properties is a common objective for protein engineers. Powerful high-throughput methods like fluorescent activated cell sorting and next-generation sequencing have dramatically improved directed evolution experiments. However, it is unclear how to best leverage these data to characterize protein fitness landscapes more completely and identify lead candidates. In this work, we develop a simple yet powerful framework to improve protein optimization by predicting continuous protein properties from simple directed evolution experiments using interpretable, linear machine learning models. Importantly, we find that these models, which use data from simple but imprecise experimental estimates of protein fitness, have predictive capabilities that approach more precise but expensive data. Evaluated across five diverse protein engineering tasks, continuous properties are consistently predicted from readily available deep sequencing data, demonstrating that protein fitness space can be reasonably well modeled by linear relationships among sequence mutations. To prospectively test the utility of this approach, we generated a library of stapled peptides and applied the framework to predict affinity and specificity from simple cell sorting data. We then coupled integer linear programming, a method to optimize protein fitness from linear weights, with mutation scores from machine learning to identify variants in unseen sequence space that have improved and co-optimal properties. This approach represents a versatile tool for improved analysis and identification of protein variants across many domains of protein engineering.


A longstanding goal of biochemistry has been to map the sequence of a protein to its structure and function (1). However, the complex biophysics that govern the protein fitness landscape, including how a protein folds and how its structure influences function, make the coupling of sequence to function an extremely difficult task. Thus, protein engineers often focus on a much smaller subdomain of the protein fitness landscape, using the confined resources of experimental protein science to explore variants close to a known functional protein with the goal of incrementally improving function. A common and extremely powerful approach is directed evolution, where, for example, a protein is encoded by DNA, expressed by cells, and assayed by magnetic or fluorescent activated cell sorting (MACS or FACS) and, more recently, next generation sequencing (NGS) to identify variants with improved fitness. While these techniques represent powerful tools in the protein engineering arsenal, it is unclear how to best leverage information from deep sequencing toward the optimization of protein variants. A method capable of generating both fitness estimates and predictions of sequences with higher activity would greatly expand the power and efficiency of directed evolution experiments.

The combination of directed evolution and NGS has enabled protein engineers to rapidly evaluate millions to billions of protein variants in a highly focused manner. With maintenance of the genotype–phenotype connection, any technique that manipulates DNA in a high-throughput manner can be applied to design focused protein variant libraries and assay protein function (2, 3). Techniques like mRNA display and phage display can evaluate the largest libraries, although their small size precludes them from sorting approaches such as FACS (4). Cell surface display techniques, such as bacteria or yeast, enable facile measurement of the interaction between protein variants with soluble proteins which can be used for assaying binding affinity in high-throughput sorting and sequencing technologies (5). Coupling FACS with cell surface display technologies allows for the selection of rare protein variants among a large library with extreme selectivity (6, 7). These techniques have enabled a wide range of protein engineering campaigns, from affinity maturation of protein-protein interactions to highly enantioselective enzymes (8, 9). However, one challenge with these large libraries is how to identify the best lead molecules from the hundreds to thousands of observed sequences in the final sorted population. Traditional approaches for lead molecule identification select variants according to their frequency in the enriched library under the assumption that higher abundance is indicative of higher function (1012). One downside to this approach is that optimal rare variants are excluded from selection and more complex descriptions of how mutations contribute to protein function are difficult to ascertain (11). Application of NGS to the output pool of a protein variant sort improves the accuracy of clone frequency, but frequency rarely correlates with protein properties directly (1315). These challenges arise from sources of error that are difficult to eliminate: variation in cell-to-cell growth, PCR/cloning biases, sequencing errors, and FACS instrument noise (16, 17). With additional sequencing of the input library, enrichment ratios can be calculated, which can improve the accuracy of protein property prediction (18, 19). Despite these improvements, there is still little consensus on the best experimental design and analysis of these directed evolution experiments.

Several approaches have been proposed to mitigate these sources of error and enable the prediction of quantitative protein properties from high-throughput sorting experiments. Deep mutational scanning (DMS) measures the enrichment of many variants. However, several challenges exist; their accuracy in resolving affinity is often limited to a narrow linear region (~10X dynamic range), the results are sensitive to the sorting conditions, stability, and expression effects, and the outcomes can differ from true quantitative measurements of binding affinity (equilibrium dissociation constants or KD’s) (20, 21). Sort-seq aims to address noise from sorting by using multiple bins across the entire fluorescence channel, followed by deep sequencing, to infer the distribution of each sequence in fluorescent space (22). These techniques, while successful at mapping large regions of the fitness landscape, require more sorting time and 8- to 12-fold increased deep sequencing throughput and still have a narrow range of resolution. Several more sophisticated sorting techniques address these issues: SORTCERY creates a rank ordering of affinities by sorting cells according to their binding and expression at a single concentration (23) amped SORTCERY further improves this technique by converting rank order to free energy changes by adding titration standards (24) TiteSeq sorts protein variants at multiple ligand concentrations across multiple bins and fits the affinity to the fraction bound (21). These methods leverage additional sorting and sequencing to improve the predicted outcomes and yield measurements of properties that rival the accuracy of traditional, solution-based protein methods (such as biolayer interferometry or surface plasmon resonance). In this work, we seek to utilize deep sequencing with interpretable machine learning approaches to determine if we can predict properties of proteins that are continuous in nature and can vary many orders of magnitude (such as binding affinity, fluorescence, and specificity) from sorting experiments that only generate binary data (whether a cell is sorted or not). Based on the positive results, we then used the weights of the machine learning models to determine whether we could extrapolate into unseen sequence space toward higher fitness protein variants.

Results

Overview of the Method.

Despite significant efforts to gather quantitative data from high-throughput sorting, most directed evolution campaigns rely on basic metrics of protein fitness. We utilized a simple workflow to extract quantitative protein properties (measurements along a continuous range of values that may correlate with precise measurements of protein fitness, such as the dissociation constant (Kd, in molar units) for binding affinity) from NGS datasets while keeping the experimental design simple and affordable (Fig. 1). To accomplish this task, we generated binary labels from enrichment ratios, trained machine learning models using these binary labels to infer continuous protein properties (25), and optimized protein sequence and function beyond experimentally sampled space into unseen sequence space (24). We hypothesized that continuous protein properties can be obtained from simple sorting and sequencing analyses for three primary reasons. First, because cell sorting is a stochastic process, cells sorted into discrete bins are sampled from an underlying continuous distribution. Thus, cells sorted in a binary manner may allow inference of this distribution (26). Second, biased sampling toward the most and least functional variants may allow models to “interpolate” function of intermediate fitness. Finally, sampling many epistatically interacting motifs may allow inference between them (27). We also hypothesized this approach would work across multiple protein engineering objectives, including affinity maturation, fluorescence, measuring the sensitivity of fitness to mutational burden (for example, via deep mutational scanning), and specificity.

Fig. 1.

Fig. 1.

Overview of the protein engineering workflow. A library of protein variants is expressed on the surface of cells and sorted according to its fluorescent property (whether it binds fluorescently-labeled molecules, its intrinsic fluorescence, etc.) (A) Sorted cells are deep sequenced, sequences are aligned, and using machine learning, continuous scores for each mutation and the overall sequence are calculated (B) Next, the scores for each mutation inferred from these data are used to inform protein fitness landscapes and to optimize protein properties by solving linear equations that maximize one or more fitness objectives (binding affinity, binding specificity, fluorescence, etc.).

To validate the approach, we aggregated data from multiple protein engineering campaigns that fulfilled two criteria: 1) they had many data points of multi-mutant proteins from a sorting campaign and 2) they had measured many continuous protein properties among these variants. These datasets were the fitness landscape of GFP (28), the directed evolution of a fluorescein-binding scFv (21), and the fitness landscape of SARS-COV-2 Spike protein (29, 30) Because the co-optimization of multiple properties is often needed, we also gathered datasets that design high-affinity and high-specificity monoclonal antibodies (25) and highly specific peptides between three B cell lymphoma 2 (Bcl-2) proteins (24).

Data Processing Pipeline for Varying Protein Variant Libraries and Hyperparameter Optimization.

The modular data processing and machine learning pipeline to analyze protein variant libraries consists of multiple steps (Fig. 2). First, a library of protein variants is sorted and deep sequenced, and the ratio of the positive to negative sample frequencies is calculated for all observed sequences. If a sequence found in the positive gate was unobserved in the negative gate, the ratio was set to the maximum observed; conversely, if a sequence found in the negative gate was unobserved in the positive gate, the ratio was set to the minimum ratio observed. Labels ("1" for high-performing variants and ‘0’ for low-performing variants) were assigned by determining a cutoff based on the average ratio (percentile ≥0.8 and ≤0.2, respectively) across how many replicates they appeared in. We hypothesized this label assignment balances the information gained from enrichment ratios while still including clones that were overwhelmingly enriched or depleted.

Fig. 2.

Fig. 2.

Deep sequencing, data pre-processing, and machine learning overview. The most and least functional protein variants from the binary sort are sequenced (A) and the ratio of sequence reads in the positive versus negative gate is calculated independently for each replicate (B). Binary labels are assigned to each unique sequence according to its ratio of read count between positive and negative bin (C); the label thresholds are easily modified depending on the library construction, sorting strategy, and sequencing data quality. Protein sequences are one hot encoded for machine interpretability (D) before being used to train a linear discriminant analysis (LDA) model (E), which is evaluated on a hold-out test set (F). Then, to evaluate whether the LDA model’s internal continuous projection is correlated with actual continuous values, protein properties are quantified either using a complex sorting method (such as SORTCERY, Sort-Seq, or TiteSeq) or from low throughput measurements (such as cell surface based flow cytometry titrations) (G and H). Finally, the internal continuous projection from the LDA model is used to predict these continuous protein properties (I).

Armed with a dataset of sequences and binary function labels, a LDA machine learning model was trained because it fulfilled two criteria: It could perform classification of sequence with its function label, and it had an internal continuous projection that could be used to correlate with continuous properties. Because LDA models project high dimensional sequence data to maximize class separation, the final projection of the model is a continuous representation. This approach has been used previously to correlate with continuous properties (25). The model was trained and tested by splitting the sequencing data into train and test sets randomly (80:20 train:test). High performance among these models indicates that the model is learning to classify high performing from low-performing sequences, but it does not directly indicate that the continuous values predicted are biologically meaningful. To evaluate whether the trained weights correlate with meaningful continuous properties, a subset of the sequences were assayed for their property from a lower throughput but more accurate technique. For all but the Makowski dataset, this was a quantitative cell sorting experiment, and otherwise a low throughput measurement of affinity or specificity via flow cytometry with individual sequences. We then predicted the continuous properties of proteins by comparing the continuous values from LDA with actual continuous measurements.

To evaluate the performance of the label assignment and training process, we also investigated two hyperparameters: label assignment threshold and the minimum number of replicates. First, we hypothesized that while splitting the positive and negative labels at the 50th percentile would increase the number of labeled sequences and therefore dataset size, sorting noise around the midpoint would confound information gained from binary ratios. To evaluate this hypothesis, we assigned labels to each of the datasets at varying cutoff thresholds and measured the performance against an unseen test set (SI Appendix, Figs. S1–S5). We also hypothesized that removing sequences with few replicates would reduce noise from sorting and improve the performance of the model. To evaluate this hypothesis, we truncated each of the datasets based on the number of replicates for each measured sequence and measured the performance of the model as above (SI Appendix, Figs. S1–S5). We found that as the dataset size decreased due to higher stringency, the performance of the models decreased. However, as we reduced the stringency, model performance was relatively unaffected, suggesting that that LDA is well equipped to model noisy data. These parameters were chosen to balance the size of the dataset, the strictness of inclusion, and the confidence of the sequencing data. Having easily modifiable parameters for label assignment serves as both a tool for sequencing quality processing and a powerful hyperparameter in the subsequent machine learning steps.

Binary Sorts with NGS Predict Protein Properties with Equal Correlation Power.

To evaluate whether the LDA models trained on binary sorting data inferred meaningful features of protein fitness, we curated five datasets as described in the methods (see SI Appendix, Table S1 for dataset summaries). Using data from each of these, we compared the actual measurements of protein fitness (continuous properties, such as Kd for binding affinity or ratio of Kd’s for specificity) to their predicted values from LDA models trained on binary sorting data (quantitative properties, but lacking biologically meaningful units) as shown in Fig. 3 A, Left for the Sarkisyan dataset. As a point of comparison, we next sought to determine the performance of a comparable model but trained on accurate, continuous data. This training data is more expensive and/or complicated to obtain but is presumably more information rich. Therefore, we hypothesized models trained on continuous data would have stronger correlative power. To evaluate this hypothesis, we trained Ridge regression models, which have been previously shown to be powerful linear models that are not prone to over-fitting and whose trained weights are biologically meaningful (31). We then compared both LDA and Ridge models' predictions of continuous properties (Fig. 3 A, Right). Surprisingly, for the Sarkisyan et al. dataset, the LDA models performed similarly to the Ridge regression models as evidenced by a similar Spearman’s ρ (0.846 for the LDA model and 0.855 for the Ridge regression model).

Fig. 3.

Fig. 3.

Predictions from models trained on binary data are highly correlated with continuous protein properties and equally powerful as models trained on continuous data. Evaluated on the Sarkisyan 2016 data, LDA models trained on binary data (A, Left) or Ridge models trained on continuous data (A, Right) are highly correlated with GFP fluorescence (measured via Sort-Seq). Across all five protein engineering datasets, LDA models trained on binary data are equally predictive of continuous properties (B). Error bars represent SD among n = 5 hyperparameter cross-validation.

We then tested this hypothesis on the other four datasets. First, we observed that LDA models achieved high classification performance on the held-out test set for all datasets (see SI Appendix, Table S2 for accuracy, precision, recall and F1 score) and were not overfit as evidenced by similar performance on the training and test sets. Next, we observed that LDA projections were highly correlated with continuous measurements, as evidenced by Spearman’s r between 0.5 and 0.85 (Fig. 3B, additionally see SI Appendix, Figs. S1–S5 for hyperparameter effect on performance). To get an estimate of model sensitivity to dataset splitting, we performed fivefold cross-validation (Materials and Methods) on each training dataset (SI Appendix, Fig. S6). Strikingly, for each of the datasets, we observed no significant difference in the correlation between models trained on continuous data with those trained on binary data (significance was measured as a t-test on the unbounded Z transform of the Spearman ρ) (32).

Encouraged by the success of correlation, we also sought to investigate the magnitude of correlation, which was consistently high but had two outliers. Adams dataset had a significantly lower predictive value of ρ ~ 0.5. We suspect this decrease in performance has two sources: noise in the dataset due to an abundance of unresolvable low-affinity variants (see SI Appendix, Figs. S7 and S8 for correlation plots and statistics for each dataset, respectively), and the absence of discrimination between binding affinity and expression level in the dataset, which can attribute higher affinity to sequences with higher display and vice versa (17). The Makowski specificity dataset also had lower than average performance; we hypothesize this model was limited by the difficult nature of measuring antibody off-target binding (25, 33, 34).

We next characterized the model performance as a function of both the proportion of functional sequences and the distribution of mutations within a dataset. Depending on the sensitivity of a given protein to increasing mutational burden, it is unsurprising that many mutations result in a complete loss of function (28). However, we wanted to investigate how the presence of sequences with unquantifiable and low function affected the ability of the model to predict fitness among functional variants. To test this, we truncated each of the datasets by removing the non-functioning sequences (based on their actual, continuous property) and reevaluated the performance of the model (SI Appendix, Fig. S8). Interestingly, despite the correlation changing considerably for different datasets with many non-functioning sequences, the direction of change was not consistent. For example, in the Adams dataset, the removal of non-functioning sequences increased the Spearman correlation coefficient from 0.5 to 0.7, suggesting that the non-functioning sequences had noisy labels that resulted in the underfitting of the model. However, for the Tessier dataset (affinity) and the Sarkisyan dataset, the magnitude decreased from 0.8 to 0.6, indicating that either the inclusion of non-functioning sequences during model training improved regression performance, or that the model performance was overestimated by the rank order correlation statistic with the inclusion of many negative datapoints. Overall, the relatively high correlations even after retrospective truncation of non-functional sequences indicate that the model is capable of capturing differences within the high fitness population in addition to identifying low-fitness sequences.

We also sought to characterize how the distribution of mutations relative to the wild type sequence affected model performance. As a sequence accumulates mutations, the likelihood that the resulting fitness is a linear combination of independent effects is increasingly unlikely, depending on the magnitude of epistasis present in the dataset (27, 28) However, the relatively high performance of the LDA models to predict fitness with linear weights does not address how well the model predicts fitness far from the wild type sequence. To determine the accuracy of the predictions as a function of mutational burden, we first looked at the distributions of mutations in the Sarkisyan dataset (SI Appendix, Table S1 and Fig. S9). We then compared the predicted fitness at increasing mutational burden. As expected, both the average and predicted fitness decreased as additional mutations were added, though the magnitude of decrease widens at intermediate mutational burden before the complete loss at more extreme values (greater than fifteen). We also observed that the average error of prediction (as measured by the sum of squared error of predicted fitness) increased as more mutations accumulated, suggesting that epistasis plays a larger role in these cases and/or extremely mutated proteins were poorly modeled during model training.

To test whether linear models were limiting the predictive capabilities of continuous properties, we also tested fully connected, feed-forward neural networks, which have been shown to similarly identify continuous values from binary data (25). While non-linear models may capture higher-order epistatic behavior, these models generally performed as strongly as LDA models (SI Appendix, Fig. S10). Over this wide range of protein engineering objectives, the LDA approach consistently predicts continuous properties and has comparable accuracy to models trained on more sophisticated sequencing and sorting data.

Prediction of Stapled Peptide Affinity and Specificity from Binary Labels.

To apply this method prospectively to a new dataset following the promising retrospective analysis, we chose Bcl-2 stapled peptide antagonists as our design case. In addition to requiring non-natural amino acids, making it incompatible with modeling approaches based on naturally evolved proteins, these peptides are well suited for this approach because we can evaluate not just a single property but the tradeoff between affinity and specificity. We generated a dataset of Bcl-2 stapled peptide variants that were sorted over several rounds (Fig. 4A) using bacterial cell surface display (35, 36). This library was designed based on naturally occurring peptide sequences, SPOT arrays of BIM mutants, and previously designed high-affinity or specificity BH3 variants (Case 2023, manuscript in progress) (SI Appendix, Tables S3–S6) (24, 37, 38) Because bacterial surface display libraries are limited by size compared to the theoretical diversity of BH3 peptides (~1030), mutations were prioritized that were predicted to govern specificity between Mcl-1 and Bcl-xL. The final library of ~109 was transformed into bacteria (Fig. 4B) and sorted against either Mcl-1 or Bcl-xL with a combination of three MACS and four FACS rounds (Fig. 4C). The magnetic cell sorting was performed until the library was sufficiently reduced in diversity for analysis with FACS, which offers more precise control over property selection. We deep sequenced these pools to isolate highly active peptides (Fig. 4D), which enabled an understanding of sequence trends that governed high affinity and specificity (see SI Appendix, Fig. S11 for sequence trends) and provided a source of data to train and evaluate the capabilities of LDA models to predict peptide function (Fig. 4 E and F). We observed high correlation between both Mcl-1 and Bcl-xL LDA models (Spearman’s ρ of 0.893 for Mcl-1 and 0.708 for Bcl-xL).

Fig. 4.

Fig. 4.

Prospective analysis of Bcl-2 pro-apoptotic stapled peptides via bacterial surface display, deep sequencing, and machine learning. A combinatorial mutagenesis library of stapled BIM variants was designed including staple locations (Left) and sequence (red positions fixed, blue positions variable, Right) (A), transformed into bacteria (B), sorted using a combination of MACS (C) and FACS toward Bcl-xL and Mcl-1 (two members of the Bcl-2 family) in parallel. The library was next generation sequenced (NGS) to calculate frequencies of each unique sequence along the sorting progression (D). Finally, a LDA model was trained on the binary labels from NGS and used to predict the continuous binding of 57 peptide variants, which were selected randomly from FACS 2 to 4 for both Mcl-1 (E) and Bcl-xL (F). Mcl-1 or Bcl-xL binding is measured as normalized binding to expression on the surface of bacteria (as previously used to measure continuous binding for many sequences) (25). Error bars represent SD among n = 3 technical replicates.

To generate training data, we aggregated all four rounds of FACS and the expression positive MACS sorts, hypothesizing that would provide additional confidence for “hits” and expressing but non-binding sequences. The ratio of these counts was computed as described above and used to generate labels for LDA training and testing (see SI Appendix, Fig. S11 for logoplots of negative and positive sequences). First, we observed that LDA models had high classification performance and were not overfit (see SI Appendix, Table S7 for performance statistics and SI Appendix, Fig. S12 for hyperparameter effect). We then tested the performance of LDA to predict continuous properties by randomly sampling 57 sequences among the FACS sorts, measuring their continuous binding via flow cytometry (here, continuous binding is the ratio of normalized binding to expression, which has been previously used as a proxy for Kd) (25), and measuring the correlation between predicted LDA binding and the sequences’ actual binding (Fig. 4 E and F) (see SI Appendix, Fig. S13 for sequences and data). We observed a strong correlative power between LDA projections and continuous measurements of peptide affinity: Spearman ρ of ~0.7 and ~0.8 for Mcl-1 and Bcl-xL respectively (P < 0.00001). Finally, we sorted the final round of sorted cells via SORTCERY for a comparison with high-throughput, semi-quantitative measurements of binding affinity. Surprisingly, the binary sorting data coupled with an LDA model trained with NGS data had better performance than selecting clones using a multi-gate sorting scheme (SORTCERY) from the final rounds of sort for Mcl-1 or Bcl-xL specificity (SI Appendix, Fig. S14), suggesting that the information contained from simple sorting experiments provides a powerful method to predict continuous protein properties.

Optimization of Stapled Peptides using Machine Learning and Integer Linear Programming.

While directed evolution campaigns may yield the desired properties after sorting, sequencing, and modeling, it is also possible that further optimization is necessary. In such cases, protein engineers rely on a combination of manual and automated approaches to further optimize lead candidates (8, 24, 39, 40). We sought to explore how our modeling workflow could not only score entire sequences, but how individual amino acids contributed to the protein property of interest, potentially enabling the generation of new, unsampled sequences. Because linear models (such as LDA and Ridge) have associated weights for each amino acid and sequence position, the same scoring tools to find the best-measured clones can also be used to score sequences that have never been evaluated experimentally. We therefore applied an optimization approach that can utilize discrete inputs for continuous properties and explore unseen sequence space: integer linear programming (ILP) (Fig. 5), which has previously been applied to design specific linear peptides toward the Bcl-2 proteins (24). To establish the baseline of specificity from sorting, we further characterized variants from the final round of sorting that were predicted to be specific for Bcl-xL and Mcl-1. Interestingly, most peptides from the Bcl-xL library were highly specific (SI Appendix, Fig. S15), while fewer from Mcl-1 performed favorably (~80% had significant off-target binding, Fig. 5A). We hypothesized we could recover specific Mcl-1 clones by optimizing sequences from sorting and sequencing data that otherwise yielded mixed results. We solved the ILP model three times, once for Bcl-xL-specific peptides, again for Mcl-1 specific peptides, and finally for bispecific peptides (see Materials and Methods for more details). Out of thirty sequences predicted to have high activity for Mcl-1 (SI Appendix, Fig. S16 and Table S8), we randomly selected two sequences for low-throughput flow cytometry analysis. Strikingly, we observed that the optimized Mcl-1 sequences displayed similar or improved specificity compared to the highest activity clones assayed experimentally.

Fig. 5.

Fig. 5.

Extrapolation of interpretable machine learning model weights to generate highly specific Mcl-1 inhibitors. Of 20 sequences randomly selected from the final two rounds of sorting toward Mcl-1, many did not display high levels of specificity toward Mcl-1 when measured in low throughput binding assays (A). Normalized binding is defined as the ratio of binding to expression (as previously used to measure continuous binding) (25). Error bars represent SD among n = 3 technical replicates. We hypothesized the weights from LDA machine learning could be used to design peptides with high affinity to Mcl-1 or Bcl-xL (B). To optimize the sequences, we applied ILP (C) to maximize the likelihood a peptide binds Mcl-1 while minimizing its binding to Bcl-xL. ILP identified numerous sequences that were predicted to be highly specific (D) that were not among the 105 sequences assayed experimentally. Two variants were randomly chosen among this set and were found to be as specific as the best clones identified from sorting (E). Asterisks indicate that the binding affinity was unquantifiable. Error bars represent SD among n = 3 technical replicates.

For Bcl-xL specificity, we initially selected sequences by minimizing Mcl-1 binding while maximizing Bcl-xL binding, but this resulted in peptides that did not bind either Mcl-1 or Bcl-xL in the experimental range tested (SI Appendix, Fig. S17 and Table S8); it has been previously shown that subtle differences in ILP set up can affect the efficiency of outcome (24). We suspect this failure was due to the model being overly sensitive to mutation Asp at position 4b, which was the only mutation consistently sampled that had a high score for both Bcl-xL and Mcl-1 but was slightly higher for Mcl-1. To utilize ILP for Bcl-xL-specific binders, we maximized Bcl-xL binding first, then chose the sequences which had the lowest Mcl-1 scores. This approach preserved Bcl-xL binding and successfully identified highly specific Bcl-xL-binding peptides (SI Appendix, Table S9 and Fig. S18).

While our sorting campaign was originally designed to identify highly specific peptides, we also pursued bispecific peptides. This approach serves as evidence that the model can interpolate in sequence-function space, but these peptides could also function as therapeutics in diseases driven by both Bcl-2 proteins. Sequences were identified by maximizing both Mcl-1 and Bcl-xL binding, yielding peptides with relatively high affinity for both targets that had significant sequence difference from wild type (BIM) (SI Appendix, Fig. S19 and Table S8). Therefore, ILP was able to identify sequences specific for Mcl-1, Bcl-xL, or both.

To show generalizability of ILP to generate functional protein variants, we additionally set up the optimization problem using the Makowski dataset (SI Appendix, Fig. S20). We defined the objective of this optimization as the minimization of off-target binding, while maintaining affinity. We solved the model and compared the highest functional sequences according to our predictions to those described in the original manuscript. We analyzed the similarity between ILP-predicted sequences and those originally identified as co-optimal by Makowski and co-authors by comparing the frequency of mutations in each set (N = 100 for ILP and N = 41 for experimentally validated). Importantly, the mutational frequency for ILP and experimentally optimal sequences were more highly correlated (R2 = 0.71) than ILP and naïve mutational frequency (R2 = 0.45). Furthermore, comparison of ILP and experimentally optimal clones did not significantly differ from equality according to the confidence intervals of the fit model parameters.

Discussion

In this work, we developed a method to leverage NGS data from simple binary sorting results with machine learning to infer continuous protein properties (such as Kd, measured in molar units, for binding affinities). The ability to measure continuous properties is extremely important for many protein engineering tasks, where the ability to distinguish small changes in fitness may be lost with less sensitive methods. This method can also be applied to extend property prediction beyond sequences directly observed in the library (Fig. 1). The workflow consists of two parts: the label assignment process from deep sequencing data, and the use of linear machine learning models to predict continuous protein properties from binary data (Fig. 2). Currently, there is a lack of consensus on how to best analyze directed evolution data for lead molecule selection and protein optimization. This lack of consensus likely arises from variations in how experiments are set up, which depends on surface display platform, sequencing instrumentation, FACS instrumentation, the design of sort gates, and sequencing depth, among other factors. This technique provides a practical but powerful method compared to typical enrichment ratio analysis through a simple binary classification from any sorting experiment. Likewise, it does not require a more involved experimental design to collect multiple gates and/or labeling conditions. By defining a ratio of frequencies based on any two gates (positive/negative sort, input/output sort, etc.) and binarizing the ratios into “1” and “0”, any directed evolution experiment can be transformed into a dataset for downstream analysis. The transformation to binary labels is important because the next component of the workflow is the use of linear machine learning models that can be used to predict continuous properties from directed evolution data (LDA) (25). The noise in enrichment ratios is likely mitigated by binarization, and the information contained from labels and sorted protein sequences facilitates the continuous transformation yielded by machine learning models.

To test the generalizability of this method, we curated data from five large protein engineering campaigns: the fluorescent landscape of avGFP (28), the directed evolution of a fluorescein-binding scFv (21), the RBD affinity landscape toward SARS-COV-2 Spike protein (29, 30), high-affinity and high-specificity Fabs (25), and the design of highly specific peptides against Bcl-2 proteins (Fig. 3) (24). Proteins in these data vary in complexity from short alpha helical peptides to large globular proteins and in objective from protein fluorescence to multi-objective affinity and specificity optimization. Furthermore, each of these datasets varied in both sorting strategy and complexity: Makowski sorted for the top ~5% of antibody variants while Adams quantified the binding of an entire family of fluorescein binders. While many of the projects relied on complex sorting techniques to obtain quantitative protein labels, we simulated simple binary sorting experiments by limiting the sequencing data (Materials and Methods). We then evaluated the predictive power of LDA models trained on these simple sorting experiments and observed both impressive classification performance and strong prediction of continuous properties from LDA binary projections. Interestingly, models trained on binary data were highly correlated with continuous data (Spearman correlation coefficients ranged from 0.5 to 0.9). Furthermore, when we compared the predictive power of LDA models trained on binary data to regression models trained on continuous data, we observed no increase in rank order performance, suggesting that models trained on simple sorting experiments yield comparable information to models trained on data from experiments that generate hundreds to thousands of continuous measurements (21, 23, 24). It is only when the datasets are truncated to focus on smaller regions of protein fitness that a difference between the binary sorting and the presumably more information-rich continuous datasets can be seen (SI Appendix, Fig. S8). Likewise, the accuracy of the method is fairly independent of the size of the positive and negative sorting gates and read depth as long as sufficient sequences are available for analysis, making it more robust to experimental variation. These results provide limits and guidance on the utility of the approach.

Next, we sought to explore how this workflow could be used for prospective analysis in addition to retrospective analysis (Fig. 4). We hypothesized that because the workflow is agnostic to protein type and display platform, any directed evolution campaign with sufficient sorting and sequencing data is a suitable environment for testing. As such, we chose to analyze libraries of stapled peptides, an important class of protein formed by a covalent crosslinking of two amino acids (41). Stapled peptides are being explored as therapeutics for previous “undruggable” disease-related proteins, owing to their location inside the cell and untargetable by small molecule drugs (42). Stabilized peptide engineering by Escherichia coli display (SPEED) has previously been demonstrated to accelerate the development of stapled peptides by displaying them on the surface of bacteria, where libraries of peptides varying in sequence and staple location simultaneously can be optimized for protein–peptide interactions (35, 36). One additional challenge in the optimization of stapled peptides is their reliance on non-natural amino acids, which generally results in the incompatibility of models trained on naturally occurring sequences (4346). We built on previous work by generating a library of randomized stapled peptides toward two Bcl-2 proteins, an important class of apoptosis regulatory proteins that is responsible for cancer cell immortality (47).

We sorted this library against two important members: Mcl-1 and Bcl-xL (48), each of which drives apoptosis resistance in different diseases (49). Selective targeting among Bcl-2 proteins is an outstanding goal in drug targeting but is difficult due to the highly homologous nature of these proteins. After several rounds of cell sorting and subsequent deep sequencing, we trained LDA models on a subset of the binary sequencing data, evaluated the model on both the hold-out test set, and generally observed high classification performance. We then measured the binding of 57 sequences from various rounds of sorting with low throughput flow cytometry experiments and observed that many of the clones did not demonstrate favorable affinity or specificity properties when sampling from these enriched libraries. However, we did observe a high degree of correlative power between LDA projections and continuous peptide binding. Importantly, these models were able to identify molecules within the set of experimentally observed sequences that were highly specific but may not have been selected for lead compounds due to their rarity (11). Several sequences along the Pareto frontier, or the boundary of co-optimality where an increase in one property leads to a decrease in the other, were translated into bacteria and assayed via flow cytometry. We also characterized several clones that were bispecific, which could have applications in specific diseases, but also serves as a test case if the model can interpolate function where it was not directly engineered via cell sorting. The specificities of these peptides agreed with model predictions, indicating the model was able to identify functional and rare peptides from across the specificity landscape.

Finally, we sought to use the interpretive nature of the linear machine learning models to explore unseen sequence space and generate highly diverse and novel sequences (Fig. 5). To accomplish this task, we used ILP and the coefficients from machine learning to mathematically optimize peptide sequences beyond the properties that were experimentally observed (from deep sequencing or flow cytometry) (24). We hypothesized that such an approach could recover functional peptides with consistency where sorting did not; while the final round of experimental Bcl-xL sorting yielded consistently high affinity and specificity variants (SI Appendix, Fig. S15), the Mcl-1 sort had a small fraction of sequence variants with desired properties. We thus prioritized the design of Mcl-1 binders and identified a new peptide sequence that improved peptide properties beyond the experimentally measured Pareto front. Importantly, this variant demonstrated specificity at least as potent as the most specific clone identified from experimental work.

To test whether our sequence optimization workflow generalizes beyond small alpha helices, we also applied ILP to the Makowski dataset (SI Appendix, Fig. S20). While antibodies have been the subject of optimization using highly sophisticated models (5053), we hypothesized that the high performance from linear machine learning models would make it amenable to ILP optimization. Like Bcl-2 inhibitors, antibodies need to demonstrate properties beyond high affinity to be considered therapeutic, and ILP is uniquely suited to tackle co-optimization (54). We observed that the set of sequences predicted to be co-optimal by ILP were similar to their most optimal clones identified experimentally. Furthermore, their lead antibody identified as co-optimal (EM1) was among the set of antibodies predicted by ILP. Makowski and co-authors used a comparatively small library (~106) for their experimentally measured sequences (~104), resulting in a more confident sampling of mutated amino acids experimentally. In contrast, the library of stapled peptides we designed had a much larger ratio of design space (~109) to experimentally measured sequences (~105), making this library suitable for extrapolation beyond experimentally measured space using machine learning. For protein variant libraries where mutations are sufficiently independent (minimal higher-order epistatic interactions), a strategic subsampling of design space can be advantageous for subsequent protein optimization with linear models (55, 56) and help to de-risk sorting campaigns, as exploration through the full design space can improve function beyond those originally assayed.

The use of ML with NGS data from binary sorting campaigns has many advantages, but the approach also has a few limitations. First, it is important to note that LDA projections are correlated with, but not predictive of, continuous measurements. Therefore, LDA-informed properties may not match 1:1 with continuous properties. However, because many protein engineering campaigns do not seek to quantify the exact magnitude of fitness, but rather seek to maximize or minimize a property or trade-off between properties, this correlation can still provide direct insight into protein fitness and accelerate optimization efforts. Second, this approach relies on sequences being aligned in a manner consistent with their function; sequences of varying length would need to be aligned and extra tokens would have to be added to represent inserts and deletions of mutations. Mutations in different contexts (such as binding different epitopes or having different chromophores) would likely not be well described by the same weight matrices. Next, we also found that ILP optimization was sensitive to model weights as evidenced by the initial failure of generating highly specific Bcl-xL peptides. Two approaches to address this are incorporating uncertainty into model predictions that could yield more confident extrapolation into unseen sequence space (39), or selecting a range of sequences from multiple modes of optimization simultaneously (24). In the case of designing Bcl-2 family inhibitors, despite identifying peptides with high specificity toward Mcl-1 and Bcl-xL, more work is needed to yield effective peptide therapeutics: it is equally important to show these peptides do not bind the other 3 Bcl-2 members (49). Because this approach is amenable to higher dimension multi-objective optimization, we expect that optimizing specificity for five proteins with this approach is possible. Finally, datasets with greater extents of epistasis may be poorly modeled by linear models. The datasets utilized in this work had an average of 7 mutations each. To address epistasis at higher mutational burdens, this approach could be scaled to use combinations of features, enabling the model to directly weigh contributions from second (or higher)-order mutation combinations. This approach was previously validated to yield better results than their linear counterparts, though the computational resources needed for training and optimization are significantly increased (24).

Despite these limitations, the ability to score sequences beyond those observed experimentally is important because drug-like properties that are not easily assayable by high-throughput techniques (immunogenicity, stability, cell permeability, etc.) are often highly dependent on sequence and may need further optimization (54, 5759). For example, minimization of positive charge in CDR regions of antibodies has been shown to minimize off-target binding (34), while selective placement of hydrophobicity and positive charge has been shown to improve cell penetration for stapled peptides (58, 60) This combined machine learning and optimization approach provides a powerful method to identify highly functional protein variants if experimentally measured clones do not meet fitness criteria or further sequence optimization is necessary.

In summary, the data processing and modeling workflow designed in this work is a versatile tool toward the improved analysis and identification of protein variants across many domains of protein engineering by utilizing machine learning and NGS data to predict continuous properties from binary sorting data and optimize new protein variants in previously unseen sequence space.

Materials and Methods

Curation of NGS Data for Validation.

Five datasets were used to test the simple method of using binary labels to predict continuous properties. The datasets and brief descriptions are given below. The metadata and distribution of mutations for each dataset is shown in SI Appendix, Table S1 and Fig. S21, respectively.

Adams et al. (21).

NGS data were downloaded from their GitHub repository: https://github.com/jbkinney/16_titeseq. The read counts and CDR1H and CDR3H sequences for each clone were extracted and aligned using in-house python scripts. Read counts were converted to frequencies.

Starr et al. (29) and Greaney et al. (30).

NGS data were downloaded from their GitHub repository: https://github.com/jbloomlab/SARS-CoV-2-RBD_DMS_variants. The data for the Delta mutation are stored in a different repository: https://github.com/jbloomlab/SARS-CoV-2-RBD_Delta. Due to limits in Illumina paired-end reading length, each sequence was given a unique molecular barcode, which was sequenced in high depth, but each full-length sequence was sequenced with its unique barcode separately. The sequences and their TiteSeq profiles were associated with their corresponding barcodes and read counts were converted to frequencies. In the current method, sequences with more than one mutation were not discarded.

Makowski et al. (25).

Processed data were downloaded from their GitHub repository: https://github.com/Tessier-Lab-UMich/Emi_Pareto_Opt_ML. Raw data were available from their repository.

Sarkisyan et al. (28).

The GFP sequence is too long for high-depth Illumina sequencing, and therefore, the authors gave each sequence a unique molecular barcode. We downloaded the accurate full-length protein sequences, their matching unique barcodes, and the high-depth sequencing of Sort-seq data from their repository: https://figshare.com/articles/dataset/Local_fitness_landscape_of_the_green_fluorescent_protein/3102154. The read accuracy on the barcodes was low, and the authors used a Levenshtein distance of ≤1 to connect barcodes that were close but not identical to the full protein sequence. We used the Levenshtein module with in-house python scripts to cluster sequences to their barcodes, which were available at https://pypi.org/project/python-Levenshtein/. After clustering, sequences and their barcodes were merged with their Sort-Seq distributions like Starr et al. Read counts were converted to frequencies.

Jenson et al. (24).

NGS data were obtained from their GitHub repository: https://github.com/KeatingLab/sortcery_design. The peptides’ short lengths permitted high-depth deep sequencing and thus counts were directly converted to frequencies without further preprocessing.

Binarization of FACS/NGS Data.

The variety of factors that influence the design of an experiment makes it challenging to generalize a sorting and sequencing workflow for any given protein engineering campaign. Each of these projects were analyzed by a different group, using different cell sorters, expression platforms, sequencing instruments, and protein types among other parameters (see SI Appendix, Table S1 for dataset property summaries). Thus, controlling each of those parameters in our data processing workflow was an important consideration toward the application of this approach to existing datasets and new targets alike. Many of the experiments use sophisticated sorting techniques to infer quantitative protein properties. We simulated a simple binary sort experiment by truncating the dataset such that it only includes the top or bottom 20% of sorted sequences (or as close as possible). This subsample of sequencing data approximates a simple sorting campaign from these quantitative sorting techniques. For example, the Sarkisyan dataset contains sequencing data of GFP variants that were sorted into 8 bins; to simulate a simple binary sort, we aggregated the top two bins as positive and the bottom two bins as negative. For TiteSeq experiments (Starr, Greaney, and Adams datasets) we only included data from sorts that used ligand concentrations near the average KD of the library (10−9, 10−9, and 10−8 M, respectively). Because the KD of a library can be readily obtained from low-throughput flow cytometry experiments, sorting at the KD of the library is a feasible approach to yield the largest separation between high- and low-affinity variants (17). This was 10−8 M for the Adams dataset, and 10−9 M for the Starr dataset. For the Makowski dataset, data was provided as a positive and negative dataset with varying cutoffs for each selection. We additionally computed the correlation between continuous measurements, which varies from one to three depending on the dataset (SI Appendix, Fig. S22). This ideally serves as an upper limit for model performance: Experimental noise should limit the model predictions. Interestingly, for some datasets, the modeling techniques used in this work are more correlated with the average experimental measurement than experimental replicates are to one another. If experimental measurements are noisy but unbiased, the average of multiple replicates approaches the true value, which could explain the increase in performance when using the mean.

Label Assignment.

In all cases, in-house python scripts were used to perform the data preparation and modeling on each of the datasets. Scikit-learn (https://scikit-learn.org/stable/) was used for LDA, one-hot encoding, scaling label vectors, and other pre-processing steps. Pandas (https://pandas.pydata.org/) and NumPy (https://numpy.org/) were used to handle sequencing and numerical data. PyTorch (https://pytorch.org/) was used to train neural network models.

First, sequences were one-hot encoded, eliminating positions that were not randomized in the study or appeared with very low abundance (only one reported sequence with a mutation at a given position). Then, we calculated the frequencies of each sequence for the high- and low- protein property, and a multi-sequence alignment (MSA) was performed to ensure every vector was the same length and columns corresponded to the correct residues. The ratio of positive sort frequency versus the negative sort frequency was calculated for each unique sequence and replicate (some datasets had two replicates, while others had different numbers of replicates per sequence). The average ratio of all replicates was used to assign a positive (1) or negative (0) label. The correlation of ratios between each replicate (for all five datasets) is shown in SI Appendix, Fig. S22. The magnitude of this correlation serves as an indication of the disagreement between replicates and the possible improvement in data accuracy by averaging over multiple replicates. After computing these replicates, we find that the magnitude of replicate correlation (as measured by the Spearman correlation coefficient) is highly variable across datasets used in this study—ranging from 0.23 for Sarkisyan dataset to 0.82 for Jenson (Mcl-1) dataset

To minimize sorting noise, instead of splitting the positive and negative labels at the median ratio, we set the labels as the top or bottom 20% of sequences. Any sequences that contained zero frequency in the low property pool were set to the maximum ratio observed and any sequences that contained zero frequency in the high property pool were set to the minimum ratio observed. The one-hot encoded protein sequences and their labels were then divided into an 80:20 training:test split. The test set was held aside until all analyses were complete and used to validate the model training process. In later analyses, to explore the hyperparameter space of these cutoff parameters, we tested all combinations of the read count, replicate count, and ratio percentile and measured the change in modeling performance. Sensitivity to training:test splitting and the ratio of positive negative labels was tested by performing five-fold cross-validation using SciKit Learn’s ShuffleSplit function.

Machine Learning Method.

LDA was identified as a standout machine learning method for several reasons. Most importantly, LDA is a classification model, which indicates that it is trained on data with binary labels. Because modern FACS experiments are naturally performed in a binary manner (each cell is either collected or not collected based on its fluorescence), a model that is trained on similarly structured data is a natural starting point. However, because our objective was to predict continuous properties from binary data, the critical advantage of LDA is that it contains an internal, continuous representation of each data point. Because the final decision boundary drawn by the LDA model to predict binary labels is drawn in this continuous space, the model learns to quantify each data point despite not having quantitative labels (like a regression model). While we also evaluated other models that use internal continuous representation for classification (such as support vector machines), these models were prohibitively computationally expensive to train and did not converge for datasets with greater than 10,000 samples. An exhaustive search of hyperparameters for support vector machines, such as kernel, C, or epsilon, did not improve training time (SI Appendix, Table S10).

There were several other advantages to using LDA models. First, this method has previously been used to predict continuous properties from binary sorting data (25). Next, hyperparameter optimization for this model was straightforward, as the Sci-Kit Learn implementation of LDA has very few parameters, including the solver ("svd" was the only one to converge consistently), n_components (which is fixed to 1 for projection to a single dimension to correlate with protein properties), and tol (which did not change the outcome). Another benefit of using LDA is its simplicity; the linear nature of the model allows for the direct interpretation of how certain residues contribute to function. Finally, the transform method was used to project data into the one-dimensional LDA projection after training and this projection was used as the continuous property metric.

To perform a baseline of a model trained on binary data versus one trained on actual continuous data, we sought a comparable regression model because LDA is a classification model and does not have a regression analog. Thus, we used ridge regression, a modified version of linear regression that penalizes large weights, to compare LDA projections to models trained on continuous data. Finally, ridge regression has been shown to be a powerful modeling technique for protein engineering tasks (31).

Neural network models were used to evaluate whether non-linear models would capture additional useful information that linear models are unable of exhibiting, as proposed previously (25). Standard fully connected, feed-forward networks were used with dropout P = 0.5 as shown to be effective in the literature (61). The hidden size and number of layers were set to 128 and 2, respectively, based on previous work using feed-forward neural networks for predicting protein fitness (25). For all datasets we used 700 epochs, a batch size of 32, binary cross-categorical entropy loss as the loss function, and stochastic gradient descent optimizer with a learning rate of 0.01. Training was done on an Nvidia Tesla V100 and typically took between 5 min and 2 h depending on the size and complexity of the dataset.

Stapled Peptide Cell Sorting, Sequencing, and Flow Cytometry.

Experimental stapled peptide libraries targeting Bcl-2 proteins were used to evaluate the computational methods on experimental data generated in this study. These libraries were sorted and sequenced as described previously (Case 2023, preprint on BioRxiv). In brief, a combinatorial library of BIM mutants (a non-specific anti-apoptotic peptide) was designed and transformed into bacteria that display stapled peptides using click chemistry and non-natural amino acid incorporation (see SI Appendix, Table S3 for mutagenesis codons, SI Appendix, Table S4 for sampled amino acids, and SI Appendix, Table S5 for library primers) (35, 36). This library was sorted using a combination of MACS and FACS as follows: one round of expression MACS, two rounds of affinity MACS, two rounds of affinity FACS, and two rounds of specificity FACS. Two of such libraries were sorted in parallel: one toward Bcl-xL and another toward Mcl-1. These libraries were deep sequenced using Illumina NovaSeq S4, demultiplexed, merged using NGMerge, and analyzed using in-house python scripts (see SI Appendix, Table S6 for NGS primers) (62). Each peptide sequence was identified by aligning the DNA with the scaffold eCPX protein and then translating the peptides in the corresponding open reading frame. Peptide sequences and their frequencies were aligned across all rounds of sorting, and sequences that had mutations not specified by the original library design were removed (~10% of all sequences). Sequences from the four rounds of FACS were denoted as "hits" and sequences from the expression sort were denoted as "not hits" (see SI Appendix, Table S1 for dataset summary). Then, the data were used to train LDA models using the same method as the other datasets.

A smaller number of peptide sequences were expressed on the surface of bacteria and measured in low-throughput flow cytometry experiments. To evaluate whether LDA projections were predictive of continuous properties, we selected 57 stapled peptides from various rounds of sorting (Mcl-1 FACS 2, 3, or 4, and Bcl-xL FACS 2 or 4) to capture a wide distribution of specificities: Peptides from later in the rounds of sorting should have more specificity while those from earlier rounds should be less specific if sorting enriched toward higher performing sequences. We then expressed these peptides on the surface of bacteria and measured their binding at the approximate KD of the wild-type sequence in triplicate (1 nM and 10 nM for Mcl-1 and Bcl-xL, respectively). Fraction bound was calculated by normalizing to expression and dividing by a saturated binder (BIM-p5 at 250 nM) (35). LDA projections were calculated and compared to continuous values identically to the other datasets.

SORTCERY.

To get continuous estimates of binding properties from cell sorting, peptides from the final round of FACS for both Mcl-1 and Bcl-xL were evaluated using SORTCERY. Peptides were incubated with either Mcl-1 or Bcl-xL at 1 nM and sorted into 12 bins following the protocol from Reich et al. (23). Briefly, cells labeled with target Bcl-2 protein and anti-HA display tag were sorted into 12 bins along the “axis of affinity”, the diagonal gates that resolves the fraction bound. To compare the SORTCERY value with those measured from binary sorting, we computed the gate score of each sequence as described in the original work (23). Each of these gates were collected individually and processed for deep sequencing as described previously. The deep sequencing data from these experiments was processed identically to the stapled peptide libraries as above.

Sequence Optimization Via ILP.

To optimize protein sequences, we applied ILP, an approach that solves an objective problem given discrete input variables and constraints. Compared to other techniques that maximize an objective given an input, ILP scales more efficiently with a large number of samples and does not rely on iterative predict and test loops that require additional experimental resources (8, 39, 40). Furthermore, ILP is directly amenable to multi-objective optimization through the addition of inequality requirements (24). We implemented this problem using the PuLP python module (63). First, we defined the objective as maximizing the dot product of the model coefficient vector and the positions and amino acid constraints as defined by the library design. This objective is the maximization of the confidence of binding for a given sequence. Next, we constrained the optimization by only allowing one amino acid at each position, requiring that each peptide had two azidohomoalanine residues (responsible for peptide stapling), and that the two stapled residues were at a distance as specified by the library design (i, i+7). Finally, we formulated the problem as a multi-objective problem by adding the additional constraint that the dot product of the off-target coefficients and peptide sequence was in the non-binding regime (which was set to be the bottom twentieth percentile of binding).

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

We thank the members of the Thurber Lab for their helpful insights with making figures. This work was supported by NIH R35 GM128819 (G.T.).

Author contributions

M.C. and G.T. designed research; M.C. and J.V. performed research; M.C., M.S., and J.V. analyzed data; and M.C., M.S., and G.T. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

Next Generation Sequencing data have been deposited in The National Center for Biotechnology Information [ID number: 1065831; Accession: SRA, SRP484190 (64); BioProject (65)].

Supporting Information

References

  • 1.Anfinsen C. B., Principles that govern the folding of protein chains. Science 181, 223–232 (1973). [DOI] [PubMed] [Google Scholar]
  • 2.Cobb R. E., Chao R., Zhao H., Directed evolution: Past, present, and future. AIChE J. 59, 1432–1440 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lerner S. A., Wu T. T., Lin E. C. C., Evolution of a catabolic pathway in bacteria. Science 146, 1313–1315 (1964). [DOI] [PubMed] [Google Scholar]
  • 4.Roberts R. W., Szostak J. W., RNA-peptide fusions for the in vitro selection of peptides and proteins. Biochemistry 94, 12297–12302 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Smith G. P., Filamentous fusion phage: Novel expression vectors that display cloned antigens on the virion surface. Science 228, 1315–1317 (1984). [DOI] [PubMed] [Google Scholar]
  • 6.Freudl R., MacIntyre S., Degen M., Henning U., Cell surface exposure of the outer membrane protein OmpA of Escherichia coli K-12. J. Mol. Biol. 188, 491–494 (1985). [DOI] [PubMed] [Google Scholar]
  • 7.Boder E. T., Wittrup K. D., Yeast surface display for screening combinatorial polypeptide libraries. Nat. Biotechnol. 15, 553–557 (1997). [DOI] [PubMed] [Google Scholar]
  • 8.Yang K. K., Wu Z., Arnold F. H., Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019). [DOI] [PubMed] [Google Scholar]
  • 9.Liu B., Yeast surface display: Methods, protocols, and applications. Mol. Methods Biol. 1319, 3–39 (2015). [Google Scholar]
  • 10.Barreto K., et al. , Next-generation sequencing-guided identification and reconstruction of antibody CDR combinations from phage selection outputs. Nucleic Acids Res. 47, e50 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.D’Angelo S., et al. , From deep sequencing to actual clones. Protein Eng., Des. Sel. 27, 301–307 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ravn U., et al. , By-passing in vitro screening—Next generation sequencing technologies applied to antibody display and in silico candidate selection. Nucleic Acids Res. 38, 1–11 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rubin A. F., et al. , A statistical framework for analyzing deep mutational scanning data. Genome Biol. 18, 1–15 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kowalsky C. A., et al. , Rapid fine conformational epitope mapping using comprehensive mutagenesis and deep sequencing. J. Biol. Chem. 290, 26457–26470 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Fowler D. M., Araya C. L., Gerard W., Fields S., Enrich: Software for analysis of protein function by enrichment and depletion of variants. Bioinformatics 27, 3430–3431 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Derda R., et al. , Diversity of phage-displayed libraries of peptides during panning and amplification. Molecules 16, 1776–1803 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chao G., et al. , Isolating and engineering human antibodies using yeast surface display. Nat. Protoc. 1, 755–768 (2006). [DOI] [PubMed] [Google Scholar]
  • 18.Kelil A., Gallo E., Banerjee S., Adams J. J., Sidhu S. S., CellectSeq: In silico discovery of antibodies targeting integral membrane proteins combining in situ selections and next-generation sequencing. Commun. Biol. 4, 561 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Maranhão A. Q., et al. , Discovering selected antibodies from deep-sequenced phage-display antibody library using ATTILA. Bioinform Biol. Insights 14, 1–8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Fowler D. M., Fields S., Deep mutational scanning: A new style of protein science. Nat. Methods 11, 801–807 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Adams R. M., Mora T., Walczak A. M., Kinney J. B., Measuring the sequence-affinity landscape of antibodies with massively parallel titration curves. eLife 5, 1–27 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kinney J. B., Murugan A., Callan C. G., Cox E. C., Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. U.S.A. 107, 9158–9163 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Reich L., Dutta S., Keating A. E., SORTCERY—A high-throughput method to affinity rank peptide ligands. J. Mol. Biol. 427, 2135–2150 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Jenson J. M., et al. , Peptide design by optimization on a data parameterized protein interaction landscape. Proc. Natl. Acad. Sci. U.S.A. 115, E10342–E10351 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Makowski E. K., et al. , Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space. Nat. Commun. 13, 1–14 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Trippe B. L., et al. , Randomized gates eliminate bias in sort-seq assays. Protein Sci. 31, 1–8 (2022). [Google Scholar]
  • 27.Somermeyer L. G., et al. , Heterogeneity of the GFP fitness landscape and data-driven protein design. eLife 11, e75842 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Sarkisyan K. S., et al. , Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Greaney A. J., et al. , Comprehensive mapping of mutations in the SARS-CoV-2 receptor-binding domain that affect recognition by polyclonal human plasma antibodies. Cell Host Microbe 29, 463–476 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Starr T. N., et al. , Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science 377, 420–424 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Hsu C., Nisonoff H., Fannjiang C., Listgarten J., Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022). [DOI] [PubMed] [Google Scholar]
  • 32.Raghunathan T. E., Rosenthal R., Rubin D. B., Comparing correlated but nonoverlapping correlations. Psychol. Methods 1, 178–183 (1996). [Google Scholar]
  • 33.Makowski E. K., Wu L., Desai A. A., Tessier P. M., Highly sensitive detection of antibody nonspecific interactions using flow cytometry. MAbs 13, 1–11 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Makowski E. K., et al. , Reduction of therapeutic antibody self-association using yeast-display selections and machine learning. MAbs 14, 1–15 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Case M., Navaratna T., Vinh J., Thurber G. M., Rapid evaluation of staple placement in stabilized alpha helices using bacterial surface display. ACS Chem. Biol. 18, 905–914 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Navaratna T., et al. , Directed evolution using stabilized bacterial peptide display. J. Am. Chem. Soc. 142, 1882–1894 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Dutta S., Determinants of BH3 binding specificity for Mcl-1 vs. Bcl-xL. J. Mol. Biol. 398, 747–762 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Dutta S., et al. , Potent and specific peptide inhibitors of human pro-survival protein bcl-xl. J. Mol. Biol. 427, 1241–1253 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Romero P. A., Krause A., Arnold F. H., Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U.S.A. 110, 193–201 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Wu Z., Jennifer Kan S. B., Lewis R. D., Wittmann B. J., Arnold F. H., Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U.S.A. 116, 8852–8858 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Loren Walensky D., et al. , Activation of apoptosis in vivo by a hydrocarbon-stapled BH3 helix. Science 23, 1–7 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Walensky L. D., Bird G. H., Hydrocarbon-stapled peptides: Principles, practice, and progress. J. Med. Chem. 57, 6275–6288 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Rives A., et al. , Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016-18 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Alley E. C., Khimulya G., Biswas S., AlQuraishi M., Church G. M., Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Riesselman A. J., Ingraham J. B., Marks D. S., Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Shin J. E., et al. , Protein design and variant prediction using autoregressive generative models. Nat. Commun. 12, 1–11 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Adams J. M., Cory S., The Bcl-2 protein family: Arbiters of cell survival. Science 281, 1322–1326 (1998). [DOI] [PubMed] [Google Scholar]
  • 48.Shamas-Din A., Kale J., Leber B., Andrews D. W., Mechanisms of action of Bcl-2 family proteins. Cold Spring Harb. Perspect. Biol. 5, 1–21 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Czabotar P. E., Lessene G., Strasser A., Adams J. M., Control of apoptosis by the BCL-2 protein family: Implications for physiology and therapy. Nat. Rev. Mol. Cell Biol. 15, 49–63 (2014). [DOI] [PubMed] [Google Scholar]
  • 50.Hie B. L., et al. , Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 1–23 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kang Y., Leng D., Guo J., Pan L., Sequence-based deep learning antibody design for in silico antibody affinity maturation. arXiv [Preprint] (2021). 10.48550/arXiv.2103.03724 (Accessed 14 January 2024). [DOI]
  • 52.Ruffolo J. A., Gray J. J., Sulam J., Deciphering antibody affinity maturation with language models and weakly supervised learning arXiv [Preprint] (2021). 10.48550/arXiv.2112.07782 (Accessed 14 January 2024). [DOI]
  • 53.Amimeur T., et al. , Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. bioRxiv [Preprint] (2020). 10.1101/2020.04.12.024844 (Accessed 10 July 2023). [DOI]
  • 54.Jain T., et al. , Biophysical properties of the clinical-stage antibody landscape. Proc. Natl. Acad. Sci. U.S.A. 114, 944–949 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Taguchi A. T., et al. , Comprehensive prediction of molecular recognition in a combinatorial chemical space using machine learning. ACS Comb. Sci. 22, 500–508 (2020). [DOI] [PubMed] [Google Scholar]
  • 56.Mason D. M., et al. , Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat. Biomed. Eng. 5, 600–612 (2021). [DOI] [PubMed] [Google Scholar]
  • 57.Chu Q., et al. , Towards understanding cell penetration by stapled peptides. MedChemComm 6, 111–119 (2015). [Google Scholar]
  • 58.Bird G. H., et al. , Biophysical determinants for cellular uptake of hydrocarbon-stapled peptide helices. Nat. Chem. Biol. 12, 845–852 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Bird G. H., et al. , Hydrocarbon double-stapling remedies the proteolytic instability of a lengthy peptide therapeutic. Proc. Natl. Acad. Sci. U.S.A. 107, 14093–14098 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Chandramohan A., et al. , Design-rules for stapled alpha-helical peptides with on-target in vivo activity: Application to Mdm2/X dual antagonists. bioRxiv [Preprint] (2023). 10.1101/2023.02.25.530030 (Accessed 10 July 2023). [DOI]
  • 61.Srivastava N., Hinton G., Krizhevsky A., Salakhutdinov R., “Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research” in ICML (2014). [Google Scholar]
  • 62.Gaspar J. M., NGmerge: Merging paired-end reads via novel empirically-derived models of sequencing errors. BMC Bioinf. 19, 1–9 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Mitchell S., et al. , Optimization with PuLP (Version 2.8.0, Github, 2009).
  • 64.Case M., Vinh J., Smith M., Thurber G., Machine Learning to Predict Continuous Protein Properties from Binary Cell Sorting Data and Map Unseen Sequence Space. Sequence Run Archive. https://trace.ncbi.nlm.nih.gov/Traces/?view=study&acc=SRP484190. Deposited 14 January 2024. [DOI] [PMC free article] [PubMed]
  • 65.Mitchell S., et al. , Optimization with PuLP (Version 2.8, 2024). Github. https://github.com/coin-or/pulp. Accessed 4 February 2024.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

Next Generation Sequencing data have been deposited in The National Center for Biotechnology Information [ID number: 1065831; Accession: SRA, SRP484190 (64); BioProject (65)].


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES