Structure-based design of combinatorial mutagenesis libraries

Deeptak Verma; Gevorg Grigoryan; Chris Bailey-Kellogg

doi:10.1002/pro.2642

. 2015 Mar 2;24(5):895–908. doi: 10.1002/pro.2642

Structure-based design of combinatorial mutagenesis libraries

Deeptak Verma ¹, Gevorg Grigoryan ^1,², Chris Bailey-Kellogg ^1,^*

PMCID: PMC4420537 PMID: 25611189

Abstract

The development of protein variants with improved properties (thermostability, binding affinity, catalytic activity, etc.) has greatly benefited from the application of high-throughput screens evaluating large, diverse combinatorial libraries. At the same time, since only a very limited portion of sequence space can be experimentally constructed and tested, an attractive possibility is to use computational protein design to focus libraries on a productive portion of the space. We present a general-purpose method, called “Structure-based Optimization of Combinatorial Mutagenesis” (SOCoM), which can optimize arbitrarily large combinatorial mutagenesis libraries directly based on structural energies of their constituents. SOCoM chooses both positions and substitutions, employing a combinatorial optimization framework based on library-averaged energy potentials in order to avoid explicitly modeling every variant in every possible library. In case study applications to green fluorescent protein, β-lactamase, and lipase A, SOCoM optimizes relatively small, focused libraries whose variants achieve energies comparable to or better than previous library design efforts, as well as larger libraries (previously not designable by structure-based methods) whose variants cover greater diversity while still maintaining substantially better energies than would be achieved by representative random library approaches. By allowing the creation of large-scale combinatorial libraries based on structural calculations, SOCoM promises to increase the scope of applicability of computational protein design and improve the hit rate of discovering beneficial variants. While designs presented here focus on variant stability (predicted by total energy), SOCoM can readily incorporate other structure-based assessments, such as the energy gap between alternative conformational or bound states.

Keywords: combinatorial library, structure-based protein design, cluster expansion, high-throughput screening, protein design space

Introduction

Computational protein design focuses experimental effort on beneficial regions of sequence space, by modeling protein sequence-structure-function relationships and optimizing variants with desired properties. In contexts where a limited number of alternatives are to be experimentally tested, computational methods have yielded a wide range of novel proteins: new sequences for existing structures¹^,² and new sequences for previously unobserved structures;³^,⁴ enzymes with altered substrate specificity,⁵ grafted activity,⁶ and entirely new catalytic activity;⁶^,⁷ variants with optimized binding affinity;⁸^–¹⁰ new pairs of binding partners with targeted affinity and specificity;¹¹ variants with improved thermostability¹² or immunogenicity;¹³^,¹⁴ and so on. In contexts where screening and selection techniques are available to assess larger libraries of diverse alternatives, computational methods have generated functionally enriched sets of candidates that likewise have led to the identification of a wide range of novel proteins, with altered activity;¹⁵^,¹⁶ improved activity;¹⁷^,¹⁸ targeted binding affinity and selectivity;¹⁹ and enhanced thermostability.²⁰^,²¹ We seek here the best of both worlds, bringing structure-based design techniques to bear in library-based experimental contexts.

The use of detailed structural modeling has been critical in many individual protein design studies, but unfortunately such modeling does not readily scale to the design of large combinatorial libraries, which can include millions or even billions of variants (with room for expansion, as stochastic library construction techniques already generate orders of magnitude more). It would thus be computationally very demanding to model at an atomistic level even the variants in a single library, much less over the space of possible libraries that must be considered in the design process. Consequently, computational structure-based approaches in library optimization have been limited to smaller mutagenesis libraries,¹⁵^,²² for guiding the selection of degenerate codons encoding frequently occurring choices from individual protein designs,²³ or indirect application in contact assessment for recombination libraries.¹⁷^,²⁴^–²⁶ However, modern library techniques are enabling larger library sizes, and it has been shown that large mutational loads can lead to beneficial variant properties.²⁷^,²⁸ Thus the integration of structure-based design with large-scale library-based experiments promises mutual benefits: a high library hit rate due to enrichment of structurally stable variants, the ability to scale to large library size, and a tractable structure-based design objective of generating diverse stable variants (leaving more poorly understood and modeled objectives to the screening process).

This article presents a novel approach to structure-based protein library design that enables optimization of large combinatorial mutagenesis libraries (in library space) directly based on the structural properties of its members. Our method, Structure-based Optimization of Combinatorial Mutagenesis (SOCoM) builds from OCoM,²⁹ a method to optimize combinatorial mutagenesis libraries based on a one- and two-body sequence potential derived from a multiple sequence alignment (MSA) for the target protein. The key enabling insight in the jump from OCoM to SOCoM is the use of Cluster Expansion (CE)¹¹^,³⁰ to transform structure-based evaluation into a function of amino acid sequence that can be efficiently assessed and optimized. As shown in previous studies, CE improves calculation time by orders of magnitude without a significant loss in accuracy.³¹ It has been successfully applied in a number of significant individual protein design studies.¹¹^,³⁰^,³¹ However, even with CE-based efficient evaluation of variant energies, it is computationally infeasible to enumerate and explicitly evaluate every variant in every possible library. Instead, design must be performed in library space, treating libraries as “points” that can be efficiently represented, evaluated, and optimized. To address this issue, SOCoM leverages the OCoM-based formulation of library design, specifying a library in terms of choices of mutations at choices of positions and assessing it in terms of average variant quality, under the hypothesis that a better average leads to better individuals (shown to hold in the results presented here).

Results

Figure 1 summarizes the SOCoM approach. The library design space is defined in terms of possible positions at which to introduce mutations and possible choices of amino acids at those positions; the goal is to choose a subset of positions and substitutions to be “mixed and matched” in a library. As used in previous studies,²⁹ each set of amino acid choices that make up these libraries is referred to as a “tube.” A tube can either specify any mixture of point mutations or, if extended to multisets, degenerate oligonucleotides. Using sets of tubes for library construction, the total number of variants within a library is equal to the product of the size of the tubes used (thus implicitly controlling the overall library size). For example, in the left-hand library in Figure 1, there are two sites, one incorporating {R,K,H} and the other {S,T}, leading to the six variants listed; in the middle library one site has {R,K} and the other {I,L}, leading to the four variants. The variants define a distribution of energies, but to enable efficient evaluation and optimization in library space, SOCoM assesses libraries in terms of their average CE-based energies without explicitly enumerating the variants. SOCoM employs an integer linear programming framework to select a specific library (positions and substitutions) that optimizes this library-averaged score. This library is predicted to be enriched in stable variants (each a combination of some of the mutations) that can be experimentally evaluated for other properties of interest.

Overview of SOCoM. (Left) The library design space specifies possible positions that could be mutated and amino acids that could be incorporated at those positions. Each library is defined by choices for positions and amino acids. A Cluster Expansion model maps amino acid sequences of potential library variants to structural properties, here energies. (right) Library designs, specifying subsets of positions and amino acids, are optimized by an integer linear programming method. To enable efficient evaluation of large combinatorial libraries while optimizing over the massive design space, energy evaluation is based on library-averaged energy scores, computed without enumerating all variants.

Validation of the SOCoM method requires demonstrating that it meets the stated goal of generating library designs enriched in variants with good scores. Here we use Rosetta energy as the score. While our method can be applied with any structure-based score via CE decomposition, Rosetta has certainly proved its utility in structure-based protein design, particularly with diverse variants, as SOCoM now enables for structure-based libraries.

In order to assess SOCoM's ability to correctly optimize combinatorial mutagenesis libraries, we applied it to three different proteins previously targeted by library studies: green fluorescent protein (GFP), β-lactamase, and lipase A. We first performed smaller-scale focused design and compared SOCoM-optimized libraries with those from earlier library studies by Treynor et al.¹⁵ (GFP core), Hayes et al.³² (β-lactamase active site), and Sandström et al.³³ (lipase A active site). Since library space is sufficiently large that experimental results provide little insight on our designs (and again, would reflect only on the scoring function employed), we focus on the demonstration that SOCoM does indeed produce libraries enriched in variants with good scores. We then allowed SOCoM to optimize larger libraries (previously unattainable by structure-based library design) and studied scalability and implications for library design, as well as how SOCoM-optimized libraries compared against representative randomly designed libraries.

Focused designs

For a direct comparison with the previous methods, in each case the same mutable positions were targeted as in the earlier studies; point mutation libraries were restricted to incorporate at most three mutations at a site in addition to wild-type, optimized by SOCoM by choosing from position-specific sets prefiltered for homology (i.e., for each position, all subsets of up to three mutations found sufficiently frequently in homologs); and the library size was restricted to generate the same number of variants. Libraries constructed using degenerate oligos are also presented. As a matter of general practice, mutations to/from proline and cysteine were not considered for the SOCoM designs, avoiding the need for more detailed modeling of their structural impact.

Green fluorescent protein

GFP has revolutionized microscopy and enabled significant advances in cell biology by highlighting patterns of gene expression, protein localization and protein association.³⁴^–³⁷ Newly designed GFP variants have enabled the identification of new proteins³⁸ as well as better visualization of subcellular structures in cells.³⁹ The generation of a large library of stable variants to be screened/selected for desired activities could further expand the palette of applications.

Following Treynor et al.,¹⁵ 512-member libraries were optimized for wild-type GFP from Aequorea victoria, allowing mutations at positions 57 to 72, which form the longest stretch of contiguous core residues.⁴⁰ Targeting core positions could be disruptive;⁴¹^,⁴² however, such mutations have a higher likelihood of directly affecting the fluorescence properties of the chromophore region⁴³ and hence producing better-differentiated libraries.

Figure 2 summarizes the libraries and constituent variant energies designed by both SOCoM and the ORBIT methods applied by Treynor et al. The table in Figure 2(a) compares mutations selected by the methods. For example, position 57 and 67 remain unmutated by all methods, and all methods choose S72A. At position 63, both C^ORBIT and SOCoM-degenerate oligos opt for T63A, SOCoM-point mutations selects T63E, and DBIS^ORBIT leaves it unmutated. Divergence between the ORBIT library plans and the SOCoM ones is observed at both T59, where the ORBIT plans choose the relatively conservative S while the SOCoM ones choose I or E, and similarly Q69, where ORBIT plans incorporate L but SOCoM uses E. We next explore the energetic implications of these choices.

GFP core libraries. (a) Mutational choices made by previous methods (Treynor *et al*.) and by SOCoM. (b) Histograms of the AMBER energies (postminimization) of structural models built for the variants comprising the libraries. (c) Position-specific contributions to the AMBER energies, averaged over the libraries.

We enumerated all 512 variants in the DBIS^ORBIT library and all 512 variants in the SOCoM degenerate oligo library, and used Rosetta to construct models for them all. Our designs score more favorably in terms of Rosetta energies (Supporting Information Fig. 1), but this is somewhat to be expected as SOCoM optimizes for Rosetta energy (albeit library averaged). Thus in order to provide a more unbiased comparison in assessing relative energetic favorability of the variants (predictive of stability, our goal), we subjected each model to energy minimization via Tinker⁴⁴ according to the AMBER⁴⁵ force field using the Generalized Born implicit solvent model. We present analyses based on these AMBER energies, as a somewhat distinct evaluation. Corresponding Rosetta-based evaluations are provided in the Supporting Information (Supporting Information Fig. 1(a)) and illustrate similar conclusions.

Figure 2(b) illustrates the distributions of AMBER energies over the two libraries. Clearly, SOCoM variants tend to have better energies, on average scoring more favorably by 71 kcal/mol relative to DBIS^ORBIT variants; the difference in distributions is statistically significant (Wilcoxon P value <0.001). Further, whereas most SOCoM variants (59%) are scored with lower energies than the wild-type sequence, only 6% of the DBIS^ORBIT variants are in this category. The standard deviation of the energies captures one aspect of library diversity, and we see that the SOCoM library is slightly more diverse under this notion, at 43 kcal/mol, compared with 40 kcal/mol for DBIS^ORBIT.

SOCoM optimally selects mutations for the overall library score, summed over variants and positions. While Figure 2(b) broke the score down by variant, it is also interesting to characterize it by position-specific energy contributions. Figure 2(c) presents the energetic contributions for each position, averaged over all library variants. It also indicates the contributions from those positions in the wild-type. We see that SOCoM variants are also much better than DBIS^ORBIT variants at this finer resolution, with SOCoM's variants aiding nine positions at an average of −7.8 kcal/mol energy with respect to wild-type, versus DBIS^ORBIT's variants benefiting only eight positions at an average of only −2.9 kcal/mol. SOCoM variants tend to substantially improve energies involving positions 61, 62, 64, and 65, but only marginally those at 57, 67, 68, 70, and 71. Note that the improvements can be observed even at positions that are left wild-type, due to interactions with mutated residues; for example, the wild-type valine at position 68 has an energetic contribution of −23.3 kcal/mol, which is further enhanced among variants by an additional −0.98 kcal/mol.

β-lactamase

The family of β-lactamases comprises a diverse group of enzymes that hydrolyze the β-lactam ring of penicillin-like drugs, providing bacteria with antibiotic resistance⁴⁶ and thus forming an important drug target.⁴⁷ β-lactamase may also be put to productive use, for example, in anti-cancer ADEPT therapies¹³ and in measuring gene expression levels in cells.⁴⁸ β-lactamases have been widely used as model systems in the development of combinatorial library construction methods²⁹^,⁴⁹ due to inexpensive, high-throughput activity screens. β-lactamase libraries can provide insights on hydrolysis activity of β-lactam derivatives. As such, site saturation studies have helped to characterize functional and stability contributions of active site residues.⁵⁰ Also, site-saturation methods are widely popular for understanding diversity in the spectrum of substrate recognition.⁵¹

We follow the general specification of Hayes et al.,³² designing libraries targeting the active site of TEM-1 β-lactamase from Escherichia coli, thereby potentially altering enzyme activity and substrate specificity.⁵²^,⁵³ Specifically, the mutable positions include a total of 19 residues within 5Å of β-lactamase active site residues S70, K73, S130, E166, and K234. Mutations at catalytic residue sites (S70, K73, S130, and E166) themselves were not considered. For a direct comparison with the Hayes et al. library, at most four mutations were allowed at each position in addition to wild-type, optimally selected by SOCoM from prefiltered position-specific sets of allowed mutations.

Figure 3 summarizes the library designs by both SOCoM and the Protein Design Automation (PDA) method employed by Hayes et al.³² (panel a), AMBER energies of sampled sets of library members (panel b), and position-specific contributions among those library members (panel c). Since the libraries contain too many members to allow explicit structural modeling of each, 1000 were randomly chosen from each library. Considering all 19 positions, SOCoM provides an average stabilizing effect of −2.4 kcal/mol (with 11 of 19 being stabilized) whereas PDA provides an average destabilizing effect of 0.5 kcal/mol (though 10 of 19 are stabilized). Positions with major differences (>10 kcal/mol) in the stability effects include 69M, F72, Y105, N132, N170, D214, and S235. Some of the mutated residues have similar stabilizing (Y105N, I127L, and G236S) and destabilizing (K234I) trends within SOCoM and PDA, with respect to the wild-type.

β-lactamase active site libraries. (a) Mutational choices made by previous methods (Hayes *et al*.) and by SOCoM. (b) Histograms of the AMBER energies (postminimization) of structural models built for the variants comprising the libraries. (c) Position-specific contributions to the AMBER energies, averaged over the libraries.

The energy distributions in Figure 3(b) show that SOCoM has better scoring β-lactamase variants than PDA does. The wild-type β-lactamase AMBER energy is referenced at 0 kcal/mol, and in comparison, ∼58% of the SOCoM variants are better while only ∼36% of the PDA ones are. The mean (±SD) Δenergy of SOCoM library is −16 (±69) kcal/mol, whereas the PDA library's mean Δenergy is 32 (±86) kcal/mol and the distributions are significantly different with a P value of less than 0.001. As discussed for GFP, these results are with respect to the minimized AMBER energy (not the objective function for either method). The Rosetta energies optimized by SOCoM (via cluster expansion and library averaging) display similar trends (Supporting Information Fig. 1(b)).

Lipase A

Candida antarctica lipase A is a thermostable enzyme that exhibits many different properties such as activity towards tertiary alcohols,⁵⁴ high enantioselectivity towards β-amino acids,⁵⁵ and sn-2 fatty acid preference of triglycerides.⁵⁶ Recent studies have shown that combinatorial reshaping of Lipase A substrate pocket leads to highly active enantioselectivite variants. Such enzymes can be employed in industrial processes under mild reaction conditions providing beneficial outcomes for production of pharmaceuticals.⁵⁷

The Lipase A library design was focused on the residues targeted in the Sandström et al.³³ enantioselective library: F149, I150, P215, T221, L225, F233, A234, G237, and F431. These active site proximal residues were identified from ibuprofen ester binding and include all nonconserved residues within 4 Å from the bound ester, that is, those with a higher chance of influencing lipase A catalytic properties.⁵⁸ Our conservation analysis identified ∼4 allowed mutations at each of these sites.

Figure 4 summarizes the designs and constituent variant energies. Sandström et al. constructed their combinatorial library using a knowledge-based approach, and it was surprising to note that SOCoM, without such prior knowledge, captured some of the mutations identified as being important for enantioselectivity, including F149Y and I150N.⁵⁹ Other similar mutations selected by Sandström et al. and SOCoM libraries include L225V (degenerate oligos) and G237A (point mutations). Analyzing the localized AMBER energies shows that in general SOCoM mutations are more stabilizing at five out of nine positions, and make an average stability contribution of −20.75 kcal/mol versus −11.48 kcal/mol in the Sandström et al. library. The wild-type residue at position 149 exhibits a low energy state which Sandström et al. improve by incorporating tyrosine (knowledge-based), while SOCoM improves further by providing additional sets of stabilizing mutations (i.e., asparagine and tryptophan). While SOCoM avoids mutations to/from proline, interestingly the P215A mutation of Sandström et al. seems to have little effect.

Lipase A active site libraries. (a) Mutational choices made by previous methods (Sandström *et al*.) and by SOCoM. (b) Histograms of the AMBER energies (postminimization) of structural models built for the variants comprising the libraries. (c) Position-specific contributions to the AMBER energies, averaged over the libraries.

The library energy distributions [Fig. 4(b)] clearly show that SOCoM yields more variants predicted to be more stable. The distributions are significantly different with a Wilcoxon P value <0.001. Wild-type lipase A has an AMBER Δenergy of −44 kcal/mol and 72% of SOCoM variants are better, as compared with 59% of the Sandström et al. ones. It is also interesting to observe that most of the stabilized variants from Sandström et al. overlap with the bimodal tail region of SOCoM variants, suggesting that SOCoM provides equal quality but with a higher deviation yielding greater diversity. Supporting Information Figure 1(c) shows that the same trends hold under the Rosetta energy.

Complete designs

In addition to the smaller-scale focused library designs, a set of larger-scale libraries was also designed considering mutations at any position, over the range of 10, 15, 20, 25, and 30 mutated sites. Such scalability is not computationally tractable by any other structure-based design method, but when an appropriate screen/selection is available, a scale up improves the chances to find beneficial variants, either at higher mutational loads themselves,²⁷^,²⁸ or even just as different subsets of the larger (but targeted) sequence space being explored experimentally.

Figure 5 shows the trends in average library energy at different numbers of mutated sites (optimally selected by SOCoM throughout the entire protein). A stabilizing trend is observed when increasing the number of sites from 10 to 30. However, the average stabilizing contribution at each position decreases when increasing the number of mutations. For example, a 10-site β-lactamase library provides an average stability of 1 kcal/mol while a 30-site library provides only 0.8 kcal/mol on average per mutation. Also, the average energetic favorability of point mutations over degenerate codon GFP libraries is ∼4 kcal/mol.

Trends in energies and sizes of SOCoM-based complete library designs over different numbers of mutated sites using (a) point mutations and (b) degenerate oligos.

These optimally constructed libraries were tested for stability by library enumeration. The average number of sequences over all three proteins' point mutation libraries is 8.4 × 10⁴ for the 10-site library, 2.8 × 10⁹ for the 20-site, and 5.7 × 10¹³ for the 30-site. Due to computational infeasibility of complete library enumeration, we chose 1000 variants for each of libraries for each of the proteins and compared them with 20 mutated-site random libraries representative of approaches regularly employed in library-based protein engineering. As such, 20 random positions with Shannon entropy scores between 1 and 2 were chosen⁶⁰ from the MSA. Given these positions, we evaluated libraries composed of (A) all 20 amino acids at each of the 20 positions, mimicking NNK-based libraries;⁶¹ (B) only those amino acids represented in the MSA, representing more finely-targeted random libraries. For a direct comparison with SOCoM optimized libraries, we picked 1000 random variants from each and computed the CE-based energies.

Figure 6 illustrates the distributions of energies (with wild-type reference energy 0) for the various libraries. For all three proteins, the optimized libraries display a stabilizing trend while the variants of the random libraries have higher energy scores with respect to the wild-type. Apparently, entropy-based mutagenesis libraries produce few beneficial variants in terms of predicted stability, whereas SOCoM designed libraries provide many more and much better variants.

Histograms of the AMBER energies (postminimization) of structural models built for sampled variants from different complete library designs.

Calculation time

SOCoM is fast, requiring only minutes on commodity hardware (an off-the-shelf 2012 Macbook Pro) to optimize each library presented so far. This excludes the time for Rosetta modeling and CE training, which requires far more CPU time, but then the resulting model may be used repeatedly by all subsequent library designs. Tens of thousands of models must be generated by Rosetta in order to properly fit the CE parameters. Supporting Information Table 1 provides details on the resulting design space scalability and calculation times. For example, constructing all 30,000 Rosetta models for Lipase-A would require 810,000 CPU minutes. This task was accomplished in 13.5 h on a 25 node cluster (having four cores in each node). Interestingly, the optimization time does depend on the number of positions being considered and the number of choices at each. For example, 10-site complete design libraries required 1 min for GFP, 3 min for β-lactamase, and 29 min for lipase A. While we expected an increasing calculation time with increasing numbers of mutated sites, the differences were only of a few milliseconds. Consequently SOCoM can readily scale to design massive libraries. For example, as proof of principle we designed a 30-site GFP library, with 1 trillion unique variants, in less than a minute.

Discussion

We have introduced a novel method for optimization of combinatorial mutagenesis libraries, the first such method to perform large-library structure-based design. SOCoM works directly in library space, choosing positions and mutations to produce a set of variants enriched, on average, in a defined structure-based property. This stands in contrast to typical approaches, which work primarily at the level of individual variants and then generate a library from them, without care for the overall quality of the library members.

We have applied our method to a number of proteins that have been previously targeted via combinatorial mutagenesis, including green fluorescent protein, β-lactamase and Candida antarctica lipase A. We demonstrated the ability of SOCoM to effectively optimize libraries over a range of scenarios that would be appropriate in different contexts: with point mutations or degenerate codons, under different numbers of mutated sites, with or without focusing on particular positions, and controlling the overall library size. Comparison of results reveals that SOCoM can efficiently generate better libraries than other existing methods. Our results also suggest that higher numbers of mutated sites enable spreading out the energetically favorable mutations more broadly, while generating a diverse group of variants better than random libraries. In a study of scalability, we have found that we can optimize billion-member libraries in only a few minutes.

Our method is general, enabling the optimization of arbitrary structure-based properties of members of combinatorial mutagenesis libraries, constrained by key criteria such as library size and composition. Further, because it is expressed as an integer linear program, SOCoM enables the incorporation of additional constraints such as diversity,⁶² potentially set in the Pareto optimization framework⁶³ enabling simultaneous optimization of multiple such criteria. Ultimately, designing libraries in this fashion promises to result in improved success rates in discovering beneficial variants.

Methods

Given a target protein, the goal is to design combinatorial mutagenesis libraries optimizing the structural energy of the constituent variants (Fig. 1). A preprocessing phase identifies allowed mutations and constructs energy-based potentials by which to score possible libraries. Then an optimization phase identifies the best protein library in terms of its average score over library members. The key input to SOCoM is a structure of the target protein; a multiple sequence alignment (MSA) of its homologs can be used to help identify allowed mutations. Additional parameters for preprocessing refine and control the allowed mutations and the terms to be included in the potentials. Parameters for optimization specify the number of sites to mutate, whether to use point mutations or degenerate oligos, the allowed number of mutations per site, and the total library size. Default values for these parameters, used in the presented results, are indicated.

Algorithm

Allowed mutations

In can be helpful both computationally and experimentally to prefilter the mutations to be considered at each position, rather than considering all 19 other amino acids. For example, a multiple sequence alignment of homologs may reveal which positions are highly conserved (and to which amino acids); this information may complement the structure-based analysis by encoding information about function,⁶⁴ allosteric pathways,⁶⁵ folding,⁶⁶ etc. The prefiltering also helps keep under control the potential combinatorial explosion in the design space.

SOCoM includes preprocessing scripts to aid the specification of a position-specific list of allowed mutations defining the choices to consider for inclusion in a library. Ultimately it is up to the user to define which sites and amino acids to consider for library optimization, but the following approaches are generally useful.

A standard approach to prefiltering mutations is to select those of sufficient frequency in an MSA of homologs, assuming that evolutionarily-accepted mutations are likely to be favorable. Here, an MSA is preprocessed to identify sequences that are not too gappy (default at most 25%), that are sufficiently similar to the wild type (default at least 35% identity), and are sufficiently different from each other (default at most 95% identity). A background distribution (default McCaldon and Argos⁶⁷) provides thresholds, potentially scaled to be more/less conservative (default unscaled); a substitution at a position is deemed acceptable if its MSA frequency exceeds the background threshold for the amino acid type.
Structure-based allowed mutations (in addition to homology-based allowed mutations) are those whose secondary structure Chou-Fasman propensities⁶⁸ are similar enough (default propensity cutoff of 1.50 or more) to those of the corresponding wild-type residues in the target structure. In addition, by default mutations to and from proline and cysteine are excluded, due to their larger structural impact.
Target-based positions and allowed mutations are those manually indicated based on previous experimental results or insights into structure and function (e.g., to avoid or to focus on active site regions, depending on goals).

Energy-based potential

As discussed in the introduction, Cluster Expansion (CE) provides an efficient yet accurate representation of a given protein property, expressed as a sequence potential (i.e., a function of the amino acid sequence). SOCoM applies CE to derive an up to two-body potential Ψ(S) characterizing the energy of the structure adopted by sequence S = {s₁, s₂, …, s_N}, where s_i represents the amino acid at position i and N is the total number of amino acids:

Here ψ_i(s_i) is the energetic contribution of amino acid s_i at position i and ψ_i_,_j(s_i, s_j) is the contribution due to the pair-wise interaction between amino acid s_i at position i and s_j at j. (The constant contribution from CE is distributed among self terms.³¹) To derive the terms in this potential, CE starts with a large training set of sequence:energy pairs, and considers which possible “clusters” (positions i for terms ψ_i and position pairs i, j for terms ψ_i_,_j) to include in the model (the rest are 0). Here we train CE based on structures and energies predicted by Rosetta⁶⁹ for a large set of randomly sampled sequences; any method or force field (and indeed any structural properties) can be used for energy calculation and subsequently for CE training. Only those sequences with sufficiently good energies (below a desired energy cutoff value, e.g., 0 kcal/mol) are included in the training set. It is recommended to generate a set of low-energy sequences numbering at least (# one-body terms) × (# AAs) + 2 × (# two-body terms) × (# AAs)², where the number of amino acids represents the average number of choices considered per site. This corresponds to ∼25,000 variants for a protein of 200 or so residues with approximately four allowed amino acids at each mutable position and ∼800 contact-derived two-body terms. All single-position clusters and all pair clusters between residues deemed to be in sufficient contact are considered. Inter-residue contacts are quantified by placing all rotamers of all amino acids at both positions, discarding rotamers with backbone clashes, and counting the fraction of remaining rotamer pairs with closest heavy-atom distances below 3 Å. The rotamer library by Lovell et al.⁷⁰ is used, and if this fraction is above 0.05, the two positions are considered in contact. CE terms are then optimized as previously described.⁷¹ The potential Ψ enables the rapid calculation of the overall energy of any given sequence S, and the predictive quality of the potential can be assessed by cross-validation before proceeding to use the potential prospectively. We show below how to build on the CE-based model to enable efficient evaluation of a combinatorial library of variants.

As discussed in the Introduction, with sufficient training data the CE approach yields remarkably good predictions of structural properties based on amino acid sequence. It is particularly appropriate for library optimization, as the accuracy is certainly good enough to be used to guide library screening, and yet it is many orders of magnitude more efficient than atomistic evaluation of individual variants, and thus enables optimization of massive libraries over massive design spaces.

Tube-based library representation and evaluation

As illustrated in the right panel of Figure 1, a combinatorial mutagenesis library is specified in terms of particular amino acids (including wild-type) at particular mutation sites; all combinations are to be constructed. The illustrated libraries have two mutated sites each, with a choice from a set of two or three amino acids incorporated at each site. More generally, if there are M sites to be mutated with amino acids T₁ = {a₁₁, a₁₂, …} to be used at the first position, T₂ = {a₂₁, a₂₂, …} at the second, …, T_M = {a_M₁, a_M₂, …} at the Mth, then the variants comprising the library are T₁ × T₂ × … × T_M = {{a₁₁, a₂₁, …, a_M₁}, …, {a₁₁, a₂₁, …, a_M₂}, {a₁₁, a₂₂, …, a_M₁}, …, {a₁₂, a₂₂, …, a_M₂}, …}. The amino acid sets T_i can specify point mutations or, if extended to multisets, degenerate oligonucleotides (i.e., a codon mixture can include multiple “copies” of the same amino acid encoded by different codons). Thus for short we call these amino acid (multi-)sets “tubes,” as introduced previously in the OCoM method.²⁹

Using the tube-based representation of a library, library design can be formulated as selecting a set of positions and corresponding set of tubes, from predetermined position-specific sets of allowed tubes. Thus it is analogous to single variant design, except using a tube alphabet rather than an amino acid or rotamer alphabet. The allowed tubes are predetermined based on possible mutations. One allowed tube contains just the wild-type, that is, no mutations at the site. When using point mutations, SOCoM enumerates for each position all available amino acid combinations that include the wild-type plus one or more allowed mutations, up to a user-specified maximum number of amino acids per tube (the prefiltering of allowed amino acids for each position keeps the combinatorics of this enumeration under control). When using degenerate oligos, SOCoM considers all degenerate codons that include the wild-type and no stop codons. Degenerate codons that include additional amino acids beyond the desired ones are eliminated if there are too many undesired ones; the default, used in this study, is to allow only desired amino acids. When multiple tubes encode the same proportions of amino acids, only the smallest one is used.

The previous section presented an energy-based potential for evaluating individual variants. Considering that this evaluation must be done throughout optimization, it would be extremely inefficient to enumerate all the variants in a combinatorial library and separately score them, in order to compute an aggregate score for the library. Instead, following the tube potential approach of OCoM²⁹ (which was in turn based on analogous evaluations of recombination-based libraries²⁴^,²⁶), SOCoM computes an average score over the variants in a library by summing position-specific tube-averaged potentials. The intuition is that each amino acid in a tube occurs the same number of times in the library (in the same combinations with the amino acids in the other tubes). To compute an average one-body score in the library, a sum can be taken of the position-specific amino acid contributions for each variant, and an average can then be taken over the variants. But since each amino acid at a position occurs in the same number of variants, the sum and average can be rearranged to instead sum position-specific average amino acid contributions. The same idea works for two-body terms. The tube-averaged potentials are thus computed as follows.

An entire library T = {T₁, T₂, …, T_M} can then be evaluated by summing the tube-averaged potentials.

Library optimization

The tube-based library representation and evaluation provides the means for specifying a library optimization problem: given position-specific sets of allowed tubes and a desired library size (number of mutated positions and total number of variants), choose positions and tubes yielding a set of variants with optimal average energy. Since the energy includes two-body terms, it has the same form as the side-chain packing problem which has been shown to be NP-hard⁷² and can be readily reduced to this problem. Following the lead of OCoM,²⁹ SOCoM formulates the optimization as an integer linear program for optimization via the IBM ILOG CPLEX solver. We present here only the single optimal design for each problem specification, though we note that SOCoM can readily generate a series of suboptimal library designs by invoking the optimizer once per identified design, adding constraints in subsequent rounds to ensure uniqueness of solutions.

The choice of tubes comprising a library is represented by a set of binary variables: x_i_,_t indicates whether or not tube t is present at position i. In order to account for the two-body terms in the potential, pairwise binary variable y_i_,_j_,_t_,_u are derived from these single position binary variables, indicating whether or not both tube t is at i and u is at j. Using these two sets of binary variables to represent the composition of a library, Eq. 4 can be rewritten as:

This is the objective function, to be minimized. To ensure a valid library, the following constraints impose the selection of exactly one tube at a position (Eq. 6) and consistency of the pairwise variables with the single-position ones (Eqs. 7 and 8).

The number of mutated positions and total library size are constrained to user-specified ranges. Equation 9 ensures that between µ and M positions have a selected tube that is not just the wild-type residue. Equation 10 ensures that the product of the number of amino acids in the selected tubes (expressed as the sum of the logs, so as to be linear) is between λ and Λ.

SOCoM is implemented in Python, and the source code is freely available for academic use. It interfaces to the CPLEX IP solver, which is freely available from IBM for academic use.