Skip to main content
PLOS One logoLink to PLOS One
. 2013 Jul 3;8(7):e68069. doi: 10.1371/journal.pone.0068069

Fitness Loss and Library Size Determination in Saturation Mutagenesis

Yuval Nov 1,*
Editor: Alexandre G de Brevern2
PMCID: PMC3700877  PMID: 23844158

Abstract

Saturation mutagenesis is a widely used directed evolution technique, in which a large number of protein variants, each having random amino acids in certain predetermined positions, are screened in order to discover high-fitness variants among them. Several metrics for determining the library size (the number of variants screened) have been suggested in the literature, but none of them incorporates the actual fitness of the variants discovered in the experiment. We present the results of an extensive simulation study, which is based on probabilistic models for protein fitness landscape, and which investigates how the result of a saturation mutagenesis experiment – the fitness of the best variant discovered – varies as a function of the library size. In particular, we study the loss of fitness in the experiment: the difference between the fitness of the best variant discovered, and the fitness of the best variant in variant space. Our results are that the existing criteria for determining the library size are conservative, so smaller libraries are often satisfactory. Reducing the library size can save labor, time, and expenses in the laboratory.

Introduction

Directed evolution has emerged as an indispensable vehicle for engineering biocatalysts, therapeutics, and other protein-based tools. Through iterative rounds of mutagenesis and selection that mimic natural evolution, directed evolution “breeds” proteins with improved or novel chemical properties [1]. A key concept in evolutionary biology is that of fitness – the capacity of an organism to survive and reproduce. This meaning of the term originates from the phrase “survival of the fittest,” coined by Herbert Spencer and adopted by Charles Darwin. John Maynard Smith extended the notion of the so-called “fitness landscape” of entire organisms to the molecular world of proteins [2], and more recent writers have used the term “fitness” in the context of protein engineering, to denote any desirable protein property that is subject to optimization [3][5]; we follow this usage below.

Though the very essence of directed evolution protocols depends on screening large numbers of mutants, recent works increasingly advocate relatively small, yet carefully designed libraries. Such targeted libraries enjoy a significantly higher fraction of functional mutants, compared to completely random libraries, and thus achieve remarkable experimental results. Various principles are used in the design of targeted libraries, including reduced amino-acid alphabets [6], inference on ancestral sequences [7], and combinatorial spiking [8]; for a recent review, see Goldsmith and Tawfik [9]. In this work we provide another argument in favor of smaller libraries – one which is related to an experimental procedure called saturation mutagenesis.

Saturation mutagenesis is a simple in vitro directed evolution technique, whereby the amino acids at certain preselected positions along a protein sequence are replaced by random ones, in a manner called randomization. A large number of the resulting mutants are generated to form a library, and are then screened, in the hope of discovering among them a highly improved variant of the wildtype protein. Saturation mutagenesis has been used successfully to improve a host of desirable protein properties, including thermostability, specificity, catalytic activity, enantioselectivity, binding affinity, and photostability [10][14]. The method can also be used iteratively, to probe efficiently larger portions of protein space [15].

All randomization schemes for saturation mutagenesis originate at the DNA level. Most commonly, the three wild-type nucleotides at each randomized codon are replaced with random ones. If all four nucleotides are allowed at each of the codon’s three positions – a randomization scheme known as NNN – then all 43 = 64 possible codons may be formed. Other randomization schemes use reduced codon sets, and include NNB, NNS, and NNK (where N = A/C/G/T, B = C/G/T, S = C/G, K = G/T), which are all superior to NNN randomization as they induce a lower probability of introducing a premature stop codon, as well as a more uniform probability distribution across the twenty amino acids [16]. Still better schemes eliminate completely, or almost completely, both codon bias and the stop codon problem; this can be achieved either by replacing entire wild-type codons by random ones [17], or as shown recently, by using an appropriately weighted mixture of primers [18], [19].

Regardless of the chosen randomization scheme, the larger the library, the higher the likelihood of exploring more distinct protein variants at the screening stage, and hence the higher the probability of discovering a novel, high-fitness variant. On the other hand, generating and screening large libraries is laborious, time consuming, and expensive. To balance judiciously the benefits and drawbacks associated with the library size, one needs a metric – a performance measure that can be calculated before the actual fitness of the library’s variants is observed – and understand how this metric depends on the library size.

Two well known such metrics are: (i) the probability that all possible variants are represented in the library (“probability of full coverage”); and (ii) the expected percentage of all possible variants that are represented in the library (“expected coverage”). Exact and approximate mathematical expressions for these two metrics can be found in [20][22].

Recently, another metric has been proposed: the probability that the library contains at least one of the k top-performing variants (in terms of their fitness), for some small integer k [23]. The probability that the library contains the best variant – i.e., the case k = 1 – is clearly a meaningful metric; under very mild assumptions on the distribution of fitness, this case can be shown to be the same as metric (i) above. The rationale behind this metric for k ≥2 is that often, especially when the number of all possible distinct variants is very large, there are several variants whose fitness satisfies the design requirements, and discovering any of these top-performing variants will be considered a success. Using a metric of the latter type, even with k as low as 2, allows the designer to reduce significantly the library size, and thus to save laboratory resources and labor.

However, none of the above metrics incorporates directly the actual fitness of the variants to be discovered in the process, which is arguably the most relevant metric. The goal of this work is to fill this gap, and to study in silico how the fitness of the best variant discovered in a saturation mutagenesis experiment varies as a function of the library size. We wish to emphasize that it is not the intention of this work that protein engineering experimentalists should repeat themselves any of the computational or mathematical procedures described below (running simulations, estimating parameters, etc.); rather, we aim to derive general guidelines that are applicable to a wide range of saturation mutagenesis experiments, and that can be used by those practitioners.

To attain this study’s goal, one needs a model for “protein landscape” – the relationship between a protein’s sequence and fitness. This relationship is notoriously complex, and indeed, several attempts have been made to model it mathematically. We use for this purpose the probabilistic approach, i.e., view the sequence–fitness relationship as a realization of a random phenomenon, so that the fitness of each variant, before its exact value is revealed in an experiment, is only known to have a certain distribution. Among the probabilistic models for the sequence–fitness relationship are Kauffman and Weinberger’s NK model [24], Aita and Husimi’s “rough Mount Fuji” model [25], the model of Nov and Wein [26], and machine-learning techniques such as Fox et al.’s ProSAR [27].

Our main modeling vehicle in this work is the aforementioned model of Nov and Wein; this model was devised specifically for protein design purposes, and was used successfully to improve the reduction activity of the E. coli enzyme ChrR [28], and the oxidizing activity of the toluene-4-monooxygenase (T4MO) enzyme [29]. To ground our simulation results in experimental data, we use below the parameters estimated in these studies, as well as those estimated from another empirical work [30]. For further support of our conclusions and as a control, we also use the NK model and the “rough Mount Fuji” model.

Methods

Probability

Let the randomized protein positions be labeled i = 1,…., M, and let Qa ,i be the probability that the randomization will result in amino acid a at position i. The probabilities Qa ,i are induced by the randomization scheme and the genetic code: for example, under NNN randomization at position i, since 4 out of the 64 codons encode alanine, Q Ala,i = 4/64. The probability that the randomization will result in a nonsense mutation (a premature stop codon) at the codon corresponding to amino acid i is Q stp,i = 1 − ∑a Qa ,i; depending on the randomization scheme, this probability may be zero or positive.

Denote by Ai the number of amino acids with positive probability to appear at position i, i.e., Ai = |{a : Qa ,i > 0}|. Note that Ai < 20 in some randomization schemes [6], [31]. We use the term variant space to denote the set of all possible distinct protein variants that may be formed in the experiment; the number of elements in this space is n = A 1 A 2AM. Assuming that the randomizations across the M positions are independent of each other, the probability that a random variant will have amino acid a 1 in randomized position 1, amino acid a 2 in randomized position 2, etc., is Inline graphic. Let p 1,…, pn be an enumeration of these product probabilities, corresponding to the n distinct variants in variant space. The probability that a random variant will have at least one nonsense mutation, and will thus be completely dysfunctional, is Inline graphic.

Let L be the library size, i.e., the number of protein variants whose fitness is measured, let Xv, v = 1,…, n, be the number of copies of variant v that are included in the library, and let X stp be the number of variants with a nonsense mutation. Then the random vector (X 1,…, Xn, X stp) has a multinomial distribution with parameters L, p 1,…, pn, p stp. Denote by V the (random) subset of variant space that is represented in the library: V = {v : Xv ≥ 1}. In this notation, the probability of full coverage is P(|V| = n) and the expected coverage is E(|V|/n).

Let Fv be the fitness of variant v. We denote by v * the variant with the highest fitness discovered in the experiment (i.e., the best among those that happened to be represented in the library), and by v ** the globally fittest variant in variant space (i.e., the best among all distinct variants that could possibly be generated in the experiment). Formally,

graphic file with name pone.0068069.e003.jpg (1)

There are two sources of randomness contributing to the variability of Inline graphic. The first is the aforementioned randomness of the set V, which is experimental in origin, i.e., stems from the randomness of the set of variants chosen by the experimenter to constitute the library. The second is the randomness of the mapping v Inline graphic Fv (the protein landscape), which is natural in origin: even when the sequence of a variant is known, its fitness is determined by the laws of chemistry in such an intricate way, that we resort to model it probabilistically. Only the second type of randomness plays a role in the randomness of Inline graphic. While the first type of randomness requires only elementary probability to model and analyze, the second, as stated earlier, is more intricate and challenging to study. Three approaches to modeling the relationship between sequence and fitness are outlined in the next subsection.

Assuming that fitness is a continuous characteristic, so that there are no fitness ties, and assuming that no variant may be considered a priori to be better than another, the distribution of the fitness ranking (from best to worst) across variant space is uniform over all n! permutations of its elements. Let Tk be the event “the library contains at least one of the top k variants in variant space,” and let Inline graphic be the minimal library size required to ensure a probability α of discovering at least one of the top k variants, i.e., Inline graphic. See [23] for methods to compute P(Tk) and Inline graphic.

Sequence–fitness Models

The model of nov and wein

The model of Nov and Wein [26] aims to capture the following three characteristics of the sequence–fitness relationship:

  • Wild-type dominance. Mutations in the wild type are more likely to decrease fitness, rather than to increase it, and the more mutations a variant has, the lower is its resulting fitness, on average [32][35].

  • Clustering. Given a favorable mutation at a certain position, other mutations at this position are more likely to be favorable as well. Indeed, the basic rationale behind the saturation mutagenesis approach is that certain “hotspots” are more likely than others to accommodate favorable mutations [36].

  • Additivity. The change in fitness due to multiple mutations equals approximately to the sum of the corresponding individual fitness changes [37][39]. Importantly, additivity may be weak when measured using the raw fitness values, but can become stronger if the measurements are transformed to a different scale; see below.

In this model, the raw fitness values are transformed so that the wild type, denoted by Inline graphic, has zero fitness: Inline graphic. The fitness of any other variant Inline graphic is viewed as a realization of a random variable Fv, and the collection of the random variables (F 1,…, Fn) is a realization of a random vector whose distribution is governed, in a manner described below, by four parameters: the drift m, the position variance Inline graphic, the amino-acid variance Inline graphic, and the non-additivity variance Inline graphic.

A mutation at position i, which replaces the wild-type amino acid Inline graphic with another amino acid a, contributes a random quantity Inline graphic to the fitness of v. The M means μ1,…, µM are not constants, but rather, independent random variables themselves (in a way resembling the random effect model from statistics), having a Inline graphic distribution, where m < 0. The fitness of a variant v that has amino acid a 1 in position 1, amino acid a 2 in position 2, etc., is then given by

graphic file with name pone.0068069.e018.jpg (2)

where ε1,…, εn are Inline graphic random variables, independent of each other, of the fi ,a’s, and of the µi’s.

By setting the drift m to be negative, the model captures the “wild-type dominance” characteristic. By having the 19 random variables Inline graphic at each position i share a common random mean µi, we establish a positive correlation among them, which captures the clustering characteristic. Finally, the additive structure of (2) captures the additivity characteristic, with Inline graphic determining its magnitude.

It can be shown that the resulting collection of random fitness variables (F 1,…, Fn) is a multivariate Gaussian process, and that for two variants v and v′, having respectively the amino acids Inline graphic and Inline graphic at the randomized positions, this process satisfies

graphic file with name pone.0068069.e024.jpg
graphic file with name pone.0068069.e025.jpg (3)
graphic file with name pone.0068069.e026.jpg

where Inline graphic is the Hamming distance between v and Inline graphic (the number of positions in which v and the wild type differ); Inline graphic (the number of positions in which both v and v′ differ from the wild type); and Inline graphic (the number of positions in which v and v′ have the same mutation).

For transforming the raw fitness values into the values Fv, we use in this work mainly the logarithmic transformation (relative to the wild type), in which a raw fitness x is transformed by Inline graphic, where Inline graphic is the wild-type’s raw fitness. The scaling by Inline graphic guarantees that Inline graphic, as required by the model. This logarithmic transformation was used in the two empirical studies involving the model of Nov and Wein [28], [29].

The four parameters of the model used below were estimated using the maximum likelihood approach: Given a vector of fitness observations Inline graphic for N variants Inline graphic, the estimated parameters are those that maximize the likelihood of F, i.e., that solve the maximization problem

graphic file with name pone.0068069.e037.jpg

where Inline graphic and Inline graphic are the mean vector and covariance matrix of Inline graphic, respectively, whose entries satisfy equation (3). See [26] for more details about this estimation techniques, and [28], [29] for the method used to address sampling bias problems.

The NK Model

The NK model, introduced by Kauffman and Weinberger [24], is a spin-glass-like model of random epistatic interactions, conceived to study evolutionary adaptive walks in protein space. According to this model (and using its original notation, which is different from that used in the previous subsection), the total fitness of a protein variant is the average Inline graphic of the fitness contributions Wi from the protein’s N positions. The fitness contribution of each position is a U(0, 1) random variable that depends not only on the amino acid at that position, but also on the amino acids at K < N other positions.

The value of K determines the ruggedness of the protein landscape: When K = 0, the fitness contributions from the N positions are independent of each other, there is a strong positive correlation between the fitness of neighboring variants (i.e., variants whose sequences differ by a single amino acid), and the resulting smooth landscape contains a single peak. At the other extreme, when K = N − 1, the fitness values of all 20N variants in protein space are independent of each other, resulting in a highly rugged landscape with many local peaks.

The K positions influencing the fitness contribution of each position i may be either randomly chosen, or the nearest K positions flanking position i in the primary sequence. We follow the simulations of Kauffman and Weinberger, and use the latter option.

The Rough Mount Fuji Model

According to the “rough Mount Fuji” model of Aita and Husimi [26], and using its original notation, the fitness of a variant having sequence P is the sum W PP, where W P is an additive term and ωP is a non-additive term. The non-additive term is modeled as a N(0, ϱ2) random variable, and the additive term is the sum of the fitness contributions from the protein’s ν positions: Inline graphic, where Inline graphic is the jth amino acid in the sequence P, and the fitness contribution of amino acid αi at position j is wji) = εi/10, i = 0, 1,…, 19. The parameter ε < 0 is the mean fitness contribution resulting from placing an “incorrect” amino acid (i.e., i ≠ 0) at each position.

By placing the “correct” amino acid (i = 0) at each of the ν positions, the additive term W P attains its maximal value, zero. However, the final fitness of a variant having only “correct” amino acids (i.e., with Inline graphic for all j) may not be the global optimum, because of the non-additive terms ωP. The ratio |ε|/ϱ is a measure of smoothness of the fitness landscape, with low values corresponding to a rugged landscape and high values corresponding to a smooth one.

Computation and Simulation

The Inline graphic values and the “full coverage” probabilities were computed by a C computer program called TopLib; see stat.haifa.ac.il/∼yuval/toplib for documentation. The simulations involving the sequence–fitness models were coded in the R programming language (The R Foundation for Statistical Computing, www.r-project.org); each simulation was based on 5000 replications. When using the model of Nov and Wein, the simulated landscapes were initially generated following equation (2) and then exponentially transformed, so that the resulting values need to be logarithmically transformed to fit the model.

Results

The goal of this work is to study the relationship between the library size L and the loss of fitness in a saturation mutagenesis experiment, i.e., the difference between the fitness of the globally fittest variant, Inline graphic, and the fitness of the best variant discovered in the experiment, Inline graphic (see (1)).

Both Inline graphic and Inline graphic depend on the sequence–fitness model used, and in particular, on the model’s parameters; Inline graphic (but not Inline graphic) depends also on the randomization scheme (NNN, NNK, etc.). We now explore, via computer simulations, how the loss of fitness depends on L under the various models and parameters.

Studying Loss of Fitness

We first study how Inline graphic converges to Inline graphic as L increases. Figure 1 depicts this convergence under NNK randomization of M = 1, 2 and 3 positions; the underlying probabilistic model is that of Nov and Wein, with the parameters estimated in the T4MO study [29]: the variance parameters are Inline graphic, Inline graphic, Inline graphic, and the drift parameter was taken to be the median of the values used in the study, m = −0.3. Still in accordance with that study, it is assumed that the raw activity values need to be logarithmically transformed in order to meet the model’s probabilistic assumptions.

Figure 1. Expected fitness as a function of library size.

Figure 1

Expected fitness of the best variant discovered in the experiment, Inline graphic, as a function of the library size L, when randomizing M = 1, 2, 3 NNK positions. The horizontal dashed lines denote the expected fitness of the best variant in variant space, Inline graphic. The dotted lines correspond to the library sizes required to ensure a 0.95 probability of discovering the top variant (Inline graphic), any of the top two (Inline graphic), and any of the top three (Inline graphic). The solid line correspond to Inline graphic, the library size required to reach 95% of the distance from the lowest point of the curve (at L = 1) to Inline graphic. Simulations are based on the model of Nov and Wein, with the T4MO parameter and a logarithmic transformation.

In each of the three panels of Figure 1, the horizontal upper dashed line intersects the vertical axis at Inline graphic, and thus serves as an upper bound for the curve of Inline graphic. The three dotted lines correspond to Inline graphic, Inline graphic, and Inline graphic, i.e., to the library sizes required to ensure a 0.95 probability of discovering the best variant (rightmost line), any of the top two variants (middle), and any of the top three (left). The solid line corresponds to a library size which we denote by Inline graphic: the size required to reach 95% of the distance from the expected fitness of a single randomized variant (i.e., the lowest point of the curve, corresponding to L = 1) to Inline graphic. This Inline graphic serves as a rough indication for a reasonable library size, which nearly exhausts, on average, the fitness that may be extracted from the fitness landscape.

It is evident that as L increases, the difference between Inline graphic and Inline graphic becomes negligible. For example, when randomizing M = 2 positions (middle panel of Figure 1), a library size of about 1,000 appears to be satisfactory: When Inline graphic variants, we have Inline graphic, a loss of about 0.8% relative to Inline graphic, so the requirement P(T 1) = 0.95 seems overly stringent. When Inline graphic or Inline graphic, we have Inline graphic (3.7% loss) or 4.72 (6.9% loss), respectively. When Inline graphic, we have Inline graphic = 4.86 (4.1% loss), which indeed appears reasonable, though it is up to the experiment designer to decide what magnitude of loss is considered negligible. The library size required to achieve a 0.95 probability of “full coverage” (metric (i) mentioned in the Introduction) is 8,128 – almost an order of magnitude more than the ∼1,000 that appear to suffice, and is therefore extremely wasteful.

Figure 1 also shows that as expected, Inline graphic increases sharply in absolute numbers as the number of randomized positions increases, from 50 to 842 to 15,562, corresponding to M = 1, 2, 3, respectively. However, Inline graphic decreases relative to the landmarks Inline graphic: when M = 1, Inline graphic is about half way between Inline graphic and Inline graphic; when M = 2, Inline graphic is close to Inline graphic; and when M = 3, Inline graphic is about half way between Inline graphic and Inline graphic.

For the remainder of this work we fix the number of randomized positions at M = 2, and study how the choice of model and its parameters influence the difference between Inline graphic and Inline graphic, as a function of L.

The only other study in which all four parameters of the model of Nov and Wein were estimated is [26]. That study analyzed data which was used to improve the binding affinity of an αvβ3-specific humanized monoclonal antibody called Vitaxin [30]. The estimated parameters are m = −1.35, Inline graphic, Inline graphic, Inline graphic, and the raw data were logarithmically transformed. The left panel of Figure 2 shows again the convergence of Inline graphic to Inline graphic when randomizing two NNK positions, this time under the Vitaxin parameters. In this case, Inline graphic =  932, slightly above Inline graphic =  875. The main factor contributing tothis moderate increase in Inline graphic, relative to the corresponding value under the T4MO parameters (middlepanel of Figure 1), is the higher proportion of the non-additivity variance from the total fitness variance:when randomizing two positions, this proportion is Inline graphic =  875. The main factor contributing tothis moderate increase in Inline graphic (see equation (3)) inthe Vitaxin parameters, whereas it is only 0.034 in the T4MO parameters. To demonstrate the influenceof high non-additivity variance, we repeated the T4MO simulation under the (unrealistic and extreme) parameterization in which this proportion is 1, without changing the marginal distribution of the fitnessvalues. This was done by keeping m  =  −0.3, setting Inline graphic, and letting Inline graphic, which is thefitness variance of an arbitrary variant (with two randomized positions) under the T4MO parameters.The results are shown in the right panel of Figure 2; as expected, Inline graphic has significantly increased, to 1,277.It can be shown mathematically, using Slepian's inequality [40], that a high proportion of non-additivityvariance yields more conservative estimates for the library size, so that the right panel of Figure 2 is themost challenging scenario.

Figure 2. Expected fitness as a function of library size, other landscape parameters.

Figure 2

Expected fitness of the best variant discovered in the experiment, Inline graphic, as a function of the library size L, when randomizing two NNK positions. Left: Vitaxin parameters. Right: same marginal fitness distribution as with the T4MO parameters, but the proportion of non-additivity variance equals 1.

We now turn to modeling the fitness landscape by models other than the model of Nov and Wein. Figure 3 shows the results when randomizing two NNK positions using the NK model, with parameters N = 24 (the median value of the N’s used in the simulations in [24]) and K = 6 (the median value of the K’s for N = 24, in that study). Given K, it still remains to determine the degree of overlap betweenthe K+ 1 positions influencing the first randomized position, and the K + 1 influencing the second. Weconsider therefore three scenarios: maximal overlap of K  =  6 positions, moderate overlap of 3 positions,and no overlap. The resulting values of Inline graphic are 446, 491, and 358, respectively.

Figure 3. Expected fitness as a function of library size, the NK model.

Figure 3

Expected fitness of the best variant discovered in the experiment, Inline graphic, as a function of the library size L, when randomizing two NNK positions, under the NK model. Parameters are N = 24 and K = 6. The number of overlapping positions among those influencing the fitness contributions from the two randomized positions is 6, 3, and 0. Given K, it still remains to determine the degree of overlap between the K +1 positions influencing the first randomized position, and the K +1 influencing the second. We consider therefore three scenarios: maximal overlap of K = 6 positions, moderate overlap of 3 positions, and no overlap. The resulting values of Inline graphic are 446, 491, and 358, respectively.

Figure 4 shows the results when modeling the fitness landscape through the rough Mount Fuji model. The parameters are ν = 60, ε = −1, and ϱ = 0.5, 1, and 2 (the values used in [25]). The resulting valuesof Inline graphic are 543, 663, and 644, corresponding to the three values of ϱ.

Figure 4. Expected fitness as a function of library size, the rough Mount Fuji model.

Figure 4

Expected fitness of the best variant discovered in the experiment, Inline graphic, as a function of the library size L, when randomizing two NNK positions, under the rough Mount Fuji model. Parameters are ν = 60, ε = −1, and ϱ = 0.5, 1, 2.

Thus, under both the NK model and the rough Mount Fuji model, Inline graphic is significantly smaller than the value 842, obtained under the model of Nov and Wein, so the latter model is more conservative. All the results presented so far concerned only mean values. We define the percent loss of fitness to be.

graphic file with name pone.0068069.e115.jpg (4)

which is a random variable taking values in the set [0, 100], and whose distribution strongly depends on the library size L. Clearly, lower values of Λ are desirable, with Λ = 0 corresponding to the (ideal) event Inline graphic.

Figure 5 shows the loss probabilities Inline graphic for Inline graphic, and 20% (under the model of Nov and Wein, T4MO parameters, and a logarithmic transformation). For example, when the library size is Inline graphic, there is a 0.069 probability that the fitness of the best variant discovered will be lower than that of the best variant in variant space by more than 20% (the dotted lines).

Figure 5. Probability of losing more than 5%, 10%, 20% of fitness.

Figure 5

The probability that the fitness of the best variant discovered Inline graphic will be at least 5%, 10%, and 20% lower than that of the best variant in variant space Inline graphic. Randomization, fitness model, parameters, and transformation as in Figure 1.

Figure 6 shows histograms depicting the distribution of Λ for three library sizes: Inline graphic, Inline graphic, and Inline graphic. The spiking, leftmost column in each of the three histograms corresponds almost in its entirety to the case Λ = 0. As expected, the height of this column increases as L increases, and the distribution reflected by the remaining columns becomes stochastically closer to zero.

Figure 6. Distribution of loss of fitness as a function of library size.

Figure 6

Distribution of the loss of fitness for three library sizes: Inline graphic, Inline graphic, and Inline graphic. Randomization, fitness model, parameters, and transformation as in Figure 1.

Discussion

It is difficult to establish universal, sharp guidelines for the optimal library size in a saturation mutagenesis experiment, as the requirements from the design process, the screening costs, and the fitness landscape of the specific protein at hand vary from one experiment to another. However, judging from the results of the extensive simulation study presented above, it seems that when randomizing two positions, a rough yet reasonable rule of thumb is to choose a library size no larger than a point about half way between Inline graphic and Inline graphic, i.e., between the library size required to ensure a 0.95 probability of discovering the best variant, and that of discovering any of the top two variants. Using a similar methodology, one can establish similar rough guidelines for library size determination when randomizing M ≠ 2 positions. We have done so (results not shown) for the cases M = 1 and M = 3: When randomizing a single position, we recommend a library no larger than about Inline graphic variants, and when randomizing three positions, about Inline graphic variants.

These recommendations are conservative, as they are based on the model that provides the largest library sizes among the three models considered (the model of Nov and Wein), and on the most challenging scenario from a wide range of model parameters (right panel of Figure 2).

In [29] it was suggested that the logarithmic transformation is perhaps “too concave” for converting the raw fitness measurements, and that to achieve a better fit to the model of Nov and Wein, a power transformation such as the Box–Cox transformation [41] with power 0 < λ ≤1 should be used instead (such transformations are used routinely in statistics to improve model fit, and λ has no physicochemical interpretation). However, since a Box–Cox transformation with 0 < λ ≤1 is “less concave” than the logarithmic transformation, the resulting library sizes will be smaller; thus, assuming that a logarithmic transformation is required, as was done in this work, is again conservative.

The library sizes we recommend are lower than those actually chosen by some practitioners, sometimes by a very large margin. In two unrelated experiments, each involving randomization of a single NNS position, Boersma et al. [42] screened 1,250 variants and Maeda et al. [43] screened 360 variants, where ∼80 variants seem to suffice. Similarly, Champion et al. [44] screened ∼8,000 variants when randomizing two NNN positions, where ∼2,000 seem to suffice.

One may use the approach described in this work to compare the efficiency of the various randomization schemes. We have done so, and the results are consistent with previously published findings [16], [23]: the most efficient schemes are those that induce a 1/20 probability for each of the amino acids [17], [18], next is NNK (or NNS, which is identical in this sense), then NNB, and the least efficient scheme is NNN.

Another approach to protein engineering, which often complements directed evolution, is rational design. Here, scientists use elaborate computational algorithms and detailed structural knowledge of the protein at hand, in order to introduce targeted mutations that are predicted to improve the protein's fitness. When taking a pure rational design approach, library size is minimal, as only the targeted mutants need to be generated via site-directed mutagenesis. Often, rational design and directed evolution are combined into the so-called semi-rational approach, whereby computational tools dictate which positions are best to be explored by saturation mutagenesis [45]. The library size analysis presented in this work is relevant to such experiments. Some rational design studies aim at designing highly stable, “ideal” protein structures; such structures may be used as scaffolds for the next generation of engineered proteins, over which saturation mutagenesis could be employed to focus the desired function [46].

Conclusions

Our simulation results indicate that existing criteria and practices for determining the library size in saturation mutagenesis experiments are often too conservative: smaller libraries, sometimes by an order of magnitude, suffice to exhaust almost all the fitness that the fitness landscape has to offer. This observation is in line with other recent works, which advocate the use of “small but smart” libraries [9]. The reduction in library size carries clear benefits in laboratory expenses, time, and labor.

Acknowledgments

I thank three referees whose comments significantly improved this paper.

Funding Statement

The author has no support or funding to report.

References

  • 1. Romero PA, Arnold FH (2009) Exploring protein fitness landscapes by directed evolution. Nature Reviews Molecular Cell Biology 10: 866–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Maynard Smith J (1970) Natural selection and the concept of a protein space. Nature 225: 563–564. [DOI] [PubMed] [Google Scholar]
  • 3. Dandekar T, Argos P (1992) Potential of genetic algorithms in protein folding and protein engineering simulations. Protein Engineering 5: 637–645. [DOI] [PubMed] [Google Scholar]
  • 4. Voigt CA, Kauffman S, Wang Z-G (2001) Rational evolutionary design: the theory of in vitro protein evolution. Advances in protein chemistry 55: 79–160. [DOI] [PubMed] [Google Scholar]
  • 5. Reetz MT (2013) The Importance of Additive and Non-Additive Mutational Effects in Protein Engineering. Angewandte Chemie 52: 2658–2666. [DOI] [PubMed] [Google Scholar]
  • 6. Reetz MT, Kahakeaw D, Lohmer R (2008) Addressing the numbers problem in directed evolution. ChemBioChem 9: 1797–1804. [DOI] [PubMed] [Google Scholar]
  • 7. Alcolombri U, Elias M, Tawfik DS (2011) Directed Evolution of Sulfotransferases and Paraoxonases by Ancestral Libraries. Journal of Molecular Biology 411: 837–853. [DOI] [PubMed] [Google Scholar]
  • 8. Herman A, Tawfik DS (2007) Incorporating Synthetic Oligonucleotides via Gene Reassembly (ISOR): a versatile tool for generating targeted libraries. Protein Engineering, Design and Selection 20: 219–226. [DOI] [PubMed] [Google Scholar]
  • 9. Goldsmith M, Tawfik DS (2013) Enzyme Engineering by Targeted Libraries. Methods in Enzymology 523: 257–283. [DOI] [PubMed] [Google Scholar]
  • 10. Prasad S, Bocola M, Reetz MT (2011) Revisiting the lipase from Pseudomonas aeruginosa: Directed evolution of substrate acceptance and enantioselectivity using iterative saturation mutagenesis. ChemPhysChem 12: 1550–1557. [DOI] [PubMed] [Google Scholar]
  • 11. Yeung YA, Leabman MK, Marvin JS, Qiu J, Adams CW, et al. (2009) Engineering human IgG1 affinity to human neonatal Fc receptor: Impact of affinity improvement on pharmacokinetics in primates. The Journal of Immunology 182: 7663–7671. [DOI] [PubMed] [Google Scholar]
  • 12. Williams GJ, Goff RD, Zhang C, Thorson JS (2008) Optimizing glycosyltransferase specificity via “hot spot” saturation mutagenesis presents a catalyst for novobiocin glycorandomization. Chemistry and biology 15: 393–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Feingersch R, Shainsky J, Wood TK, Fishman A (2008) Protein engineering of toluene monooxygenases for synthesis of chiral sulfoxides. Applied and Environmental Microbiology 74: 1555–1566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Shaner NC, Lin MZ, McKeown MR, Steinbach PA, Hazelwood KL, et al. (2008) Improving the photostability of bright monomeric orange and red fluorescent proteins. Nature Methods 5: 545–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Reetz M, Carballeira J (2007) Iterative saturation mutagenesis (ISM) for rapid directed evolution of functional enzymes. Nature protocols 2: 891–903. [DOI] [PubMed] [Google Scholar]
  • 16. Patrick WM, Firth AE (2005) Strategies and computational tools for improving randomized protein libraries. Biomolecular engineering 22: 105–112. [DOI] [PubMed] [Google Scholar]
  • 17. Hughes MD, Nagel DA, Santos AF, Sutherland AJ, Hine AV (2003) Removing the redundancy from randomised gene libraries. Journal of Molecular Biology 331: 973–979. [DOI] [PubMed] [Google Scholar]
  • 18. Tang L, Gao H, Zhu X, Wang X, Zhou M, et al. (2012) Construction of “small-intelligent” focused mutagenesis libraries using well-designed combinatorial degenerate primers. BioTechniques 52: 149–157. [DOI] [PubMed] [Google Scholar]
  • 19.Kille S, Acevedo-Rocha CG, Parra LP, Zhang ZG, Opperman DJ, et al.. (2013) Reducing codon redundancy and screening effort of combinatorial protein libraries created by saturation mutagenesis. ACS Synthetic Biology 2(2) 83–92. [DOI] [PubMed]
  • 20. Patrick WM, Firth AE, Blackburn JM (2003) User-friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries. Protein Engineering 16: 451–457. [DOI] [PubMed] [Google Scholar]
  • 21. Bosley AD, Ostermeier M (2005) Mathematical expressions useful in the construction, description and evaluation of protein libraries. Biomolecular Engineering 22: 57–61. [DOI] [PubMed] [Google Scholar]
  • 22. Firth AE, Patrick WM (2008) GLUE-IT and PEDEL-AA: new programmes for analyzing protein diversity in randomized libraries. Nucleic Acids Research 36: W281–W285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Nov Y (2012) When second best is good enough: Another probabilistic look at saturation mutagenesis. Applied and Environmental Microbiology 78: 258–262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Kauffman SA, Weinberger ED (1989) The NK model of rugged fitness landscapes and its application to maturation of the immune response. Journal of Theoretical Biology 141: 211–245. [DOI] [PubMed] [Google Scholar]
  • 25. Aita T, Husimi Y (2000) Adaptive walks by the fittest among finite random mutants on a Mt. Fuji-type fitness landscape – II. Effects of small non-additivity. Journal of Mathematical Biology 41: 207–231. [DOI] [PubMed] [Google Scholar]
  • 26. Nov Y, Wein LM (2005) Modeling and analysis of protein design under resource constraints. Journal of Computational Biology 12: 247–282. [DOI] [PubMed] [Google Scholar]
  • 27. Fox RJ, Davis SC, Mundorff EC, Newman LM, Gavrilovic V, et al. (2008) Improving catalytic function by ProSAR-driven enzyme evolution. Nature Biotechnology 25: 338–344. [DOI] [PubMed] [Google Scholar]
  • 28. Barak Y, Nov Y, Ackerley DF, Matin A (2008) Enzyme improvement in the absence of structural knowledge: a novel statistical approach. The ISME Journal 2: 171–179. [DOI] [PubMed] [Google Scholar]
  • 29. Brouk M, Nov Y, Fishman A (2010) Improving biocatalyst performance by integrating statistical methods into protein engineering. Applied and Environmental Microbiology 76: 6397–6403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Wu H, Beuerlein G, Nie Y, Smith H, Lee B, et al. (1998) Stepwise in vitro affinity maturation of Vitaxin, an αvβ3-specific humanized mAb. Proceedings of the National Academy of Science 95: 6037–6042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Reetz M, Wu S (2008) Greatly reduced amino acid alphabets in directed evolution: making the right choice for saturation mutagenesis at homologous enzyme positions. Chemical Communications : 5499–5501. [DOI] [PubMed]
  • 32. Arnold FH (1996) Directed evolution: Creating biocatalysts for the future. Chemical Engineering Science 51: 5091–5102. [Google Scholar]
  • 33. Sniegowski PD, Gerrish PJ, Lenski RE (1997) Evolution of high mutation rates in experimental populations of E. coli. Nature 387: 703–705. [DOI] [PubMed] [Google Scholar]
  • 34. Voigt CA, Kauffman S, Wang ZG (2001) Rational evolutionary design: the theory of in vitro protein evolution. Advances in Protein Chemistry 55: 79–160. [DOI] [PubMed] [Google Scholar]
  • 35. Drummond DA, Iverson BL, Georgiou G, Arnold FH (2005) Why high-error-rate random mutagenesis libraries are enriched in functional and improved proteins. Journal of Molecular Biology 350: 806–816. [DOI] [PubMed] [Google Scholar]
  • 36. Pavelka A, Chovancova E, Damborsky J (2009) HotSpot wizard: a web server for identification of hot spots in protein engineering. Nucleic Acids Research 37: W376–W383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Skinner MM, Terwilliger TC (1996) Potential use of additivity of mutational effects in simplifying protein engineering. Proceedings of the National Academy of Sciences 93: 10753–10757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Benos PV, Bulyk ML, Stormo GD (2002) Additivity in proteindna interactions: how good an approximation is it? Nucleic Acids Research 30: 4442–4451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Carneiro M, Hartl DL (2010) Adaptive landscapes and protein evolution. Proceedings of the National Academy of Sciences 107: 1747–1751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Slepian D (1962) The one-sided barrier problem for Gaussian noise. Bell System Technical Journal 41: 463–501. [Google Scholar]
  • 41. Box GEP, Cox DR (1964) An analysis of transformations. Journal of the Royal Statistical Society Series B (Methodological) 26: 211–252. [Google Scholar]
  • 42. Boersma YL, Pijning T, Bosma MS, van der Sloot AM, Godinho LF, et al. (2008) Loop grafting of bacillus subtilis lipase a: Inversion of enantioselectivity. Chemistry & Biology 15: 782–789. [DOI] [PubMed] [Google Scholar]
  • 43. Maeda T, Sanchez-Torres V, Wood T (2008) Protein engineering of hydrogenase 3 to enhance hydrogen production. Applied Microbiology and Biotechnology 79: 77–86. [DOI] [PubMed] [Google Scholar]
  • 44. Champion E, Moulis C, Morel S, Mulard LA, Monsan P, et al. (2010) A pH-based high-throughput screening of sucrose-utilizing transglucosidases for the development of enzymatic glucosylation tools. ChemCatChem 2: 969–975. [Google Scholar]
  • 45. Lutz S (2010) Beyond directed evolution – semi-rational protein engineering and design. Current Opinion in Biotechnology 21: 734–743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Koga N, Tatsumi-Koga R, Liu G, Xiao R, Acton TB, et al. (2012) Principles for designing ideal protein structures. Nature 491: 222–227. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES