Abstract
Here we present a simple statistical method to determine the phenotypic contribution of a single mutation from libraries of mutants with diverse phenotypes in which each mutant contains a multitude of mutations. The central premise of this method is that, given M phenotypic classes, mutations that do not affect the phenotype should partition among the M classes according to a multinomial distribution. Deviations from this distribution are indicative of a link between specific mutations and phenotypes. We suggest that this method will aid the engineering of functional nucleic acids, proteins, and other biomolecules by uncovering target sites for rational mutagenesis. As a proof of the principle, we show how the method can be used to deduce the individual effects of mutations in a set of 69 PL-λ promoter variants. Each of these promoters was generated by error-prone PCR and incorporated numerous mutations. The activity of the promoters was assayed using flow cytometry to measure the fluorescence of a green fluorescent protein reporter gene. Our analysis of the sequences of these mutants revealed seven positions having a statistically significant correlation with promoter activity. Using site-directed mutagenesis, we constructed point mutations for several sites, both statistically significant and insignificant, and combinations of these sites. Our results show that the statistical method correctly elucidated the phenotypic manifestations of these mutations. We suggest that this method may be useful for expediting directed evolution experiments by allowing both desired and undesired mutations to be identified and incorporated between rounds of mutagenesis.
The engineering of functional nucleic acid sequences and other biomolecules is frequently hampered by a limited understanding of how specific mutations at the genotype level are manifested in the phenotype. For some well-studied, large protein families, these relationships can be inferred; however, such cases are rare. In the absence of these relationships, we resort to strategies that explore the genotype space in a random manner, such as directed evolution.
In many cases, directed evolution of genes and other functional DNA loci is an effective approach to sample the sequence space in search of biomolecules with desirable properties (7, 15). However, the most successful examples employ a selectable fitness criterion that allows for high-throughput screening of the mutational space: sampling a large enough space eliminates the need to make rational mutations. For many proteins or functional nucleic acids, it may not be possible to link a desired phenotype with a selectable criterion fit for high-throughput screening. In the absence of such a criterion, clonal populations of mutants must be assayed individually for the phenotype of interest. This scenario might be called “assay-based” directed evolution, a situation in which the upstream mutagenesis has a higher throughput than the downstream characterization does. In this scenario, there is a premium on information linking mutational changes to their phenotypic manifestations. Further, there is a strong incentive to “learn from” the (relatively small) mutational spectra of these mutants to determine sequence-phenotype interactions and to use this information rationally in subsequent rounds of mutagenesis.
Here, we present a simple statistical method for analyzing a mutational spectrum to parse out the phenotypic manifestation of individual mutations, even when they are masked by the presence of many other mutations. Because assay-based directed evolution does not employ any prescreening or selection of clones, as is the case when a selectable marker is available, mutants are expected to have a range of phenotypes, including both increased and decreased fitness. Here, we demonstrate our method by identifying mutations in a library of mutagenized PL-λ promoters (2) that result in either increased or decreased promoter activity, and we show how to quantify the statistical confidence in these mutation-phenotype linkages.
The central premise of our method is that mutations that have no effect on mutant phenotype should partition randomly, following a multinomial distribution, between phenotypic classes. For example, consider a hypothetical experiment in which we mutagenize a protein that can fluoresce in one of three colors: red, blue, or green. After generating a library of 1,000 mutants, each bearing many point mutations, our assay reveals that 600 have the red phenotype, 300 are blue, and 100 are green. If a particular point mutation has no effect on the color, then we expect that, by chance, mutants containing this modification will be distributed between the red, blue, and green classes in a ratio of 6:3:1. That is, the mutation should not be correlated to any particular phenotypic class. More rigorously, we say that the mutations are multinomially distributed between the three classes with background frequencies of 0.6, 0.3, and 0.1.
Multinomial statistics and related combinatorial statistics commonly arise in the analysis of naturally occurring mutational diversity (1, 13). For example, similar statistical analyses have been used to find functional gene domains (9), important structural RNA sites (8), and genomic loci with an overabundance of single-nucleotide polymorphisms (16). Here we apply multinomial statistics to the analysis of an artificially generated mutational landscape to parse out critical residues controlling phenotypic behavior. We show that, based on this information, mutants with sets of individual mutations can be made, and we suggest that this can be used as a method for improving directed evolution experiments by incorporating sequence information.
In what follows, we detail the construction of numerous PL-λ promoter variants, which were generated by error-prone PCR such that each mutant incorporated many point mutations. The activity of these promoters was assayed using flow cytometry to measure the fluorescence of a green fluorescent protein (GFP) reporter gene. We show how our statistical analysis revealed the phenotypic manifestation of numerous mutations. Finally, we present a validation of our method by constructing point mutations for several of the identified mutations and combinations of sites using site-directed mutagenesis. These mutations, we show, have the predicted effect on the promoter phenotype, even when removed from the background of other mutations.
MATERIALS AND METHODS
Strains and media.
Escherichia coli DH5α (Invitrogen) was used for routine transformations as described in the protocol. Assay strains were grown at 37°C with 225 rpm orbital shaking in M9 minimal medium (11) containing 5 g/liter d-glucose (M9G) supplemented with 0.1% Casamino Acids. All other strains and propagations were cultured at 37°C in LB medium. The medium was supplemented with 68 μg/ml chloramphenicol. All PCR products and restriction enzymes were purchased from New England BioLabs and utilized Taq polymerase. M9 minimal salts were purchased from U.S. Biological, and all remaining chemicals were from Sigma-Aldrich.
Library construction.
Nucleotide analogue mutagenesis was carried out in the presence of 20 μM 8-oxo-2′-deoxyguanosine (8-oxo-dGTP) and 6-(2-deoxy-β-d-ribofuranosyl)-3,4-dihydro-8H-pyrimido-[4,5-c][1,2]oxazin-7-one (dPTP) (TriLink Biotech), using plasmid pZE-gfp(ASV) kindly provided by M. Elowitz as the template (10) along with the primers PL_sense_AatII (TCCGACGTCTAAGAAACCATTATTATC) and PL_anti_EcoRI (CCGGAATTCGGTCAGTGCGTCCTGCTGAT). Ten and 30 amplification cycles with the primers mentioned above were performed. The 151-bp PCR products were purified using theGeneClean spin kit (Qbiogene). Following digestion with AatII and EcoRI, the product was ligated overnight at 16°C and transformed into the library of E. coli DH5α mutants (Invitrogen). About 30,000 colonies were screened by eye from agar plates containing minimal medium and Casamino Acids, and 200 colonies, spanning a wide range in fluorescent intensity, were picked from each plate. Selected mutants were sequenced using primers PL_Sense_Seq (AGATCCTTGGCGGCAAGAAA) and PL_Anti_Seq (GCCATGGAACAGGTAGTTTTCCAG).
Library characterization.
About 20 μl of overnight cultures of library clones growing in LB broth were used to inoculate 5 ml M9G medium supplemented with 0.1% (wt/vol) Casamino Acids. The cultures were grown at 37°C with orbital shaking. After 14 h, roughly the point of glucose depletion, a culture sample was centrifuged at 18,000 × g for 2 min, and the cells were resuspended in ice-cold water. Flow cytometry was performed on a Becton-Dickinson FACScan as described elsewhere (2), and the geometric mean of the fluorescence distribution of each clonal population was calculated.
The means and standard deviations were calculated from the FL1-H distribution resulting after gating the cells based on a forward scatter-side scatter plot. A total of 200,000 events were counted to gain statistical confidence in the results.
Construction of designed promoters.
Promoters with specific nucleotide changes were created using overlap-extension PCR and primers specifically designed to incorporate these changes. Primers were designed to divide the promoter region into thirds, and the proper primers were assembled piecewise in a PCR consisting of 95°C for 4 min, 10 cycles with an annealing temperature of 44°C, followed by 30 cycles of PCR with an annealing temperature of 60°C, and a final extension for 3 min at 72°C. Fragments were gel extracted using 2.5% agarose gels and the QIAGEN MERmaid spin kit. The isolated fragment was then linked with the final primer using the same PCR and extraction procedures. These fragments were then digested using EcoRI and AatII and ligated into the digested plasmid backbone. Sequencing was performed to verify correct constructs.
RESULTS
Generation of mutant library.
Previously, we reported on the development of a promoter library generated through the random mutagenesis of the sequence space (2). In that work, library diversity was created through error-prone PCR of the PL-TET01 promoter, a variant of the PL-λ promoter (3), which was placed upstream of a gfp gene. The promoter region contains two tandem promoters, PL-1 and PL-2, each of which contains −10 and −35 sigma factor binding sites (4, 5, 6). Furthermore, the promoter contains, at approximately the same location, an UP element that binds the C-terminal domain of the alpha subunit and a binding site for integration host factor (IHF). In addition, the PL-TET01 promoter has two tetO2 operators from the Tn10 tetracycline resistance operon (10).
Mutants in the library were analyzed using flow cytometry to measure the single-cell level of expression of GFP as a proxy for the activity of the mutagenized promoters. (A detailed schematic of the experimental procedure is shown in Fig. 1.) Promoters that had roughly log-normal fluorescence distributions (no obvious tails in the distribution or bimodal distributions) were sequenced, and those mutants that contained deletions or insertions were removed from that set. The final set comprised 69 mutant promoters, with well-behaved fluorescence distributions (single distribution with a low standard deviation), that contained only transition and transversion mutations. Notably, our error-prone PCR method introduces predominantly transitions and not transversions, except in rare cases.
FIG. 1.
Schematic of the experimental procedure. A variant of the constitutive bacteriophage PL-λ promoter (PL-TET01) was mutated through error-prone PCR to create mutant promoters. Plasmid constructs containing these promoters were used to drive the expression of gfp in E. coli. Clonal populations of promoter mutant cells were then analyzed using flow cytometry to quantify the fluorescence of GFP and output capacity of the promoter. Kan, kanamycin resistance; FACS, fluorescence-activated cell sorting; FSC, forward scatter; SSC, side scatter; Freq., frequency; Fluor., fluorescence.
Identification of critical sites.
Returning to the red, blue, and green example introduced earlier, each of these N hypothetical mutants can be classified into one of M mutually exclusive and collectively exhaustive phenotypic classes—P1, P2,…, PM—such that there are n1, n2,…, nM mutants in each class and Σni = N. Consider a subset of mutants B of size X, where X < N, comprising mutants with a particular mutation. If the mutation does not influence the phenotype of the mutants, we would expect, by chance, that there would be xi =
/N mutants of type Pi. In general, the probability (Pr) that the set {x1, x2,…, xM} will take on the particular set of values {y1, y2,…, yM} is
![]() |
(1) |
where Σyi = X. In this equation, the term
![]() |
(2) |
is the so-called multinomial coefficient, which can be equivalently written
![]() |
(3) |
The coefficient is the number of ways sets of size {y1, y2,…, yM} could be chosen from a set of size X. (For example, in the case X = 6, M = 2, y1 = y2 = 3, the coefficient is 20 because there are 20 different ways to choose two subsets of size three from a set of six.)
The probability that q or more (where q < X) of the B mutants would be seen in a particular class, Pi, by chance is
![]() |
(4) |
Equivalently, this is the P value for seeing q of the B mutants in class Pi. The lower the P value, the more confident we are that the B mutation is correlated with the Pi phenotype.
For this study, we divided the mutants into two phenotypic classes on the basis of their fluorescence (i.e., M = 2): the top 50th percentile and the lower 50th percentile. Figure 2 shows a detailed schematic of the statistical analysis, which is greatly simplified in this case because there are only two phenotypes. As shown in the figure, applying our statistical method to the sequence data resulted in the identification of seven nucleotide positions that are correlated with one of the two phenotypic classes in a statistically significant manner. The figure should be read clockwise from the top left, progressively showing the fluorescence distribution, mutation distribution, statistical distribution of mutations, and finally, the identified important positions in Fig. 2D in the lower left (see the legend to Fig. 2 for more detail).
FIG. 2.
Statistical distribution of mutations and their effects on mutant fluorescence. In panel A, the vertical axis shows the mutant number, where the mutants are sorted in descending order by their relative fluorescence. In general, the single-cell fluorescence distribution for each mutant strain was log normal distributed. The horizontal axis shows the mean of the log relative fluorescence for each mutant strain, where the error is the standard deviation of this distribution. Reading to the right from panel A into panel B reveals the point mutations present in each mutant. For each location in a mutant (where location is indicated on the horizontal axis) that was changed via the error-prone PCR, a black dot is indicated. With only two exceptions, all of these changes are base transitions rather than transversions, so the sequence of each of the 69 clones can be inferred from the wild-type sequence shown in panel D. (All of the mutations indicated in panel B are transitions with the exception of one A-C transversion at −125 bp in clone 53 and one T-G transversion at −8 in clone 68. These were treated as though they were transitions in our analysis.) Reading down from panel B into panel C shows how mutations at a particular location partition between the two classes of mutants: the top and bottom 50th percentiles. Sites that have no effect on the fluorescence phenotype should partition equally between the two classes, i.e., they should follow a binomial distribution with P = 0.5. Sites that deviate from this distribution are labeled with a dot and are colored either green or red, corresponding to the apparent effect of a mutation at the site. For these sites, P values are indicated, where this value is the probability of seeing a distribution at least as skewed to one side. Sites that were subsequently tested experimentally (see text) are indicated with an asterisk, where the color of the asterisk denotes the expected effect of a mutation at the site. We chose a range of sites to test experimentally from sites with high-confidence (low P value) positive effects to those with low-confidence (P value 0.5) negative effects (Table 1). These sites are also shown in panel D, which contains the wild-type nucleotide sequence of the promoter region that was subjected to mutation.
Site-directed mutagenesis of predicted sites.
We selected eight sites in the promoter region to test whether their phenotypic effects, as predicted by the statistical method, agreed with their observed effects when the mutations were introduced individually, without the background of other mutations. These eight mutated positions are shown in Table 1 and labeled in Fig. 2C and D. The sites were chosen to span a range of characteristics. The −8 site was predicted to have a negative effect on promoter strength with high confidence, i.e., it was statistically significant (Table 1). The −10, −28, and −123 sites were predicted to have negative effects but had moderate P values and, thus, medium to low confidence. Sites −14 and −21 were predicted to have positive effects with high confidence. The sites −82 and −96 were chosen because they had P values of exactly 0.5. Notably, there are two ways that a position could have produced an insignificant P value (i.e., a P value close to 0.5): the mutation could partition equally between the two classes, or the mutation could have been observed very few times. Mutations at both the −82 and −96 sites were observed relatively few times and seemed to partition between the top 50th percentile and bottom 50th percentile classes with equal frequency. Thus, in the absence of a statistically significant correlation, we predicted they would have no effect on the phenotype. (These observations are summarized in Table 1.)
TABLE 1.
Summary of site-directed mutagenesis locia
| Mutation site | Predicted activity | P value | No. of observations | Confidence | Relative fluorescence | Log relative fluorescence | Agreementb |
|---|---|---|---|---|---|---|---|
| −8 | Low | <0.0001 | 22 | High | 0.036 | −3.32 | Yes |
| −10 | Low | 0.1094 | 6 | Med.c | 0.011 | −4.52 | Yes |
| −14 | High | 0.0625 | 4 | High | 1.428 | 0.35 | Yes |
| −21 | High | 0.0625 | 4 | High | 1.585 | 0.46 | Yes |
| −28 | Low | 0.3770 | 10 | Low | 0.756 | −2.58 | Yes |
| −82 | No effect | 0.5000 | 2 | Low | 0.926 | −0.08 | Yes |
| −96 | No effect | 0.5000 | 5 | Med. | 0.046 | −3.08 | No |
| −123 | Low | 0.1938 | 12 | Med. | 0.087 | −2.45 | Yes |
The selected sites, which span a range of P values and predicted activities, were each mutated and assayed for fluorescence levels individually (Fig. 2). As shown in the table, all sites but the −96 site were in the phenotypic class predicted by our statistical method.
Agreement between the phenotypic effect predicted by the statistical method and the observed effect when the mutations were introduced individually, without the background of other mutations.
Med., medium.
For each of the sites listed in Table 1, we created mutant strains incorporating transition single-nucleotide polymorphisms at the specified location. Each of these mutants was analyzed using flow cytometry to test the single-cell level of expression of GFP using the same protocols used for the parent mutant library. The fluorescence results for each mutant are shown in Table 1. In addition, for certain combinations of sites in Table 1, we created double and triple mutants (see Table 2).
TABLE 2.
Summary of double and triple mutants constructed by site-directed mutagenesis
| Mutation sites | Predicted activity | Relative fluorescence | Log relative fluorescence | Agreementa |
|---|---|---|---|---|
| −14, −21 | High | 1.924 | 0.65 | Yes |
| −14, −82 | High | 0.954 | −0.04 | No |
| −21, −82 | High | 1.433 | 0.36 | Yes |
| −96, −123 | Low | 0.274 | −1.43 | Yes |
| −82, −14, −21 | High | 0.140 | −1.97 | No |
| −8, −10, −28 | Low | 0.018 | −4.03 | Yes |
Agreement between the phenotypic effect predicted by the statistical method and the observed effect when the mutations were introduced individually, without the background of other mutations.
DISCUSSION
As shown in Table 1, the statistical method correctly predicts the phenotypic effects of seven of the eight individual mutations that were tested. Furthermore, the phenotypic effects of the mutations with statistically significant P values were correctly predicted. For these mutations, we showed that the effect of an individual mutation on the phenotype can be parsed out from a mutational spectrum, even when the effect is obscured by a background of other mutations.
It is interesting to note that while most of the statistically significant mutations are near the sigma factor binding sites, two are located further upstream of this region. The −123 site, which was not statistically significant, but was tested experimentally, showed that such distal sites are participating in the regulation of transcription.
There are a few caveats to the use of our statistical method. First, the method assumes independence between mutations. That is, we assume mutated sites cannot interact. As shown in Table 2, four of six of the combination mutations had the predicted effect. The two combination mutants that had unintuitive phenotypes could be a result of interaction between sites. (Notably, the −82, −14, −21 triple mutant appeared to have a high fluorescence by visual inspection in a rich medium preculture; however, quantification of GFP activity by flow cytometry revealed consistently low measurements in the minimal medium used.)
The second caveat is that the method can require a significant number of mutants for each position: for a position to be statistically significant in our particular experiment, at least four observations were required. (This would be true for any two-phenotype mutational spectra, where each phenotype occurs with equal prior probability.) The number of observations required scales roughly with the number of mutation types. Our mutagenesis method introduced only transitions, not transversions, which allowed us to treat each site as “mutated” or “not mutated” without loss of information. The method can by applied to cases in which all four nucleotides are present; however, roughly four times as many observations would be required to make a statistically significant correlation between a particular nucleotide (at a single position) and a phenotype. Finally, the statistical method presented here is applicable only to situations in which the method used to introduce sequence diversity does not also introduce deletions or insertions. Ignoring relatively small insertions or deletions in the analysis would not significantly bias the results of identifying critical residues (data not shown). However, rigorously, alterations would be needed to differentiate between deletions and mutations in our statistical framework. In such cases, more-complex models could be adapted, such as those used to describe the distribution and effects of naturally occurring mutations over a fitness landscape for populations under positive and negative selective pressures (12, 14).
Despite its caveats, this method has a significant advantage compared to deducing critical mutations using sequence data from only the best-performing mutants. Intuitively, if we were to ignore the bottom 50th percentile in Fig. 2C, we may mistakenly identify sites as associated with high fluorescence that are, in fact, evenly distributed between the two classes. That is, having sequence data for multiple phenotypes allowed us to determine, with quantifiable confidence, the effect of each individual mutation in a way that discounts artifacts of the mutagenesis method, such as a bias for mutagenizing particular loci.
Acknowledgments
We acknowledge financial support from the DuPont-MIT Alliance and the National Science Foundation, grant number BES-0331364.
We also thank the MIT biopolymers laboratory for DNA sequencing.
REFERENCES
- 1.Adams, W. T., and T. R. Skopek. 1987. Statistical test for the comparison of samples from mutational spectra. J. Mol. Biol. 194:391-396. [DOI] [PubMed] [Google Scholar]
- 2.Alper, H., C. Fischer, E. Nevoigt, and G. Stephanopoulos. 2005. Tuning genetic control through promoter engineering. Proc. Natl. Acad. Sci. USA 102:12678-12683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cirino, P. C., K. M. Mayer, and D. Umeno. 2003. Generating mutant libraries using error-prone PCR. Methods Mol. Biol. 231:3-9. [DOI] [PubMed] [Google Scholar]
- 4.Giladi, H., D. Goldenberg, S. Koby, and A. B. Oppenheim. 1995. Enhanced activity of the bacteriophage lambda PL promoter at low temperature. Proc. Natl. Acad. Sci. USA 92:2184-2188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Giladi, H., S. Koby, G. Prag, M. Engelhorn, J. Geiselmann, and A. B. Oppenheim. 1998. Participation of IHF and a distant UP element in the stimulation of the phage lambda PL promoter. Mol. Microbiol. 30:443-451. [DOI] [PubMed] [Google Scholar]
- 6.Giladi, H., K. Murakami, A. Ishihama, and A. B. Oppenheim. 1996. Identification of an UP element within the IHF binding site at the PL1-PL2 tandem promoter of bacteriophage lambda. J. Mol. Biol. 260:484-491. [DOI] [PubMed] [Google Scholar]
- 7.Glieder, A., E. T. Farinas, and F. H. Arnold. 2002. Laboratory evolution of a soluble, self-sufficient, highly active alkane hydroxylase. Nat. Biotechnol. 20:1135-1139. [DOI] [PubMed] [Google Scholar]
- 8.Johnson, M., S. Morris, A. Chen, E. Stavnezer, and J. Leis. 2004. Selection of functional mutations in the U5-IR stem and loop regions of the Rous sarcoma virus genome. BMC Biol. 2:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lossos, I., R. Tibshirani, B. Narasimhan, and R. Levy. 2000. The inference of antigen selection on Ig genes. J. Immunol. 165:5122-5126. [DOI] [PubMed] [Google Scholar]
- 10.Lutz, R., and H. Bujard. 1997. Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic Acids Res. 25:1203-1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Maniatis, T., E. F. Fritsch, and J. Sambrook. 1982. Molecular cloning: a laboratory manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.
- 12.Orr, H. 2003. A minimum on the mean number of steps taken in adaptive walks. J. Theor. Biol. 220:241-247. [DOI] [PubMed] [Google Scholar]
- 13.Piegorsch, W., and A. Bailer. 1994. Statistical approaches for analyzing mutational spectra: some recommendations for categorical data. Genetics 136:403-416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Rokyta, D., P. Joyce, S. Caudle, and H. Wichman. 2005. An empirical test of the mutational landscape model of adaptation using a single-stranded DNA virus. Nat. Genet. 37:441-444. [DOI] [PubMed] [Google Scholar]
- 15.Solem, C., and P. R. Jensen. 2002. Modulation of gene expression made easy. Appl. Environ. Microbiol. 68:2397-2403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Walker, D., J. Bond, R. Tarone, C. Harris, W. Makalowski, M. Boguski, and M. Greenblatt. 1999. Evolutionary conservation and somatic mutation hotspot maps of p53: correlation with p53 protein structural and functional features. Oncogene 18:211-218. [DOI] [PubMed] [Google Scholar]






