IPRO: An Iterative Computational Protein Library Redesign and Optimization Procedure

Manish C Saraf; Gregory L Moore; Nina M Goodey; Vania Y Cao; Stephen J Benkovic; Costas D Maranas

doi:10.1529/biophysj.105.079277

. 2006 Mar 2;90(11):4167–4180. doi: 10.1529/biophysj.105.079277

IPRO: An Iterative Computational Protein Library Redesign and Optimization Procedure

Manish C Saraf ^*, Gregory L Moore ^†, Nina M Goodey ^‡, Vania Y Cao ^‡, Stephen J Benkovic ^‡, Costas D Maranas ^*

PMCID: PMC1459523 PMID: 16513775

Abstract

A number of computational approaches have been developed to reengineer promising chimeric proteins one at a time through targeted point mutations. In this article, we introduce the computational procedure IPRO (iterative protein redesign and optimization procedure) for the redesign of an entire combinatorial protein library in one step using energy-based scoring functions. IPRO relies on identifying mutations in the parental sequences, which when propagated downstream in the combinatorial library, improve the average quality of the library (e.g., stability, binding affinity, specific activity, etc.). Residue and rotamer design choices are driven by a globally convergent mixed-integer linear programming formulation. Unlike many of the available computational approaches, the procedure allows for backbone movement as well as redocking of the associated ligands after a prespecified number of design iterations. IPRO can also be used, as a limiting case, for the redesign of a single or handful of individual sequences. The application of IPRO is highlighted through the redesign of a 16-member library of Escherichia coli/Bacillus subtilis dihydrofolate reductase hybrids, both individually and through upstream parental sequence redesign, for improving the average binding energy. Computational results demonstrate that it is indeed feasible to improve the overall library quality as exemplified by binding energy scores through targeted mutations in the parental sequences.

BACKGROUND AND INTRODUCTION

The ability to proactively modify protein structure and function through a series of targeted mutations is an open challenge that is central in many different applications. These include, among others, enhanced catalytic activity (1–3) and stability (4,5), creation of gene switches for the control of gene expression for use in gene therapy and metabolic engineering (6,7), signal transduction (8,9), genetic recombination (10), motor protein function, and regulation of cellular processes (see Bishop et al. (11) for a review). This task is complicated by the fact that proteins rely on complex networks of subtle interactions to enable function (12–14). Therefore, the effect of a mutation is difficult to assess a priori requiring the capture of its direct or indirect effects on many neighboring amino acids. As a result, most protein engineering paradigms involve the synthesis and screening of multiple protein candidates (protein library) as a way to enhance the odds of identifying proteins with the desired functionality level. These directed evolution design paradigms (15–20) typically involve juxtaposition of repeated library generation and screening (Fig. 1). On the other hand, most computational approaches for guiding protein design are focused on the downstream redesign of single parental sequences or promising hybrids (Fig. 1). Notable exceptions include the work of Bogarad and Deem (21) and efforts by Saven (22) that describe computational methods for protein library design.

(a) Promising hybrid sequences from the library are selected for downstream redesign that involves either random or site-directed mutagenesis. (b) Illustration of the upstream parental sequence redesign. Note that the mutations in the parental sequences propagate downstream into the combinatorial library effectively designing the combinatorial library at once, thereby improving the overall quality of the library.

A number of computational models and techniques have been developed (see Moore and Maranas (23) for review) to aid in the in silico evaluation of protein redesign candidates. Typically these techniques attempt to find single or multiple amino acid sequences that are compatible with a given three-dimensional structure specific to a targeted function (e.g., enzymatic activity). The protein fold is usually represented by the Cartesian coordinates of its backbone atoms, which are fixed in space so that the degrees of freedom associated with backbone movement are neglected. More recent approaches (24–29) allow for some backbone movement. Candidate protein designs are generated by selecting amino acid side chains (using atomistic detail) along the backbone design scaffold. For simplicity, side chains are usually only permitted to assume a discrete set of statistically preferred conformations referred to as rotamers (see Dunbrack (30) for a review of current rotamer libraries). Thus, a protein design consists of both a residue and a rotamer assignment for each amino acid position. To evaluate how well a possible design fits a given fold, rotamer/backbone and rotamer/rotamer interaction energies for all the rotamers in the rotamer library are tabulated. These energies are approximated using standard force fields (e.g., CHARMM (31), DREIDING (32), AMBER (33), and GROMOS (34)). Scoring functions customized for protein design (35–37) (see Gordon et al. (38) for a review) typically include van der Waals interactions, hydrogen bonding, and electrostatics, solvation, along with entropy-based penalty terms for flexible side chains (e.g., arginine) (39–42). Because activity level or other performance objectives are very difficult to compute directly, alternative surrogates of hybrid fitness, such as stability or binding affinity, are employed in most studies. The use of these indirect objectives further necessitates the need for designing a combinatorial library rather than a single hybrid to improve the chances of success.

Even for a small 50-residue protein, an enormous number (i.e., 153⁵⁰ ≈ 10¹⁰⁹ assuming a 153-rotamer library (43)) of designs is possible. Both stochastic and deterministic search strategies have been used to tackle the computational challenge of finding the globally optimum design within this vast search space. Despite these challenges, a number of success stories of combinatorial design for many different applications has been reported (42,44–50) in the last few years demonstrating the feasibility of using computations to guide protein redesign. Briefly, successes include manyfold improvements in enzyme activity and thermostability (50–52), improved enantioselectivity (53–55), enhanced bioremediation (56–58), and even the design of genetic circuits (6,7,10) and vaccines (59–61). It is increasingly becoming apparent, however, that instead of computationally generating a set of distinct protein redesigns, it is more promising to use computations to shape the statistics of an entire combinatorial library. This allows one to assess and then “steer” diversity toward the most promising regions of sequence space (62). This paradigm is more likely to succeed compared to constructing, one at a time, protein designs. On the other end, construction of combinatorial libraries based on mutation and/or recombination without any guidance from models/computations is a daunting task because only an infinitesimally small fraction of the diversity afforded by DNA and protein sequences can be examined regardless of the efficiency of the screening procedure.

In response to these challenges, in this article we introduce a new computational procedure IPRO (iterative protein redesign and optimization) that allows for the upstream redesign of parental sequences (Fig. 1). The key idea here is that the residue changes within the parental sequences will propagate in the combinatorial library; effectively introducing mutations within the hybrid sequences in the library (see Fig. 1). Judicious selection of these mutations in the parental sequences can simultaneously relieve unfavorable interactions or clashes (63–65) within the hybrid sequences and therefore enhance the overall quality of the library in one step mirroring the experimental protocol design. Note that even though IPRO is geared toward parental sequence redesign, it can be used, as a limiting case, for the redesign of a single or handful of individual sequences.

The key feature of the IPRO protocol is the cycling between sequence design, ligand redocking, and backbone movement of a set of sequences representative of the combinatorial library. The goal of the sequence design here is to choose mutations within the parental sequences, and therefore in the hybrid sequences, that optimize the average binding energy/score (or alternative surrogates of design objectives) of the hybrid sequences in the library. The genetic algorithm of Desjarlais and Handel (66) and the Monte Carlo minimization protocol of Kuhlman and co-workers (41) involve similar sequence design and backbone perturbation moves. However, they only allow for the design of a single sequence at a time and involve full-scale optimization over rotamers for only a local backbone perturbation. On the other hand, IPRO allows for the design of the entire combinatorial library and involves optimization over the local perturbation region using a globally convergent mixed-integer linear programming (MILP) formulation. In addition, IPRO allows for the redocking of the associated ligands (e.g., substrates, cofactors, solvent, etc.) after a prespecified number of design iterations.

In the next section, we describe in detail the IPRO procedure and introduce the globally convergent mixed-integer linear program that drives residue redesign. We also discuss the methods used for generating and identifying hybrid Escherichia coli/Baccilus subtilis dihydrofolate reductase (DHFR) and B. subtilis/Lactobacillus casei DHFR enzymes containing single crossover positions and assays for DHFR activity. Next, we provide an example application of IPRO to highlight the features and type of output obtained with IPRO. The study involves the computational identification of parental redesigns that are likely to improve a single crossover E. coli/B. subtilis DHFR combinatorial library composed of 16 hybrids (64). We conclude by discussing the implications of our results and some of the modeling and algorithmic enhancements that we are currently incorporating to further improve the IPRO framework.

MATERIALS AND METHODS

The IPRO procedure

The IPRO procedure is composed of four parts (see Fig. 2):

A set of hybrid sequences matching the members of the combinatorial library, if <∼100, is generated. For larger libraries, only a representative sample of the diversity of the combinatorial library is considered.
For each hybrid sequence, an initial structure is computationally generated. This is a critical step as the efficacy of the identified redesigns depends heavily on the accuracy of the modeled structures.
A set of positions, ranging from a single residue position to the entire sequence length, to be targeted for redesign is compiled. Note that the larger the number of design positions is, the more expansive the search space becomes leading to higher computational requirements. Typically we only consider between 3 and 20 design positions that include residue positions within or in the neighborhood of the active site. In addition, restrictions on the type of allowable residue redesigns (e.g., hydrophobic, charged, etc.) can be imposed for each redesign position.
Next, a set of residue changes is identified in the parental sequences, which upon propagation among the combinatorial library members, lead to the optimization of the average library score (e.g., binding energy or stability (35–37)). This optimization step is carried out globally using a MILP model within a local perturbation window, whereas simulated annealing is used to accept or reject the residue redesigns associated with each backbone perturbation step.

Four key steps involved in the IPRO procedure. Details of each of these steps are described separately in the text.

Generating a set of sequences representative of the combinatorial library

A set of hybrid sequences is selected to exhaustively or statistically represent the combinatorial library. This step begins with the sequence/structural alignment (67) of the parental sequences. A statistical description of the combinatorial library is obtained by considering the specifics of the combinatorialization protocol. For example, in case of DNA shuffling, models such as eShuffle (68) or those developed by Maheshri and Schaffer (69) can be used to estimate the library diversity. Alternatively, for an oligonucleotide ligation-based protocol such as GeneReassembly (70), SISDC (71), and degenerate homoduplex recombination (72), a statistically unbiased sample of fragment concatenations is constructed that broadly captures the diversity of the resulting combinatorial library. In the limiting case when there is only a single starting sequence to be redesigned, IPRO reverts back to the traditional single protein sequence design procedure. Note, however, that the concept of designing for the optimum of the average of a library of sequences can also find utility in this case when not a unique but rather an ensemble of putative structures is available for the protein to be redesigned. The ensemble of modeled structures then plays the role of the combinatorial library when fed to IPRO. By optimizing with respect to the ensemble average of the putative structures, a more robust redesign strategy is likely to be obtained.

Generation of starting hybrid protein structures

The initial putative structures of the hybrid proteins forming the library are obtained by splicing fragments of the parental structures consistent with its sequence (see Fig. 3). The coordinates of the fragment structures are taken from the structural alignment of the parental sequences. The fold at the junction point(s) typically involves a “kink” as a result of the “ad hoc” concatenation of the parental structures, which becomes even more prominent in case of insertions. This is “smoothened” by allowing the backbone around the junction point to move. The backbone φ and ψ angles of seven residues on either side of the crossover position(s) are allowed to vary and their new positions are determined through energy minimization. In the current implementation of IPRO, we use the CHARMM (73) energy function and molecular modeling environment. Note that during the energy minimization, the bond lengths (b), bond angles (χ₁, χ₂, etc.), and internal coordinates of the side chains are restrained to their original values (b_o, χ_o) by penalizing any deviations (see Eqs. 1 and 2). The bond stretching is penalized using Hooke's law formula (Eq. 1) and the distortions in the bond angles are penalized using the harmonic function (Eq. 2). In addition, distances between certain key atoms can also be restrained using Eq. 1. Note that because less energy is required to distort an angle than to stretch a bond, the force constant associated with bond angle distortion is accordingly smaller:

(1)

(2)

This figure highlights the key steps for constructing the initial structure of a hybrid protein from a set of parental structures with known crossover position(s). These involve i), backbone splicing, ii), backbone relaxation at the crossover positions, and iii), ligand redocking. These steps are repeated for different crossover positions to generate the combinatorial library.

Alternative methods to parental fragment splicing and relaxation for modeling the hybrid structures include techniques such as homology modeling (74,75) and ab initio structure prediction methods (75,76). After the structure of the hybrid protein is modeled, the missing hydrogen atoms are added to the hybrid protein in accordance with the standard procedure used in CHARMM (31). Finally, the positions of the associated ligands are identified using crystallographic data (whenever available) in conjunction with the ZDOCK docking software (77,78). Notably the ZDOCK software allows for the user-specified rough placement of the docked molecules, thus significantly reducing the computational expense of the docking calculations.

Selecting design positions

The selection of the set of positions that will be allowed to mutate (i.e., candidate redesign positions) for each of the parental sequences is largely dependent on the design objective and associated surrogate criterion. Typically, design objectives involve one or more of the following: i), protein stability, ii), binding affinity, iii), specific activity, and iv), substrate specificity. Protein stability is associated with the ability of the protein to fold correctly under a set of conditions. Generally, unfavorable interactions present within the proteins such as the electrostatic repulsion, hydrogen bond disruptions, steric clashes, or a combination of these tend to prevent these proteins from folding correctly (63). A number of structure or sequence data based (SCHEMA (79), SIRCH (65), and clashMaps (63)) and functionality based (FamClash (64)) scoring strategies can be used to quantify the extent of such unfavorable interactions in each hybrid. Residue positions that participate in a disproportionate number of such clashing interactions serve as design positions. On the other hand, when binding affinity, specificity, or specific activity is the design objective, residues within or in the neighborhood of the binding site are chosen as candidates for design. In general, the design positions are either the clashing residues, binding pocket residues, or a combination of both. In most cases, the set of candidate design positions is subsequently revised (either upward or downward) by using information, found in some cases in the literature, about the direct or indirect impact of different residues on the presence, absence, or extent of functionality.

Iterative protein optimization step

The optimization procedure of IPRO involves iterating between sequence design, backbone optimization, and ligand redocking (see Fig. 4). This iterative procedure involves six main steps as follows:

i. Backbone perturbation. Different backbone conformations are sampled by iteratively perturbing small regions of the backbone that are randomly chosen during each cycle along the length of the sequence (N). For this purpose, a segment (from one to five contiguous residues (k to k′) excluding prolines) of the protein sequence is randomly chosen for perturbation. Because the special structure of proline makes the polypeptide backbone more rigid, prolines, whenever present, are considered part of the backbone. The φ and ψ angles of the positions within the perturbation window are perturbed by up to ±5° from their current values. The probability distribution of the perturbation (between −5° and +5°) follows a Gaussian distribution with a mean of zero and a standard deviation of 1.65°. This ensures that smaller perturbations are chosen more often (64% chance that the perturbations are between −1.65° and +1.65°) compared to larger ones that in most cases are found to result in steric clashes. Note that the backbone conformations of both parental and hybrid sequences are perturbed during each cycle. Although the perturbation positions are the same for every hybrid and parental sequences, the perturbation magnitude in the backbone angles may vary. This allows different parental and hybrid sequences to assume diverse backbone conformations to better accommodate the differing side chains.
ii. Rotamer-rotamer/rotamer-backbone energy tabulations. Given the backbone conformations determined in Step i and the rotamers and rotamer combinations permitted at each position, this step involves the calculation of the interaction energies of all rotamer-backbone and rotamer-rotamer combinations within an interaction-dependent cutoff distance (cutoff distance for van der Waals = 12 Å, hydrogen bond = 3 Å, and solvation = 9 Å). This energy tabulation must be performed separately for each hybrid and parental structure. The computational expense is reduced by only updating the part of the tables that are affected by the current perturbation. These values are then fed as parameters to the side-chain/sequence optimization model.
iii. Side-chain/sequence optimization. This step optimizes the amino acid choices and conformations (rotamers) for the given backbone structure over a 10–15 residue window that includes the perturbation positions and five residue positions flanking it on either side (see Fig. 5). Specifically, the design positions within the perturbation region are permitted to change amino acid type, whereas the flanking residue positions (five residues on either side) can only change rotamers but not the residue type. This entails two discrete decisions: 1), identifying the choice of amino acid at any given position; and 2), selecting the rotamer of the chosen amino acid that minimizes the selected surrogate objective function. To model these discrete decisions, IPRO draws upon the MILP optimization model formulations that use binary variables to mathematically represent these discrete decisions.

IPRO is an iterative protein redesign software that includes the following steps: i), A local region of the protein (1–5 consecutive residues as shown in *black circle*) is randomly selected for perturbation. The backbone torsion angles of these residues are perturbed by up to ±5°. ii), All amino acid rotamers consistent with these torsion angles are selected at each position from the Dunbrack and Cohen rotamer library (86). Rotamer-backbone and rotamer-rotamer energies are calculated for all the selected rotamers using a suitable energy function (87). iii), A mixed-integer linear programming formulation is used to select the optimal rotamer at each of these positions such that the binding energy is minimized. iv), The backbone of the protein is relaxed through energy minimization to allow it to adjust to these new side-chains. v), The ligand position is readjusted with respect to the modified backbone and side chains using the ZDOCK (78) docking software. vi), The binding energy of the protein-ligand complex is evaluated and the move is accepted or rejected using the Metropolis criterion.

Design positions within the perturbation region (shown in *orange*) are permitted to change amino acid type, whereas the flanking residue positions (five residues on either side shown in *green*) can only change rotamers but not the residue type. Positions outside this 10–15 residue window (*gray*) are fixed and cannot change either rotamer or residue type.

For clarity of presentation, we will first describe the MILP formulation for the special case, i.e., redesign of a single parental sequence. This description will then serve as the starting point for the more general combinatorial library design optimization formulation. In both cases, the set of allowed side-chain conformations and amino acid choices at any position is encoded within sets (R_i and R_ih, respectively), where i denotes the residue position and h denotes a hybrid sequence in the combinatorial library in case of parental sequence redesign. Positions within the perturbation window but outside the set of redesign candidates are restricted to the original amino acid type but can change their rotamer state. All other residue positions outside the perturbation window are fixed and cannot change either residue type or rotamer. As expected, the parental sequence redesign problem is much more complex than the single hybrid design. This is because a substituted residue need not assume the same rotamer conformation in each library member. In other words, the hybrids are “tied together” at the sequence level, but not necessarily at the rotamer level. Starting with the simpler MILP formulation for the design of a single hybrid sequence, we first outline the sets, parameters, and variables used in the model as described below:

Sets

Binary variables

graphic file with name biophysj-4167-eqn-fd35.jpg

Continuous variables

graphic file with name biophysj-4167-eqn-fd36.jpg

Parameters

Based on the above defined sets, variables, and parameters, the single sequence design problem (SSDP) is implemented as the following MILP formulation, which is a special case of the quadratic assignment problem (80):

(3)

(4)

(5)

(6)

(7)

The objective function (Eq. 3) here entails the minimization of the binding score between the substrate and the protein as an example. The objective function can be changed depending on the design requirements. In many cases, (e.g., binding score) the objective function does not encode information about the interactions in the entire protein. Therefore, the minimization step may lead to mutations or rotamer changes that adversely affect the overall stability of the protein. Constraint Eq. 4 is included to safeguard against this by requiring that the total energy of the protein be below a prespecified cutoff value, E_cutoff. The versatility of the adopted MILP modeling description enables the incorporation of this explicit stability requirement that is absent in most other frameworks proposed for protein design/redesign. In the same spirit, additional energy-based requirements can be imposed to ensure, for instance, retention of important hydrogen bonds between a donor and an acceptor. Constraint Eq. 5 ensures that only one rotamer is selected at any given position i along the sequence. Note that the rotamers may be that of the original residue or of other residues, depending on whether or not position i is a design position. Constraint Eq. 6 prevents any rotamers from being selected at position i that have sufficiently high energy values Inline graphic that preclude them from the optimal solution. This rotamer elimination procedure formalizes the “background optimization” concept proposed by Looger and Hellinga (81) and allows for eliminating rotamers that are guaranteed not to be part of the optimal solution (see Looger and Hellinga (81) for details) . This concept allows us to a priori trim down the search space and therefore reduces the computational time. Constraint Eq. 7 determines which rotamers r and s are simultaneously selected at positions i and j, respectively. This is encoded with variable Inline graphic , which is equal to one only if both variables and are equal to one. This implies that is equal to the product of the two binary variables. These nonlinear terms are then recast into an equivalent linear form by summing over s and r, respectively, as shown below:

(8)

(9)

(10)

By replacing constraint Eq. 7 with constraints Eqs. 8–10, the linearity of the SSDF formulation is preserved. The complete MILP formulation for SSDP includes constraints Eqs. 3–10 excluding constraint Eq. 7.

Unlike the single sequence protein design formulation SSDP, the hybrid library design problem (HLDP) involves the simultaneous optimization of the hybrids (h) comprising the combinatorial library. Because the hybrid sequences in the combinatorial library are derived from the parental sequences, their amino acid composition must be restricted to the amino acid type present in the corresponding parental sequences after the targeted mutations. To this end, we introduce parameters Inline graphic that link the amino acid type a selected at a given position i′ in parental sequence p to those present in the hybrid sequences at the corresponding position i. In case of insertions and deletions, the positions i and i′ in the hybrid and parental sequences, respectively, may not be the same. Therefore, one needs to keep track of both the parental sequence p and what position i′ in that sequence corresponds to a given position i in a hybrid sequence h. Specifically, parameter Inline graphic is equal to one if amino acid a occurs at position i′ in parental sequence p, whereas parameter stores the amino acid type of rotamer r at position i in hybrid h. In addition, binary variable is introduced and set to be equal to one if amino acid a is selected at position i in hybrid sequence h. Unlike amino acid type changes, which are propagated throughout the entire library, rotamer choices can differ between hybrid and/or parental sequences. These new complexities give rise to the following additional sets, parameters, and variables definitions.

Sets