α-Helical Topology Prediction and Generation of Distance Restraints in Membrane Proteins

Scott R McAllister; Christodoulos A Floudas

doi:10.1529/biophysj.108.132241

. 2008 Sep 5;95(11):5281–5295. doi: 10.1529/biophysj.108.132241

α-Helical Topology Prediction and Generation of Distance Restraints in Membrane Proteins

Scott R McAllister ¹, Christodoulos A Floudas ¹

PMCID: PMC2586566 PMID: 18775963

Abstract

The field of protein structure prediction has seen significant advances in recent years. Researchers have followed a multitude of approaches, including methods based on comparative modeling, fold recognition and threading, and first-principles techniques. It is noteworthy that the structure prediction of membrane proteins is comparatively less studied by researchers in the field. A membrane protein is characterized by a protein structure that extends into or through the lipid-lipid bilayer of a cell. The structure is influenced by the combination of the hydrophobic bilayer region, the direct interaction with the bilayer, and the aqueous external environment. Due to the difficulty in obtaining reliable experimental structures, accurate computational prediction of membrane proteins is of paramount importance. An optimization model has been developed to predict the interhelical interactions in α-helical membrane proteins. A database of α-helical membrane proteins of known structure and limited sequence identity can be constructed to develop interaction probabilities. By then maximizing the occurrence of highly probable pairwise or three-residue interactions, realistic contacts can be predicted by imposing a number of geometrical constraints. The development of these low distance contacts can provide additional distance restraints for first principles-based approaches to the tertiary structure prediction problem. The proposed approach is shown to successfully predict interhelical contacts in several membrane protein systems, including bovine rhodopsin and the recently released human β2 adrenergic receptor protein structure.

INTRODUCTION

Despite the multitude of available methods for protein structure prediction, the advances in the study of membrane proteins have not been as quick to follow. Whereas there are more than 46,000 experimentally determined structures available through the Protein Data Bank (1), only 144 proteins are membrane proteins with experimentally validated transmembrane segments (2). Due to the difficulty in obtaining reliable experimental structures, accurate theoretical prediction of membrane proteins is of paramount importance. This significance becomes even more striking given the number of membrane proteins and their role in drug development. It has been estimated that integral membrane proteins make up ∼20–30% of the total proteins across a variety of organisms (3,4). As much as 30% of commercial drugs are known to target G-protein-coupled receptors (5), a family of membrane proteins characterized by an α-helical bundle of seven helices.

Although it is difficult to crystallize membrane proteins to determine their three-dimensional structure, the analysis of membrane topology through biochemical methods is much more feasible. There have been major advances in the prediction of transmembrane regions of proteins. Due to the distinctive patterns of hydrophobic regions within the membrane and polar loop regions beyond the membrane, hydrophobicity and polarity have been used to predict these regions. These methods can be evaluated based on their ability to correctly predict the membrane-spanning regions, as well as the sidedness of a protein. One popular method, TMHMM, uses a global implementation of a hidden Markov model to make its predictions (3). Different approaches, such as MEMSAT, are based on a combined form that accounts for local level effects and incorporates them into global heuristics (6). Independent studies of these types of prediction methods have identified MEMSAT and TMHMM as high-performing methods in this area, although prediction performance was less impressive for eukaryotic proteins (7,8). Recent contributions in this area have considered combining a hidden Markov model with evolutionary information (9), combining a hidden Markov model with a molecular mechanics energy-scoring function (10), applying a support vector machine algorithm (11), and combining a variety of algorithms through a consensus approach (12).

Since a large percentage of membrane proteins form α-helical bundles, many efforts have been made to compare and contrast these proteins with soluble α-helical proteins. Membrane proteins seem to satisfy the backbone hydrogen bonds in the low dielectric environment (13). Eilers et al. have used the technique of occluded surface to demonstrate that membrane proteins have higher packing values than soluble proteins (14). Part of the reasoning behind this effect is the tendency of membrane proteins to have a higher occurrence of small amino acids, such as GxxxG or AxxxA motifs, in the helical interface (15).

By applying an atom-based probability model, Adamian and Liang were able to analyze membrane helical pairwise propensity at the helix interface (16). A major conclusion of their analysis is that membrane proteins and soluble proteins do indeed pack differently and the same pairwise interaction can have dramatically different propensities in the soluble and membrane environments. Other research in the area has shown that it is almost a rule that consecutive transmembrane helices pack against each other and these α-helices have a strong preference for antiparallel interactions (17). The recent application of interhelical three-body interactions in membrane proteins has led to unique triplet propensity values that are important for membrane protein folding and assembly (18). Gimpelev et al. found that the majority of transmembrane helix pairs could be modeled by templates from soluble helix pairs, establishing a model to sample interhelical contacts that may form in membrane proteins (19). Knowledge-based pair potentials have been developed for transmembrane helix pair configurations and were shown to have predictive power in tests of rigid docking of transmembrane helix pairs (20).

Even though membrane-spanning regions can be predicted with a reasonable level of accuracy and transmembrane helix-helix interactions have been thoroughly studied, there have been few attempts to develop a method to predict the tertiary structure of transmembrane proteins. Waldispühl and Steyaert proposed a structure prediction algorithm that combines local and global constraints to model transmembrane protein secondary and super-secondary structures (21). One research group has explored the conformations of membrane protein folds for α-helical bundles. Using an input of α-helix ranges and a set of distances between pairs of atoms, they were able to describe a method to enumerate all the possible conformations that satisfy the distances (22). This approach is especially useful as an initial step to a local refinement method, such as a custom penalty function derived from a statistical analysis of membrane protein structures (23). It should be noted that the interhelical prediction models proposed in this article could be used to develop a set of input distances for this method.

Research on computer simulations using a coarse-grained lattice model has shown initial success in predicting membrane protein structure. By applying a composite energy function to differentiate between amino acids in the membrane and those in the water, a rough estimate of the helical structure (without loops) was assembled using Monte Carlo simulations (24). Incorporated into this effort was an extension of the two-stage folding model proposed for membrane proteins (25). This model divides the α-helical membrane protein folding into two steps: inserting the helices into the membrane and then subsequently assembling the helices into the final α-helical bundle structure. A more detailed model of transmembrane protein energetics has four stages: partitioning, folding, insertion, and association (26). Determining the ΔG values for each step along this path allows for a complete thermodynamic description of the system.

A hybrid method has been developed to predict the structure of G-protein-coupled receptors (27). The protocol for this approach has five main steps:

Step 1. The TM2NDS program is used to determine the transmembrane regions by a hydropathicity scale.
Step 2. Each individual helix is constructed and optimized using torsional molecular dynamics.
Step 3. The helical axes are oriented according to an electron density map as the initial step in the assembly of the α-helical bundle.
Step 4. A coarse-grain optimization program, COARSEROT, is applied to rotate the helical orientations through all possible angles about the helical axes.
Step 5. The loop regions are added and the entire protein is subject to a final optimization step.

This method was able to predict the transmembrane region of bovine rhodopsin to ∼3 Å RMSD with inputs of only the primary sequence and the data from the electron density map.

Traditional protein structure prediction approaches have also been applied to membrane protein systems. A recent review highlights the successes and limitations of comparative modeling efforts for rhodopsin-based homology techniques (28). A notable approach that does not fall within the rhodopsin-based homology category is the PREDICT methodology (29). By iterating through a series of decoy generation and subsequent selection steps, PREDICT relies only upon the primary sequence and the structural constraints imposed by the membrane environment. Other methods that do not require homology to rhodopsin include a prediction approach that utilizes ensemble generation followed by clustering analysis (30) and another that applies a scoring function obtained through qualitative insights to pairs of transmembrane helices (31). Zhang et al. have applied their TASSER structure prediction approach to >900 G-protein-coupled receptor proteins and validated their predictions using a benchmark set of known membrane proteins (32).

The role of interhelical contacts in the overall folding process for membrane proteins is uncertain. The proposed approach in this article operates under the hypothesis that specific residue types have a higher likelihood of forming an interhelical contact than others. The goal of this article is to identify these more probable interactions and subsequently maximize their occurrence, thereby yielding the most likely interhelical contacts that can be used as distance restraints for tertiary structure prediction approaches.

METHODS

The interhelical contact prediction models of this article aim at predicting interhelical contacts between the transmembrane α-helices of membrane proteins to derive lower and upper distance bounds on these contacts for tertiary structure prediction applications. A data set of membrane proteins was compiled using a database of known structures and homology considerations. This data set of membrane proteins was used to develop pairwise and three-body interhelical contact probabilities. These probabilities serve as input to two mixed-integer linear programming approaches. One approach attempts to maximize the sum of the pairwise interhelical residue contacts. The second approach builds on the concepts of the pairwise model to maximize the three-body interhelical residue contacts.

Construction of a data set

The proteins included in the data set were selected from the Membrane Protein Topology Database (MPTopo), assembled by researchers in the Stephen White laboratory (2). This database is frequently updated to include the latest experimentally determined membrane protein structures. The 80 proteins classified as 3D_helix in September 2007 were selected for further evaluation. These 80 proteins were submitted to the PISCES web server to create a nonredundant list of protein structures by chain (33). A maximum sequence identity of 35% was allowed to cull these membrane protein structures by their individual chains. A visual inspection of the resulting protein chains was employed to remove structures with no interhelical contacts or no clear formation of an α-helical bundle. The final data set contains a total of 26 unique proteins and a total of 42 protein chains. This data set is presented in Table 1.

TABLE 1.

A data set of α-helical membrane proteins; listed are the PDB identifier, the number of amino acids, and the number of helices

PDB name	AAs	Helices	PDB name	AAs	Helices
1e12-A	253	9	1occ-H	85	4
1ehk-A	562	18	1oed-C	260	4
1eys-C	382	18	1okc-A	297	17
1eys-L	280	17	1ots-A	465	22
1eys-M	324	17	1q16-C	225	13
1f88-A	348	15	1qle-B	252	5
1fx8-A	281	16	1rwt-A	232	10
1h2s-A	225	10	1u7g-A	385	21
1h2s-B	60	2	1xio-A	261	8
1j4n-A	271	12	1yew-A	382	8
1jb0-A	755	36	1yew-B	247	11
1jb0-F	164	9	1yew-C	289	10
1jb0-K	83	3	1zoy-A	622	17
1jb0-L	154	9	1zoy-B	252	10
1kqf-C	217	14	1zoy-C	140	5
1nek-C	129	5	1zoy-D	103	4
1nek-D	115	4	2ahy-A	110	4
1occ-A	514	22	2bbh-A	269	7
1occ-B	227	5	2ic8-A	182	11
1occ-C	261	8	2j7a-A	500	26
1occ-E	109	6	2j7a-C	159	10

Open in a new tab

A helix of at least 10 amino acids was classified as a transmembrane helix for the purposes of mining interhelical contact probabilities, as described in the section Calculating Pairwise Probabilities and the section after it, Calculating Triplet Probabilities. Helices shorter than 10 amino acids are just as likely to be present outside the lipid bilayer, whereas the proposed model is designed to predict the contacts between the membrane layers. With this restriction in place, a protein was removed from the data set if it had fewer than two transmembrane helices. By removing these proteins, only those proteins with possible helix-helix contacts were considered. The numbers of α-helices presented in Table 1 comprise the total number of α-helices in the protein, not just the transmembrane helices.

The development of a set of membrane protein structures for both training and testing purposes was unrealistic due to the limited number of structures available. Therefore, six proteins from the training data set presented here were also selected to be members of the test set. Any potential for bias was removed by developing a unique set of interhelical contact probabilities for each of the six test proteins that was calculated using all of the proteins in the data set except for the specific test protein being evaluated.

Calculating pairwise probabilities

For the development of probabilities, two amino acid residues from separate helices are considered a PRIMARY contact if they have a C^α-C^α distance between 4.0 and 10.0 Å. Although many such PRIMARY contacts can be present between two helices, only the minimum distance PRIMARY contact for each helix pair is counted. For every PRIMARY contact, the presence of a WHEEL contact is considered. If the PRIMARY contact is between residues in positions (i, j), then there are eight possible parallel WHEEL contacts and eight possible antiparallel WHEEL contacts. A PRIMARY contact and several possible WHEEL contacts are illustrated in Fig. 1. In both the parallel and antiparallel case, only the WHEEL contacts between 4.0 and 12.0 Å are included in the probability calculations.

Two interacting α-helices interacting in an antiparallel manner, where residues i and j form a PRIMARY contact, and the residues (i+3), (i+4) can each interact with (j − 3), (j − 4) to form WHEEL contacts. This figure is adapted from McAllister et al. (37).

After a detailed analysis of the initial data, it became apparent that certain types of residue-residue interactions dominated the minimum interhelical contacts. The most frequent of these interactions was in the case of nonpolar-to-nonpolar contacts. However, there were also a significant number of nonpolar-to-polar interactions. The important role of polar interactions within helical contacts has been experimentally verified by the dependence of an engineered leucine zipper on an Asparagine residue (34,35). For the construction of this model, the nonpolar set of residues is defined as

(1)

and the polar residues are

(2)

As expected, the charged residues participated in few interhelical contacts. The insertion of a charged residue into the membrane layer is too energetically unfavorable to allow for many charged types to participate in interhelical contacts. It is interesting to note the difference between membrane and soluble proteins. Instead of the polar interactions that form in membrane proteins, it is generally believed that the driving force for soluble protein folding is the hydrophobic effect (36). This hypothesis is supported by the success of an interhelical hydrophobic-to-hydrophobic residue contact prediction model applied to soluble α-helical proteins (37).

Both the PRIMARY and WHEEL pairwise probabilities are divided into antiparallel and parallel classifications. The distinction between parallel and antiparallel is straightforward for two helices in the same plane, but in three-dimensional space the question of how two helices interact is not as clear. Accordingly, the definitions used for parallel and antiparallel in three dimensions had to be established through additional metrics. A procedure for determining the orientation of a pair of helices has been described previously and is applied to the development of probabilities outlined here as well (37).

Once the number of minimum distance contacts has been counted, the probabilities can be developed. The probabilities are simply defined as the number of residue-residue contacts divided by the total number of contacts. To reduce the complexity and size of the optimization problem, the residue-residue pairs that only have a single occurrence in the data set are removed from the probability table. The probability set, MIN-1, calculated for the pairwise model is provided as Supplementary Material in Data S1. A set of probabilities was also calculated based on an odds ratio given the frequency of an amino acid occurrence. However, this method was unable to match the performance of the simpler probabilities calculated (data not shown). Further analysis is needed to assess the merits of the odds ratio-based approach for application in this optimization model.

A second set of pairwise probabilities, denoted as AL-P, was developed based on the work of Adamian and Liang (16). As part of their comparison between globular and membrane α-helical proteins, they analyzed the relative frequencies of pairwise interhelical contacts according to residue types. These contacts were selected based upon atomic interaction criteria, rather than C^α-C^α distances, and considered all interactions where an atomic interaction resulted. The probabilities derived from these pairwise contacts are available as Data S1. It should be noted that these probabilities are unable to predict WHEEL contacts because conditional probabilities could not be derived.

Calculating triplet probabilities

A three-body (or triplet) interaction consists of a contact between residues (i, j) and residues (i + 1, j), where i and i + 1 reside on helix m and j is from helix n. These triplet probabilities are calculated using a method similar to the approach for the pairwise helix probabilities. A set of three residues is considered a triplet if the average C^α-C^α distance of both residue pairs is between 4.0 and 10.0 Å.

Two main sets of probabilities have been developed for use in this model. The first probability set, MIN-2, considers only the two most minimum distance triplet contacts for each helix-helix interaction in the data set. The motivation for using only the minimum distance triplets is the idea that they represent the “best” contacts. The initial generation of the probabilities for this set separated the values into both parallel and antiparallel interactions.

Once the number of contacts has been calculated, the triplet probabilities can be calculated by dividing the number of contacts of a specific triplet by the number of triplet contacts across all proteins in the data set. At this point, any specific triplet contact that only occurs once in the data set is removed from the set of probabilities. This removal reduces the complexity of the problem by only considering the more frequent triplet occurrences. The established probabilities for the MIN-2 set are available as Data S1.

The second set of triplet probabilities tested for this model, denoted as AL-T, was developed by Liang and co-workers (18). Working from a smaller data set, they selected contacts based upon atomic interaction criteria, rather than C^α-C^α distances. Instead of considering only the minimum distance triplets, their set of contacts enumerated all the three-body interactions that met the interaction criteria. Then any triplet with at least 10 contacts was included in the published analysis. The probabilities derived from their set of triplet contacts are available as Data S1.

Pairwise contact prediction model

The first model developed for transmembrane helix contact prediction considers pairwise interactions. A pairwise interaction is characterized by two residues from separate helices that have a short C^α-to-C^α contact distance. The probabilities are developed using a distance range of 4.0–10.0 Å for PRIMARY contacts and the predicted interactions are expected to have distances <12.0 Å in most cases or possibly <14.0 Å for more difficult systems. Using the probabilities developed in the section Calculating Pairwise Probabilities, the model aims at maximizing the occurrence of the most probable residue pairs.

Indices and sets

The indices m, n are used to represent the helices in the protein being modeled. Each helix that is longer than 15 amino acids is included in the sets M,N. The indices i, j, k, l represent a residue in set I, where set I is composed of all the residues in the amino acid sequence of a protein.

Binary variables

This model requires the use of several binary variables that take the value of 1 if the variable is active, and 0 if it is inactive.

Due to the complexity of the transmembrane helix model, the allowable contacts for a specific residue are restricted to be from the helices immediately before and after a specific helix. For example, a residue in the first helix in a protein is allowed to contact a residue in the second helix in that protein or a residue in the last helix in that protein. These allowable contacts comprise the set of contacts that the models will try to predict. As a result of the considerable size and complexity of membrane proteins, this set is only a subset of the possible contacts in the protein.

Parameters

The following is a complete list of parameters used in the model. Of particular note are subtract and max_contact, which have their basis in the prediction of α-helical topology in globular proteins (37). The subtract parameter allows the user to consider a subset of the possible helix-helix pairs, with the goal of identifying the lowest-distance contacts. By allowing the model to select from a subset, stronger interhelical interactions may be identified and predicted. max_contact specifies the number of residue-residue contacts that may be predicted between a specific pair of helices. A value of 2 is appropriate for smaller systems, especially those with only a single pair of helices. However, a value of 1 often produces better results for larger proteins because the model focuses on the best possible interactions for each allowed helix pair.

—PRIMARY probability that a specific pair (i, j) forms an antiparallel residue contact.
—PRIMARY probability that a specific pair (i, j) forms a parallel residue contact.
—WHEEL probability that a specific pair (k, l) forms an antiparallel residue contact given a residue contact between (i, j).
—WHEEL probability that a specific pair (k, l) forms a parallel residue contact given a residue contact between (i, j).
—Maximum over all contact combinations.
—Maximum over all contact combinations.
—Twice the value of the largest probability over all nonpolar/polar and nonpolar/nonpolar combinations.
—Twice the value of the largest probability over all nonpolar/polar and nonpolar/nonpolar combinations.
max_contact—Maximum number of contacts allowed between helices m and n.
counth(m)—2 if helix m has at least two nonpolar or polar residues not WHEEL to each other.
subtract—Nonnegative integer that specifies how many m to n helical interactions to remove from the solution with maximal helical packing.
N_hel—Number of helices in the protein.

Level 1 formulation

The objective function of the Level 1 formulation attempts to maximize the probabilities of each residue-residue contact to result in the greatest sum. It can be formulated as shown in Eq. 3:

(3)

The product of the binary variables y and w results in a nonlinear objective function. The linearization of this objective function is performed using standard techniques (38) and is presented as Data S1.

The constraints in the level 1 pairwise model formulation are separated into five categories relating to basic model relationships, geometric observations, model complexity considerations, membrane protein observations, and model features.

Basic model

The model is more tightly restrained by taking advantage of the relationships among Inline graphic and The first of these constraints, Eq. 4, requires that a residue-residue contact can only be specified if there is either a parallel or an antiparallel contact between the helices m, n

(4)

Like Eq. 4, Eq. 5 connects the binary variable representing the (i, j) residue-residue contacts, Inline graphic to the and binary variables for an interacting helix pair (m, n). When the sum over is equal to zero for a given helix pair (m, n), the helices cannot be in contact. The following constraint specifies this observation, and is especially useful when integer cuts are applied to generate a rank-ordered list of contact predictions,

(5)

Geometric observations

The same pair of transmembrane helices (m, n) cannot interact in both an antiparallel and a parallel fashion. By requiring the sum of Inline graphic and to be ≤1, the constraint expressed in Eq. 6 requires the interaction be either parallel or antiparallel,

(6)

If parallel contacts between consecutive helices have been disallowed (see Eq. 15), the type of allowable contact has been specified between the first and last helix of a membrane protein. For the case of an even number of helices >2, the interaction between helices (1,N_hel) must be antiparallel. However, in the case of an odd number of helices, the final contact is parallel. For the case of a membrane protein with only two helices, their interactions have already been specified as antiparallel by Eq. 15. In this case, Eq. 7 is redundant, and it is removed from the model. The modulus operator, MOD, is used to determine whether the number of helices is odd or even:

(7)

If more than one residue-residue contact is allowed between a given pair of helices (m, n), the positions of these two contacts must be constrained to prevent kinks in the helix. By requiring the number of residues on helix m between (i, k) to be within three residues of the number between (j, l) on helix n, the severity of any predicted kinks can be reduced to a reasonable level. In addition to implementing that requirement, Eq. 8 also prevents a second PRIMARY contact from being predicted in the WHEEL position of the first PRIMARY contact, as

(8)

where diff(i, i′) refers to the difference in sequence numbering between i and i′.

For a set of parallel helices, if residue k > i in helix m, then it must also be true that residue l > j. If this is not the case, then the two predicted PRIMARY contacts are not consistent with the parallel classification given by the Inline graphic binary variable. This constraint is shown below as Eq. 9. A similar constraint is included to require the proper numbering and classification scheme for the antiparallel case in the constraint expression in Eq. 10.

(9)

(10)

If there is a shorter helix in contact with a longer helix in the pair (m, n), the allowable set of contacts can be further tightened by considering the length of the loop between the two helices. If the loop region only contains a few residues (as is the case in many consecutive transmembrane helices), it cannot stretch far enough to allow contacts from the beginning of the first helix to the end of the second helix. To quantify this insight, Eqs. 11 and 12 have been implemented. The assumptions for these constraints consist of:

At least one residue is required for the turn.
The i, (i + 4) distance for residue i in any given helix is ∼6.0 Å.
The vertical distance a loop residue can span is 3.0 Å.

The third assumption may be restrictive, as the average distance between two C^α atoms is ∼3.8 Å. However, it is unlikely that the loop region will be able to stretch in a perfectly straight manner considering the large amount of flexibility in most loop regions. As the model is applied to additional transmembrane proteins, the values of 3.0 Å and 6.0 Å can be changed to more conservative values if necessary. In these equations, loop_length(m, n) is the length of the loop between the helix pair (m, n) and len is the length of a specific helix:

(11)

(12)

Model complexity

Equation 13 allows helix m to have at most counth(m) contacts. For almost all transmembrane helices, counth(m) is equal to 2. However, in the rare case where it is not possible for helix m to have two predicted contacts because of the structure of the probability set, then counth(m) can be set to 1 to tighten the bounds on Inline graphic and

(13)

For a given helix, any specified amino acid is allowed to be in contact with at most one other amino acid on a specific helix. This simplification is introduced to predict only the most probable contacts and reinforces the focus of the model predictions on accuracy instead of coverage. Due to the structure of the modeling language, the index m is always assumed to be <n to reduce the number of variables in the formulation. Therefore, Eq. 14 is needed to implement this restriction:

(14)

Membrane protein observations

The majority of α-helical membrane proteins contain consecutive helices that interact in an antiparallel fashion. To have a parallel interaction between two consecutive helices, a nonhelical segment that stretched the length of the helix would need to exist. Since this model has been developed for transmembrane helices that span the membrane bilayer, the loop region between the two residues would need to span the bilayer to allow for a parallel helix-helix interaction between two consecutive helices. The energetics of inserting a loop segment across a membrane layer are unfavorable, so Eq. 15 is included to prevent parallel helical interactions between two consecutive helices. The use of methods to predict membrane-spanning regions could verify this constraint in future implementations,

(15)

Transmembrane helices are often of approximately the same length and they tend to line up in a similar fashion from top to bottom to form a bundlelike structure. It is highly unlikely that a PRIMARY contact prediction yielding little overlap between helices is an accurate representation of the protein. Equations 16 and 17 prevent the model from predicting contacts where the overlap between helices (m, n) is <90% of the shorter helix length. Although this value is a strict overlapping requirement, it is justified by the energetics of the helices in the membranes that lead to alignments of this type,

(16)

(17)

Model features

The next constraint, Eq. 18, allows the optimization model to predict at most max_contact number of contacts between a specified pair of helices (m, n). Using a parameter value of one is useful for the contact prediction of large proteins with many helices. However, a max_contact value of two provides more constraints for the subsequent tertiary fold prediction. Therefore, for most proteins, both values of the max_contact parameter are explored as

(18)

Sometimes it is desirable to predict fewer than the maximum possible number of helix-helix contacts (m, n). Equation 19 introduces the parameter subtract to limit the number of helical interactions. A subtract value of zero allows the maximum number of interhelical contacts to be equal to the number of helices, as specified by Eq. 13. Each additional increment of the subtract parameter effectively removes a helix-helix contact from the allowable prediction. A larger subtract value leads to looser helix packing, and it is postulated that the model will then be able to predict the most essential and most accurate helical contacts:

(19)

In some cases, the best contact prediction (ranked by average distance or some other measure) does not correspond to the most probable solution and it is informative to look at several solutions ranked by probability. The true power of this model results from the ability to generate a rank-ordered list of contact predictions. Equation 20 implements the concept of an integer cut, restricting the model to a unique set of binary variables for each iteration. After each successive solve of the above model, the previous solution can be excluded from the feasible solution space using this equation. Here A is the set of active variables, which are all the variables that assume a value of 1. Also, I is the set of inactive variables and card(A) is the cardinality of set A, or in other words the number of members of set A:

(20)

Level 2 formulation

The Level 2 formulation uses information from the PRIMARY contacts predicted in the Level 1 formulation to maximize the most probable WHEEL contacts. By predicting the WHEEL contacts as well, the model provides a direct method to distinguish among any rank-ordered PRIMARY contact predictions with the same objective function value in Level 1. For the case of a “blind” prediction problem, this second formulation can be especially useful. Although it is possible to solve the Level 1 and Level 2 formulations simultaneously, the current implementation is solved sequentially to allow for faster predictions due to the size and complexity of the problem for larger protein systems.

If the data set was large enough it would be desirable to use probabilities that represent the odds of a specific (k, l) WHEEL contact given an (i, j) PRIMARY contact, but it is not feasible with the limited size of the current data set. Instead, the probabilities are calculated as the probability that position (k, l) will contain a WHEEL contact given that (i, j) form a PRIMARY contact. But, to distinguish among WHEEL contact probabilities, the model must also consider the probability of an (i, j) contact given an (k, l) contact. This Inline graphic value effectively defines the (k, l) interaction as a PRIMARY contact and calculates the probability of a WHEEL contact (i, j).

The objective function for the Level 2 formulation is presented in the form of

(21)

In this equation, Inline graphic and are then defined as the product of the wheel probability sum (as described above) and the binary variable representing the presence of a WHEEL contact in position (k, l) given a PRIMARY contact (i, j). Also, the binary parameters and are defined as the appearance of a PRIMARY contact (i, j) in the Level 1 model,

(22)

(23)

(24)

(25)

Equations 26 and 27 are then implemented to ensure at most one WHEEL contact (k, l) is specified for a given (i, j) PRIMARY contact.

(26)

(27)

Triplet contact prediction model

The use of pairwise interhelical residue-residue contacts was an obvious first choice to satisfy the objectives of this optimization model. However, there are other methods that may work just as well. A recent article suggests that higher-order interactions may be necessary to properly model the system (18). This optimization model considers the interaction between interhelical triplet contacts. To enable proper description of the constraints implemented as part of this model, two types of triplet residues are defined. The first is a MAIN residue, which represents the central residue of the triplet that appears on the helix that is opposite the helix containing the other two residues. These other two residues are defined as SECONDARY residues. For example, consider a triplet contact between Leucine and Valine on helix m and Glycine on helix n. The Leucine-Glycine-Valine triplet contains the MAIN residue Glycine and two SECONDARY residues Leucine and Valine.

By applying the probabilities developed in the section Calculating Triplet Probabilities, this model seeks to predict the most probable triplet contacts between transmembrane helices. Since this problem is formulated as an optimization model, it will be able to maximize the sum of the triplet probabilities to guarantee the highest probability allowed by the constraints.