Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2011 Jul 26;6(7):e21821. doi: 10.1371/journal.pone.0021821

A Combination of Compositional Index and Genetic Algorithm for Predicting Transmembrane Helical Segments

Nazar Zaki 1,*, Salah Bouktif 2, Sanja Lazarova-Molnar 1
Editor: Pierandrea Temussi3
PMCID: PMC3144211  PMID: 21814556

Abstract

Transmembrane helix (TMH) topology prediction is becoming a focal problem in bioinformatics because the structure of TM proteins is difficult to determine using experimental methods. Therefore, methods that can computationally predict the topology of helical membrane proteins are highly desirable. In this paper we introduce TMHindex, a method for detecting TMH segments using only the amino acid sequence information. Each amino acid in a protein sequence is represented by a Compositional Index, which is deduced from a combination of the difference in amino acid occurrences in TMH and non-TMH segments in training protein sequences and the amino acid composition information. Furthermore, a genetic algorithm was employed to find the optimal threshold value for the separation of TMH segments from non-TMH segments. The method successfully predicted 376 out of the 378 TMH segments in a dataset consisting of 70 test protein sequences. The sensitivity and specificity for classifying each amino acid in every protein sequence in the dataset was 0.901 and 0.865, respectively. To assess the generality of TMHindex, we also tested the approach on another standard 73-protein 3D helix dataset. TMHindex correctly predicted 91.8% of proteins based on TM segments. The level of the accuracy achieved using TMHindex in comparison to other recent approaches for predicting the topology of TM proteins is a strong argument in favor of our proposed method. Availability: The datasets, software together with supplementary materials are available at: http://faculty.uaeu.ac.ae/nzaki/TMHindex.htm.

Introduction

A biological membrane or biomembrane is an enclosing or separating membrane that acts as selective barricade within or around a cell in which cells may maintain specific chemical or biochemical environments. Membrane proteins play key roles in biological systems as pores, ion channels and receptors. Being important in intracellular communication and coordination, membrane proteins may serve as good drug targets. A biological membrane is usually spanned by a TM protein which makes it an important target of both basic science and pharmaceutical research [1]. The major category of TM proteins is the Inline graphic-helical proteins. This protein category constitutes roughly 30% of a typical genome and is usually present in the inner membranes of bacterial cells, the plasma membrane of eukaryotes, the outer membrane of Gram negative bacteria or mitochondrial membranes. Inline graphic-helical transmembrane proteins are involved in a wide range of important biological processes such as cell signaling, transport of membrane-impermeable molecules, cell-cell communication, cell recognition and adhesion. Since many TMHs are also prime drug targets, it has been estimated that more than half of currently commercialized drugs target membrane proteins [2]. Therefore, the prediction of TMHs could play an important role in the study of membrane proteins. The importance of this role is emphasized by the lack of high-resolution structures for such proteins. Thus, the total number of transmembrane proteins in the Protein Data Bank (PDB) [3] is limited, comprising 1% of available structures [4], [5]. Knowledge of the TMH topology can help in identifying binding sites and infer functions for membrane proteins. However, because membrane proteins are hard to solubilize and purify, only a very small amount of membrane proteins have experimentally determined structure and topology. This has motivated various computational methods for predicting the topology of membrane proteins [6]. These methods are important applications in genome analysis, and can be used to understand the global trend in membrane protein evolution.

A computational method is usually considered successful if it does not only predict individual TMHs, but rather attempt to predict the full topology of the protein [7]. To this end, in the last two decades, researchers have developed a battery of successively more powerful methods for predicting TMH. This development can be broken into three main categories. In the first category, early TMH prediction methods were based on experimentally determined hydropathy indices of hydrophobic properties for each residue in the protein sequence. Examples of this category include TOP-Pred [8], DAS-TMfilter [1] and SOSUI [9] which are among the most reliable methods in providing descriptive information about TMHs. These methods use hydrophobicity analysis alone and therefore, they can not predict TMHs with length greater than 25 residues [10]. The recent high-resolution structures production of helical membrane proteins revealed that TMH could have a wide length distribution of more than 25 residues.

In the second category, further accuracy was achieved by employing probabilistic approaches such as Hidden Markov Models (HMMs). In this case the actual biological structural knowledge was incorporated into the model's architecture in order to increase its prediction power. Methods such as HMM-TOP [11], TMHMM [12], THUMBU [13] and Phobius [14], allowed researchers to predict reliable integral membrane proteins in a large collection of genome. However, HMM based methods are considered computationally expensive since they involve multiple sequences alignments, calculation of the profile HMM topology and parameterization, and training via expectation maximization. Moreover, the HMM based methods are unable to correctly predict TMHs shorter than 16 residues or longer than 35 residues [10]. As for distantly related protein sequences, a profile alignment may not be possible if, for example, the sequences contain shuffled domains.

In the third category, additional accuracy was gleaned by leveraging machine learning techniques such as neural networks, support vector machines and k-nearest neighbor. Examples of this category include PHD [15], MemBrain [10] and MEMSAT-SVM [2]. Despite their success, the feature extraction step in the machine learning based techniques is often computationally expensive since they also involve multiple sequences alignments. Therefore, a simple and general feature extraction algorithm that do not require sequence alignments is desirable.

Numerous methods have also been developed to study secondary structure assignment [16][18]. Pylouster et al. [19] have recently studied the influence of the assignment on the prediction of transmembrane helices in protein structures. His study of the sequence structure relationship shows very limited differences with regards to the structural disagreement. This is very encouraging finding which shows that accurate prediction of TMH could lead to identifying the secondary structure in a protein sequence.

In this paper, we focus on the determination of TMH spanning segments and the amino-terminal orientations. We introduce TMHindex which predicts TMH segments solely from the amino acid sequence information. The prediction is done by using a TMH compositional index which is deduced from the dataset of TMH segments and the amino acid composition. A TMH preference profile is then generated by calculating the average TMH index values along the amino acid sequence using a sliding window of different sizes. Finally, a genetic algorithm was employed to refine the prediction by detecting the optimal set of threshold values that separate the TMH segments from non-TMH segments.

Materials and Methods

In this section we introduce our method of predicting TMH proteins topology referred to as TMHindex. An overview of TMHindex method is shown in Figure 1. TMHindex consists of the two following major steps which are further detailed in subsequent sections:

Figure 1. TMHindex overview.

Figure 1

  1. Calculation of the TMH compositional index: In this step we extract the TMH segments and non-TMH segments from the training dataset, compute the difference in amino acid appearances in TMH segments and non-TMH segments, compute the amino acid composition of the test protein sequence and finally calculate the TMH compositional index.

  2. Employing a Genetic Algorithm (GA) to find the optimal set of threshold values: In this step we tailor a GA to find an optimal set of threshold values that will accurately segregate TMH and non-TMH segments.

TMH compositional index

We start by analyzing the amino acid composition in TMH segments and non-TMH segments. We denote by Inline graphic the enumerated set of sequences in the database of membrane protein sequences. From each protein sequence Inline graphic in Inline graphic, we extract known TMH and non-TMH segments and store them in datasets Inline graphic and Inline graphic, respectively. To represent the preference for amino acid residues in TMH segments, we define an index Inline graphic. The index Inline graphic for the amino acid Inline graphic {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T,W, Y, V}, is calculated as follows:

graphic file with name pone.0021821.e011.jpg (1)

where Inline graphic and Inline graphic are respectively the frequencies of amino acid Inline graphic in the datasets Inline graphic and Inline graphic. The negative value of Inline graphic (threshold value of Inline graphic) indicates that the amino acid Inline graphic preferably exists in a TMH segment. This is rather analogous to the DomCut method [20] which was developed to predict the inter-domain linker regions in amino acid sequences. However, the information contained in the index values Inline graphic alone is insufficient to accurately predict the TMH segments, thus we incorporated the amino acid composition knowledge to Inline graphic index. The conventional amino acid composition (AAC) values contain 20 components, each of which reflects the normalized occurrence frequency for one of the 20 native amino acids in a sequence. Owing to its simplicity, the AAC model was widely used in many earlier statistical methods for predicting protein attributes. It has also been used in many bioinformatics applications such as inferring the lifestyle of an organism from the characteristic properties of its genome [21] and compensating for the lack of domain information in predicting protein-protein interaction [22].

To this end, we recalculate the compositional index Inline graphic as follows:

graphic file with name pone.0021821.e023.jpg (2)

where Inline graphic is the AAC of amino acid Inline graphic. We then represent each residue in all the testing protein sequences by its corresponding compositional index Inline graphic. Subsequently, the index values are averaged over a window that slides along the length of each protein sequence. To calculate the averaged compositional index values Inline graphic for a protein sequence Inline graphic, given a single window size Inline graphic, we apply the following formula:

graphic file with name pone.0021821.e030.jpg (3)

where Inline graphic is the length of the protein and Inline graphic is the amino acid at position Inline graphic in protein sequence Inline graphic.

To illustrate the calculation of the averaged compositional index values Inline graphic, we use the 1LGH:B protein sequence (AERSLSGLTEEEAIAVHDQFKTTFSAFIILAAVAHVLVWVWKPWF). In Table 1, we show the calculation of Inline graphic for the first 5 amino acids with a window size Inline graphic equal to 5.

Table 1. Illustration of the calculation of the averaged compositional index values Inline graphic.

Inline graphic Amino Acid Inline graphic AAC Inline graphic Inline graphic Inline graphic
1 A 15.556 −0.30841 (15.556*(−0.30841)+8.889*(1.472438)+2.222*(1.473881))/3 = 4.160103797
2 E 8.889 1.472438 (15.556*(−0.30841)+8.889*(1.472438)+2.222*(1.473881)+6.667*(0.137164))/4 = 3.120077848
3 R 2.222 1.473881 (15.556*(−0.30841)+8.889*(1.472438)+2.222*(1.473881)+6.667*(0.137164)+8.889*(−0.53791))/5 = 1.53976588
4 S 6.667 0.137164 (8.889*(1.472438)+2.222*(1.473881)+6.667*(0.137164)+8.889*(−0.53791)+6.667*(0.137164))/5 = 2.68218555
5 L 8.889 −0.53791 (2.222*(1.473881)+6.667*(0.137164)+8.889*(−0.53791)+6.667*(0.137164)+2.222*(−0.07568))/5 = 0.030853082
: : : : :

As revealed in the MemBrain method [10], the fusion of various window sizes provides more flexibility in accounting for the length variation of TMHs. This reduces the bias towards a fixed TMH length, introduced by using only one window size (as treated in most of the previous TMH topology predictors). Therefore, averaging is carried across a sequence of odd window sizes ranging from Inline graphic to Inline graphic (Inline graphic), generating a series of features for each protein sequence. This yields the set of values Inline graphic for each sequence:

graphic file with name pone.0021821.e048.jpg (4)

where Inline graphic is the summation index that ranges across the Inline graphic odd window sizes. The values Inline graphic are further used in conjunction with Genetic Algorithm (GA) to refine the prediction by detecting short loops and turns that separate the TMH segments.

Dynamic threshold using GA

Finding an optimal threshold value which separates TMH segments from non-TMH segments is crucial to the accuracy of the topology prediction. It is a challenging matter that remains unsolved by many existing predictors, most of which use fixed threshold values to separate TMH segments from non-TMH segments (e.g. residues with scores higher than a defined threshold value, are assigned to a helical segment). Indeed, this is a weakness because an optimal threshold for defining two TMH segments separated by long loops is different from a threshold required for identifying TMH segments separated by short loops or tight turns. High-resolution structures show that two consecutive TMH segments are often connected by very short loops or turns and that is why in MemBrain [10] for instance, the authors have utilized a dynamic threshold value in which a base threshold propensity of 0.4 was used to initially define TMH fragments. Then, the threshold was raised according to the shape of the local propensity profile for identifying short loops or helical breaks in fragments. Despite the success shown by utilizing a dynamic threshold, it is noted that raising the threshold could improve the predictions of the TMH segments in part of the sequence and could reduce the prediction accuracy in another part of the sequence.

The prediction problem turns into a search a set of dynamic threshold values that will better reflect the structure of the amino acid sequence and predict accurately the TMH and non-TMH segments. Such a search problem can be viewed as a partition problem [23] which is unsolvable in a polynomial time algorithm. The application of metaheuristic search techniques to this class of problems is a promising solution [23][25]. Metaheuristics are high-level frameworks that employ heuristics to find solutions for combinatorial problems at a reasonable computational cost, with strategies ready for adaptation to specific problems. In particular, GA is one of the most commonly used techniques and has proven its effectiveness in combinatorial optimization [23]. Besides, GA is easily customizable for our problem. In the following section we focus on the adaptation of GA to our TMH segment prediction method.

Customized Genetic Algorithm

The basic idea of GA is to typically start from a set of initial solutions, and use biologically inspired evolutionary mechanisms to derive new and possibly better solutions [24]. The derivation starts by an initial solution set Inline graphic (called the initial population), and generates a sequence of populations Inline graphic, of new solutions applying the genetic operators, crossover and mutation, with probability values Inline graphic and Inline graphic, respectively. The Inline graphic fittest chromosomes of each population are automatically added to the next generation. The algorithm stops if a convergence criterion is satisfied or if a fixed number of generations is reached.

To apply GA to a specific problem, all elements of the generic algorithm must be customized and adapted to the problem. In particular, the solutions must be encoded into chromosomes and the two operators (crossover and mutation) and the fitness function must be defined.

Encoding a protein sequence as a chromosome

To properly apply GA to our problem, we define a chromosome encoding for the protein sequence represented by a vector of Inline graphic values, calculated using Equation 4. As each chromosome is a set of genes of size Inline graphic, we encode a gene as a pair Inline graphic, where Inline graphic is a threshold value and Inline graphic is the upper rank in the protein sequence before which Inline graphic is used as threshold. To illustrate this, let (Inline graphic), (Inline graphic) and (Inline graphic) be three consecutive genes in the chromosome representing the sequence of a given protein. The value Inline graphic is the threshold applied from the position Inline graphic to the position Inline graphic in the protein sequence and Inline graphic is the threshold applied from the position Inline graphic to the position Inline graphic in the sequence. In particular, the threshold Inline graphic would be applied from the beginning of the sequence to the position Inline graphic as illustrated in Figure 2.

Figure 2. Encoding protein sequence as a chromosome.

Figure 2

Customized Crossover and Mutation

Based on the chromosome representation and the arithmetic nature of our solution, we define one-cut point crossover. This is a standard way to perform crossover between the chromosomes. It consists of cutting at a position Inline graphic one of the two parent chromosomes into two subsets of genes (vector of pairs Inline graphic and Inline graphic). Then the second chromosome is cut at the position Inline graphic into two other subsets. The cutting point Inline graphic is determined as the rank of the pair (Inline graphic, Inline graphic) where the position Inline graphic is the smallest position in the second parent chromosome greater than Inline graphic. Two new chromosomes are then created by interleaving the subsets.

Mutation is the second reproduction operator that occurs with a small probability Inline graphic. When a chromosome is selected for mutation, a small number of its genes are randomly chosen to be modified. With our chromosome encoding, two ways of modifying a gene (Inline graphic, Inline graphic) are used. In the first, the threshold Inline graphic is modified by making a positive or negative variation of its value, while in the second way, the upper bound Inline graphic is moved either towards Inline graphic or Inline graphic.

Evaluation measures

To test the TMHindex method and compare its performance to the existing state-of-the-art predictors, we used the following evaluation measures:

  1. TMH segment prediction success rate (Inline graphic),
    graphic file with name pone.0021821.e091.jpg (5)
    where Inline graphic, Inline graphic and Inline graphic are the number of TMH segments correctly predicted, the total number of TMH segments in the test dataset and the total number of TMH segments, respectively. A prediction is considered correct if there is an overlap of at least nine amino acids between the predicted and the experimentally known TMH segment. This threshold length is quite reasonable compared to the typical TMH which are on average Inline graphic residues long. In the past, various length of residues overlap was used such as Inline graphic [12], Inline graphic [26] and Inline graphic [10].
  2. Protein prediction success rate (Inline graphic),
    graphic file with name pone.0021821.e100.jpg (6)
    where Inline graphic, Inline graphic and Inline graphic are the number of correctly predicted proteins, the total number of proteins in the test dataset and the total number of testing protein sequences, respectively. A protein is considered correctly predicted if all of its TMH segments are correctly predicted.
  3. Amino acid prediction success rate (Inline graphic),
    graphic file with name pone.0021821.e105.jpg (7)
    Where Inline graphic and Inline graphic are the number of correctly predicted amino acids and the total number of amino acids in a protein sequence, respectively. This evaluation measure is also used as a fitness function in the proposed GA.
  4. The N-score and C-score,  These two scores (illustrated in Figure 3) evaluate the accuracy of predicting the in and out ends of TMHs [27]. N- and C-scores are the number of N- and C-terminal residues that do not match when comparing the predicted TMH segment and the known TMH segment. A lower score in this case means a more accurate prediction. If the prediction of this TMH segment is an exact match, then the N- and C-scores should be equal to 0.

  5. Sensitivity (Inline graphic) and specificity (Inline graphic),
    graphic file with name pone.0021821.e110.jpg (8)
    graphic file with name pone.0021821.e111.jpg (9)
    where Inline graphic is the number of amino acids within the known TMH segment predicted as ‘TMH’, Inline graphic is the number of amino acid out of the known TMH segment predicted as ‘non-TMH’, Inline graphic is the number of amino acid out of the known TMH segment predicted as ‘TMH’ and Inline graphic is the number of amino acid within the known TMH segment predicted as ‘non-TMH’.

Figure 3. The N and C scores.

Figure 3

Results and Discussion

Illustration

To illustrate the experimental work, in Figure 4 and Figure 5 we show the way the TMH segment is detected in a sample protein 1OCC using the index Inline graphic with a threshold value of 0. We used odd window sizes, from Inline graphic to Inline graphic, to calculate Inline graphic values which represent each amino acid in the sequence. The maximum window size was chosen to be Inline graphic because a 19-residue segment is close to the thickness of the hydrocarbon core of a lipid bilayer [28]. In the sample sequence, the known TMH segment (in bold) starts in residue Inline graphic and ends in residue Inline graphic. The length of the protein sequence Inline graphic and therefore Inline graphic, C-score = Inline graphic and N-score = Inline graphic.

Figure 4. Sample protein 1OCC.

Figure 4

Figure 5. TMH segment detection in protein 1OCC using the index Inline graphic.

Figure 5

To improve the prediction accuracy we incorporated the compositional index Inline graphic and the results are shown in Figure 6, where we can easily spot the improved accuracy, i.e., Inline graphic, C-score = Inline graphic and N-score = Inline graphic.

Figure 6. TMH segment detection in protein 1OCC using the compositional index Inline graphic.

Figure 6

As a second enhancement of our approach, GA was applied to find the optimal threshold set separating TMH segments from the non-TMH segments, as illustrated in Figure 7. Prior to the application of GA, several runs were performed to tune the different parameters. As a result of parameter tuning, the number of generations Inline graphic was set to Inline graphic and the population size to Inline graphic. During the reproduction process, crossover and mutation occur with probabilities Inline graphic equal to Inline graphic and Inline graphic equal to Inline graphic, respectively. The elitism strategy was used by which the Inline graphic fittest chromosomes of one generation are cloned and copied to the next generation. After applying GA to the sequence of the protein 1OCC, the latter is divided into Inline graphic equal parts. Each part consists of Inline graphic residues and the two upper boundary positions, Inline graphic and Inline graphic, are respectively found by GA to be Inline graphic and Inline graphic. The threshold values Inline graphic and Inline graphic are computed to be Inline graphic and Inline graphic, respectively. The obtained structure of the protein 1OCC, as computed by GA, achieved high accuracy, i.e., Inline graphic, C-score = Inline graphic and N-score = Inline graphic.

Figure 7. TMH segment detection in protein 1OCC using GA.

Figure 7

Comparison with existing methods

The aim of the TMH segment prediction method is to obtain high accuracy when applied to unknown proteins. For predicting the TMH segment within a protein, we first computed the index Inline graphic. We collected the TMH and non-TMH segments from a training dataset. The training dataset contains Inline graphic protein sequences which consist of Inline graphic known TMH segments. The testing dataset contains Inline graphic protein sequences which consist of Inline graphic known TMH segments. The training and testing datasets have experimentally determined TMH topology and were used by most of the available TMH predictors such as MemBrain [10], Phobius [14], THUMBU [13] and TMHMM [12]. The datasets are available at http://faculty.uaeu.ac.ae/nzaki/TMHindex.htm.

The performance of TMHindex was measured by Inline graphic, Inline graphic, N-score, C-score and the number of TMH segments which were correctly predicted. The comparison of the performance of TMHindex against those of THUMBU, SOSUI, DAS-TMfilter, TOP-PRED, TMHMM, Phobious and MemBrain, are reported in Table 2. The results show that TMHindex is successful in making fewer mis-classifications of TM helices. It outperforms the compared methods according to all of the measures used for performance evaluations. To analyze the performance of TMHindex based on approximately one helical turn, we calculated Inline graphic based on an overlap of five amino acids between the predicted and the experimentally known TMH segment. The accuracy of Inline graphic in this case was found to be 100%.

Table 2. Performance comparison of various TMH predictors.

Predictor Inline graphic (%) Inline graphic (%) N-Score C-Score Correct TMHs
THUMBU 85.5 47.1 Inline graphic Inline graphic 316
SOSUI 89.1 57.1 Inline graphic Inline graphic 334
DAS-TMfilter 90.7 64.3 Inline graphic Inline graphic 341
TOP-PRED 92.6 60 Inline graphic Inline graphic 352
TMHMM 91 65.7 Inline graphic Inline graphic 343
Phobious 91.8 71.4 Inline graphic Inline graphic 345
MemBrain 97.9 87.1 Inline graphic Inline graphic 371
TMHindex 99.46 91.1 Inline graphic Inline graphic 376

TMHindex was able to predict Inline graphic of the total Inline graphic TMH segments in the testing dataset. The unpredicted TMH were from proteins 2IUB:A and 2B5F:A. Furthermore, the amino acid prediction success rate in terms of Inline graphic, Inline graphic and Inline graphic were Inline graphic, 0.901 and 0.865, respectively.

The distributions of helix lengths in the testing datasets were also examined (Figure 8). This is an essential feature because there is a wide distribution of TMH length amongst the 70 helical polytopic membrane proteins in the testing dataset. Our method in this case demonstrated significant ability in correctly identifying the ends of TMHs. The investigation shows that the prediction methods typically search for TM helices with length ranging between Inline graphic and Inline graphic residues. In fact, out of the Inline graphic TM helices in the dataset, only Inline graphic (Inline graphic) of the helices fall within this range, Inline graphic (Inline graphic) have length less than Inline graphic and Inline graphic (Inline graphic) of the helices have over Inline graphic residues. Several membrane proteins contain TM helices that do not span the bilayer. For example, the pore (P) helix of the potassium channel KcsA (1K4C) and the nitropropionic acid (NPA) contain loops of the aquaporins. These ‘half-TMs’ are shorter in length than conventional TM helices and are expected to be more difficult to predict [27]. The distributions of TM helices given in Figure 8 reveal a small but significant population of half-TMs to be present in the testing dataset. Similarly, there are many TMH segments which are longer than Inline graphic residues in length that often ended unpredicted or partially predicted by most of the available methods. Figure 8 clearly shows that Phobius is unable to detect TMH segments shorter than Inline graphic and longer than Inline graphic residues. DAS-TMfilter and THUMBU are unable to detect many TMH segments longer than Inline graphic residues. MemBrain is unable to detect many TMH segments longer than Inline graphic residues. The only remark that needs more inversigation of the TMHindex method is related to the prediction of some TMH segments of length Inline graphic. Their predictions show more errors than any other segments.

Figure 8. Length distribution of the 378 known TMHs in the testing dataset compared to predicted TMHs using (a) TMHindex, (b) MemBrain, (C) THUMBU, (d) DAS-TMfilter and (e)Phobius methods.

Figure 8

For further validation, TMHindex was also tested on 73-protein 3D helix database created by Zhou et al. [13]. The dataset was used to assess the predictions of THUMBU method [13]. Pylouster et al. [19] have also used 56 proteins with correct resolutions out of the 73 proteins to study the influence of assignment on the prediction of TMH in protein structure. The percentage of proteins with correct TMH segments (Inline graphic) predicted using TMHindex was 91.8%. The prediction accuracy in this case is superior to the accuracy acheived by other methods such as THUBMU (87.7%), TOP-PRED II (68.5%), TMHMM 2.0 (68.5%) and MEMSAT 1.8-3D (84.9%) reported by Zhou et al. [13]. Furthermore, Inline graphic, Inline graphic, N-score, C-score, Inline graphic and Inline graphic were 0.987, 0.922, 2.007, 1.517, 0.905 and 0.901, respectively.

The accuracy achieved using TMHindex in comparison to the known methods for predicting the topology of TM proteins is a strong indication of its capability. The performance of the proposed method is due to two main reasons. The first one is the employment of the TMH compositional index, which was deduced from a dataset of prior known TMH segments and the incorporation of the amino acid composition knowledge. The second one is tailoring GA, which offered a flexible way to model an intelligent predictor of TM proteins topology based on more dynamic thresholds.

The current version of TMHindex needs appriximately 20 minutes for predicting and converging towards accurate structures of the available 70 protein sequences using a computer equipped with Intel Core 2 Duo CPU T7250 @ 2.00 GHz and 2.99 GB of RAM.

In the future, we will extend the TMHindex method to predict signal peptides. Predicting TMH and signal peptides is challenging because of the high similarity between the hydrophobic regions of a TMH and that of a signal peptide [14]. Although, the GA customization has significantly improved the prediction, further tuning and other strategy choices within the metaheuristic framework could achieve more capable and flexible prediction.

Acknowledgments

The authors would like to acknowledge the assistance provided by the Research Support and Sponsored Projects Office and the Faculty of Information Technology at the United Arab Emirates University (UAEU).

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was partially supported by Emirates Foundation for science and engineering programs (grant no. 2010/116) and Faculty of Information Technology, UAEU. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cserzo M, Eisenhaber F, Eisenhaber B, Simon I. TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics. 2004;20:136–137. doi: 10.1093/bioinformatics/btg394. [DOI] [PubMed] [Google Scholar]
  • 2.Nugent T, Jones D. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics. 2009;26:159. doi: 10.1186/1471-2105-10-159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, et al. The protein data bank. Nucleic Acids Res. 2000:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tusnady G, Dosztanyi Z, Simon I. Pdb-tm: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res. 2005:275–278. doi: 10.1093/nar/gki002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.von Heijne G, Gavel Y. Topogenic signals in integral membrane proteins. Eur J Biochem. 1988:671–678. doi: 10.1111/j.1432-1033.1988.tb14150.x. [DOI] [PubMed] [Google Scholar]
  • 6.Kahsay RY, Gao GR, Liao L. An improved hidden markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics. 2005;21:1853–1858. doi: 10.1093/bioinformatics/bti303. [DOI] [PubMed] [Google Scholar]
  • 7.Wallin E, von Heijne G. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci. 1998;7:1029–1038. doi: 10.1002/pro.5560070420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Claros MG, von Heijne G. Toppred II: an improved software for membrane protein structure predictions. Computer Applications in the Biosciences. 1994;10:685–686. doi: 10.1093/bioinformatics/10.6.685. [DOI] [PubMed] [Google Scholar]
  • 9.Hirokawa T, Boon-Chieng S, Mitaku S. SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics. 1998;14:378–379. doi: 10.1093/bioinformatics/14.4.378. [DOI] [PubMed] [Google Scholar]
  • 10.Shen H, Chou JJ. Membrain: Improving the accuracy of predicting transmembrane helices. Plos One. 2008;6 doi: 10.1371/journal.pone.0002399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Tusnady GE, Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol. 1998;289:489–506. doi: 10.1006/jmbi.1998.2107. [DOI] [PubMed] [Google Scholar]
  • 12.Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes. J Mol Biol. 2001;305:567–580. doi: 10.1006/jmbi.2000.4315. [DOI] [PubMed] [Google Scholar]
  • 13.Zhou H, Zhou Y. Predicting the topology of transmembrane helical proteins using mean burial propensity and a hidden-markov-model-based method. Protein Sci. 2003;12:1547–1555. doi: 10.1110/ps.0305103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kall L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004:1027–1036. doi: 10.1016/j.jmb.2004.03.016. [DOI] [PubMed] [Google Scholar]
  • 15.Rost B, Casadio R, Fariselli P. Refining neural network predictions for helical transmembrane proteins by dynamic programming. In: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith R, editors. Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. Menlo Park: AAAI Press; 1996. pp. 192–200. [PubMed] [Google Scholar]
  • 16.Thomas A, Bouffioux O, Geeurickx D, Brasseur R. Pex, analytical tools for pdb files i. gf-pex: basic file to describe a protein. Protein. 2001:28–36. [PubMed] [Google Scholar]
  • 17.Taylor T, Rivera M, Wilson G, Vaisman I. New method for protein secondary structure assignment based on a simple topological descriptor. Protein. 2005:513–524. doi: 10.1002/prot.20471. [DOI] [PubMed] [Google Scholar]
  • 18.Hosseini S, Sadeghi M, Pezeshk H, Eslahchi C, Habibi M. Prosign: a method for protein secondary structure assignment based on three-dimensional coordinates of consecutive c(alpha) atoms. Comput Biol Chem. 2008:406–411. doi: 10.1016/j.compbiolchem.2008.07.027. [DOI] [PubMed] [Google Scholar]
  • 19.Pylouster J, Bornot A, Etchebest C, de Brevern A. Influence of assignment on the prediction of transmembrane helices in protein structures. Amino Acids. 2010:1241–1254. doi: 10.1007/s00726-010-0559-6. [DOI] [PubMed] [Google Scholar]
  • 20.Suyama M, Ohara O. Domcut: prediction of inter-domain linker regions in amino acid sequences. Bioinformatics. 2003;19:673–674. doi: 10.1093/bioinformatics/btg031. [DOI] [PubMed] [Google Scholar]
  • 21.Tekaia F, Yeramian E, Dujon B. Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene. 2002;297:51–60. doi: 10.1016/s0378-1119(02)00871-5. [DOI] [PubMed] [Google Scholar]
  • 22.Roy S, Martinez D, Platero O, Lane T, Werner-Washburne M. Exploiting amino acid composition for predicting protein-protein interactions. PLoS One. 2009;11 doi: 10.1371/journal.pone.0007813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Falkenauer E. Genetic algorithms and grouping problems. England: John Wiley and Sons; 1997. [Google Scholar]
  • 24.Holland J. Adaptation in Natural and Artificial Systems. Ann Arbor, Michigan: University of Michigan Press; 1975. [Google Scholar]
  • 25.Garey R. Computers and Intractability: A guide to the theory of NP-completeness. New York: W. H. Freeman and Company; 1979. [Google Scholar]
  • 26.Sonnhammer EL, von Heijne G, Krogh A. A hidden markov model for predicting transmembrane helices in protein sequences. 1998. pp. 175–182. In: Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology (ISMB-98), Montr'eal, Qu'ebec, Canada, June 28 - July 1, 1998. AAAI. [PubMed]
  • 27.Cuthbertson JM, Doyle DA, Sansom MS. Transmembrane helix prediction a comparative evaluation and analysis. Protein Eng Des Sel. 2005;18:295–308. doi: 10.1093/protein/gzi032. [DOI] [PubMed] [Google Scholar]
  • 28.Jayasinghe S, Hristova K, White S. Energetics, stability, and prediction of transmembrane helices. J Mol Biol. 2001;312:927–934. doi: 10.1006/jmbi.2001.5008. [DOI] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES