Abstract
Aim: This study was conducted to find the best suited freely available software for modelling of proteins by taking a few sample proteins. The proteins used were small to big in size with available crystal structures for the purpose of benchmarking. Key players like Phyre2, Swiss-Model, CPHmodels-3.0, Homer, (PS)2, (PS)2-V2, Modweb were used for the comparison and model generation. Results: Benchmarking process was done for four proteins, Icl, InhA, and KatG of Mycobacterium tuberculosis and RpoB of Thermus Thermophilus to get the most suited software. Parameters compared during analysis gave relatively better values for Phyre2 and Swiss-Model. Conclusion: This comparative study gave the information that Phyre2 and Swiss-Model make good models of small and large proteins as compared to other screened software. Other software was also good but is often not very efficient in providing full-length and properly folded structure.
Keywords: Mycobacterium tuberculosis, Proteins, Homology modelling, Freeware, Benchmarking
Background
To rationally develop new anti-infective agents, it is essential to study the genetics and physiology of microbes. The main cause of hindrance in the designing of new drugs rationally, is the lack of structural information of some very well known targets of drugs for example RpoB protein in M. tuberculosis.
The literature has good amount of data available about the in vitro identification of mutations in various microbial genes and the data have been precisely correlated with the phenotypic expression of resistance. However, unavailability of crystal structures of some proteins makes it difficult to predict and understand the exact impact of the mutation on structural changes and binding of drugs or other inhibitors. The only way to understand its three dimensional structure and the related properties to be used in understanding protein-ligand binding is to model the protein with in-silico approaches using amino acid sequences as starting point. There are numerous tools available for this purpose. Some very good tools are too costly and for most of the scholars affordability is always a concern. Online or freely available tools generated by academia have been a great help for such researchers. This study aims to explore and compare a few freely available on-line tools. As per the reviewed literature, software chosen for the comparison are top in the list of available free software for homology modelling of protein structure. The information about the best software would allow resource limited settings to still move forward with their studies while saving cost, time and effort. In order to fulfil the aim of this study, structure generation of proteins with known crystal structures was attempted and it was tried to find the best online homology modelling server.
In this study, the proteins chosen for benchmarking experiment were different in nature. One was Icl protein of M. tuberculosis and other was the RpoB protein of Thermus Thermophilus. Since Icl is a small protein of M. tuberculosis and RpoB is a long protein of different microorganism, a range of proteins was covered in this experiment. The other two proteins InhA and KatG from M. tuberculosis with sizes smaller and larger than the tested proteins were taken in order to confirm and validate the results obtained.
Software
Phyre2:
Protein Homology/analogy Recognition engine 2 (PHYRE2) is a free online homology modelling server [1, 2]. Phyre2 uses the alignment of hidden Markov models via HHsearch to significantly improve accuracy of alignment and detection rate. Phyre2 also incorporates a new ab-initio folding simulation called Poing to model those regions of proteins in question which have no detectable homology to known structures [3]. Poing is also used to combine multiple templates. Distance constraints from individual models are treated as linear elastic springs. Poing then synthesises entire protein in the presence of these springs and at the same time models unconstrained regions using its physics simulation [1].
Swiss-Model:
It is a web-based integrated service dedicated to protein structure homology modelling. A personal working environment is provided for each user where several modelling projects can be carried out in parallel. Protein sequence and structure databases necessary for modelling are accessible from the workspace and are updated in regular intervals. Tools for template selection, model building and structure quality evaluation can be invoked from within the workspace [4].
CPHmodels-3.0:
CPHmodels-3.0 is a web-server predicting 3D-structure of protein by use of single template homology modelling. The server employs a hybrid of the scoring functions of CPHmodels-2.0 and a novel remote homology-modelling algorithm. The web server is available at http://www.cbs.dtu.dk/services /CPHmodels/ [5].
ModWeb:
ModWeb is a server for automated comparative protein structure modelling (http://salilab.org/modweb). It accepts one or many sequences in the FASTA format and calculates models for them based on the best available template structures from the Protein Data Bank. The structural templates used to build models in ModPipe consist of a set of non-redundant chains extracted from structures in the PDB [6]. Sequencestructure matches are established using multiple variations of sequence-sequence, profile-sequence, sequence-profile and profile-profile alignment methods. Significant alignments (Evalue better than 1.0) covering at least 30 amino acid residues are selected for modelling. Models are built for each one of the sequence-structure matches using comparative modelling by satisfaction of spatial restraints as implemented in Modeller [7]. The resulting models are evaluated using several model assessment schemes like MTALL (Training set is based on the template structure), MSALL (Training set is based on similar secondary structure), RMSD (Predicted RMSD), etc and the best scoring models are returned to the user [8].
(PS)2:
PS)2 is an automated homology modelling server [9]. The method uses an effective consensus strategy by combining PSIBLAST [10], IMPALA [11], and T-Coffee [12] in both template selection and target-template alignment. MODELLER [13], the modelling package, is used to built the final three dimensional structure. The PROCHECK program, after generating a predicted model with no other refinements, was used to evaluate the quality of this model based on the G-factor. Finally, the predicted model is displayed by Chime and automatically sent to users [9]. The main drawback of PS2 server is that it cannot process the sequence more than 800 amino acid.
(PS)2-V2:
(PS)2-v2 is the advanced version of (PS)2, an automatic homology modelling server [9]. The method uses a new substitution matrix called S2A2. This is a 60×60 substitution matrix based on secondary structure propensities of 20 amino acids (aa). It is a handy tool for the detection of remote homologous and target-template alignment. The final 3-D structure is built using the modelling package MODELLER. After generated a model, the programs ProQ and ProQres are used to evaluate the quality of this model based on a number of structural features predicts the quality of a protein model and to find correct models in contrast to other methods which are optimized to find native structures [14]. Finally, the predicted model was displayed using software for molecular visualization (AstexViewer) and automatically sent to users.
Homer:
Homer (HOmology ModellER) is a comparative modelling server for protein structure prediction. It builds a model structure from an alignment (in FASTA format) and a single template structure (PDB format). Homer performs loop modelling and side-chain optimization on request. The program utilizes FRST to generate the model and a per-residue energy profile. Scoring functions are widely used in the final step of model selection in protein structure prediction. A novel combination of four knowledge-based potentials recognizing different features of native protein structures is introduced and tested. The pairwise, solvation, hydrogen bond, and torsion angle potentials contain largely orthogonal information. Of these, the torsion angle potential is found to show the strongest correlation with model quality. Combining these features with a linear weighting function, it was possible to construct a robust energy function capable of discriminating native-like structures on several benchmarking sets.
Methodology
Common parameters for comparison:
The Expectation value or Expect value (E value) represents the number of different alignments with low or better scores that is expected to occur in a database search by chance. The lower the Expectation value, the more significant the score and the alignment are. Phyre2 does not provide E-value.
Sequence identity and template selection are other important parameters to begin with. Sequence identity is the extent to which two sequences have the same residues at the same positions when aligned together. Sequence coverage shows how much sequence is covered for generating the model. Template selection and the resolution of template are very important.
Templates having finer resolution (generally <2.0Å) are treated to be good templates. Better template results in better alignment and finally better three dimensional structures.
Other important parameters which were included for comparison were RMS deviation of the modelled structure with the template used and second one was the energy of the structure obtained calculated by Swiss PDB Viewer [4].
Icl modelling and benchmarking:
The Icl protein of M. tuberculosis [Uniprot-Id P0A5H3] was selected for testing the software followed by the benchmarking process. Isocitrate lyase (Icl) has 429aa and a good quality crystal structure [PDB-ID 1F8MA] having a resolution of 1.8Å. Models of Icl were generated using all the above mentioned software. No further processing of modelled structures was done. The models generated by different software were compared among themselves with respect to certain parameters as provided in the software (default parameters). For benchmarking, the original structure with PDB code 1F8M was taken and all the structures obtained through modelling were superimposed on it.
RpoB benchmarking:
in-silico structures of RpoB protein of T. Thermophilus [(PDB-ID 1IW7C, 2.60Å) & 1119aa] was modelled through the same procedure as like Icl.
Validation of the method by applying them on other two proteins:
To validate the results obtained from Icl and RpoB benchmarking, two new proteins of M. tuberculosis were chosen whose crystal structures are available. First is Enoyl-[acylcarrier- protein] reductase [NADH] InhA protein [Uniprot-Id P0A5Y6] [PDB-ID 3OEW, 2.20 Å] and second is Catalaseperoxidase KatG protein [Uniprot-Id Q08129] [PDB-ID 2CCA, 2.00 Å]. Crystal structures of these proteins are available with RCSB-IDs as of 3OEW and 1SJ2 respectively. InhA protein is 269aa long and KatG is 740aa long.
Results
Modelling and benchmarking:
Icl: In Icl modelling 1F8M, 1F8I, 1F61 templates were selected for model generation. All these were structures of Icl protein of M. tuberculosis with or without ligand. This showed that all the software work well in opting the right and best template for model generation. The structures were modelled by Phyre2, Swiss-Model, CPHmodels-3.0, Homer, ModWeb, (PS)2, and (PS)2-V2. Different software picked different templates based on various selection methods. Some software like Swiss-Model take one letter amino acid code as the query sequence and other software like CPHmodels-3.0 take protein sequence in FASTA format. The detailed results are provided in Table 1 (see supplementary material) and the models generated with different software were superimposed to see the structural similarity (Figure 1). Structures generated by (PS)2 and (PS)2-V2 are absolutely same with an RMS deviation of 0.0Å. All the figures of generated models by individual software are presented in the supplementary files. The sequence of protein of Icl is 429aa long. The model generated by Phyre2 and Swiss- Model is of 429aa and the rest software make 427aa long structure. Homer generated 425aa long model. The templates picked by all software were same as like the crystal structure and hence the results of modelling obviously explain the benchmarking success.
Figure 1.
Superimposed figure of all the models of Icl protein of M. tuberculosis generated by different software phyre2 in orange (brick colour), HOMER in blue, ModWeb in red, CPHmodels in green, PS2 in grey, ps2-v2 in pink, swiss-model in cyan. Crystal structure is in yellow.
RpoB: RpoB protein of T. Thermophilus is modelled using the same software as used for Icl. Table 1 shows the results obtained. For RpoB benchmarking process, different software selected different templates for homology modelling. 1YNJ (Taq RNA polymerase) and 1SMY (Thermus thermophilus RNA polymerase) were selected by Phyre2, 2CW0M (Crystal structure of T. Thermophilus RNA polymerase holoenzyme at 3.3Å resolution) used by Swiss-Model, 1IW7C (Crystal structure of the RNA polymerase holoenzyme from T. Thermophilus at 2.6Å resolution) by CPHmodels-3.0, and 2A6HC (Crystal structure of the T. thermophilus RNA polymerase holoenzyme in complex with antibiotic sterptolydigin) by ModWeb. Homer software covered 1117aa out of 1119 for model generation. All other software covered the whole sequence in generating the model. All the generated structures were superimposed to the crystal structures of the protein and also to each other (Figure 2) and RMS values were recorded (Table 1). Structure generated by Swiss-Model using template 2CW0M showed RMS deviation of 0Å.
Figure 2.
Superimposed figure of all the models of RpoB protein of Thermus thermophilus generated by different software phyre2 in grey, HOMER in blue, ModWeb in pink, CPHmodels in red, ps2-v2 in green. Crystal structure is in yellow.
InhA: Results of InhA benchmarking were as like the results of Icl and RpoB benchmarking (Table 1). In the model generation of inhA (269aa), every software used different template to generate the final three dimensional model. Phyre2 covered the whole sequence. Other software left some residues while generating the model. The E-value obtained by Swiss-Model, CPHmodels-3.0, PS2, and PS2-V2 is a non zero value. Superimposition of models generated and crystal structure provided the information regarding the difference between the obtained structures from different software (Figure 3). The figure revealed that the model generated by ModWeb had a loop protruding outside the superimposed core structure whereas all other software made similar structure alike crystal structure.
Figure 3.
Superimposed figure of all the models of InhA protein of M. tuberculosis generated by different software phyre2 in blue, Homer in yellow, ModWeb in green, CPHmodels in grey, (PS)2 in pink, (PS)2-v2 in red, swiss-model in cyan. Crystal structure is in brick red.
KatG: Modelling of KatG protein of M. tuberculosis having sequence length of 740aa, showed varied results (Table 1). Except Phyre2 and Swiss –model none other software covered the whole sequence for model generation. E-value of this model given by Swiss-Model is 0.00 e-1. Other software except Phyre2 gave 0 as the E-value. . All the generated structures were superimposed to each other and differences are recorded (Figure 4). The figure revealed that the model generated by (PS)2 had a long loop protruding outside the superimposed core structure whereas all other software made similar structure as the crystal structure.
Figure 4.
Superimposed figure of all the models of KatG protein of M. tuberculosis generated by different software phyre2 in blue, Homer in yellow, ModWeb in green, CPHmodels in cyan, (PS)2 in pink, (PS)2-v2 in red, swiss-model in grey. Crystal structure is in brick red.
Discussion
Proteins with available crystal structures helped finding the best software to be used for modelling of proteins with crystal structure. The default parameters of all the software were chosen so as to compare them without user-biasness. However, the option for using advanced parameters as per requirement and skills is always available. The main idea behind the benchmarking process with the help of four different proteins was to judge the software quality to a next level. This step was to decipher whether these software work well with short as well as long sequences.
From Icl benchmarking process it was deciphered that all the software work well for shorter sequences and provide enough fair results. In Icl benchmarking, Phyre2 and Swiss-Model covered the whole sequence whereas other software left some residues either because of the unavailability of the correct residue profile or to make a smaller but better conformation having good resolution. Although Phyre2 picked up the right template and covered the whole sequence, there is a reportable amount of deviation in the RMS value between in-silico structure and the template used. It was observed that when crystal structure of a protein sequence is present, Swiss-Model could pick it up and made the structure accordingly with low RMS deviation. In case of non-availability of crystal structure and hence no proper template, software (Swiss-Model) made model of higher RMS deviation. This is because the template selected is of different kind and the alignment is not expected to be the same.
To confirm if this software also provide good result for long chain protein sequences, same methodology of model generation was applied on the RpoB protein of T. Thermophilus with Uniprot-ID 1IW7 using chain C. In case of Icl modelling templates selected were more or less same. However, in case of RpoB modelling there was a variety in template selection. The reason of this was that sequence-structure alignment might not be good. On performing a protein-protein blast on the basis of non redundant or ref-sequence database, 4GZZ (Crystal structures of bacterial RNA Polymerase paused elongation complexes) at 4.29Å resolution was found as the best hit.
The decision of choosing the best online software is also validated with the use of two more proteins one is small protein InhA (269aa) and second is a relatively long chain protein, KatG (740aa). Both the proteins are of M. tuberculosis having crystal structure information in PDB database. Table 1 also shows the results of InhA and KatG protein modelling. For InhA, all software gave better results as we got in the case of Icl but in the case of KatG same problem of incomplete sequence coverage is encountered by almost all software as we got in RpoB protein. As the length of protein increases, it becomes difficult for automated software to find the right and best suited template. The alignment also becomes tedious to perform. This problem was taken care by Phyre2 that uses multiple templates to cover the whole sequence and search better folding patterns to make the complete three dimensional structure of the protein. This results in increased RMS deviation but overall quality of structure is highly reliable. Modelled structure can be used for further analysis after post modelling modifications.
Disclosure statement
No competing financial interests exist.
Supplementary material
Footnotes
Citation:Nema & Pal, Bioinformation 9(15): 796-801 (2013)
References
- 1.Kelley LA, Sternberg MJ. Nat Protoc. 2009;4:363. doi: 10.1038/nprot.2009.2. [DOI] [PubMed] [Google Scholar]
- 2.Soding J. Bioinformatics. 2005;21:951. doi: 10.1093/bioinformatics/bti125. [DOI] [PubMed] [Google Scholar]
- 3.Malik IA, et al. J Pak Med Assoc. 1993;43:118. [PubMed] [Google Scholar]
- 4.Guex N, Peitsch MC. Electrophoresis. 1997;18:2714. doi: 10.1002/elps.1150181505. [DOI] [PubMed] [Google Scholar]
- 5.Nielsen M, et al. Nucleic Acids Res. 2010;38:W576. doi: 10.1093/nar/gkq535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Eswar N, et al. Nucleic Acids Res. 2003;31:3375. doi: 10.1093/nar/gkg543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sali A, Blundell TL. J Mol Biol. 1993;234:779. doi: 10.1006/jmbi.1993.1626. [DOI] [PubMed] [Google Scholar]
- 8.Eramian D, et al. Protein Sci. 2008;17:1881. doi: 10.1110/ps.036061.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chen CC, et al. Nucleic Acids Res. 2006;34:W152. doi: 10.1093/nar/gkl187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Schaffer AA, et al. Nucleic Acids Res. 2001;29:2994. doi: 10.1093/nar/29.14.2994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Schaffer AA, et al. Bioinformatics. 1999;15:1000. doi: 10.1093/bioinformatics/15.12.1000. [DOI] [PubMed] [Google Scholar]
- 12.Notredame C, et al. J Mol Biol. 2000;302:205. doi: 10.1006/jmbi.2000.4042. [DOI] [PubMed] [Google Scholar]
- 13.Marti-Renom MA, et al. Annu Rev Biophys Biomol Struct. 2000;29:291. doi: 10.1146/annurev.biophys.29.1.291. [DOI] [PubMed] [Google Scholar]
- 14.Wallner B, Elofsson A. Protein Sci. 2006;15:900. doi: 10.1110/ps.051799606. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.