Comparative Protein Structure Modeling Using MODELLER

Benjamin Webb; Andrej Sali

doi:10.1002/cpbi.3

. Author manuscript; available in PMC: 2017 Jun 20.

Published in final edited form as: Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1–5.6.37. doi: 10.1002/cpbi.3

Comparative Protein Structure Modeling Using MODELLER

Benjamin Webb ¹, Andrej Sali ¹

PMCID: PMC5031415 NIHMSID: NIHMS815731 PMID: 27322406

Abstract

Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and how to use the ModBase database of such models, and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described.

Keywords: comparative modeling, ModBase, MODELLER, protein fold, protein structure, structure prediction

INTRODUCTION

Functional characterization of a protein sequence is one of the most frequent problems in biology. This task is usually facilitated by an accurate three-dimensional (3-D) structure of the studied protein. In the absence of an experimentally determined structure, comparative or homology modeling often provides a useful 3-D model for a protein that is related to at least one known protein structure (Marti-Renom et al., 2000; Fiser, 2004; Misura and Baker, 2005; Petrey and Honig, 2005; Misura et al., 2006). Comparative modeling predicts the 3-D structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates).

Comparative modeling consists of four main steps (Marti-Renom et al., 2000; Fig. 5.6.1): (i) fold assignment, which identifies similarity between the target and at least one known template structure; (ii) alignment of the target sequence and the template(s); (iii) building a model based on the alignment with the chosen template(s); and (iv) predicting model errors.

Figure 5.6.1 — Steps in comparative protein structure modeling. See text for details.

There are several computer programs and Web servers that automate the comparative modeling process (Table 5.6.1). The accuracy of the models calculated by many of these servers is evaluated by CAMEO (Haas et al., 2013) and the biannual CASP (Critical Assessment of Techniques for Proteins Structure Prediction; Moult, 2005; Moult et al., 2009) experiment.

Table 5.6.1.

Programs and Web Servers Useful in Comparative Protein Structure Modeling

Name	URL
*Databases*
Protein Sequence Databases
Ensembl (Flicek et al., 2013)	http://www.ensembl.org
GENBANK (Benson et al., 2013)	http://www.ncbi.nlm.nih.gov/Genbank/
Protein Information Resource (Huang et al., 2007)	http://pir.georgetown.edu/
UniprotKB (Bairoch et al., 2005)	http://www.uniprot.org
Domains and Superfamilies
CATH/Gene3D (Pearl et al., 2005)	http://www.cathdb.info
InterPro (Hunter et al., 2012)	http://www.ebi.ac.uk/interpro/
MEME (Bailey and Elkan, 1994)	http://meme.nbcr.net/meme/
PFAM (Bateman et al., 2004)	http://pfam.sanger.ac.uk/
PRINTS (Attwood et al., 2012)	http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
ProDom (Bru et al., 2005)	http://prodom.prabi.fr
ProSite (Hulo et al., 2006)	http://prosite.expasy.org/
SCOP (Andreeva et al., 2004)	http://scop.mrc-lmb.cam.ac.uk/scop/
SFLD (Brown and Babbitt, 2012)	http://sfld.rbvi.ucsf.edu/
SMART (Letunic et al., 2012)	http://smart.embl-heidelberg.de/
SUPERFAMILY (Gough et al., 2001)	http://supfam.cs.bris.ac.uk/SUPERFAMILY/
Protein Structures and Models
ModBase (Pieper et al., 2011)	http://salilab.org/modbase/
PDB (Berman et al., 2000)	http://www.pdb.org/
Protein Model Portal (Arnold et al., 2009; Haas et al., 2013)	http://www.proteinmodelportal.org/
SwissModel Repository (Kiefer et al., 2009)	http://swissmodel.expasy.org/repository/
Miscellaneous
DBALI (Marti-Renom et al., 2001)	http://salilab.org/dbali
GENECENSUS (Lin et al., 2002)	http://bioinfo.mbb.yale.edu/genome/
*Alignment*
Sequence and structure based sequence alignment
AlignMe (Khafizov et al., 2010)	http://www.bioinfo.mpg.de/AlignMe/
CLUSTALW (Thompson et al., 1994)	http://www2.ebi.ac.uk/clustalw/
COMPASS (Sadreyev and Grishin, 2003)	ftp://iole.swmed.edu/pub/compass/
EXPRESSO (Armougom et al., 2006)	http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi
FastA (Pearson, 2000)	http://www.ebi.ac.uk/Tools/sss/fasta/
FFAS03 (Jaroszewski et al., 2005)	http://ffas.burnham.org/
FUGUE (Shi et al., 2001)	http://www-cryst.bioc.cam.ac.uk/fugue
GENTHREADER (Jones, 1999; McGuffin and Jones, 2003)	http://bioinf.cs.ucl.ac.uk/psipred/
HHBlits/HHsearch (Remmert et al., 2012)	http://toolkit.lmb.uni-muenchen.de/hhsuite
MAFFT (Katoh and Standley, 2013)	http://mafft.cbrc.jp/alignment/software/
MUSCLE (Edgar, 2004)	http://www.drive5.com/muscle
MUSTER (Wu and Zhang, 2008)	http://zhanglab.ccmb.med.umich.edu/MUSTER
PROMALS3D (Pei et al., 2008)	http://prodata.swmed.edu/promals3d/promals3d.php
PSI-BLAST (Altschul et al., 1997)	http://blast.ncbi.nlm.nih.gov/Blast.cgi
PSIPRED (McGuffin et al., 2000)	http://bioinf.cs.ucl.ac.uk/psipred/
SALIGN (Eswar et al., 2003)	http://salilab.org/salign/
SAM-T08 (Karplus et al., 2003; Karplus, 2009)	http://compbio.soe.ucsc.edu/HMM-apps/
Staccato (Shatsky et al., 2006)	http://bioinfo3d.cs.tau.ac.il/staccato/
T-Coffee (Notredame et al., 2000; Notredame, 2010)	http://www.tcoffee.org/
Structure
CE (Prlic et al., 2010)	http://source.rcsb.org/jfatcatserver/ceHome.jsp
GANGSTA+ (Guerler and Knapp, 2008)	http://agknapp.chemie.fu-berlin.de/gplus/index.php
HHsearch (Soding, 2005)	ftp://toolkit.lmb.uni-muenchen.de/hhsearch/
Mammoth (Ortiz et al., 2002)	http://ub.cbm.uam.es/software/mammoth.php
Mammoth-mult (Lupyan et al., 2005)	http://ub.cbm.uam.es/software/mammothm.php
MASS (Dror et al., 2003)	http://bioinfo3d.cs.tau.ac.il/MASS/
MultiProt (Shatsky et al., 2004)	http://bioinfo3d.cs.tau.ac.il/MultiProt
MUSTANG (Konagurthu et al., 2006)	http://www.csse.monash.edu.au/~karun/Site/mustang.html
PDBeFold (Dietmann et al., 2001)	http://www.ebi.ac.uk/msd-srv/ssm/
SALIGN (Eswar et al., 2003)	http://salilab.org/salign/
TM-align (Zhang and Skolnick, 2005)	http://zhanglab.ccmb.med.umich.edu/TM-align/
Alignment modules in molecular graphics programs
Discovery Studio	http://www.accelrys.com
PyMol	http://www.pymol.org/
Swiss-PDB Viewer (Kaplan and Littlejohn, 2001)	http://spdbv.vital-it.ch/
UCSF Chimera (Huang et al., 2000)	http://www.cgl.ucsf.edu/chimera
*Comparative Modeling, Threading, and Refinement*
Web servers
3d-jigsaw (Bates et al., 2001)	http://www.bmm.icnet.uk/servers/3djigsaw/
HHPred (Soding et al., 2005)	http://toolkit.genzentrum.lmu.de/hhpred
IntFold (Roche et al., 2011)	http://www.reading.ac.uk/bioinf/IntFOLD/
i-TASSER (Roy et al., 2010)	http://zhanglab.ccmb.med.umich.edu/I-TASSER/
M4T (Fernandez-Fuentes et al., 2007)	http://manaslu.aecom.yu.edu/M4T/
ModWeb (Eswar et al., 2003)	http://salilab.org/modweb/
Phyre2 (Kelley and Sternberg, 2009)	http://www.sbg.bio.ic.ac.uk/phyre2
RaptorX (Kallberg et al., 2012)	http://raptorx.uchicago.edu/
Robetta (Song et al., 2013)	http://robetta.bakerlab.org/
SWISS-MODEL (Schwede et al., 2003)	http://www.expasy.org/swissmod
Programs
HHsuite (Soding, 2005)	ftp://toolkit.genzentrum.lmu.de/pub/HH-suite/
Modeller (Sali and Blundell, 1993)	http://salilab.org/modeller/
MolIDE (Wang et al., 2008)	http://dunbrack.fccc.edu/molide/
Rosetta@home	http://boinc.bakerlab.org/rosetta/
RosettaCM (Song et al., 2013)	https://www.rosettacommons.org/home
SCWRL (Krivov et al., 2009)	http://dunbrack.fccc.edu/scwrl4/SCWRL4.php
Quality estimation
ANOLEA (Melo and Feytmans, 1998)	http://melolab.org/anolea/index.html
ERRAT (Colovos and Yeates, 1993)	http://nihserver.mbi.ucla.edu/ERRAT/
ModEval	http://salilab.org/modeval/
ProQ2 (Ray et al., 2012)	http://proq2.theophys.kth.se/
PROCHECK (Laskowski et al., 1993)	http://www.ebi.ac.uk/thornton-srv/software/PROCHECK/
Prosa2003 (Sippl, 1993; Wiederstein and Sippl, 2007)	http://www.came.sbg.ac.at
QMEAN local (Benkert et al., 2011)	http://www.openstructure.org/download/
SwissModel Workspace (Arnold et al., 2006)	http://swissmodel.expasy.org/workspace/index.php?func=tools_structureassessment1
VERIFY3D (Luthy et al., 1992)	http://www.doe-mbi.ucla.edu/Services/Verify_3D/
WHATCHECK (Hooft et al., 1996)	http://www.cmbi.kun.nl/gv/whatcheck/
Methods evaluation
CAMEO (Haas et al.)	http://cameo3d.org/
CASP (Moult et al., 2003)	http://predictioncenter.llnl.gov

Open in a new tab

While automation makes comparative modeling accessible to both experts and nonspecialists, manual intervention is generally still needed to maximize the accuracy of the models in the difficult cases. A number of resources useful in comparative modeling are listed in Table 5.6.1.

This unit describes how to calculate comparative models using the program MODELLER (Basic Protocol). The Basic Protocol goes on to discuss all four steps of comparative modeling (Fig. 5.6.1), frequently observed errors, and the ModBase database and associated Web services. The Support Protocol describes how to download and install MODELLER.

BASIC PROTOCOL: MODELING LACTATE DEHYDROGENASE FROM TRICHOMONAS VAGINALIS (TvLDH) BASED ON A SINGLE TEMPLATE USING MODELLER

MODELLER is a computer program for comparative protein structure modeling (Sali and Blundell, 1993; Fiser et al., 2000). In the simplest case, the input is an alignment of a sequence to be modeled with the template structures, the atomic coordinates of the templates, and a simple script file. MODELLER then automatically calculates a model containing all non-hydrogen atoms, within minutes on a modern PC and with no user intervention. Apart from model building, MODELLER can perform additional auxiliary tasks, including fold assignment, alignment of two protein sequences or their profiles (Marti-Renom et al., 2004), multiple alignment of protein sequences and/or structures (Madhusudhan et al., 2006; Madhusudhan et al., 2009), calculation of phylogenetic trees, and de novo modeling of loops in protein structures (Fiser et al., 2000).

NOTE: Further help for all the described commands and parameters may be obtained from the MODELLER Web site (see Internet Resources).

Necessary Resources

Hardware

A computer running RedHat Linux (PC, Opteron, or EM64T/Xeon64 systems) or other version of Linux/Unix (x86/x86_64 Linux), Apple Mac OS X (10.6 or later), or Microsoft Windows (XP or later)

Software

The MODELLER 9.15 program, downloaded and installed from http://salilab.org/modeller/download_installation.html (see Support Protocol)

Files

All files required to complete this protocol can be downloaded from http://salilab.org/modeller/tutorial/basic-example.tar.gz (Unix/Linux) or http://salilab.org/modeller/tutorial/basic-example.zip (Windows)

Background to TvLDH

A novel gene for lactate dehydrogenase (LDH) was identified from the genomic sequence of Trichomonas vaginalis (TvLDH). The corresponding protein had higher sequence similarity to the malate dehydrogenase of the same species (TvMDH) than to any other LDH. The authors hypothesized that TvLDH arose from TvMDH by convergent evolution relatively recently (Wu et al., 1999). Comparative models were constructed for TvLDH and TvMDH to study the sequences in a structural context and to suggest site-directed mutagenesis experiments to elucidate changes in enzymatic specificity in this apparent case of convergent evolution. The native and mutated enzymes were subsequently expressed and their activities compared (Wu et al., 1999).

Searching structures related to TvLDH

Conversion of sequence to PIR file format

It is first necessary to convert the target TvLDH sequence into a format that is readable by MODELLER (file TvLDH.ali; Fig. 5.6.2). MODELLER uses the PIR format to read and write sequences and alignments. The first line of the PIR-formatted sequence consists of >P1; followed by the identifier of the sequence. In this example, the sequence is identified by the code TvLDH. The second line, consisting of ten fields separated by colons, usually contains details about the structure, if any. In the case of sequences with no structural information, only two of these fields are used: the first field should be sequence (indicating that the file contains a sequence without a known structure) and the second should contain the model file name ( TvLDH in this case). The rest of the file contains the sequence of TvLDH, with an asterisk (*) marking its end. The standard uppercase single-letter amino acid codes are used to represent the sequence.

Figure 5.6.2 — File `TvLDH.ali`. Sequence file in PIR format.

Searching for suitable template structures

A search for potentially related sequences of known structure can be performed using the profile.build () command of MODELLER (file build_profile.py). The command uses the local dynamic programming algorithm to identify related sequences (Smith and Waterman, 1981). In the simplest case, the command takes as input the target sequence and a database of sequences of known structure (file pdb_95.pir) and returns a set of statistically significant alignments. The input script file for the command is shown in Figure 5.6.3.

Figure 5.6.3 — File `build_profile.py`. Input script file that searches for templates against a database of nonredundant PDB sequences.

The script, build_profile.py, does the following:

Initializes the “environment” for this modeling run by creating a new environ object (called env here). Almost all MODELLER scripts require this step, as the new object is needed to build most other useful objects.
Creates a new sequence_db object, calling it sdb, which is used to contain large databases of protein sequences.
Reads a file, in text format, containing nonredundant PDB sequences, into the sdb database. The sequences can be found in the file pdb_95.pir. This file is also in the PIR format. Each sequence in this file is representative of a group of PDB sequences that share 95% or more sequence identity to each other and have less than 30 residues or 30% sequence length difference.
Writes a binary machine-independent file containing all sequences read in the previous step.
Reads the binary format file back in for faster execution.
Creates a new “alignment” object ( aln), reads the target sequence TvLDH from the file TvLDH.ali, and converts it to a profile object ( prf). Profiles contain similar information to alignments, but are more compact and better for sequence database searching.
prf.build() searches the sequence database ( sdb) with the target profile ( prf). Matches from the sequence database are added to the profile.
prf.write() writes a new profile containing the target sequence and its homologs into the specified output file (file build_profile.prf; Fig. 5.6.4). The equivalent information is also written out in standard alignment format.

Figure 5.6.4 — An excerpt from the file `build_profile.prf`. The aligned sequences have been removed for clarity.

The profile.build() command has many options (see Internet Resources for MODELLER Web site). In this example, rr_file is set to use the BLOSUM62 similarity matrix (file blosum62.sim.mat provided in the MODELLER distribution). Accordingly, the parameters matrix_offset and gap_penalties_1d are set to the appropriate values for the BLOSUM62 matrix. For this example, only one search iteration is run, by setting the parameter n_prof_iterations equal to 1. Thus, there is no need to check the profile for deviation ( check_profile set to False). Finally, the parameter max_aln_evalue is set to 0.01, indicating that only sequences with E-values smaller than or equal to 0.01 will be included in the output.

Execute the script using the command

python build_profile.py > build_profile.log

(or, if Python is not installed on the machine, with mod9.15 build_profile.py). At the end of the execution, a log file is created ( build_profile.log). MODELLER always produces a log file. Errors and warnings in log files can be found by searching for the _E> and _W> strings, respectively.

Selecting a template

An extract (omitting the aligned sequences) from the file build_profile.prf is shown in Figure 5.6.4. The first six commented lines indicate the input parameters used in MODELLER to create the alignments. Subsequent lines correspond to the detected similarities by profile.build(). The most important columns in the output are the second, tenth, eleventh, and twelfth columns. The second column reports the code of the PDB sequence that was aligned to the target sequence. The eleventh column reports the percentage sequence identities between TvLDH and the PDB sequence normalized by the length of the alignment (indicated in the tenth column). In general, a sequence identity value above ~25% indicates a potential template, unless the alignment is too short (i.e., <100 residues). A better measure of the significance of the alignment is given in the twelfth column by the E-value of the alignment (lower the E-value the better).

In this example, six PDB sequences show very significant similarities to the query sequence, with E-values equal to 0. As expected, all the hits correspond to malate dehydrogenases (1bdm:A, 5mdh:A, 1b8p:A, 1civ:A, 7mdh:A, and 1smk:A). To select the appropriate template for the target sequence, the alignment.compare_structures() command will first be used to assess the sequence and structure similarity between the six possible templates (file compare.py; Fig. 5.6.5).

Figure 5.6.5 — Script file `compare.py`.

In compare.py, the alignment object aln is created and MODELLER is instructed to read into it the protein sequences and information about their PDB files. The command malign () calculates their multiple sequence alignment, which is subsequently used as a starting point for creating a multiple structure alignment by malign3d (). Based on this structural alignment, the compare_structures() command calculates the RMS and DRMS deviations between atomic positions and distances, differences between the main-chain and side-chain dihedral angles, percentage sequence identities, and several other measures. Finally, the id_table () command writes a file ( family.mat) with pairwise sequence distances that can be used as input to the dendrogram () command (or the clustering programs in the PHYLIP package; Felsenstein, 1989). dendrogram () calculates a clustering tree from the input matrix of pairwise distances, which helps visualizing differences among the template candidates. Excerpts from the log file ( compare.log) are shown in Figure 5.6.6.

Figure 5.6.6 — Excerpts from the log file `compare.log`.

The objective of this step is to select the most appropriate single template structure from all the possible templates. The dendrogram in Figure 5.6.6 shows that 1civ:A and 7mdh:A are almost identical, both in terms of sequence and structure. However, 7mdh:A has a better crystallographic resolution than 1civ:A (2.4 Å versus 2.8 Å). From the second group of similar structures (5mdh:A, 1bdm:A, and 1b8p:A), 1bdm:A has the best resolution (1.8 Å). 1smk:A is most structurally divergent among the possible templates. However, it is also the one with the lowest sequence identity (34%) to the target sequence ( build_profile.prf). 1bdm:A is finally picked over 7mdh:A as the final template because of its higher overall sequence identity to the target sequence (45%).

Aligning TvLDH with the template

One way to align the sequence of TvLDH with the structure of 1bdm:A is to use the align2d () command in MODELLER (Madhusudhan et al., 2006). Although align2d () is based on a dynamic programming algorithm (Needleman and Wunsch, 1970), it is different from standard sequence-sequence alignment methods because it takes into account structural information from the template when constructing an alignment. This task is achieved through a variable gap penalty function that tends to place gaps in solvent-exposed and curved regions, outside secondary structure segments, and between two positions that are close in space. In the current example, the target-template similarity is so high that almost any alignment method with reasonable parameters will result in the same alignment.

The MODELLER script shown in Figure 5.6.7 aligns the TvLDH sequence in file TvLDH.ali with the 1bdm:A structure in the PDB file 1bdm.pdb ( file align2d.py). In the first line of the script, an empty alignment object aln, and a new model object mdl, into which chain A of the 1bmd structure is read, are created. append_model() transfers the PDB sequence of this model to aln and assigns it the name of 1bdmA ( align_codes). The TvLDH sequence, from file TvLDH.ali, is then added to aln using append (). The align2d () command aligns the two sequences and the alignment is written out in two formats, PIR ( TvLDH-1bdmA.ali) and PAP ( TvLDH-1bdmA.pap). The PIR format is used by MODELLER in the subsequent model-building stage, while the PAP alignment format is easier to inspect visually. In the PAP format, all identical positions are marked with a * (file TvLDH-1bdmA.pap; Fig. 5.6.8). Due to the high target-template similarity, there are only a few gaps in the alignment.

Figure 5.6.8 — The alignment between sequences `TvLDH` and `1bdmA`, in the MODELLER PAP format. File `TvLDH-1bmdA.pap`.

Model building

Once a target-template alignment is constructed, MODELLER calculates a 3-D model of the target completely automatically, using its automodel class. The script in Figure 5.6.9 will generate five different models of TvLDH based on the 1bdm:A template structure and the alignment in file TvLDH-1bdmA.ali (file model-single.py).

Figure 5.6.9 — Script file, `model-single.py`, that generates five models.

The first line (Fig. 5.6.9) loads the automodel class and prepares it for use. An automodel object is then created and called “a” and parameters are set to guide the model-building procedure. alnfile names the file that contains the target-template alignment in the PIR format. knowns defines the known template structure(s) in alnfile ( TvLDH-1bdmA.ali) and sequence defines the code of the target sequence. starting_model and ending_model define the number of models that are calculated (their indices will run from 1 to 5). The last line in the file calls the make method that actually calculates the models. The most important output files are model-single.log, which reports warnings, errors, and other useful information including the input restraints used for modeling that remain violated in the final model, and TvLDH.B9999000[1–5].pdb, which contain the coordinates of the five produced models, in the PDB format. The models can be viewed by any program that reads the PDB format, such as Chimera (http://www.cgl.ucsf.edu/chimera/) or RasMol (http://www.rasmol.org).

Evaluating a model

If several models are calculated for the same target, the best model can be selected by picking the model with the lowest value of the MODELLER objective function or the DOPE (Shen and Sali, 2006) or SOAP (Dong et al., 2013) assessment scores, which are reported at the end of the log file. (To calculate the SOAP score, download the SOAP-Protein library file from http://salilab.org/SOAP/ and uncomment the two SOAP-related lines in model-single.py by removing the ‘#’ characters.) In this example, the second model ( TvLDH.B99990002.pdb) has the lowest objective function and is selected. None of these scores are absolute measures, in the sense that they can only be used to rank models calculated from the same alignment.

Once a final model is selected, there are many ways to further assess it. In this example, the DOPE potential in MODELLER is used to evaluate the fold of the selected model. Links to other programs for model assessment can be found in Table 5.6.1. However, before any external evaluation of the model, one should check the log file from the modeling run for runtime errors ( model-single.log) and restraint violations (see the MODELLER manual for details).

The script, evaluate_model.py (Fig. 5.6.10) evaluates the model with the DOPE potential. In this script, the atomic coordinates of the PDB file are read in (using complete_pdb ()) to a model object, mdl. This is necessary for MODELLER to correctly calculate the energy, and additionally allows for the possibility of the PDB file having atoms in a nonstandard order, or having different subsets of atoms (e.g., all atoms including hydrogens, while MODELLER uses only heavy atoms, or vice versa). The DOPE energy is then calculated using assess_dope(). An energy profile is additionally requested, smoothed over a 15-residue window, and normalized by the number of restraints acting on each residue. This profile is written to a file TvLDH.profile, which can be used as input to a graphing program such as GNUPLOT.

Figure 5.6.10 — File `evaluate_model.py`, used to generate a pseudo-energy profile for a single model.

Similarly, the profile can be calculated for the template structure (see the scripts evaluate_template.py and plot_profiles.py in the zipfile). A comparison of the two profiles is shown in Figure 5.6.11. It can be seen that the DOPE score profile shows clear differences between the two profiles for the long active-site loop between residues 90 and 100 and the long helices at the C-terminal end of the target sequence. This long loop interacts with region 220 to 250, which forms the other half of the active site. This latter region is well resolved in both the template and the target structure. However, probably due to the unfavorable nonbonded interactions with the 90 to 100 region, it is reported to be of high energy by DOPE. It is to be noted that a region of high energy indicated by DOPE may not always necessarily indicate actual error, especially when it highlights an active site or a protein-protein interface. However, in this case, the same active-site loops have a better profile in the template structure, which strengthens the argument that the model is probably incorrect in the active-site region. Resolution of such problems is beyond the scope of this unit, but is described in a more advanced modeling tutorial available at http://salilab.org/modeller/tutorial/advanced.html.

Figure 5.6.11 — A comparison of the pseudo-energy profiles of the model (red) and the template (green) structures.

Searching for existing models in the ModBase database

ModBase (http://salilab.org/modbase/; Pieper et al., 2014) is our database of annotated comparative protein structure models. These models are constructed using ModPipe (Eswar et al., 2003), a pipeline that automates the entire process of template selection, alignment, model building, and evaluation described earlier. In addition to the basic sequence-sequence template search employed above, it conducts a more thorough sequence-profile and profile-profile search, leveraging PSI-BLAST (Altschul et al., 1997), HHBlits (Remmert et al., 2012), HHSearch (Soding, 2005), and Modeller’s own functionality. Alignments created by any of these methods can cover the complete target sequence, or only a segment of it, depending on the availability of suitable PDB templates.

Models in ModBase are organized in datasets. Because of the rapid growth of public sequence databases, efforts are concentrated on adding datasets that are useful for specific projects, rather than attempting to model all known protein sequences based on all detectably related known structures. Currently, ModBase includes a model dataset for each of 65 complete genomes, as well as datasets for all sequences in the Structure Function Linkage Database (SFLD; Pegg et al., 2006), and for the complete SwissProt/TrEMBL database as of 2005. As of 2015, ModBase contains almost 35 million reliable models for domains in 5.8 million unique protein sequences. Thus, for a sequence of interest, it is possible that models already exist in this database.

The ModBase database can be searched in many ways, e.g., by amino acid sequence, annotation keywords, the template used for modeling, accession number (such as from UniProt; Bairoch et al., 2005; Benson et al., 2013), gene name, or organism. It is also accessible from the Protein Model Portal (http://proteinmodelportal.org; Arnold et al., 2009; Haas et al., 2013) and is cross-linked to many other databases, such as UniProt.

ModBase can be searched for the TvLDH sequence, which was modeled above (Fig. 5.6.2), from the main ModBase search page (http://salilab.org/modbase/), by selecting “Sequence Similarity (Blast)” from the “Search type” drop-down menu, selecting the “100 % Sequence identity” button, and then pasting the raw TvLDH sequence (without the FASTA header) into the search box. On pressing the Search button, the ModBase Sequence Overview page is obtained (Fig. 5.6.12). On clicking the coverage sketch (the blue bar on the left side of that page), the Model Details page is displayed (Fig. 5.6.13).

Figure 5.6.12 — Excerpt of ModBase Sequence Overview page for TvLDH. For this sequence, coverage is shown (the fraction of the sequence for which a model is available, and its quality) together with any annotations available. In this case, the entire sequence was modeled with a good quality (>=30% sequence identity) template.

Figure 5.6.13 — Excerpt of ModBase Model Details page for TvLDH. Metadata about the model are shown on the left side of the page; these data include the part of the sequence that was modeled, the template used, the date when the modeling was performed, and a set of assessment scores. The actual 3-D models are shown on the right side of the page. The “Perform action on this model” menu allows for the models themselves or modeling alignments to be downloaded.

At the time of this writing, ModBase contains two models for the exact TvLDH sequence used here. These models can be selected by clicking on the small protein images on the right side of the Model Details page. The models are similar, differing only in the template used; one model uses the same template (1bdmA) that was chosen for the Modeller run, and the other uses 5mdhA. A key feature of ModPipe is that the validity of sequence–structure relationships is not pre-judged at the fold-assignment stage; instead, sequence-structure matches are assessed after the construction of the models and their evaluation. This approach enables a thorough exploration of fold assignments, sequence–structure alignments, and conformations, with the aim of finding the model with the best evaluation score at the expense of increasing the computational time significantly; for some sequences, a few thousand models can be calculated. In this case, the model built using 5mdhA actually has slightly better assessment scores than that using 1bdmA, even though 1bdmA appeared earlier to be a better-quality template.

The Model Details page also displays basic information about each model, such as the template used, the portion of the sequence that was aligned, the date it was created, and a variety of assessment scores. These scores include a normalized (z-score) version of DOPE as above; the GA341 score (Melo and Sali, 2007); ModPipe’s own quality score (MPQS; Pieper et al., 2011), which is a linear combination of DOPE, GA341, and other scores; and a prediction from TSVMod (Eramian et al., 2008) of the C^α root-mean-squared deviation (RMSD) and native overlap (the fraction of C^α atoms within 3.5 Å of their native positions). Finally the “Perform action on this model” drop-down menu allows the alignment used in modeling and the models to be downloaded.

Adding new models to ModBase

If a sequence does not yet have a model in ModBase, the ModWeb (http://salilab.org/modweb/; Eswar et al., 2003) Web server can be used to model it. ModWeb is a front end to ModPipe and is simple to use; a user needs to provide only an amino acid sequence to model. The entire ModPipe pipeline then runs automatically, and any models generated are uploaded into ModBase where they can be viewed or downloaded in the same way as any other ModBase model. By default, such models are added to the public datasets so that other users of ModBase can see them too; alternatively, the model dataset can be made private, or the models can be e-mailed to the user rather than uploaded into ModBase.

If a sequence already has models in ModBase, but they were generated some time ago, the Model Details page allows the user to request an update. This action rebuilds the models, potentially using any newer templates that have been deposited in the PDB since the last calculation. For example, for TvLDH, new structures (4UUM and 4UUN) that are almost 100% identical in sequence were recently deposited (in Aug 2015), and would almost certainly yield better models.

Other Web tools for model evaluation, validation, refinement, and analysis

ModWeb is one of the Web services associated with the ModBase database. A number of other such Web services exist. These services generally take as input one or more PDB files, so they can be used with models extracted from ModBase, atomic structures from the PDB itself, or models manually generated with MODELLER or another Web service. A selection of these servers is outlined here.

The ModEval server (http://salilab.org/modeval/; Pieper et al., 2011) takes as input a protein structure, an alignment in the PIR format, and the sequence–template sequence identity. The modeling alignment and sequence identity are optional, but should be provided if available as they result in more accurate assessment scores. In the TvLDH case above, the modeling alignment is available in Fig. 5.6.8 and the sequence identity can be read from the header of each model PDB file. The server then computes the TSVMod scores, the DOPE score and profile, and the GA341 score.

ModLoop (http://salilab.org/modloop/; Fiser and Sali, 2003b) takes as input a protein structure and one or more residue ranges. It then applies MODELLER’s loop modeling protocol to the selected residues to generate a set of candidate “loop” models, and returns the single model with the best-scoring loop conformation. The server can be particularly helpful for regions of the structure that have no templates. For example, in the TvLDH model, residues 94 to 102 do not align with the template (Fig. 5.6.8), and while MODELLER generates a stereochemically reasonable structure in this region, its conformation is unlikely to be close to native. The MODELLER model generated earlier ( TvLDH.B99990002.pdb) can thus be uploaded to ModLoop and 94::102:: given as the loop segments, to generate loop models. The resulting models can then be evaluated with the same Python scripts that were used to evaluate the MODELLER models, or with the ModEval server.

The AllosMod Web server (http://salilab.org/allosmod/; Weinkam et al., 2012) predicts conformational differences that may occur in the native ensemble in solution, such as those representing allosteric conformational transitions. The input is one or more macromolecular coordinate files (including DNA, RNA, and sugar molecules) and the corresponding sequence(s). The output is a set of molecular dynamics trajectories based on a simplified energy landscape. Biased energy landscapes result in efficient molecular dynamics sampling at constant temperatures, thereby providing a more ergodic sampling of the conformational space than standard molecular dynamics simulations.

FoXS (http://salilab.org/foxs/; Schneidman-Duhovny et al., 2010) calculates a Small Angle X-ray Scattering (SAXS) profile for an uploaded protein structure and compares it with an experimental profile. SAXS is a common structural characterization technique that is performed with the protein sample in solution, and usually takes only a few seconds on a well-equipped synchrotron beamline (Hura et al., 2009). Models generated with MODELLER can thus be evaluated with FoXS if a SAXS profile is available, or even used in modeling a flexible or multi-modular protein or assembling a macromolecular complex from its subunits.

For a full list of other Web services, see http://salilab.org.

SUPPORT PROTOCOL: OBTAINING AND INSTALLING MODELLER

MODELLER is written in Fortran 90 and uses Python for its control language. All input scripts to MODELLER are, hence, Python scripts. While knowledge of Python is not necessary to run MODELLER, it can be useful in performing more advanced tasks. Pre-compiled binaries for MODELLER can be downloaded from http://salilab.org/modeller.