In this special issue of the Journal of Bacteriology, bacteriologists look into the smallest organisms even deeper than before, down to the molecular level. The focus is on experimentally determined molecular structures. However, structure prediction from amino acid sequence data is becoming a usable source of protein structure information as well.
Interest in protein structure prediction is old, but success is new. About 9 years ago, John Moult and others organized the first effort known as Critical Assessment of Protein Structure Prediction (CASP). They arranged with experimentalists to provide amino acid sequence information for soon-to-be-determined protein structures and invited the protein prediction community to try their methods on these target unknowns. Predictors submitted their results to the organizers for evaluation against the true structures when they became available. The format of using a community-wide experiment and a meeting to present the evaluations to the predictors propelled the improvement of methods. Last December the fifth evaluation meeting of the biennial CASP effort (CASP5) was held at Asilomar Conference Grounds in Pacific Grove, Calif. (7).
The success of the best of the predictors in the last two CASP evaluations (7, 8) warrants mention of the methods and results here. Methods for prediction are different for easy and hard cases. The choice of method depends on the degree of similarity between the amino acid sequence of the unknown and the sequences of known structures.
THE HARDEST TEST
Even though they have the worst agreement with the experimental results, the most exciting predictions are the successes in the “new fold” category, where the sequence of the unknown has no significant similarity to the sequence of any known structure. Five of the eighty-odd domains available for prediction fell into this category in CASP5. In this most difficult category, the evaluator considered that at least one “excellent” prediction was made for each target. Of 165 predictors who attempted these difficult targets, nine had a prediction among the best ten (out of hundreds) for three or more of the targets. So some techniques consistently perform better than the rest.
In the new fold category, a respectable result means that the predicted chain has the same kinds of pieces in the same relative orientations, not that the pieces superimpose on each other. The degree of agreement might be similar to that between photographs of the same person at age 20 and at age 80. In fact, predicting a new fold is like drawing a face that the artist has never seen. And in fact, structure prediction methods are like the methods used by police artists, in an important sense. A witness is shown a gallery of faces and asked to pick out parts from them that individually resemble parts of the suspect's face. The police artist then combines the parts into a whole that resembles the witness's memory of the face. The most successful methods of structure prediction for new folds similarly rely on the assembly of a unique whole from fragments selected from a gallery of protein structures.
ONE OF THE GOOD METHODS
In a coarse description of the most successful method of new fold prediction, the first step is to obtain secondary structure (helix, beta strand, etc.) predictions for the unknown and to divide the sequence of the unknown into short fragments (nine amino acids). Then known structures (the equivalent of the gallery of faces) are searched for fragments that are similar in secondary structure and/or sequence profile to the unknown's fragments. A library of these fragments from known structures is constructed (the equivalent of the collection of witness-selected individual features). The starting guess for the unknown structure is a completely extended chain (equivalent to the blank paper), but randomly selected suitable fragments repeatedly replace sections of the extended chain. After each fragment placement (“move”), the chain is checked for collisions and other bad and good features, and the move is rejected or accepted. After a large number (thousands) of fragment placements, a folded chain has been created (the equivalent of a single face). In contrast to the limited number of faces an artist could produce, however, tens of thousands of candidate structures are produced. The candidate structures are clustered according to their structural similarity to each other, and the centers of the few largest clusters are selected as the best candidate structures. Final adjustments to the candidates are made to make the models more physically realistic. The method's increasing power lies in the improving selection of the contents of the fragment library and in the improving rules for accepting or rejecting a fragment placement. (For further detail and other methods, see reference 7.)
In CASP5, the method just described was used effectively not only for new folds but also for loop regions in unknowns where a structure for a related sequence was available. The loops were modeled by the new fold method, but otherwise the prediction was closely guided by the template (“comparative modeling”). Why use a template?
OLD FOLDS—EASIER WORK, BETTER ANSWERS
Predictors do get a more accurate answer (the same face at ages 20 and 35) in cases where a template exists—where a structure for a protein with a similar fold has already been determined by experiment. About four-fifths of the sequences provided to the CASP5 predictors turned out to have templates. Identifying such a template shades from easy to difficult. At the easy end (comparative modeling—about half the unknowns), the template can be identified by similarity between the sequence of the unknown and the sequence of the template. Sophisticated methods may be invoked to detect that similarity. In difficult cases (“fold recognition”), sequence similarity is too low to provide an unambiguous choice of template, and a different method has to be used.
THE “GLASS SLIPPER” APPROACH
When sequence-based comparisons do not unambiguously identify a unique template in the Protein Data Bank (PDB), predictors can nevertheless proceed by the method pioneered by Cinderella's prince: search the PDB for a structure that fits the sequence. Predictors using this approach do not search the whole PDB but a representative subset of structures. They may not find a fit, in which case they may be dealing with a new fold, which requires the methods described above.
But how does one determine whether a sequence fits a given structure? Different amino acids prefer different surroundings. Statistically the most obvious distinction is that hydrophobic amino acids prefer to have other hydrophobic amino acids as neighbors, and charged or polar amino acids prefer to be on the surface of a protein, in contact with water. Threading a sequence through a structural template positions amino acids relative to each other. Predictors evaluate whether the resulting neighbor clusters are consistent with what is known about amino acid neighborhoods in experimentally determined structures and decide whether the structure is a viable template or not. An important consideration is how to align the unknown's sequence with the structural template. A very effective method for generating and testing different alignments using a “genetic algorithm” approach was presented at CASP5.
The bad news is that all the procedures described here are computationally complex. The good news is that web servers provide public access to some of the expertly implemented procedures.
AUTOMATION
Very welcome at the meeting were the good results of CAFASP3 (Critical Assessment of Fully Automated Protein Structure Prediction, third evaluation). Individual fully automated servers started from a submitted amino acid sequence and without human intervention carried out a series of tasks that ended with a set of alpha-carbon coordinates for the sequence. (Of course, a server's procedures are based on an automation of the successful procedures of its human creators.) Meta-servers, the next level up, did not themselves create predictions but operated on the results of individual prediction servers. They outperformed individual servers because different methods implemented in individual servers have different strengths and do well on some but not all targets. As one evaluator said, there are many different ways to be wrong but only one way to be right. By a consensus approach, a meta-server pulls out the best answers from a collection. The success of meta-servers depends on having a variety of independent methods available from individual servers, so that even though each individual server has weaknesses its contribution still improves the overall results. Meta-servers were up to 60% more likely than individual servers to choose the correct structural class (5, 6) of a sequence and up to 30% more likely to score correct answers higher than incorrect answers.
Could meta-meta servers do even better? Yes. 3D-Jury is a meta-meta server that collected and analyzed results from all the other servers in CAFASP3. Although it was not entered in CAFASP3, 3D-Jury itself would have scored highest among all servers. The best servers did better than two-thirds of the human prediction groups. A few predictors compared the results of automatic predictions with human-aided predictions that were made either by an expert human or by trained but inexperienced humans. The interesting result was that sometimes human intervention helped and sometimes it harmed. In the experience of one extremely good team, the easier the unknown was, the less likely human intervention was to improve a prediction.
HOW GOOD AN ANSWER?
“Unknowns” have been described as easy or difficult. The easiest unknowns (the comparative modeling category) are those that have a significant amount (∼1/3 or more) of sequence identity with a protein for which a three-dimensional (3D) structure is already known. In the best of these cases, alpha-carbon positions were predicted with better than a 0.9-Å root mean square difference from the experimental answer. The median was around 2 Å. In other measuring terms, at 50% sequence identity, 95% of backbone rotation angles (phi and psi angles) were correct within 30°. Of course, predictors, evaluators, and users of predictions set the bar higher in cases like this and want to know the positions of side chain atoms as well as main chain atoms (“homology modeling” subcategory: 27 unknowns, nine servers evaluated). Side chain predictions are more difficult and less accurate than high-identity main chain predictions—the level of accuracy is more like 50% of side chains having their first bond rotation angle correct within 40°.
PROBLEM AREAS—OLD ENEMIES AND NEW FRIENDS
Predicting the details of side chain conformations is still a challenging problem, as mentioned above. Another side chain-related area of importance and difficulty is developing methods to predict long-range contacts (side chains that are far apart in sequence but near in space). Improving a comparative modeling main chain prediction beyond agreement with a homologous structure (“refinement”) is also difficult to do. But people new to the field might be surprised to learn that a fourth significant difficulty is recognizing which predicted model is the best. In CASP5, prediction groups were permitted to submit up to five models for each unknown, ranked according to believed correctness. However, it was sometimes the case that the truly best model was not the top-ranked model of the five. An area of active investigation is the question of how to reliably identify the best from a group of likely models. A fifth area of surprising difficulty is the question of correct sequence-to-structure alignment. As one of the judges said, alignment is still a hard problem. Why? Probably for the same reason that structure prediction is a hard problem. The fold of a protein is the result of a lot of weak and not very specific interactions. The general message of a stretch of amino acids may be “form a helix,” but the surrounding contacts may alter where the helix begins or ends. The same amino acid may play a somewhat different structural role in different members of a family.
New directions of effort include predicting sequences and predicting disorder.
PREDICTING AMINO ACID SEQUENCES
The availability of sequence variants enables aligning a set of related sequences to obtain a sequence profile. A profile-based search is better able to identify a structural template in the PDB than a single-sequence search is. Some CASP attendees implemented an interesting twist on this idea: design amino acid sequences for a structure! This is related to the “inverse folding problem,” first discussed in 1991 (2): what sequences are consistent with a given structure? (Think “Imelda Marcos”—design shoes to fit a particular foot.) For structures in the PDB that do not have a good number of natural homolog sequences available, predict a large set of amino acid sequences that are consistent with a given structure, form a profile from these, and use this profile to search genomes for sequences which are compatible with that structure. In each of 40+ genomes, this reverse procedure identified a previously unrecognized structure template for one or more sequences (4).
DISORDER AS STRUCTURE
Another new facet to the concept of structure prediction at CASP5 was that naturally or conditionally disordered regions are a predictable and functional structural category also encoded by amino acid sequences (3). A survey of the literature had suggested that functionally important regions of disorder exist in many proteins. Nineteen unknowns for CASP5 had at least 5% disorder. One (intentionally chosen) was completely disordered, and three others had disordered regions of between 15 and 40% of their total length. Six groups attempted disorder predictions, with success rates in identifying disordered regions up to 100% in favorable cases. Disordered regions, by organizing only in the presence of a ligand, could provide high binding specificity without tight (and therefore hard to disengage) binding, because the binding energy would be taken up in organizing the disordered region. Such behavior could be important in regulatory settings (3).
OBTAINING A PREDICTION
As a public service (1), members of the CASP community have started a “ten-most-wanted” list (TMW), important sequences whose structures are desired by members of the biological community, to be worked on by a number of predictors. The first round is in progress (January 2003). A second round will start when the first is finished (http://www.doe-mbi.ucla.edu/TMW [excellent one-page introduction to TMW]; http://tmw.llnl.gov [gives more detail, including where to go to find out how to suggest a sequence of interest]).
Automated sites.
The BioInfo site (overview at http://bioinfo.pl; 3D-Jury at http://BioInfo.PL/Meta) is the most successful meta-server. A sequence can be submitted there, and the submitter will be able to download automatically generated results to examine and interpret. Turnaround time is 7 days or less.
A successful individual server site in CASP5 (with a particularly clear user interface) was ORNL-Prospect (http://compbio.ornl.gov/PROSPECT/), offering secondary structure predictions, 3D predictions by threading a sequence through candidate structures, and a 3D prediction pipeline for using a comprehensive set of tools and more submitter-provided information than sequence alone. The pipeline offers all-atom models and evaluates their quality. The output is well presented.
The 3D Jigsaw server (http://www.bmm.icnet.uk/servers/3djigsaw/) returns side chain-containing models, not just alpha-carbon models.
The 3dpssm server (http://www.sbg.bio.ic.ac.uk/∼3dpssm/) was a high-scoring server in CAFASP2.
Other servers.
Links to 18 meta-servers and individual servers are listed on the BioInfo page mentioned above.
One of the first efforts to offer structure prediction as a service (originally at Heidelberg) has evolved into the site at http://cubic.bioc.columbia.edu/pp/.
The PDB site is http://www.rcsb.org/pdb/.
TRUTH IN ADVERTISING
Some quantitative information in this article is preliminary because it comes from the meeting reports, not from the refereed reports that are to follow from the meeting (7). Further, in describing methodology we showcased the hardest area of structure prediction because we consider its recent success a triumph. This demanding area still requires in-house expertise for good results, but we anticipate that in a year or two public servers will be available for such prediction methods. In the hardest area of prediction there are other promising and different techniques (7) besides the prominent one that we describe. Also, because automation is still under development, many of the excellent CAFASP3 servers are not publicly accessible yet—another reason to try the BioInfo site, which does have access to them. An area that we gave less coverage to is comparative modeling, the prediction of structures from existing information about other structures that are expected to be quite similar. Good public servers are already available for comparative modeling predictions (URLs above). A very good review (9) of the meeting by the comparative modeling assessor was published after we had submitted this commentary. Finally, and most importantly, predictors do not want users to blindly accept predictions. Getting an automated prediction is easy; understanding and evaluating it take experience and effort. But we are saying that it can now be worth that effort.
The views expressed in this Commentary do not necessarily reflect the views of the journal or of ASM.
REFERENCES
- 1.Abbott, A. 2001. Computer modelers seek out ′Ten Most Wanted' proteins. Nature 409:6816. [DOI] [PubMed] [Google Scholar]
- 2.Bowie, J. U., R. Luthy, and D. Eisenberg. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253:164-170. [DOI] [PubMed] [Google Scholar]
- 3.Dunker, A. K., C. J. Brown, and Z. Obradovic. 2002. Identification and functions of usefully disordered proteins. Adv. Protein Chem. 62:25-49. [DOI] [PubMed] [Google Scholar]
- 4.Larson, S. M., A. Garg, J. R. Desjarlais, and V. S. Pande. 2003. Increased detection of structural templates using alignments of designed sequences. Proteins 51:390-396. [DOI] [PubMed] [Google Scholar]
- 5.Murzin, A. G., S. E. Brenner, T. Hubbard, and C. Chothia. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536-540. [DOI] [PubMed] [Google Scholar]
- 6.Orengo, C. A., F. M. Pearl, and J. M. Thornton. 2003. The CATH domain structure database. Methods Biochem. Anal. 44:249-271. [DOI] [PubMed] [Google Scholar]
- 7.Proteins. Fifth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. Proteins, in press. [DOI] [PubMed]
- 8.Proteins. Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. 2001. Proteins 45, Supplement 5. [DOI] [PubMed]
- 9.Tramantano, A. 2003. Of men and machines. Nat. Struct. Biol. 10:87-90. [DOI] [PubMed] [Google Scholar]
