Skip to main content
Faculty Reviews logoLink to Faculty Reviews
. 2022 Dec 14;11:38. doi: 10.12703/r-01-0000020

Solution of the protein structure prediction problem at last: crucial innovations and next frontiers

David A Agard 1,a, Gregory R Bowman 2,b, William DeGrado 3,c, Nikolay V Dokholyan 4,d, Huan-Xiang Zhou 5,*,e,X
PMCID: PMC9815721  PMID: 36644294

Abstract

The protein structure prediction problem is solved, at last, thanks in large part to the use of artificial intelligence. The structures predicted by AlphaFold and RoseTTAFold are becoming the requisite starting point for many protein scientists. New frontiers, such as the conformational sampling of intrinsically disordered proteins, are emerging.

Keywords: AlphaFold, RossTTAFold, protein structure, three-dimensional structure

Background

Ever since the first protein structures were solved by X-ray crystallography in the early 1960s1,2 and the idea that proteins fold to a stable structure was established in the early 1970s3, numerous attempts have been made to predict the three-dimensional structures of proteins from their amino-acid sequences. At last, this problem was solved by the publication of two related methods, AlphaFold4 and RoseTTAFold5, in 2021. The impact of this breakthrough has been immediate and will continue to explode.

The development of AlphaFold and RoseTTAFold was paved by the growth in known protein structures and sequences. Protein structures in the Protein Data Bank (170,000 entries at present) provide information about distances between pairs of amino acids and possibly a three-dimensional template. In addition, billions of protein sequences enable the construction of a multiple-sequence alignment, which contains information about correlated mutations between two positions along the sequence that could signal proximity in the three-dimensional structure. Crucial innovations of AlphaFold took advantage of recent advances in deep learning, resulting in a powerful neural network for mining the sequence and structure databases for geometrical restraints. Guided by their deep expert knowledge in structure prediction and protein design, Baek et al. adapted such features into RoseTTAFold combined with their own unique components. One such component is the use of two separate protein sequences as input for predicting the structures of binary complexes.

Main Contributions and Importance

Similar to the central dogma of molecular biology, which states that genetic information flows from DNA to RNA to protein6, a basic tenet of biophysics has been that the amino-acid sequence of a protein determines its three-dimensional structure, which in turn determines its biological function. In essence, what AlphaFold and RoseTTAFold have accomplished is to provide the central link between sequence and function for the many proteins that fold into stable structures.

Structural biologists, biophysicists, and others are adapting to the new reality presented by AlphaFold and RoseTTAFold. While some were initially concerned about their work being rendered obsolete, most are eagerly embracing these new tools as they make it easier to solve difficult problems or even provide solutions to previously intractable problems. In many cases, the structures predicted by AlphaFold or RoseTTAFold provide direct insight into functional or disease mechanisms5, while in other cases, the predicted structures are most useful as initial models for molecular replacement (crystallography) or docking into electron density maps obtained by cryo-EM.

Open Questions

As with all machine-learning methods, the devil is in the database. The developers of AlphaFold have acknowledged that prediction accuracy deteriorates for proteins with inadequate multiple-sequence alignments (with <30 homologs). Since there are far fewer membrane proteins than water-soluble proteins in the Protein Data Bank, one wonders whether transmembrane domains can be predicted as accurately as their water-soluble counterparts. Additionally, AlphaFold accuracy becomes poor for domains whose structures are dictated not by interactions within but by interactions with other domains. The latter observation has far-reaching consequences.

While the importance of protein structure cannot be overstated for understanding biological function, one should not lose sight of the fact that proteins are dynamic, not rigid, molecules. In its functional or disease process, a typical protein adopts multiple stable conformations. A particular conformation may be stabilized by the binding of a small molecule, as exemplified by the R state of hemoglobin upon binding oxygen. If excessive interactions with other domains pose a challenge for AlphaFold (also presumably for RoseTTAFold), can or how would they deal with the outsized effects of small ligands? This question is of crucial importance to drug discovery.

In addition to the particular overall conformation of a protein that is appropriate for a given ligand, another important issue is the geometry of the binding pocket, which must be determined with high accuracy for drug discovery purposes. Given these twin issues, the jury is still out on how useful AlphaFold and RoseTTAFold predictions are for drug discovery7. A related question is whether these predicted structures are accurate enough for running long molecular dynamics simulations or, conversely, whether molecular dynamics simulations are capable of refining the predicted structures. Finally, the predictions do not include organic cofactors, which are essential for the function of many enzymes.

RoseTTAFold has found success in predicting the structures of binary or even ternary complexes. However, higher-order oligomers remain a challenge. The difficulty arises not only because there are very few such structures in the Protein Data Bank for training purposes, but also because the number of possible ways to arrange the subunits grows rapidly with the number of subunits. A related problem is the structures of high-order self-assemblies such as amyloid fibrils or complexes that make extended lattices. Small local inaccuracies in intersubunit contacts become amplified in the extended lattice structures.

It is now well-recognized that not all proteins fold into stable structures. Rather, up to 50% of proteins possess intrinsic disorders to different degrees. Some may transiently form α-helices that become stabilized upon binding a target surface (e.g., on a structured protein or a lipid membrane), while others may fold into a structure as part of a complex with a structured target protein. Either way, the structural stability is largely derived from interactions with the target surface. This situation, as noted above, is precisely one that AlphaFold finds challenging. Also, some proteins never fold into any structure, even upon binding with another protein, and fall outside the purview of structure prediction. Ironically, one use of AlphaFold may be for predicting disorder regions: long stretches of amino acids where coils are predicted or predictions are made with low confidence may be assumed to be intrinsically disordered.

With the remarkable success of AlphaFold and RoseTTAFold, all the above issues have come into sharp focus and indeed become the next frontiers. Even if these tools do not provide direct solutions, the methodological advances may still be instructive for solving these outstanding problems. In particular, machine learning has recently been used to sample the conformational space of intrinsically disordered proteins8. The neural network architecture and training procedure of AlphaFold may well offer useful lessons for such efforts. Lastly, AlphaFold and RoseTTAFold may also be adapted for RNA structure prediction.

Conclusion

AlphaFold and RoseTTAFold, at last, have solved the protein structure prediction problem. The predicted structures are good enough for understanding functional or disease mechanisms in many cases, but whether they are sufficiently accurate for drug discovery or for running long molecular dynamics simulations is still open. With the structure prediction problem solved, other areas have come to the forefront, including the conformational sampling of intrinsically disordered proteins. The methodological developments in AlphaFold and RoseTTAFold may even offer lessons for these new problems.

References

  • 1. Kendrew JC, Bodo G, Dintzis HM, Parrish RG, Wyckoff H, Phillips DC. 1958. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis Nature 181:662–6. 10.1038/181662a0 [DOI] [PubMed] [Google Scholar]
  • 2. Perutz MF, Rossmann MG, Cullis AF, Muirhead H, Will G, North AC. 1960. Structure of hæmoglobin: a three-dimensional Fourier synthesis at 5.5-Å. resolution, obtained by X-ray analysis Nature 185:416–22. 10.1038/185416a0 [DOI] [PubMed] [Google Scholar]
  • 3. Anfinsen CB. 1973. Principles that govern the folding of protein chains Science 181:223–30. 10.1126/science.181.4096.223 [DOI] [PubMed] [Google Scholar]; Faculty Opinions Recommendation
  • 4. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. 2021. Highly accurate protein structure prediction with AlphaFold Nature 596:583–589. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]; Faculty Opinions Recommendation
  • 5. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D. 2021. Accurate prediction of protein structures and interactions using a three-track neural network Science 373:871–876. 10.1126/science.abj8754 [DOI] [PMC free article] [PubMed] [Google Scholar]; Faculty Opinions Recommendation
  • 6. Crick FH. 1958. On protein synthesis Symp Soc Exp Biol 12:138–63. [PubMed] [Google Scholar]
  • 7. Mullard A. 2021. What does AlphaFold mean for drug discovery? Nat Rev Drug Discov 20:725–727. 10.1038/d41573-021-00161-0 [DOI] [PubMed] [Google Scholar]
  • 8. Gupta A, Dey S, Hicks A, Zhou HX. 2022. Artificial intelligence guided conformational mining of intrinsically disordered proteins Commun Biol 5:610. 10.1038/s42003-022-03562-y [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Faculty Reviews are provided here courtesy of Faculty Opinions Ltd.

RESOURCES