HIGH-ACCURACY PROTEIN STRUCTURES BY COMBINING MACHINE-LEARNING WITH PHYSICS-BASED REFINEMENT

Lim Heo; Michael Feig

doi:10.1002/prot.25847

. Author manuscript; available in PMC: 2021 May 1.

Published in final edited form as: Proteins. 2019 Nov 15;88(5):637–642. doi: 10.1002/prot.25847

HIGH-ACCURACY PROTEIN STRUCTURES BY COMBINING MACHINE-LEARNING WITH PHYSICS-BASED REFINEMENT

Lim Heo ¹, Michael Feig ^1,^*

PMCID: PMC7125014 NIHMSID: NIHMS1541842 PMID: 31693199

Abstract

Protein structure prediction has long been available as an alternative to experimental structure determination, especially via homology modeling based on templates from related sequences. Recently, models based on distance restraints from co-evolutionary analysis via machine learning have significantly expanded the ability to predict structures for sequences without templates. One such method, AlphaFold, also performs well on sequences where templates are available but without using such information directly. Here we show that combining machine-learning based models from AlphaFold with state-of-the-art physics-based refinement via molecular dynamics simulations further improves predictions to outperform any other prediction method tested during the latest round of CASP. The resulting models have highly accurate global and local structure, including high accuracy at functionally important interface residues, and they are highly suitable as initial models for crystal structure determination via molecular replacement.

Keywords: Protein structure prediction, structure refinement, CASP, AlphaFold, deep learning

INTRODUCTION

High-resolution protein structures are key for understanding mechanistic details of biological function and for guiding rational drug design. The main experimental source of such structures is X-ray crystallography, nuclear magnetic resonance spectroscopy, or cryo-electronmicroscopy. By now, there is fairly comprehensive coverage of the manifold of protein structures encountered in biology, but a complete set of structures for any specific organism other than human immunodeficiency virus remains a very distant goal due to experimental limitations^1,2.

The computational modeling of protein structures has long been an alternative^3,4. Structures can be predicted with reasonable accuracy by using templates based on sequence homology⁵. More recently, high-accuracy models have also been derived based on intramolecular distance restraints according to inferred co-evolutionary relationships^6–8, especially for sequences where structural templates cannot be identified. Co-evolution analysis is highly suitable for the application of deep learning methods that are trained to predict residue interactions from multiple sequence alignments^9–18. This has been demonstrated most convincingly by DeepMind’s AlphaFold program during the recent 13^th Critical Assessment of Techniques for Protein Structure Prediction (CASP13)¹⁹. AlphaFold outperformed all other methods for ‘free-modeling’ (FM) targets where templates were not available but was also competitive in the ‘template-based modeling’ (TBM) categories, although, presumably, without using templates directly²⁰.

Predicted protein structures often capture secondary structures and fold topology correctly, but retain modeling errors compared to experimental structures²¹. Reaching experimental accuracy is the goal of structure refinement techniques. Physics-based methods based on molecular dynamics (MD) simulations have been most successful in improving both global and local structure^22–28. Current methods can provide consistent refinement for most structures while it is becoming increasingly possible to approach experimental accuracy based on root mean square deviations (RMSDs) of 1 Å or better as demonstrated during CASP13²⁹. Interestingly, refinement was especially successful for two targets that came initially from AlphaFold predictions. This suggested that MD-based refinement could allow significant improvements of machine-learning based models due to complementarity between the data-driven and physics-based approaches. Indeed, we find that the application of MD-based structure refinement to AlphaFold’s machine learning models submitted during CASP13 leads to models that overall surpass the accuracy of other approaches, remarkably even for targets where traditional template-based modeling is possible. The combination of machine learning and physics-based refinement therefore results in the highest-resolution protein structure models to date.

METHODS

Refinement of AlphaFold Models

The latest version of the PREFMD protocol²⁹ was applied to refine AlphaFold (A7D)’s “model 1”s generated during CASP13 for protein tertiary structure predictions. The AlphaFold models were downloaded from the CASP web page (http://predictioncenter.org/download_area/CASP13/). The overall refinement method consisted of pre-sampling, sampling, and post-sampling stages. Details are described elsewhere²⁹ and only the key points are briefly outlined here. In the pre-sampling stage, local stereochemical errors such as atomic clashes and poor backbone dihedral angles were resolved by locPREFMD²⁴. At the sampling stage, molecular dynamics (MD) simulations were conducted to explore conformational space. One of two variants was chosen based on target size. In the full iterative protocol, three iterations of MD simulations were carried out, while a more conservative protocol consisted of only a single iteration. For every iteration in the iterative protocol, MD simulations were started from new initial structures chosen from the previous iteration. In the iterative protocol, flat-bottom harmonic restraints were used to allow significant structure changes from the initial model within a certain radius. This protocol was used for smaller targets with radii of gyration smaller than 17 Å. In the conservative protocol, a harmonic restraint function was used to achieve more consistent, but moderate refinement for larger targets where extensive sampling would have been a challenge. The latest version of CHARMM force field, c36m³⁰, and its inhouse modification²⁹ were used for both sampling protocols. After the sampling stage, generated snapshots were evaluated by using the Rosetta score³¹. Conformations with low Rosetta scores were then subjected to ensemble averaging and locPREFMD was applied again to the ensemble averaged structures. Finally, residuewise errors were estimated by using root-mean-square-fluctuation (RMSF) from short MD simulations. We note that in the original method used during CASP13 we attempted to identify putative ligands, but this was not done here in order to implement a fully automatic protocol.

CASP13 assessment scores

For template-based modeling (TBM) category targets, the weighted Z-score sum of Global Distance Test-High Accuracy (GDT-HA)³², Local Distance Difference Test (lDDT)³³, Contact Area Difference-all atom (CAD-aa)³⁴, SphereGrinder³⁵, and Accuracy Self Estimate (ASE) scores³⁶ were used.³⁷ This score considers global structure similarity (GDT-HA), local structure similarity (lDDT, CAD-aa, and SphereGrinder), as well as estimates of residue-wise model errors. Free modeling (FM) category targets were assessed separately in CASP13 using a different score that consists of the weighted Z-score sum of Global Distance Test-Tertiary Structure (GDT-TS)³² and Quality Control Score (QCS)³⁸, both of which focus only on global structure similarity.²⁰

Homologous Sequence Search

To investigate correlations between the number of homologous sequences for a target and the quality of models from AlphaFold and its refined model we searched for homologous sequences based on the Uniclust30 database (clustered sequences of UniProtKB at 30% pairwise sequence identity)³⁹. HHblits⁴⁰ was used for the sequence search with default options except for the number of iterations set to 3. Sequences with a very higher pairwise sequence identity (above 90%) were filtered out using hhfilter, a program from HHsuite^41,42. Results were compared based on the Uniclust30 databases released in August 2018 and September 2016, but no significant differences were observed. The results reported here are based on the August 2018 release.

Protein Models as Molecular Replacement Search Models

Crystallographic data sets were downloaded from the Protein Data Bank (PDB)⁴³ and prepared by using Phenix.⁴⁴ Molecular replacement (MR) was conducted with Phaser⁴⁵ to evaluate log-likelihood gains (LLG) of protein models. We did not perform full MR searches to reduce the computational costs with the large number of models that were considered. Instead, we carried out rigid-body refinement after superimposition of a model on its corresponding component in the crystal structure. If there were multiple different components in the crystal, the model was evaluated after the other components were placed as background structures, and the LLG were calculated with the difference made by the model. Predicted residuewise model errors e were utilized as B-factors calculated as 8π²/3 e². We report the best LLG between the MR results with and without the predicted errors for a given model.

RESULTS AND DISCUSSION

We applied the physics-based model refinement protocol established during CASP13 to “model 1” predictions from AlphaFold (group A7D during CASP13) for all targets. The refinement protocol is based on MD simulations and does not apply any knowledge of the native structure or other data (see Methods).

Model accuracy

The model qualities for the generated models were evaluated according to the CASP13 assessment procedures³⁷, i.e., TBM-score and FM-score for targets in the TBM and FM categories, respectively (see Methods). Using the same scores as during CASP13 allowed us to directly compare with results from other groups without having to reevaluate all of the predictions. Figure 1 summarizes the scores for the AlphaFold models before and after refinement in comparison with models from other groups generated with different methods. As reported during CASP13, AlphaFold models were very competitive, ranking first in the FM category and 4^th in the TBM category^20,37. MD refinement further improved AlphaFold predictions. TBM targets were refined most, resulting in an increased TBM-score of 61.7 (from 44.6) that surpassed the performance of the best template-based modeling protocols. The improvements after refinement were greater for TBM-easy targets (36.0 vs. 24.2) than for TBM-hard targets (25.6 vs. 20.3), where ‘easy’ and ‘hard’ relate to the availability and quality of suitable templates. For FM targets, where templates are not available, refinement improved AlphaFold models moderately with an increased FM-score from 67.2 to 69.0. Improvements due to refinement were consistent across different targets. 78 out of 104 (75%) of targets were refined, including 34 out of 40 (85%) TBM-easy targets.

Figure 1. — Assessment of refined AlphaFold models in comparison with CASP13 results. (A) Scores (sum of assessor’s formula) for TBM (top) and FM (bottom) targets. Results for AlphaFold (A7D) before and after refinement are shown in red and blue, respectively. Other top-performing groups are shown in grey. TBM-easy and FM-hard targets are shown in darker colors, while TBM-hard and FM/TBM targets are shown in lighter colors. (B) Head-to-head comparison between AlphaFold models before and after refinement based on assessor’s formula. Each point corresponds to a target. Targets for TBM-easy, TBM-hard, FM/TBM, and FM-hard are depicted in red circles, orange squares, green triangles, and blue Xs, respectively. The number of better targets in each subcategory is shown at the top (refined model is better) and bottom (AlphaFold model is better) in the same colors.

Changes in individual quality measures after refinement (Figure S1 and Table S1) indicate that improvements were not limited to specific structure features, but that there was an overall enhancement in global and local structural quality as well as better residue-wise error estimation. During CASP13, AlphaFold generated the best overall models of any group for 25 out of 104 (24%) targets. AlphaFold+refinement resulted in the best models for 41 targets (39%) (see Figure S2). Interestingly, even though AlphaFold or our refinement method did not use templates, it was possible to generate the best models for 14 TBM-easy and 8 TBM-hard targets, respectively, significantly improving upon the accuracy of the unrefined AlphaFold models and surpassing the average performance of any other group in all categories based on Z-scores (see Figure S3 and 4, Table S2 and S3). Only in the TBM-easy category and only with respect to GDT-HA scores were the refined AlphaFold models exceeded by models from the Zhang group.

AlphaFold’s deep-learning method focuses primarily on predicted inter-residue distance distributions and backbone dihedral angles and the accuracy in the resulting models is essentially limited to the residue level. Moreover, model errors likely reflect structural variations within in a given protein family since contacts are obtained from deep multiple sequence alignments. This leaves room for physics-based approaches to provide significant refinement at the atomistic level. As may be expected, the accuracy of AlphaFold models depends on how many homologous sequences were available (see Figure S5A) while MD-based refinement success is independent of available homologous sequences (Figure S5B), further highlighting the complementarity of both approaches.

Figure 2 illustrates two typical examples of successful refinement: The initial AlphaFold models have correct folds, but they slightly mis-predict some regions such as loops and display suboptimal packing between side-chains. These issues were resolved by the refinement. Loop structures are presumably more difficult to predict by AlphaFold because there are fewer inter-residue contacts compared to secondary structure elements. On the other hand, the exact packing of side chains due to physical laws may be difficult to obtain simply based on sequence-derived contacts.

Accuracy of protein-protein interfaces

Protein-protein interactions are important in the function of many proteins^46–48. There were 27 hetero- and 57 homo-oligomeric targets allowing the separate evaluation of the accuracy of interface residues (defined as residues with a heavy atom distance closer than 10 Å to another protein). The interface-RMSD (iRMSD) based on the backbone atoms of interface residues was lower than 4 Å in either initial or refined AlphaFold models for 14 hetero- and 22 homo-oligomeric targets. Among these targets, refinement improved not just the global structure (in terms of GDT-HA), but also the interface (in terms of iRMSD) (Figure 3A). On average, iRMSD values were decreased by 0.15 Å (p=4.1%, n=36) with improvements for 65% of the targets. Figure 3B illustrates a successful example of a signficanctly refined interface even although structures were refined as monomers. Two targets (T1015s1-D1 and T1019s1-D1) became significantly worse after refinement in terms of GDT-HA and iRMSD. These targets have a typical Zinc finger moiety but a Zinc ion was not included in the refinement. If these targets are excluded, the improvement in iRMSD increases to 0.22 Å (p=0.14%, n=34).

Figure 3. — Refinement at protein-protein interfaces. (A) Structure quality comparisons between AlphaFold and refined models in terms of interface RMSD (top) and GDT-HA (bottom). Hetero- and homo-olligomers are shown as red Xs and blue circles, respectively. (B) An example of refinement at the interface of T0997-D1. Both AlphaFold (top) and refined models (bottom) are shown in red and blue, while the experimental structure is shown in yellow and pink. Structures are depicted in cartoon representations, and interface residues are also shown as sticks. Models were superimposed onto the experimental structures by aligning the interface residues. Significantly improved interface regions are indicated by green arrows. Model qualities measures are given below each model.

Use of models for molecular replacement

A practical value of predicted structures is their use as starting models for solving crystal structures via molecular replacement (MR).⁴⁹ We followed the procedure that was used in the CASP13 assessment³⁷ where MR success in Phaser⁴⁵ was evaluated in terms of log-likelihood-gains (LLG). There are 19 crystallographic data sets available so far for a total of 27 CASP13 target domains: 12 single-component data sets for 12 target domains and 7 multi-component data sets consisted of multiple domains or heterooligomers for 15 target domains. (Table S4) LLG scores increased for most targets after refinement. LLG scores of 120 or higher often indicate the possibility for successful MR.⁴⁵ Between initial and refined AlphaFold models, this was reached for 14 targets and out of those the refined models had higher LLG than the AlphaFold model for 10 targets making success with MR more likely. Typically, MR relies on homology models with high sequence identity templates or MR-Rosetta⁵⁰ for sequences with lower sequence identity templates. The results presented here suggest that refined machine learning-based models are becoming a good alternative for solving crystallographic data via MR.

CONCLUSIONS

Co-evolutionary analysis and machine learning have significantly improved prediction accuracies for sequences where templates are not available but has also been demonstrated to perform well for sequences where templates could have been used instead. Here we show that physics-based refinement can further improve the accuracy of machine-learning models to exceed the accuracy of any other available method based on the targets that were assessed in CASP13. This suggests that a protocol that combines machine-learning to obtain residue-residue distance information and torsional preferences with distance-based model generation (such as in AlphaFold) followed by MD-based refinement to improve structural details at the atomistic level is emerging as a universal protocol for obtaining the most accurate predictions irrespective of whether template structures are available for a specific sequence or not. Additional analysis shows that functionally important residues at protein-protein interfaces can be predicted at high accuracy and that refined AlphaFold models are highly useful as search models for molecular replacement during crystallographic structure determination.

Supplementary Material

NIHMS1541842-supplement-1.pdf^{(1.8MB, pdf)}

ACKNOWLEDGEMENTS

We are grateful to the CASP organizers and assessors for organizing CASP13 and we acknowledge DeepMind for the valuable AlphaFold model predictions. We thank Prof. Read for providing scripts for molecular replacement with Phaser. This research was supported by National Institutes of Health Grants R01 GM084953 and R35 GM126948. Computational resources were used at the National Science Foundation’s Extreme Science and Engineering Discovery Environment (XSEDE) facilities under Grant TG-MCB090003.

Footnotes

COMPETING INTERESTS

The authors have no competing interest to declare.

REFERENCES

1.Kolodny R, Pereyaslavets L, Samson AO, Levitt M. On the universe of protein folds. Annu Rev Biophys. 2013;42:559–582. [DOI] [PubMed] [Google Scholar]
2.Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J. On the origin and highly likely completeness of single-domain protein structures. Proc Natl Acad Sci U S A. 2006;103:2605–2610. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Zhang Y Protein structure prediction: when is it useful? Curr Opin Struct Biol. 2009;19:145–155. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Baker D, Sali A. Protein structure prediction and structural genomics. Science. 2001;294:93–96. [DOI] [PubMed] [Google Scholar]
5.Kryshtafovych A, Monastyrskyy B, Fidelis K, Moult J, Schwede T, Tramontano A. Evaluation of the template-based modeling in CASP12. Proteins. 2018;86 Suppl 1:321–334. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014;82 Suppl 2:208–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ovchinnikov S, Park H, Varghese N, et al. Protein structure determination using metagenome sequence data. Science. 2017;355:294–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin A. Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age. Proteins. 2018;86 Suppl 1:51–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS Comput Biol. 2017;13:e1005324. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Adhikari B, Hou J, Cheng J. Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning. Proteins. 2018;86 Suppl 1:84–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Buchan DWA, Jones DT. Improved protein contact predictions with the MetaPSICOV2 server in CASP12. Proteins. 2018;86 Suppl 1:78–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Jones DT, Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics. 2018;34:3308–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wang S, Sun S, Xu J. Analysis of deep learning methods for blind protein contact prediction in CASP12. Proteins. 2018;86 Suppl 1:67–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Hou J, Wu T, Cao R, Cheng J. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Kandathil SM, Greener JG, Jones DT. Prediction of interresidue contacts with DeepMetaPSICOV in CASP13. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Xu J Distance-based protein folding powered by deep learning. Proc Natl Acad Sci U S A. 2019;116:16856–16865. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Xu J, Wang S. Analysis of distance-based protein structure prediction by deep learning in CASP13. Proteins. 2019. [DOI] [PubMed] [Google Scholar]
18.Greener JG, Kandathil SM, Jones DT. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat Commun. 2019;10:3977. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.AlQuraishi M AlphaFold at CASP13. Bioinformatics. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Abriata LA, Tamo GE, Dal Peraro M. A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments. Proteins. 2019. [DOI] [PubMed] [Google Scholar]
21.Feig M Computational protein structure refinement: Almost there, yet still so far to go. Wiley Interdiscip Rev Comput Mol Sci. 2017;7. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Heo L, Feig M. Experimental accuracy in protein structure refinement via molecular dynamics simulations. Proc Natl Acad Sci U S A. 2018;115:13276–13281. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Heo L, Feig M. PREFMD: a web server for protein structure refinement via molecular dynamics simulations. Bioinformatics. 2018;34:1063–1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Feig M Local Protein Structure Refinement via Molecular Dynamics Simulations with locPREFMD. J Chem Inf Model. 2016;56:1304–1312. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Feig M, Mirjalili V. Protein structure refinement via molecular-dynamics simulations: What works and what does not? Proteins. 2016;84 Suppl 1:282–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Mirjalili V, Noyes K, Feig M. Physics-based protein structure refinement through multiple molecular dynamics trajectories and structure averaging. Proteins. 2014;82 Suppl 2:196–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Mirjalili V, Feig M. Protein Structure Refinement through Structure Selection and Averaging from Molecular Dynamics Ensembles. J Chem Theory Comput. 2013;9:1294–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Read RJ, Sammito MD, Kryshtafovych A, Croll TI. Evaluation of model refinement in CASP13. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Heo L, Arbour CF, Feig M. Driven to near-experimental accuracy by refinement via molecular dynamics simulations. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Huang J, Rauscher S, Nawrocki G, et al. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods. 2017;14:71–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Park H, Bradley P, Greisen P Jr., et al. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. J Chem Theory Comput. 2016;12:6201–6212. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Zemla A LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31:3370–3374. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013;29:2722–2728. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins. 2013;81:149–162. [DOI] [PubMed] [Google Scholar]
35.Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins. 2014;82 Suppl 2:7–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kryshtafovych A, Barbato A, Monastyrskyy B, Fidelis K, Schwede T, Tramontano A. Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11. Proteins. 2016;84 Suppl 1:349–369. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Croll TI, Sammito MD, Kryshtafovych A, Read RJ. Evaluation of template-based modeling in CASP13. Proteins: Structure, Function, and Bioinformatics. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Cong Q, Kinch LN, Pei J, et al. An automatic method for CASP9 free modeling structure prediction assessment. Bioinformatics. 2011;27:3371–3378. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Mirdita M, von den Driesch L, Galiez C, Martin MJ, Soding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45:D170–D176. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Remmert M, Biegert A, Hauser A, Soding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011;9:173–175. [DOI] [PubMed] [Google Scholar]
41.Soding J Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21:951–960. [DOI] [PubMed] [Google Scholar]
42.Steinegger M, Meier M, Mirdita M, Voehringer H, Haunsberger SJ, Soeding J. HH-suite3 for fast remote homology detection and deep protein annotation. bioRxiv. 2019:560029. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Westbrook J, Feng Z, Chen L, Yang H, Berman HM. The Protein Data Bank and structural genomics. Nucleic Acids Res. 2003;31:489–491. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Adams PD, Afonine PV, Bunkoczi G, et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr. 2010;66:213–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.McCoy AJ, Grosse-Kunstleve RW, Adams PD, Winn MD, Storoni LC, Read RJ. Phaser crystallographic software. J Appl Crystallogr. 2007;40:658–674. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Negri A, Rodriguez-Larrea D, Marco E, Jimenez-Ruiz A, Sanchez-Ruiz JM, Gago F. Protein-protein interactions at an enzyme-substrate interface: characterization of transient reaction intermediates throughout a full catalytic cycle of Escherichia coli thioredoxin reductase. Proteins. 2010;78:36–51. [DOI] [PubMed] [Google Scholar]
47.Pawson T, Nash P. Protein-protein interactions define specificity in signal transduction. Genes Dev. 2000;14:1027–1047. [PubMed] [Google Scholar]
48.Russell RB, Alber F, Aloy P, et al. A structural perspective on protein-protein interactions. Curr Opin Struct Biol. 2004;14:313–324. [DOI] [PubMed] [Google Scholar]
49.Scapin G Molecular replacement then and now. Acta Crystallogr D Biol Crystallogr. 2013;69:2266–2275. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.DiMaio F Advances in Rosetta structure prediction for difficult molecular-replacement problems. Acta Crystallogr D Biol Crystallogr. 2013;69:2202–2208. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1541842-supplement-1.pdf^{(1.8MB, pdf)}

[R1] 1.Kolodny R, Pereyaslavets L, Samson AO, Levitt M. On the universe of protein folds. Annu Rev Biophys. 2013;42:559–582. [DOI] [PubMed] [Google Scholar]

[R2] 2.Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J. On the origin and highly likely completeness of single-domain protein structures. Proc Natl Acad Sci U S A. 2006;103:2605–2610. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Zhang Y Protein structure prediction: when is it useful? Curr Opin Struct Biol. 2009;19:145–155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Baker D, Sali A. Protein structure prediction and structural genomics. Science. 2001;294:93–96. [DOI] [PubMed] [Google Scholar]

[R5] 5.Kryshtafovych A, Monastyrskyy B, Fidelis K, Moult J, Schwede T, Tramontano A. Evaluation of the template-based modeling in CASP12. Proteins. 2018;86 Suppl 1:321–334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Kim DE, Dimaio F, Yu-Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins. 2014;82 Suppl 2:208–218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Ovchinnikov S, Park H, Varghese N, et al. Protein structure determination using metagenome sequence data. Science. 2017;355:294–298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin A. Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age. Proteins. 2018;86 Suppl 1:51–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS Comput Biol. 2017;13:e1005324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Adhikari B, Hou J, Cheng J. Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning. Proteins. 2018;86 Suppl 1:84–96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Buchan DWA, Jones DT. Improved protein contact predictions with the MetaPSICOV2 server in CASP12. Proteins. 2018;86 Suppl 1:78–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Jones DT, Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics. 2018;34:3308–3315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Wang S, Sun S, Xu J. Analysis of deep learning methods for blind protein contact prediction in CASP12. Proteins. 2018;86 Suppl 1:67–77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Hou J, Wu T, Cao R, Cheng J. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Kandathil SM, Greener JG, Jones DT. Prediction of interresidue contacts with DeepMetaPSICOV in CASP13. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Xu J Distance-based protein folding powered by deep learning. Proc Natl Acad Sci U S A. 2019;116:16856–16865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Xu J, Wang S. Analysis of distance-based protein structure prediction by deep learning in CASP13. Proteins. 2019. [DOI] [PubMed] [Google Scholar]

[R18] 18.Greener JG, Kandathil SM, Jones DT. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat Commun. 2019;10:3977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.AlQuraishi M AlphaFold at CASP13. Bioinformatics. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Abriata LA, Tamo GE, Dal Peraro M. A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments. Proteins. 2019. [DOI] [PubMed] [Google Scholar]

[R21] 21.Feig M Computational protein structure refinement: Almost there, yet still so far to go. Wiley Interdiscip Rev Comput Mol Sci. 2017;7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Heo L, Feig M. Experimental accuracy in protein structure refinement via molecular dynamics simulations. Proc Natl Acad Sci U S A. 2018;115:13276–13281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Heo L, Feig M. PREFMD: a web server for protein structure refinement via molecular dynamics simulations. Bioinformatics. 2018;34:1063–1065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Feig M Local Protein Structure Refinement via Molecular Dynamics Simulations with locPREFMD. J Chem Inf Model. 2016;56:1304–1312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Feig M, Mirjalili V. Protein structure refinement via molecular-dynamics simulations: What works and what does not? Proteins. 2016;84 Suppl 1:282–292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Mirjalili V, Noyes K, Feig M. Physics-based protein structure refinement through multiple molecular dynamics trajectories and structure averaging. Proteins. 2014;82 Suppl 2:196–207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Mirjalili V, Feig M. Protein Structure Refinement through Structure Selection and Averaging from Molecular Dynamics Ensembles. J Chem Theory Comput. 2013;9:1294–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Read RJ, Sammito MD, Kryshtafovych A, Croll TI. Evaluation of model refinement in CASP13. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Heo L, Arbour CF, Feig M. Driven to near-experimental accuracy by refinement via molecular dynamics simulations. Proteins. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Huang J, Rauscher S, Nawrocki G, et al. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods. 2017;14:71–73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Park H, Bradley P, Greisen P Jr., et al. Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. J Chem Theory Comput. 2016;12:6201–6212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Zemla A LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31:3370–3374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013;29:2722–2728. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Olechnovic K, Kulberkyte E, Venclovas C. CAD-score: a new contact area difference-based function for evaluation of protein structural models. Proteins. 2013;81:149–162. [DOI] [PubMed] [Google Scholar]

[R35] 35.Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP prediction center infrastructure and evaluation measures in CASP10 and CASP ROLL. Proteins. 2014;82 Suppl 2:7–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Kryshtafovych A, Barbato A, Monastyrskyy B, Fidelis K, Schwede T, Tramontano A. Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in CASP11. Proteins. 2016;84 Suppl 1:349–369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Croll TI, Sammito MD, Kryshtafovych A, Read RJ. Evaluation of template-based modeling in CASP13. Proteins: Structure, Function, and Bioinformatics. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Cong Q, Kinch LN, Pei J, et al. An automatic method for CASP9 free modeling structure prediction assessment. Bioinformatics. 2011;27:3371–3378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Mirdita M, von den Driesch L, Galiez C, Martin MJ, Soding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45:D170–D176. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Remmert M, Biegert A, Hauser A, Soding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011;9:173–175. [DOI] [PubMed] [Google Scholar]

[R41] 41.Soding J Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21:951–960. [DOI] [PubMed] [Google Scholar]

[R42] 42.Steinegger M, Meier M, Mirdita M, Voehringer H, Haunsberger SJ, Soeding J. HH-suite3 for fast remote homology detection and deep protein annotation. bioRxiv. 2019:560029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Westbrook J, Feng Z, Chen L, Yang H, Berman HM. The Protein Data Bank and structural genomics. Nucleic Acids Res. 2003;31:489–491. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Adams PD, Afonine PV, Bunkoczi G, et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr. 2010;66:213–221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.McCoy AJ, Grosse-Kunstleve RW, Adams PD, Winn MD, Storoni LC, Read RJ. Phaser crystallographic software. J Appl Crystallogr. 2007;40:658–674. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Negri A, Rodriguez-Larrea D, Marco E, Jimenez-Ruiz A, Sanchez-Ruiz JM, Gago F. Protein-protein interactions at an enzyme-substrate interface: characterization of transient reaction intermediates throughout a full catalytic cycle of Escherichia coli thioredoxin reductase. Proteins. 2010;78:36–51. [DOI] [PubMed] [Google Scholar]

[R47] 47.Pawson T, Nash P. Protein-protein interactions define specificity in signal transduction. Genes Dev. 2000;14:1027–1047. [PubMed] [Google Scholar]

[R48] 48.Russell RB, Alber F, Aloy P, et al. A structural perspective on protein-protein interactions. Curr Opin Struct Biol. 2004;14:313–324. [DOI] [PubMed] [Google Scholar]

[R49] 49.Scapin G Molecular replacement then and now. Acta Crystallogr D Biol Crystallogr. 2013;69:2266–2275. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.DiMaio F Advances in Rosetta structure prediction for difficult molecular-replacement problems. Acta Crystallogr D Biol Crystallogr. 2013;69:2202–2208. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

HIGH-ACCURACY PROTEIN STRUCTURES BY COMBINING MACHINE-LEARNING WITH PHYSICS-BASED REFINEMENT

Lim Heo

Michael Feig

Abstract

INTRODUCTION