Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2022 Jan 30;434(2):167336. doi: 10.1016/j.jmb.2021.167336

The AlphaFold Database of Protein Structures: A Biologist’s Guide

Alessia David 1,, Suhail Islam 1, Evgeny Tankhilevich 1, Michael JE Sternberg 1
PMCID: PMC8783046  PMID: 34757056

Graphical abstract

graphic file with name ga1.jpg

Keywords: AlphaFold, human proteome, three-dimensional model, inter-domain accuracy

Highlights

  • AlphaFold, released the three-dimensional models (3D) of the human proteome.

  • Accurately and poorly modelled regions may be present within a single AlphaFold model.

  • The advantages, limitations and unsolved challenges of AlphaFold models are discussed.

Abstract

AlphaFold, the deep learning algorithm developed by DeepMind, recently released the three-dimensional models of the whole human proteome to the scientific community. Here we discuss the advantages, limitations and the still unsolved challenges of the AlphaFold models from the perspective of a biologist, who may not be an expert in structural biology.


In July 2021, the predicted three-dimensional models for the whole human proteome generated using AlphaFold, the deep learning algorithm developed by DeepMind, were made available to the public, as recently reported in Nature.1 In the absence of an experimental structure, computational methods have been used for decades to predict three-dimensional protein models. Before the advent of the AlphaFold algorithm, the main approaches were homology modelling and ab-initio. In homology modelling (or template-based approach), which was the most successful and widely used approach, a model is built based on the experimental structure of a homologue, which serves as a structural template. In the ab-initio method (or template-free approach), the model is built by using physics-based and/or knowledge-based energy functions, combined with evolutionary information, which are used to generate distance (or contact) maps.2

The deep neural network of the AlphaFold algorithm, which combines features derived from homologous templates and from multiple sequence alignment to generate the predicted structure, has shown an outstanding accuracy in predicting the three-dimensional structure of proteins with otherwise unknown fold. In CASP14, which is a blind trial that critically assesses techniques for protein structure prediction, AlphaFold (which entered the blind trial under the name AlphaFold2, to distinguish this from an earlier version), markedly outperformed other protein structure modelling methods. When using the root mean square deviation (rmsd), a commonly used method to measure the similarity between two structures (the lower the score the more similar the structures), AlphaFold models had a median backbone accuracy of 0.96 Å rmsd compared to 2.80 Å rmsd of the next best performing method. AlphaFold models also had a high level of accuracy in predicting the position of residue side chains when the protein backbone prediction was accurate.3 The leading edge performance of AlphaFold is confirmed by the on-going Continuous Automated Model EvaluatiOn (CAMEO).4 In light of this remarkable achievement, DeepMind made the entire set of models for the human proteome freely available to the scientific community, available at https://alphafold.ebi.ac.uk/ and hosted by the European Bioinformatics Institute.

From the perspective of the biologist and the non-expert in the structural biology field, what are the advantages, the limitations and the still unsolved challenges of the models generated by AlphaFold? Currently <10% of the proteins in the human proteome have at least some experimentally-obtained coordinates (protein-level coverage) and ∼17% of the residues in the human proteome can be mapped to an experimental structure (residue-level coverage) (4). In the AlphaFold database, the protein-level coverage for the human proteome is 98.5%. However, only 58% of residues are modelled with high confidence, defined as a predicted local distance difference test score [pLDDT] > 70.1 This 58% high confidence residue-level coverage is an overall improvement of <10% compared to the combined coverage of experimental structures and models generated using templates with sequence identity >30% and standard template modelling predictors (∼50% residue-level coverage).,5, 6 However, this increment of coverage will be transformative by providing models which would not be otherwise available to the community. Moreover, the improved accuracy of AlphaFold models compared to template-based ones will be important in several applications, including structure-based drug discovery,7 variant prediction and to assist experimental structure determination (e.g. molecular8)9, 10, 11, 12 (and extensively discussed in the JMB AlphaFold Special Issue, Volume 433, Issue 20, 1st October 2021). However, in cases where the predicted model of the holo form with its cognate ligand is important, a less accurate model which inherits the ligand coordinates from the template may provide more biological insights compared to a more accurate AlphaFold model of the apo form. At present, the models released by AlphaFold do not allow user selection of the appropriate ligand-bound template, which is facilitated by many of the traditional template-based methods.13, 14

This relatively small improvement in coverage is not surprising given that 37–50% of the human proteome is predicted to be structurally disordered.15 Disordered protein domains are often important for intracellular signalling and can transition from a disordered to an ordered state, e.g. upon binding to other proteins. Predicting how these amino acid sequences fold remains a challenge.16 In the AlphaFold models, these disordered regions are identified by a pLDDT < 50 and are often graphically presented as long filaments.

Another major challenge in the field of structural biology and protein modelling is the identification of the correct placement of domains in a multi-domain protein, also known as inter-domain accuracy. AlphaFold provides full chain models for >98% of human proteins, many of which are multidomain. In CASP14, the AlphaFold inter-domain accuracy was good (formally 70% of models having a template modelling (TM) score > 0.7). Domains are often connected by short and flexible stretches of amino acids, known as linkers, which allow domains to undergo conformational changes in response to biological stimuli. In the AlphaFold models, these linkers are not always predicted at high confidence (pLDDT > 70). The implication of this is that the spatial placement, and in some cases, proximity of two ordered domains should be interpreted with caution. Here, we wish to highlight the need to inspect the heat map or “predicted aligned error” provided by AlphaFold that displays the model’s inter-domain accuracy, which should always be considered alongside the per-residue pLDDT score when interpreting model accuracy. Additionally, the relative position of domains should be explored using biological data, For example, by using experimental structures with lower resolution, structures of homogous proteins or of complexes with partial coverage of the protein sequence.

We illustrate the challenge of positioning domains with two examples. Figure 1(A) shows the predicted structure for the growth hormone receptor, where the long disordered intracellular tail is placed next to the ordered extracellular domain. Figure 1(B) shows that the relative location of the domains in PIK3R1 is inconsistent with the experimental structure of the PIK3R1 / PIK3CD complex with major clashes between chains. In this example, the relative positions of the PIK3R1 domains may alter between the single chain and the complex. Hence, if the links between the domain are flexible, AlphaFold could be generating a correct model for the single chain or be generating one of an ensemble of domain conformations.

Figure 1.

Figure 1

The challenges of protein structure prediction. A) AlphaFold model of the growth hormone receptor (GHR, UniProt P10912). The long, unstructured intracellular tail of the growth hormone receptor (residues 289–638) is presented in magenta as a long filament and is wrongly placed next to the extracellular domain. The extracellular domain (residues 19–264) is presented in blue and the transmembrane domain (residues 265–288) in cyan. B) On the left, AlphaFold model of the PIK3R1 protein (in magenta, UniProt P27986). The main domains of PIK3R1 are highlighted with dotted lines. On the right, the AlphaFold model of PIK3R1 (in magenta) is superposed to the experimental structure of PIK3R1 (in cyan) in complex with PIK3CD (in green; PDB 5M6U). The PIK3R1 interdomain placement would results in a steric clash with PIK3CD. PI3K-P85-iSH2, Phosphatidylinositol 3-kinase regulatory subunit P85 inter-SH2 domain.

Another challenge for protein structure predictions is that several proteins are very long. Currently the AlphaFold database on the EBI website does not include models for proteins longer than 2700 residues. Thus, no models are available for 207 large (residue range 2701–34350), biologically important human proteins, such as those encoded by Titin and Dystrophin, the main genes responsible for congenital cardiomyopathy17 and muscular dystrophy.18 However, AlphaFold has generated several overlapping model fragments for these proteins (available for download at https://alphafold.ebi.ac.uk/download). Inevitably, interpreting models for very long proteins will be difficult.

The structural coverage of the human proteome is not uniform. A recent study showed that some classes of proteins, such as drug targets, have been studied better than others and their structural coverage at protein level is already very high.19 We explored the additional value of AlphaFold models compared to the coverage that can be obtained using standard homology modelling algorithms, such as Phyre2,13 on two sets of proteins that make a fundamental contribution to morbidity and mortality: the top 25 cancer proteins from the PanCan TumorPortal database20 and the top 5 proteins causing familial hypercholesterolemia, one of the main inherited causes of premature cardiovascular disease (https://panelapp.genomicsengland.co.uk/panels/772/).21 Of these 30 proteins, 8 are longer than 2700 residues and models are not provided for these on the EBI website. For the remaining 22 proteins, the additional coverage at residue level provided by AlphaFold models (pLDDT > 70) over standard homology methods, exemplified by Phyre2, was not substantial: 13,059 versus 13,214 (Table 1).

Table 1.

AlphaFold database coverage compared to the experimental coverage and the coverage obtained using standard homology-based methods exemplified by our in house program Phyre2.

The three-dimensional coordinate files were extracted from the ProteinDataBank (PDB). Phyre2 was used as a representative of homology-based methods. Only Phyre2 models with a confidence score >98% and sequence identity >30% were selected. For AlphaFold models, the residue coverage is presented according to the per-residue pLDDT score.

Experimental coverage
AlphaFold (pLDDT ≥ 70)
Phyre2 (Confidence > 98%; Seq ID > 30%)
Gene UniProt Id Protein length residues, n. residues, % residues, n. residues, % residues, n. residues, %
LDLRAP1 Q5SW96 308 16 5.2 159 51.6 144 46.8
SETD2 Q9BYW2 2564 424 16.5 513 20.0 345 13.5
CREBBP Q92793 2442 556 22.8 823 33.7 829 33.9
ARID1A O14497 2285 586 25.6 554 24.2 647 28.3
NOTCH1 P46531 2555 797 31.2 602 23.6 551 21.6
SMARCA4 P51532 1647 682 41.4 831 50.5 945 57.4
PBRM1 Q86U86 1689 879 52.0 1126 66.7 373 22.1
BRAF P15056 766 447 58.4 421 55.0 295 38.5
FBXW7 Q969H0 707 444 62.8 471 66.6 443 62.7
VHL P40337 213 160 75.1 155 72.8 150 70.4
RB1 P06400 928 698 75.2 592 63.8 763 82.2
LDLR P01130 860 705 82.0 643 74.8 650 75.6
PTEN P60484 403 334 82.9 315 78.2 353 87.6
EGFR P00533 1210 1010 83.5 860 71.1 914 75.5
TP53 P04637 393 340 86.5 227 57.8 357 90.8
KRAS P01116 189 171 90.5 175 92.6 189 100.0
PCSK9 Q8NBP7 692 642 92.8 563 81.4 622 89.9
MTOR P42345 2549 2370 93.0 2074 81.4 2533 99.4
APOE P02649 317 298 94.0 218 68.8 299 94.3
PIK3R1 P27986 724 683 94.3 621 85.8 596 82.3
PIK3CA P42336 1068 1061 99.3 1002 93.8 1060 99.3
CDKN2A P42771 156 156 100.0 114 73.1 156 100.0
TOTAL 13,059 13,214
NF1 P21359 2839 NA NA 595 21.0
APC P25054 2843 NA NA 571 20.1
ATM Q13315 3056 NA NA 3053 99.9
SPEN Q96T58 3664 NA NA 456 12.4
APOB P04114 4563 NA NA 0 0.0
FAT1 Q14517 4588 NA NA 518 11.3
MLL3 Q8NEZ4 4911 NA NA 156 3.2
MLL2 O14686 5537 NA NA 309 5.6

NA, AlphaFold model not available from the EBI website. However, the predicted overlapping segments for these long proteins can be downloaded from https://alphafold.ebi.ac.uk/download.

LDLR, APOB, APOE, PCSK9 and LDLRAP1 cause Familial Hypercholesterolemia. The remaining 25 genes are the top 25 genes from PanCan (4742 patients) in the TumorPortal.

Seq ID, sequence identity between query and template.

In conclusion, the AlphaFold algorithm has rightly been called a “game changer” in the field of structural biology and has demonstrated one of the many applications of deep learning algorithms in biomedicine.22, 23 However, AlphaFold has not completely solved the “protein folding problem” and many challenges remain, such as predicting the relative position of domains within a chain, how domains shift their relative conformation in response to stimuli, and how domains transition from disorder to order.

Author contribution

AD and SI performed model analysis. AD wrote the first draft of the manuscript. All authors contributed to the interpretation of findings and manuscript preparation. All authors approved the final version of the manuscript.

Disclosures

AD is supported by Wellcome Trust (grant 218242/Z/19/Z).

ET is supported by a BBSRC grant to Imperial College London (BB/M011178/1).

These Funders and DeepMind had no role in the conceptualization, design, data collection, analysis, decision to publish or preparation of the manuscript.

This research was funded in whole, or in part, by the Wellcome Trust grant number 218242/Z/19/Z. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Conflict of interest

DeepMind are proving funding for Master studentships at Imperial College London including potentially for a course of which MJES is the Director. The Authors declare no other competing interests.

Edited by Sheena E. Radford

References

  • 1.Tunyasuvunakool K., Adler J., Wu Z., Green T., Zielinski M., Žídek A., Bridgland A., Cowie A., et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596:590–596. doi: 10.1038/s41586-021-03828-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kuhlman B., Bradley P. Advances in protein structure prediction and design. Nature Rev. Mol. Cell Biol. 2019;20:681–697. doi: 10.1038/s41580-019-0163-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Haas J., Roth S., Arnold K., Kiefer F., Schmidt T., Bordoli L., Schwede T. The Protein Model Portal–a comprehensive resource for protein structure and model information. Database (Oxford) 2013;2013:bat031. doi: 10.1093/database/bat031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Khanna T., Hanna G., Sternberg M.J.E., David A. Missense3D-DB web catalogue: an atom-based analysis and repository of 4M human protein-coding genetic variants. Hum Genet. 2021;140:805–812. doi: 10.1007/s00439-020-02246-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Swissmodel.expasy.org/repository, n.d. https://swissmodel.expasy.org/repository/species/9606.
  • 7.Mullard A. What does AlphaFold mean for drug discovery? Nature Rev. Drug Discov. 2021;20:725–727. doi: 10.1038/d41573-021-00161-0. [DOI] [PubMed] [Google Scholar]
  • 8.Millán C., Keegan R.M., Pereira J., Sammito M.D., Simpkin A.J., McCoy A.J., Lupas A.N., Hartmann M.D., et al. Assessing the utility of CASP14 models for molecular replacement. Proteins. 2021 doi: 10.1002/prot.26214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Del Alamo D., Govaerts C., Mchaourab H.S. AlphaFold2 predicts the inward-facing conformation of the multidrug transporter LmrP. Proteins. 2021;89:1226–1228. doi: 10.1002/prot.26138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Cramer P. AlphaFold2 and the future of structural biology. Nature Struct. Mol. Biol. 2021;28:704–705. doi: 10.1038/s41594-021-00650-1. [DOI] [PubMed] [Google Scholar]
  • 11.Zweckstetter M. NMR hawk-eyed view of AlphaFold2 structures. Protein Sci. 2021 doi: 10.1002/pro.4175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bouatta N., Sorger P., AlQuraishi M. Protein structure prediction by AlphaFold2: are attention and symmetries all you need? Acta Crystallogr. D Struct. Biol. 2021;77:982–991. doi: 10.1107/S2059798321007531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kelley L.A., Mezulis S., Yates C.M., Wass M.N., Sternberg M.J.E. The Phyre2 web portal for protein modeling, prediction and analysis. Nature Protoc. 2015;10:845–858. doi: 10.1038/nprot.2015.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bienert S., Waterhouse A., de Beer T.A.P., Tauriello G., Studer G., Bordoli L., Schwede T. The SWISS-MODEL Repository-new features and functionality. Nucleic Acids Res. 2017;45:D313–D319. doi: 10.1093/nar/gkw1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Oates M.E., Romero P., Ishida T., Ghalwash M., Mizianty M.J., Xue B., Dosztányi Z., Uversky V.N., et al. D2P2: database of disordered protein predictions. Nucleic Acids Res. 2013;41:D508–D516. doi: 10.1093/nar/gks1226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ruff K.M., Pappu R.V. AlphaFold and implications for intrinsically disordered proteins. J. Mol. Biol. 2021;433 doi: 10.1016/j.jmb.2021.167208. [DOI] [PubMed] [Google Scholar]
  • 17.Jungbluth H., Treves S., Zorzato F., Sarkozy A., Ochala J., Sewry C., Phadke R., Gautel M., et al. Congenital myopathies: disorders of excitation-contraction coupling and muscle contraction. Nature Rev. Neurol. 2018;14:151–167. doi: 10.1038/nrneurol.2017.191. [DOI] [PubMed] [Google Scholar]
  • 18.Nowak K.J., Davies K.E. Duchenne muscular dystrophy and dystrophin: pathogenesis and opportunities for treatment. EMBO Rep. 2004;5:872–876. doi: 10.1038/sj.embor.7400221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Somody J.C., MacKinnon S.S., Windemuth A. Structural coverage of the proteome for pharmaceutical applications. Drug Discov. Today. 2017;22:1792–1799. doi: 10.1016/j.drudis.2017.08.004. [DOI] [PubMed] [Google Scholar]
  • 20.Lawrence M.S., Stojanov P., Mermel C.H., Robinson J.T., Garraway L.A., Golub T.R., Meyerson M., Gabriel S.B., et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014;505:495–501. doi: 10.1038/nature12912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Defesche J.C., Gidding S.S., Harada-Shiba M., Hegele R.A., Santos R.D., Wierzbicki A.S. Familial hypercholesterolaemia. Nature Rev. Dis. Primers. 2017;3:17093. doi: 10.1038/nrdp.2017.93. [DOI] [PubMed] [Google Scholar]
  • 22.Fersht A.R. AlphaFold - A personal perspective on the impact of machine learning. J. Mol. Biol. 2021 doi: 10.1016/j.jmb.2021.167088. [DOI] [PubMed] [Google Scholar]
  • 23.Thornton J.M., Laskowski R.A., Borkakoti N. AlphaFold heralds a data-driven revolution in biology and medicine. Nature Med. 2021;27:1666–1669. doi: 10.1038/s41591-021-01533-0. [DOI] [PubMed] [Google Scholar]

RESOURCES