Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 3.
Published in final edited form as: Structure. 2021 May 12;29(6):515–520. doi: 10.1016/j.str.2021.04.010

Open Access Data: A Cornerstone for Artificial Intelligence Approaches to Protein Structure Prediction

Stephen K Burley 1,2,3,4,5,*, Helen M Berman 1,2,6,*
PMCID: PMC8178243  NIHMSID: NIHMS1703483  PMID: 33984281

Summary

The Protein Data Bank (PDB) was established in 1971 to archive three-dimensional (3D) structures of biological macromolecules as a public good. Fifty years later, the PDB is providing millions of data consumers around the world with open access to more than 175,000 experimentally determined structures of proteins and nucleic acids (DNA, RNA) and their complexes with one another and small-molecule ligands. PDB data users are working, teaching, and learning in fundamental biology, biomedicine, bioengineering, biotechnology, and energy sciences. They also represent the fields of agriculture, chemistry, physics and materials science, mathematics, statistics, computer science, and zoology, and even the social sciences. The enormous wealth of 3D structure data stored in the PDB has underpinned significant advances in our understanding of protein architecture, culminating in recent breakthroughs in protein structure prediction accelerated by artificial intelligence approaches and machine learning methods.

Keywords: Structural biology, Protein structure, Open-access biodata resource, FAIR principles, Protein Data Bank, PDB50, Structure-guided drug discovery, Artificial Intelligence, Machine Learning, De Novo Protein Structure Prediction, CASP, CAMEO, Drug Design Data Resource, CAPRI

Graphical Abstract

graphic file with name nihms-1703483-f0001.jpg

Introduction

This perspective is inspired by the remarkable achievements of Google DeepMind in the 14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP14) (CASP Organizers, 2020). We describe how the Protein Data Bank (PDB) evolved to become the cornerstone of a global biostructure data ecosystem that impacts science broadly and enabled successful application of machine learning (ML) tools to de novo protein structure prediction. Public availability of scientific data drives research and development. We posit that artificial intelligence (AI) will continue to benefit from open access to structural, biological, chemical, and biochemical data as new algorithms are applied to predicting small-molecule ligand binding and protein-protein interactions.

Multiple Communities Converged to Create Today’s PDB: E Pluribus Unum

In the late 1960s, long before open access was recognized as the preferred mechanism for disseminating scientific information, a small group of like-minded individuals realized that three-dimensional (3D) structure data for proteins should be centrally archived and made freely available to enable further research. Thus was born the concept of the Protein Data Bank as a public good.

Decades of innovation by physicists, chemists, and engineers spawned the scientific discipline we know today as macromolecular crystallography (MX). Nobel Laureate W.L. Bragg noted in his landmark 1968 Scientific American discourse that “crystallography … revealed the way atoms are arranged in many diverse forms of matter” (Bragg, 1968), leading to fundamental revision of ideas in many sciences. In the 1930s, only twenty years after the discovery of Bragg’s law (Bragg and Bragg, 1913), the first X-ray diffraction patterns of crystalline proteins were recorded on photographic film by Dorothy Crowfoot Hodgkin and J.D. Bernal (Bernal and Crowfoot, 1934). The first structure determination of a protein, sperm whale myoglobin, was announced more than two decades later by Sir John Kendrew and colleagues (Kendrew and Parrish, 1957). Following elucidation of several more protein structures, the PDB archive was launched under the leadership of Walter Hamilton at Brookhaven National Laboratory in 1971 (Berman, 2008; Bernstein et al., 1977; Meyer, 1997; Protein Data Bank, 1971).

PDB’s growth and success required community-wide efforts at many levels. Generations of structural biologists were trained in academe and industry. These researchers in turn developed and refined the many methods that have made 3D structure determination possible, initially using MX and now via other biophysical methods. New technologies tackled increasingly complex structures. Sustainable mechanisms for archiving and distributing 3D structure data and the associated metadata had to be created. The new structures proved to be of tremendous interest to researchers and educators. Policies governing deposition of and access to the data had to be formulated. The PDB would not have stood the test of time, let alone flourished, but for this diverse and dedicated collection of stakeholders.

Fifty years on the PDB is thriving. The archive is jointly managed by the Worldwide Protein Data Bank or wwPDB--an international partnership (Berman et al., 2003) made up of regional PDB data centers committed to maintaining the single global archive (Berman et al., 2000; Boutselakis et al., 2003; Nakamura et al., 2002; Ulrich et al., 2008). As of early 2021, the PDB contained more than 175,000 well-validated, expertly-biocurated 3D structures of proteins and nucleic acids and more than 30,000 small-molecule ligands. Today, the PDB is used intensively by many millions of data consumers from every sovereign country and territory recognized by the United Nations, impacting research and education across the sciences from agriculture to zoology.

The PDB began as a community resource founded on bedrock values of open access and facile reuse. They were recently codified within the FAIR Principles of Findability, Accessibility, Interoperability, and Reusability (Wilkinson et al., 2016) and the FACT Principles of Fairness, Accuracy, Confidentiality, and Transparency (van der Aalst et al., 2017). PDB’s centralized, strictly-enforced data ontology exemplifies the FAIR and FACT principles, and was the basis for its certification by CoreTrustSeal (https://www.coretrustseal.org). wwPDB uses the macromolecular Crystallographic Information Framework (mmCIF) (Fitzgerald et al., 2005; Westbrook et al., 2005) as a metadata standard. The fully-extensible, human- and machine-readable PDBx/mmCIF data dictionary was part of an International Union of Crystallography initiative. wwPDB partners and the PDBx/mmCIF Working Group now coordinate PDBx/mmCIF development (wwPDB, 2017).

Structures archived in the PDB resulted from innovation and hard labor by tens of thousands of scientists from diverse disciplines working on every inhabited continent. Determining the earliest protein structures typically required years of effort by multiple individuals (Johnson and Petsko, 1999). Proteins were laboriously extracted and purified from high abundance natural sources (e.g., whale meat, red blood cells, hen’s eggs, kiwi fruit). Crystal growth conditions were determined by trial and error, and diffraction data were collected on photographic film or with diffractometers one reflection at a time using weak sealed X-ray tubes. Phasing of diffraction patterns depended on Max Perutz’s multiple isomorphous replacement method (Green et al., 1954), requiring tens of crystals for each new structure. Wire models of amino acid residues were manually fit into hand-drawn electron density maps. As more and more structures were archived in the PDB, they could be used to determined related structures via molecular replacement (Rossmann and Blow, 1963) (More than 80% of X-ray structures deposited to the PDB in 2020 came from molecular replacement.) In time, more powerful methods for data collection emerged. Partnerships with high energy physicists allowed utilization of intense X-rays produced at national and regional synchrotron sources (Dauter et al., 2010). By combining tunable synchrotron X-rays with protein crystals in which sulfur was substituted by selenium, Wayne Hendrickson showed that diffraction pattern phases could be measured in a novel way that further accelerated progress (Hendrickson, 1991). Structural genomics initiatives made high throughput crystallography a reality in successful quest for structures covering more of protein fold space (Burley et al., 2008). In parallel, stewards of the PDB continuously improved and refined methods for archiving and processing data by leveraging advances in computational infrastructure and information science. New methodologies allow structural biologists to explore increasingly complex structures accelerating growth of the PDB. Micro-electron diffraction can be used to determine 3D structures of purified macromolecules that yield small well-ordered crystals (Nannenga and Gonen, 2019). Serial femtosecond X-ray crystallography using free electron lasers can capture dynamical chemical reaction processes in 4D (Chapman, 2019). Cryo-electron tomography combined with sub-tomogram averaging can be used study macromolecular machines in situ in flash frozen cells (Turk and Baumeister, 2020). Solid state NMR spectroscopy is proving remarkably effective for integral membrane proteins (van der Wel, 2018), and solution NMR spectroscopy can now probe structures of macromolecules at work inside living cells (Luchinat and Banci, 2016). Integrative approaches are being used to determine 3D structures of molecular machines that cannot be elucidated using one technique alone (Rout and Sali, 2019).

Founding of the PDB was motivated by a passionately held belief that accumulation of scientific knowledge would be accelerated by open sharing of research data (Berman, 2008). Mandatory PDB deposition of structure data as a condition of publication was advocated by key opinion leaders and recommended guidelines were published in 1989 (International Union of Crystallography, 1989). Over time, funding agencies and scientific journals adopted and enforced these guidelines. Later when deposition of primary experimental data (e.g., crystallographic structure factors) became mandatory, more rigorous validation of structures was possible. Task forces made up of structural biology thought leaders came together to review best practices for structure validation. Their recommendations were published in a series of influential white papers (Adams et al., 2016; Berman et al., 2019; Henderson et al., 2012; Montelione et al., 2013; Read et al., 2011; Trewhella et al., 2013), and then implemented and further refined by the wwPDB.

Publicly-available and machine-readable wwPDB validation reports (Feng et al., 2021; Gore et al., 2017) detail how well each structure in the PDB agrees with known chemical geometry and fits with its experimental data. Of particular importance is the fact that outliers can now be rigorously identified (Shao et al., 2017). Doing so makes it possible to estimate uncertainties for data, which is critical for success of ML. Every incoming PDB structure processed by the wwPDB OneDep system (Young et al., 2017) is validated and biocurated (or annotated) by expert structural biologists at one of the wwPDB regional data centers (Young et al., 2018). This process ensures standardization and accuracy of sample taxonomy, polymer sequence information, small-molecule ligand chemical description, and macromolecular assembly description It is no exaggeration to say that the PDB is one of the most highly curated biological data archives. The PDB and individual structures therein are trusted by data depositors and data consumers alike. For this reason, and others discussed below, PDB usage is among the highest of any data repository in biology (Read et al., 2015).

Science Enabled by Open Access to 3D Structure Data

Form (meaning shape/3D structure) dictating function in biology was first revealed with the discovery of the DNA double helix structure (Watson and Crick, 1953). The launch of the PDB 18 years later was a critical step to ensure that the impact of 3D biostructure data would be felt across the medical, natural, physical, engineering, and even social sciences. Fortunately for structural biologists, the repertoire of globular protein shapes represented in nature is decidedly limited (Schaeffer and Daggett, 2011). Stable domain structures capable of supporting diverse biochemical or biological functions emerged during the course of evolution. They have been repeatedly reused with minor structural modifications, and are often found in archaea, prokaryotes, and eukaryotes. This is why molecular replacement phasing works at all and why it is working better and better as the PDB doubles in size every six to eight years.

Analyzing a well curated 3D structure often provides immediate insights into biological and biochemical function without the need for further experimental study (e.g., identifying a catalytic triad characteristic of serine proteases). The power of direct visualization is even more evident when comparing structurally similar proteins of differing amino acid sequence, thereby revealing evolutionary relationships are not apparent from sequence information alone. Local structural similarity of enzyme active sites revealed evidence for convergent evolution in which structurally distinct proteins use the same catalytic residues to accelerate similar, if not identica,l chemical reactions. The wealth of structures now freely available in the PDB can also support hypothesis generation when planning additional functional studies. Structures of human proteins are of particular significance to many PDB data consumers. In the early days of the archive, there were very few examples (i.e., only 34 as of the end 1990). Within a decade, however, the number exceeded 2,600. Now in its 50th year of operations, PDB holdings of human protein structures number ~50,000 (~29% of the archive).

Insights from PDB data broadly impact research in fundamental biology. In some cases, they also translate directly into knowledge concerning human health and disease and facilitate discovery of diagnostic and therapeutic agents. High-throughput genome sequencing combined with 3D structure information frequently enable the identification of drug targets for life-threatening diseases. In late-stage melanoma, for example, ~50% of tumors exhibit a characteristic Valine to Glutamic Acid substitution at amino acid 600 of the BRAF protein kinase, turning the enzyme into a potent driver of uncontrolled cell proliferation (Gray-Schopfer et al., 2007). This tumor-specific variant of the normal BRAF protein is the target of several approved small-molecule drugs, all of which resulted from structure-guided drug discovery campaigns that built on earlier work by academic researchers (e.g., vemurafenib) (Westbrook et al., 2020). Initial structures of both normal and tumor-variant forms of BRAF were contributed to the PDB well before vemurafenib received regulatory approval.

Fifty years of PDB operations have seen entire subdisciplines of biology transformed by open access to 3D structures of key macromolecules. Much of our understanding of T-cell immunology, for example, can be traced back to the first major histocompatibility complex or MHC structure (Bjorkman et al., 1987, 1988). The MHC structure revealed the molecular mechanism of antigen presentation underpinning immune surveillance of proteins in our bodies. More than 750 related PDB structures deposited since 1987 laid the groundwork for understanding regulation of T-cell responses to foreign or non-self antigens and the molecular and cell biological bases of immune checkpoints.

Cancer treatment has undergone a revolution with regulatory approvals of various anti-PD1 and anti-PDL1 monoclonal antibodies. These biologic drugs block protein-protein interactions at immune checkpoints that would otherwise down regulate T-cell killing of tumor cells. President Jimmy Carter’s life, for example, was saved by the anti-PD1 antibody pembrolizumab. Clinically-successful monoclonal antibodies, antibody-drug conjugates, and other classes of biologics (e.g., cytokines) have depended critically on open access to PDB structures (Gilliland et al., 2012). Designer bispecific antibodies can also promote T-cell killing of malignant cells. Blinatumomab does so by non-covalently tethering CD3 receptors on T-cells to CD19 receptors on malignant B-cells (Burt et al., 2019). Absent of knowledge of the 3D structures of antibodies and their targets (the vast majority of which have been contributed by academic researchers), patients and their families would not be benefiting from the many safe and effective biologic drugs now available to treat malignancies and autoimmune disorders.

The COVID-19 pandemic has witnessed heroic efforts by researchers worldwide. Structural biologists are prolific generators of new knowledge concerning SARS-CoV-2. Arguably, the most important of the >1,000 COVID-19 related PDB depositions since late January 2020 are >200 structures of the homotrimeric spike protein required for cell entry. Open access to this wealth of structural information has influenced design of both vaccines and monoclonal antibodies for passive immunization. Nearly 300 PDB structures of essential SARS-CoV-2 proteases are facilitating structure-guided drug discovery efforts within multiple biopharmaceutical companies. PDB structures have also become the face of the coronavirus for the general public (Goodsell et al., 2020). As of April 2021, two small-molecule inhibitors of the SARS-CoV-2 main (or 3CL) protease had entered phase 1 clinical trials (clinical trials.gov), including PF-07304814 (intravenous dosing (Boras et al., 2020)) and PF-07321332 (oral dosing (Halford, 2021);).

Going beyond use of individual experimental structures, PDB data have profoundly influenced the computational sciences. Indeed, the field of structural bioinformatics owes its very existence to the PDB. Without an open access repository of validated, expertly biocurated, and trusted 3D structures of biological macromolecules there would be no homology modeling, no computational docking of small molecule ligands, and certainly no de novo protein structure prediction.

Synergy of Structural Biology and Machine Learning

The history of protein structure prediction recalls Newton’s famous quote “If I have seen further it is by standing on the shoulders of Giants.” Developers of homology modeling and de novo structure prediction have depended on the convergence of knowledge from the PDB, physics and chemistry (e.g., empirical potential functions, thermodynamics), molecular evolution, deep genome sequencing, mathematics, statistics, computer engineering, and computer science to make critical breakthroughs. Their advances were fostered by two community-led blind challenges (i.e., CASP (CASP Organizers, 2020), and the weekly Continuous Automated Model EvaluatiOn or CAMEO online challenge (Haas et al., 2018)). Both have relied on coordination with structural biologists and the wwPDB to ensure relevant structure data are not publicly released before challenges conclude. Every Friday evening, in support of CAMEO challenges the wwPDB pre-releases amino acid sequences pertaining to new protein structures that will be publicly released on the following Wednesday. Individual commitments to collaboration and openness (e.g., sharing of benchmark datasets and computer code) and healthy competition among researchers and software developers have also contributed to advances in de novo protein structure prediction.

In late 2020 during CASP14 (CASP Organizers, 2020), Google DeepMind revealed that its Alphafold2 system can predict 3D structures of small globular proteins with accuracies comparable to that of low-resolution experimental methods (https://deepmind.com). While not entirely unexpected given the success of Alphafold (Senior et al., 2020), the performance of Alphafold2 was justifiably heralded as a major breakthrough in de novo protein structure prediction. At this early stage, we do not know the degree to which ML methods will be able to reduce prediction errors and expand their purview to larger, multidomain proteins. We can, however, confidently assert that sharing of methodologies and possibly code by DeepMind and others using ML approaches will be vital to the longer-term success of these endeavors. Even if AI methods themselves do not improve, year-upon-year growth of the PDB is likely to improve prediction accuracy for small globular proteins. Continuation of the CASP and CAMEO challenges will also be important for future progress.

Looking further ahead, we anticipate that ML approaches will contribute substantially to improved small-molecule docking and affinity scoring methods. Early efforts in this arena have shown promise (Parks et al., 2020). Arguably, there is much be gained if ML methods can be deployed during medicinal chemistry campaigns to improve the potency and selectivity of small molecules before they are subject to preclinical testing (in vitro and in animals) prior to initiation of even more costly human clinical trials. Industry-wide estimates attribute ~30% of small-molecule drug discovery and development failures to toxicity, which results primarily from non-specific binding to so-called off targets (Kola and Landis, 2004). Progress on this front will again be accelerated by community-led blind challenges, such as the Drug Design Data Resource or D3R (Parks et al., 2020) and the weekly Continuous Evaluation of Ligand Protein Prediction or CELPP (Wagner et al., 2019). As for CAMEO, wwPDB supports CELPP challenges by pre-releasing amino acid sequences and related chemical descriptors every Friday for new protein-ligand complex structures that will be publicly released on the following Wednesday. Successful prediction of protein-protein interactions is also likely to benefit from application of ML tools, again fostered by frequent challenges (e.g., Critical Assessment of PRediction of Interactions or CAPRI (Janin, 2005)).

Conclusion

With the benefit of hindsight, we can appreciate the importance of having all public domain structural biology data expertly validated and biocurated and made freely available from a single repository in a standardized format. In a world without open access data, progress in structural biology would have been agonizingly slow. We would likely not have seen as many 3D structures becoming key drivers of research progress across the sciences. The power of molecular replacement phasing has shown that more data equals better technology, which equals richer research outcomes. We should expect the same for de novo protein structure prediction. Accuracies will improve and the impact of AI advances will broaden as the volume of protein structure data continues to increase. In contrast, PDB data pertaining to small-molecule interactions with proteins are dwarfed by holdings inside biopharmaceutical company firewalls. Contributions of significantly more data from industry would almost certainly fuel advances in prediction of small molecule binding to proteins and nucleic acids. With sufficient data in the public domain we can expect acceleration of drug discovery and development efforts in both academe and industry. The true test of AI methods, however, will be accurate prediction of intermolecular interactions that underpin complex regulatory processes in biology. Electron microscopy and integrative or hybrid methods are yielding increasing numbers of structures of large molecular machines, which will provide the necessary starting points.

Looking ahead to the next 50 years of the PDB, the Worldwide Protein Data Bank partnership will continue to foster growth of the single archive and support open access to 3D structure data, with no charge to data depositors or data consumers, and with no limitations on usage. The most immediate challenge facing the structural biology community is effective management and long-term preservation of data generated by integrative or hybrid methods structural studies of ever larger systems ranging from molecular machines to organelles to entire cells. Success in this endeavor will depend on the widest possible adoption of the open access model. Without the pioneering efforts of the PDB, we would not have the burgeoning biodata ecosystem of today that underpins much of the biological sciences (Durinx et al., 2017). Interoperation of even more data resources will be required for broader application of AI methods in modeling whole organs or populations of organisms (e.g., bacterial colonies). Realizing this bold vision will ultimately depend on basic and applied researchers, data stewards, national/regional science funders, and biopharmaceutical/technology industry leaders coming together to allocate sufficient resources to sustain open sharing and preservation of scientific data in the coming decades (Anderson et al., 2017).

Acknowledgements

Above all, the authors wish to thank the tens of thousands of structural biologists who deposited structures to the PDB since 1971 and the many millions of individuals around the world who consume PDB data. We also thank Drs. Jose Duarte, Cathy Lawson, Brinda Vallat, and John D. Westbrook, and Ms. Christine Zardecki for insightful comments and help with manuscript preparation. Finally, the authors gratefully acknowledge contributions to the success of the PDB archive made by all members (past and present) of the RCSB Protein Data Bank, our wwPDB partners (Protein Data Bank in Europe, Protein Data Bank Japan, Biological Magnetic Resonance Bank, and Electron Microscopy Data Bank), and the PDB team at Brookhaven National Laboratory. RCSB PDB is jointly funded by the National Science Foundation (DBI-1832184), the US Department of Energy (DE-SC0019749), and the National Cancer Institute, National Institute of Allergy and Infectious Diseases, and National Institute of General Medical Sciences of the National Institutes of Health under grant R01GM133198.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Interests

The authors declare no competing interests.

References

  1. Adams PD, Aertgeerts K, Bauer C, Bell JA, Berman HM, Bhat TN, Blaney JM, Bolton E, Bricogne G, Brown D, et al. (2016). Outcome of the First wwPDB/CCDC/D3R Ligand Validation Workshop. Structure 24, 502–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anderson W, Apweiler R, Bateman A, Bauer GA, Berman H, Blake JA, Blomberg N, Burley SK, Cochrane G, Di Francesco V, et al. (2017). Towards coordinated international support of core data resources for the life sciences. bioRxiv, doi: 10.1101/110825. [DOI] [Google Scholar]
  3. Berman HM (2008). The Protein Data Bank: a historical perspective. Acta Crystallogr A 64, 88–95. [DOI] [PubMed] [Google Scholar]
  4. Berman HM, Adams PD, Bonvin AA, Burley SK, Carragher B, Chiu W, DiMaio F, Ferrin TE, Gabanyi MJ, Goddard TD, et al. (2019). Federating Structural Models and Data: Outcomes from A Workshop on Archiving Integrative Structures. Structure 27, 1745–1759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Berman HM, Henrick K, and Nakamura H (2003). Announcing the worldwide Protein Data Bank. Nature Structure Biology 10, 980. [DOI] [PubMed] [Google Scholar]
  6. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE (2000). The Protein Data Bank. Nucleic Acids Res 28, 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bernal JD, and Crowfoot DM (1934). X-ray photographs of crystalline pepsin. Nature 133, 794–795. [Google Scholar]
  8. Bernstein FC, Koetzle TF, Williams GJB, Meyer EF Jr., Brice MD, Rodgers JR, Kennard O, Shimanouchi T, and Tasumi M (1977). Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of Molecular Biology 112, 535–542. [DOI] [PubMed] [Google Scholar]
  9. Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, and Wiley DC (1987). Structure of the human class I histocompatibility antigen, HLA-A2. Nature 329, 506–512. [DOI] [PubMed] [Google Scholar]
  10. Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, and Wiley DC (1988). Structure of the human class I histocompatibility antigen, HLA-A2. [DOI] [PubMed] [Google Scholar]
  11. Boras B, Jones RM, Anson BJ, Arenson D, Aschenbrenner L, Bakowski MA, Beutler N, Binder J, Chen E, Eng H, et al. (2020). Discovery of a Novel Inhibitor of Coronavirus 3CL Protease as a Clinical Candidate for the Potential Treatment of COVID-19. bioRxiv. [Google Scholar]
  12. Boutselakis H, Dimitropoulos D, Fillon J, Golovin A, Henrick K, Hussain A, Ionides J, John M, Keller PA, Krissinel E, et al. (2003). E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acids Res 31, 458–462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Bragg L (1968). X-ray crystallography. Scientific American 219, 58–74. [DOI] [PubMed] [Google Scholar]
  14. Bragg W, and Bragg W (1913). The reflection of X‐rays by crystals. Proceedings of the Royal Society of London 88, 428–438. [Google Scholar]
  15. Burley SK, Joachimiak A, Montelione GT, and Wilson IA (2008). Contributions to the NIH-NIGMS Protein Structure Initiative from the PSI Production Centers. Structure 16, 5–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Burt R, Warcel D, and Fielding AK (2019). Blinatumomab, a bispecific B-cell and T-cell engaging antibody, in the treatment of B-cell malignancies. Hum Vaccin Immunother 15, 594–602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. CASP Organizers (2020). 14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction.
  18. Chapman HN (2019). X-Ray Free-Electron Lasers for the Structure and Dynamics of Macromolecules. Annu Rev Biochem 88, 35–58. [DOI] [PubMed] [Google Scholar]
  19. Dauter Z, Jaskolski M, and Wlodawer A (2010). Impact of synchrotron radiation on macromolecular crystallography: a personal view. J Synchrotron Radiat 17, 433–444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Durinx C, McEntyre J, Appel R, Apweiler R, Barlow M, Blomberg N, Cook C, Gasteiger E, Kim J, Lopez R, et al. (2017). Identifying ELIXIR Core Data Resources. F1000Research 5, doi: 10.12688/f11000research.19656.12682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Feng Z, Westbrook JD, Sala R, Smart OS, Bricogne G, Matsubara M, Yamada I, Tsuchiya S, Aoki-Kinoshita KF, Hoch JC, et al. (2021). Enhanced validation of small-molecule ligands and carbohydrates in the protein databank. Structure. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Fitzgerald PMD, Westbrook JD, Bourne PE, McMahon B, Watenpaugh KD, and Berman HM (2005). 4.5 Macromolecular dictionary (mmCIF). In International Tables for Crystallography G Definition and exchange of crystallographic data, Hall SR, and McMahon B, eds. (Dordrecht, The Netherlands: Springer; ), pp. 295–443. [Google Scholar]
  23. Gilliland GL, Luo J, Vafa O, and Almagro JC (2012). Leveraging SBDD in protein therapeutic development: antibody engineering. Methods Mol Biol 841, 321–349. [DOI] [PubMed] [Google Scholar]
  24. Goodsell DS, Voigt M, Zardecki C, and Burley SK (2020). Integrative illustration for coronavirus outreach. PLoS Biol 18, e3000815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Gore S, Sanz Garcia E, Hendrickx PMS, Gutmanas A, Westbrook JD, Yang H, Feng Z, Baskaran K, Berrisford JM, Hudson BP, et al. (2017). Validation of Structures in the Protein Data Bank. Structure 25, 1916–1927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Gray-Schopfer V, Wellbrock C, and Marais R (2007). Melanoma biology and new targeted therapy. Nature 445, 851–857. [DOI] [PubMed] [Google Scholar]
  27. Green DW, Ingram VM, and Perutz MF (1954). The structure of haemoglobin - IV. Sign determination by the isomorphous replacement method. Proceedings of the Royal Society of London 225, 287–307. [Google Scholar]
  28. Haas J, Barbato A, Behringer D, Studer G, Roth S, Bertoni M, Mostaguir K, Gumienny R, and Schwede T (2018). Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins : Structure, Function, and Genetics 86 Suppl 1, 387–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Halford B (2021). Pfizer unveils its oral SARS-CoV-2 inhibitor. Chemical & Engineering News 99, in press. [Google Scholar]
  30. Henderson R, Sali A, Baker ML, Carragher B, Devkota B, Downing KH, Egelman EH, Feng Z, Frank J, Grigorieff N, et al. (2012). Outcome of the first electron microscopy validation task force meeting. Structure 20, 205–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hendrickson WA (1991). Determination of macromolecular structures from anomalous diffraction of synchrotron radiation. Science 254, 51–58. [DOI] [PubMed] [Google Scholar]
  32. International Union of Crystallography (1989). Policy on publication and the deposition of data from crystallographic studies of biological macromolecules. Acta Cryst A45, 658. [Google Scholar]
  33. Janin J (2005). Assessing predictions of protein-protein interaction: the CAPRI experiment. Protein Sci 14, 278–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Johnson LN, and Petsko GA (1999). David Phillips and the origin of structural enzymology. Trends Biochem Sci 24, 287–289. [DOI] [PubMed] [Google Scholar]
  35. Kendrew JC, and Parrish RG (1957). The crystal structure of myoglobin III. Sperm-whale myoglobin. Proceedings of the Royal Society of London 238, 305–324. [Google Scholar]
  36. Kola I, and Landis J (2004). Can the pharmaceutical industry reduce attrition rates? Nat Rev Drug Discov 3, 711–715. [DOI] [PubMed] [Google Scholar]
  37. Luchinat E, and Banci L (2016). A Unique Tool for Cellular Structural Biology: In-cell NMR. J Biol Chem 291, 3776–3784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Meyer EF (1997). The first years of the Protein Data Bank. Protein Sci 6, 1591–1597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Montelione GT, Nilges M, Bax A, Guntert P, Herrmann T, Richardson JS, Schwieters CD, Vranken WF, Vuister GW, Wishart DS, et al. (2013). Recommendations of the wwPDB NMR Validation Task Force. Structure 21, 1563–1570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Nakamura H, Ito N, and Kusunoki M (2002). [Development of PDBj: Advanced database for protein structures]. Tanpakushitsu Kakusan Koso 47, 1097–1101. [PubMed] [Google Scholar]
  41. Nannenga BL, and Gonen T (2019). The cryo-EM method microcrystal electron diffraction (MicroED). Nat Methods 16, 369–379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Parks CD, Gaieb Z, Chiu M, Yang H, Shao C, Walters WP, Jansen JM, McGaughey G, Lewis RA, Bembenek SD, et al. (2020). D3R grand challenge 4: blind prediction of protein-ligand poses, affinity rankings, and relative binding free energies. J Comput Aided Mol Des 34, 99–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Protein Data Bank (1971). Crystallography: Protein Data Bank. Nature (London), New Biol 233, 223–223. [Google Scholar]
  44. Read KB, Sheehan JR, Huerta MF, Knecht LS, Mork JG, Humphreys BL, and N.I.H. Big Data Annotator Group (2015). Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study. PLoS One 10, e0132735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Read RJ, Adams PD, Arendall WB 3rd, Brunger AT, Emsley P, Joosten RP, Kleywegt GJ, Krissinel EB, Lutteke T, Otwinowski Z, et al. (2011). A new generation of crystallographic validation tools for the protein data bank. Structure 19, 1395–1412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Rossmann MG, and Blow DM (1963). Determination of phases by the conditions of non‐crystallographic symmetry. Acta Cryst 16, 39–45. [Google Scholar]
  47. Rout MP, and Sali A (2019). Principles for Integrative Structural Biology Studies. Cell 177, 1384–1403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Schaeffer RD, and Daggett V (2011). Protein folds and protein folding. Protein engineering, design & selection : PEDS 24, 11–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Zidek A, Nelson AWR, Bridgland A, et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710. [DOI] [PubMed] [Google Scholar]
  50. Shao C, Yang H, Westbrook JD, Young JY, Zardecki C, and Burley SK (2017). Multivariate Analyses of Quality Metrics for Crystal Structures in the Protein Data Bank Archive. Structure 25, 458–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Trewhella J, Hendrickson WA, Kleywegt GJ, Sali A, Sato M, Schwede T, Svergun DI, Tainer JA, Westbrook J, and Berman HM (2013). Report of the wwPDB Small-Angle Scattering Task Force: data requirements for biomolecular modeling and the PDB. Structure 21, 875–881. [DOI] [PubMed] [Google Scholar]
  52. Turk M, and Baumeister W (2020). The promise and the challenges of cryo-electron tomography. FEBS Lett 594, 3243–3261. [DOI] [PubMed] [Google Scholar]
  53. Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, et al. (2008). BioMagResBank. Nucleic Acids Res 36, D402–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. van der Aalst WMP, Bichler M, and Heinzl A (2017). Responsible Data Science. Business & Information Systems Engineering 59, 311–313. [Google Scholar]
  55. van der Wel PCA (2018). New applications of solid-state NMR in structural biology. Emerg Top Life Sci 2, 57–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Wagner JR, Churas CP, Liu S, Swift RV, Chiu M, Shao C, Feher VA, Burley SK, Gilson MK, and Amaro RE (2019). Continuous Evaluation of Ligand Protein Predictions: A Weekly Community Challenge for Drug Docking. Structure 27, 1326–1335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Watson JD, and Crick FH (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171, 737–738. [DOI] [PubMed] [Google Scholar]
  58. Westbrook J, Yang H, Feng Z, and Berman HM (2005). 5.5 The use of mmCIF architecture for PDB data management. In International Tables for Crystallography, Hall SR, and McMahon B, eds. (Dordrecht, The Netherlands: Springer; ), pp. 539–543. [Google Scholar]
  59. Westbrook JD, Soskind R, Hudson BP, and Burley SK (2020). Impact of Protein Data Bank on Anti-neoplastic Approvals. Drug Discov Today 25, 837–850. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. wwPDB (2017). PDBx/mmCIF Resource Site.
  62. Young JY, Westbrook JD, Feng Z, Peisach E, Persikova I, Sala R, Sen S, Berrisford JM, Swaminathan GJ, Oldfield TJ, et al. (2018). Worldwide Protein Data Bank biocuration supporting open access to high-quality 3D structural biology data. Database 2018, bay002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Young JY, Westbrook JD, Feng Z, Sala R, Peisach E, Oldfield TJ, Sen S, Gutmanas A, Armstrong DR, Berrisford JM, et al. (2017). OneDep: Unified wwPDB System for Deposition, Biocuration, and Validation of Macromolecular Structures in the PDB Archive. Structure 25, 536–545. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES