Skip to main content
Biophysical Reviews logoLink to Biophysical Reviews
letter
. 2020 Feb 6;12(2):221–224. doi: 10.1007/s12551-020-00628-1

Big data science at AMED-BINDS

Haruki Nakamura 1,
PMCID: PMC7242525  PMID: 32030637

Science is considered to rest upon four basic pillars. Traditionally, experiment and theory are considered as the first and second of these. Advances in computing power have driven simulation to become the third pillar. The development of large repositories of data—so called Big Data—along with the ability to analyze these data libraries using inferential and interpolative statistical methods, has led big data analysis, in other words, data science, to now constitute the fourth scientific pillar.

The reason why big data analysis has become possible in such fields as genome science and astronomy is due to rapid developments in information technology (IT) that have allowed the observation and storage of very large data sets. Analysis of such huge and precise data sets, including image data, has made it possible to obtain new knowledge, which has never before been available. The famous “DIKW pyramid” by Rowley (2007) indicates the basic concept for such big data-driven science: The lowest level is raw big data (Data), which are then validated and assembled to information. Then, analysis of Information can provide Knowledge, which finally produces Wisdom. “Wisdom” in this context should be defined as the ability to solve scientific problems by making reliable predictions. Making science-based predictions can be achieved using either data-driven (empirical) or theory-driven chains of argument. Wisdom is frequently sought in the field of life science, where lots of phenomena are observed without necessarily producing any theoretical principles. With its ability to establish predictive confidence based on interpolative assessment of big data sets, such big data-driven approaches are particularly useful in life science.

Let us consider what is necessary for data science, once we define “wisdom” as above. The answer should be very simple: “Any problem lacking a firm theoretical underpinning cannot be solved without data”, or more correctly, “Any such problem cannot be solved without correct data”. Such a statement is easy to formulate but it is very difficult in practice to confirm whether the data used by data science are actually correct or not. There are two approaches towards this issue based on (i) post-validation of data sets and (ii) pre-validation of data sets using standardized data production methods.

(i) Post-validation of data sets: The first approach is to make the data well validated. Namely, any observed or simulated data should be objectively scrutinized and described by users with enough metadata, so that the observation or simulation can be reproducible. Several examples from the international databank for protein structures, wwPDB (worldwide Protein Data Bank) (wwPDB consortium 2019), have addressed this issue of data validation (Gore et al. 2017; Young et al. 2018). In addition, the data should be updated with the versioning system, so that the user can utilize the newest data with a contemporaneous understanding of the history of data production and revision.

(ii) Pre-validation of data sets using standardized data production methods: The second approach is to create new data by experiments or simulations such that the produced data is automatically archived with enough metadata to allow it to be validated and reproduced, keeping data quality high. If data production is made using a standardized high-throughput (HTP) procedure, new and correct big data are created. Such approaches are becoming popular in many fields in science, and governmental scientific funding agencies, along with other funding bodies, are increasingly requesting that fundees make their data open to society during or following the publication process.

Creating innovative drugs requires state-of-the-art protein science technology. Such technological approaches utilize the big data associated with the genome, proteome, medical and clinical data, and protein structural data deposited in the PDB. In addition to utilizing big data, drug discovery research itself produces big data. The BINDS (Basis for supporting INnovative Drug discovery and life Science research) program started in April of 2017 as one of the AMED (Japan Agency for Medical Research and Development) programs to promote drug discovery in Academia at the pre-clinical and life science research stages. AMED is a rather new funding agency in Japan since 2015 for medical and life science research, integrating the funding from the Ministry of Health, Labor and Welfare; Ministry of Education, Culture, Sports, Science and Technology; and Ministry of Economy, Trade and Industry. A characteristic feature of the BINDS program is that it is composed of 59 research groups in diverse fields which include pharmaceutical science, medicine, chemistry, genomics, structural biology, informatics, and computer science. In addition, BINDS members support researchers outside of the program through the sharing of key technologies such as, synchrotron beams, free-electron lasers, cryo-EM (electron microscopy) instrumentation, NMR devices, supercomputer resources, chemical libraries equipped with HTP assay systems, and next generation DNA sequencers.

In this letter, I will not directly address drug discovery results generated within the BINDS program which will instead be described in other papers in the near future. Instead, here I focus on the scientific activities of the BINDS program from the view point of data science. Directed to this point, several BINDS researchers have engaged in structural bioinformatics studies in order to predict and analyze protein complex structures. Kentaro Tomii at AIST has continuously developed original algorithms to predict protein tertiary structures and their complex structures (Shiota et al. 2015) and has achieved considerable success at the blind worldwide contest, CASP (Nakamura et al. 2017). Most recently, his group has developed a new algorithm with deep neural networks for protein contact prediction, which makes it possible to build tertiary structural models (Fukuda & Tomii 2020). Their core technology is based on multiple alignments of protein sequence big data. Hidetoshi Kono at QST has built many complex structural models using molecular simulation with constraints provided by experimental data of SAXS (small angle X-ray scattering) and cryo-EM, integrating local structures provided by the PDB structural database. In particular, their simulations could sample lots of possible conformations having low free energies with the advantage of their generalized sampling approach (Kono et al. 2018). These days, cryo-EM shows its strong power to reveal the near atomic structures of large complexes of proteins and nucleic acids. However, there are still technical difficulties, of which attempts are being made to overcome them using data science approaches. In order to prepare good cryo-EM grids, which is a key technology for facilitating data collection, Keiichi Namba group at Osaka University has developed software programs, Gwatch and Rwatch, to select appropriate particle images by automatically averaging millions of two-dimensional (2D) images. Toru Terada at University Tokyo and Kazutoshi Tani at Mie University have developed a deep-learning-based method to identify good regions of a cryo-EM grid without the aid of specialist knowledge and only using several hundreds of 2D images. Takeshi Kawabata at Osaka University has developed his own software program, gmfit, for fitting subunits of proteins into density map of protein complexes using a Gaussian mixture model (GMM) (Kawabata 2008; Kawabata 2018a). His approach should be useful for model building not only with low-resolution cryo-EM data (Kawabata 2018b) but also with other data of atomic force microscopy (AFM) (Dasgupta et al. 2020).

Almost all the members in the BINDS program produce lots of different kind of data through their own scientific activities. The members involved in the “Platform function optimization unit” have made efforts to construct archives for those data. One recent archive is called “Antibody Square, http://antibodysq.info/,” which has been developed by Yukinari Kato at Tohoku University and Hirofumi Suzuki at Waseda University, as the repository for antibody developers, users, and suppliers (Fig. 1). Genji Kurisu at PDBj, Osaka University, manages the mirror site of the raw image database of cryo-EM (https://empiar.pdbj.org/), EMPIAR at EMBL-EBI, for supporting data-in and data-out applications involving huge raw image data sets, which have been produced by the BINDS members. In the near future, the EMPIAR-PDBj site will directly accept the depositions of the raw image data. Gert-Jan Bekker (2020) at Osaka University has developed an archive of the static and dynamic structural models, BSM-Arc (biological structure model archive, https://bsma.pdbj.org/), which were produced during the BINDS activities for modeling and molecular simulations (Fig. 2). Several other data archives, such as chemical compound library for screening and gene expression and epigenetics given by next generation sequencer (NGS), are also planned and will be released in the near future.

Fig. 1.

Fig. 1

The web page of “Antibody Square,” the repository for antibody developers, users, and suppliers. http://antibodysq.info/. Accessed 15 January 2020

Fig. 2.

Fig. 2

The web page of “BSM-Arc (biological structure model archive),” the repository for the static and dynamic structural models. https://bsma.pdbj.org/ (Bekker et al. 2020). Accessed 15 January 2020

In conclusion, BINDS members have advanced their activities utilizing big data by the data science approach. In addition, the BINDS program has been producing many excellent results both from the individual research contributions of BINDS members and from the support and assistance provided to other researchers, who submit their proposals through the BINDS platform. Many BINDS results are themselves now archived as big data in addition to their publication within original scientific articles. Such a dual track system of publication and data deposition in big data repositories helps to spread scientific knowledge throughout the world, further contributing to drug discovery and life science.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Bekker G, Kawabata T, Kurisu G (2020) The biological structure model archive: an archive for in-silico models and simulations. Biophys Rev. (in the current issue) [DOI] [PMC free article] [PubMed]
  2. Dasgupta B, Miyashita O, Tama F. Reconstruction of low-resolution molecular structures from simulated atomic force microscopy images. BBA. 2020;1864:129420. doi: 10.1016/j.bbagen.2019.129420. [DOI] [PubMed] [Google Scholar]
  3. Fukuda H, Tomii K. DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment. BMC Bioinformatics. 2020;21:10. doi: 10.1186/s12859-019-3190-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Gore S, García ES, Hendrickx PMS, Gutmanas Westbrook JD, Yang H, Feng Z, Baskaran K, Berrisford JM, Conroy M, Hudson BP, Ikegawa Y, Kobayashi N, Lawson CL, Mading S, Mak L, Mukhopadhyay A, Oldfield TJ, Patwardhan A, Peisach E, Sahni G, Sekharan MR, Sen S, Shao C, Smart OS, Ulrich EL, Yamashita R, Quesada M, Young JY, Nakamura H, Markley JL, Berman HM, Burley SK, Velankar S, Kleywegt GJ. Validation of the structures in the Protein Data Bank. Structure. 2017;25:1916–1927. doi: 10.1016/j.str.2017.10.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Kawabata T. Multiple subunit fitting into a low-resolution density map of a macromolecular complex using a Gaussian mixture model. Biophys J. 2008;95:4643–4658. doi: 10.1529/biophysj.108.137125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Kawabata T. Gaussian-input Gaussian mixture model for representing density maps and atomic models. J Struct Biol. 2018;203:1–16. doi: 10.1016/j.jsb.2018.03.002. [DOI] [PubMed] [Google Scholar]
  7. Kawabata T. Rigid-body fitting of atomic models on 3D density maps of electron microscopy. Adv Exp Med Biol. 2018;1105:219–235. doi: 10.1007/978-981-13-2200-6_14. [DOI] [PubMed] [Google Scholar]
  8. Kono H, Sakuraba S, Ishida H. Free energy profiles for unwrapping the outer superhelical turn of nucleosomal DNA. PLoS Comput Biol. 2018;14:e1006024. doi: 10.1371/journal.pcbi.1006024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Nakamura T, Oda T, Fukasawa Y, Tomii K. Template-based quaternary structure prediction of proteins using enhanced profile–profile alignments. Proteins. 2017;86:274–282. doi: 10.1002/prot.25432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Rowley J. The wisdom hierarchy: representations of the DIKW hierarchy. J Inform Sci. 2007;33:163–180. doi: 10.1177/0165551506070706. [DOI] [Google Scholar]
  11. Shiota T, Imai K, Qiu J, Hewitt VL, Tan K, Shen H-H, Sakiyama N, Fukasawa Y, Hayat S, Kamiya M, Elofsson A, Tomii K, Horton P, Wiedemann N, Pfanner N, Lithgow T, Endo T. Molecular architecture of the active mitochondrial protein gate. Science. 2015;349:1544–1548. doi: 10.1126/science.aac6428. [DOI] [PubMed] [Google Scholar]
  12. wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucl Acids Res. 2019;47:D520–D528. doi: 10.1093/nar/gky949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Young JY, Westbrook JD, Feng Z, Peisach E, Persikova I, Sala R, Sen S, Berrisford JM, Swaminathan GJ, Oldfield TJ, Gutmanas A, Igarashi R, Armstrong DR, Baskaran K, Chen L, Chen M, Clark AR, Costanzo LD, Dimitropoulos D, Gao G, Ghosh S, Gore S, Guranovic V, Hendrickx PMS, Hudson BP, Ikegawa Y, Kengaku Y, Lawson CL, Liang Y, Mak L, Mukhopadhyay A, Narayanan B, Nishiyama K, Patwardhan A, Sahni G, Sanz-Garcia E, Sato J, Sekharan MR, Shao C, Smart OS, Tan L, van Ginkel G, Yang H, Zhuravleva MA, Markley JL, Nakamura H, Kurisu G, Kleywegt GJ, Velankar S, Berman HM, Burley SK. Worldwide Protein Data Bank biocuration supporting open access to high-quality 3D structural biology data. Database. 2018;2018:1–17. doi: 10.1093/database/bay002. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Biophysical Reviews are provided here courtesy of Springer

RESOURCES