Abstract
The DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) has been providing public data services for thirty years (since 1987). We are collecting nucleotide sequence data from researchers as a member of the International Nucleotide Sequence Database Collaboration (INSDC, http://www.insdc.org), in collaboration with the US National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI). The DDBJ Center also services Japanese Genotype-phenotype Archive (JGA), with the National Bioscience Database Center to collect human-subjected data from Japanese researchers. Here, we report our database activities for INSDC and JGA over the past year, and introduce retrieval and analytical services running on our supercomputer system and their recent modifications. Furthermore, with the Database Center for Life Science, the DDBJ Center improves semantic web technologies to integrate and to share biological data, for providing the RDF version of the sequence data.
INTRODUCTION
The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) (1) is a public database of nucleotide sequences established at the National Institute of Genetics (NIG). Since 1987, the DDBJ has been collecting annotated nucleotide sequences as its traditional database service. This endeavor has been conducted in collaboration with GenBank (2) at the National Center for Biotechnology Information (NCBI) and with European Nucleotide Archive (ENA) (3) at the European Bioinformatics Institute (EBI). The collaborative framework is called the International Nucleotide Sequence Database Collaboration (INSDC, http://www.insdc.org/) (4) and the product database from this framework is called the International Nucleotide Sequence Database (INSD).
Within the INSDC framework, the DDBJ Center also services the DDBJ Sequence Read Archive (DRA), BioProject for sequencing project metadata and BioSample for sample information to facilitate the acceptance of large-scale data generated from next-generation sequencing platforms (5–7). The comprehensive resource of nucleotide sequences and associated information complies with the INSDC policy that guarantees free and unrestricted access to data archives (8). In 2016, the advisors of INSDC published an open letter to remind scientists to submit their sequence data to the INSDC (9,10).
In addition, the DDBJ Center services the Japanese Genotype-phenotype Archive (JGA, http://trace.ddbj.nig.ac.jp/jga) in collaboration with the National Bioscience Database Center (NBDC, http://biosciencedbc.jp/en/) of the Japan Science and Technology Agency (5,11). This database stores personal genotype and phenotype data from individuals who have signed consent agreements authorizing data release only for specific research use. The data access is strictly controlled, similar to the data access policy of the database of Genotypes and Phenotypes at the NCBI (12) and the European Genome-phenome Archive at the EBI (13). NBDC provides the guideline and policies for sharing human-derived data (http://humandbs.biosciencedbc.jp/en/guidelines) and also reviews data submission and usage requests.
The DDBJ Center, a part of NIG, is funded as a supercomputing center. Our web services, including submission systems, data retrieval systems, Web API, DDBJ Read Annotation Pipeline, and backend databases are performed on the NIG supercomputer system. The current commodity-based cluster was implemented in 2012 (14).
The present article reports the update of the above services at the DDBJ Center. A highlight is the semantic web services developed in collaboration with the Database Center for Life Science (DBCLS, http://dbcls.rois.ac.jp/en) and the virtualization of annotation pipeline. All resources described here are available from http://www.ddbj.nig.ac.jp and most of the archival data can be downloaded at ftp://ftp.ddbj.nig.ac.jp.
THE DDBJ ARCHIVAL DATABASES
Data contents: traditional DDBJ and the DRA
In 2015, most of nucleotide data directly submitted to the DDBJ (3826 times; 75.3%) were made by Japanese research groups, with the remainder originating from Iran (238 times; 4.7%), India (188 times; 3.7%), Thailand (154 times; 3.0%), China (111 times; 2.2%), and other countries and regions (563 times; 11.1%).
Between June 2015 and May 2016, the DDBJ periodical release increased by 10 317 427 entries and 20 978 161 726 base pairs. The periodical release does not include whole-genome shotgun (WGS), large parts of transcriptome shotgun assembly (TSA) or third party data (TPA) files (15). The DDBJ has continuously distributed sequence data in published patent applications from the Japan Patent Office (JPO, http://www.jpo.go.jp) and the Korean Intellectual Property Office (KIPO, http://www.kipo.go.kr/en). The JPO directly transferred its data to the DDBJ, whereas the KIPO transferred its data via an arrangement with the Korean Bioinformation Center. The DDBJ contributed 19.20% of the entries and 12.84% of the total base pairs added to the core nucleotide data of the INSD. A detailed statistical breakdown of the number of records is shown on the DDBJ homepage (http://www.ddbj.nig.ac.jp/breakdown_stats/prop_ent-e.html). In addition to the above data, the DDBJ has released a total of 11 909 516 WGS entries (1694 genomes), 1 505 087 contig/constructed (CON) entries, 1 313 171 TSA entries (18 projects), 786 TPA entries, 6374 TPA-WGS entries (one genome) and 1272 TPA-CON entries as of 27 May 2016.
In the period between June 2015 and May 2016, next-generation sequencing data of 23,974 runs have been registered to the DRA.
Notable datasets released from the DDBJ sequence databases are listed in Table 1. In particular, we accepted and released the latest sequence data of the reference genome of rice (16), with the annotation performed by the Rice Annotation Project (17) that has been anticipated by many researchers.
Table 1. List of notable data sets released from the DNA Data Bank of Japan (DDBJ) sequence databases from June 2015 to May 2016.
Data type | Organism | Accession numbers for annotated sequences (number of entries) | Accession numbers for raw reads |
---|---|---|---|
Genome | Radish (Raphanus sativus cv. Aokubi S-h) | WGS: BAOO01000001-BAOO01072909 (72 909 entries) scaffold CON: DF196826-DF236948 (40,123 entries) | DRR012610-DRR012624 |
Soybean (Glycine max cv. Enrei) | BBNX02000001-BBNX02108601 (108 601 entries) | DRR021740-DRR021744 | |
Common marmoset (Callithrix jacchus) | WGS: BBXK01000001-BBXK01109198 (109 198 entries) scaffold CON: DG000097-DG000120 (24 entries) GSS: LB274659-LB427105 (152 447 entries) | DRR036754-DRR036764 | |
Rice (Oryza sativa Japonica Group cv. Nipponbare) | chromosome: AP014957-AP014968 (12 entries) unanchored: AP014969-AP015011 (43 entries) | n/a | |
Hawaiian acornworm (Ptychodera flava) | WGS: BCFJ01000001-BCFJ01317432 (317 432 entries) scaffold CON: LD342582-LD560836 (218 255 entries) | DRR027930-DRR027956 | |
Azuki bean (Vigna angularis var. angularis) | chromosome: AP015034-AP015044 (11 entries) scaffold: AP015045-AP017294 (2,250 entries) | DRR031705 DRR031878-DRR031883 DRR032984-DRR033067 | |
Taiwan habu (Protobothrops mucrosquamatu) | WGS: BCNE010000001-BCNE011421934 (1 421 934 entries) | DRR049668, DRR049669 | |
WGS: BCNE02000001-BCNE02167851 (167 851 entries) scaffold CON: LD636650-LD688929 (52 280 entries) | DRR049668, DRR049669 | ||
Acropora digitifera | WGS: BACK02000001-BACK02054400 (54 400 entries) scaffold CON: DF970692-DF973111 (2420 entries) | DRR001380-DRR001433 | |
Zoysia japonica cv. Nagirizaki | WGS: BCLF01000001-BCLF01011786 (11 786 entries) | DRR047281-DRR047283, DRR047291 | |
Zoysia matrella cv. Wakaba | WGS: BCLG01000001-BCLG01013609 (13 609 entries) | DRR047287, DRR047289 | |
Zoysia pacifica cv. Zanpa | WGS: BCLH01000001-BCLH01011428 (11 428 entries) | DRR047288, DRR047290 | |
A bacterium that degrades and assimilates PET, Ideonella sakaiensis | WGS: BBYR01000001-BBYR01000227 (227 entries) | n/a | |
Luminous mushroom (Mycena chlorophos) | WGS: BAYG01000001-BAYG01025660 (25 660 entries) scaffold CON: DF837679-DF850034 (12 356 entries) | DRR018497-DRR018504 | |
Ohi'a lehua (Metrosideros polymorpha var. glaberrima) | WGS: BCNH01000001-BCNH01036376 (36 376 entries) | n/a | |
Matsutake (Tricholoma matsutake) | WGS: BDDP01000001-BDDP01088884 (88 884 entries) | n/a | |
Common buckwheat (Fagopyrum esculentum) | WGS: BCYN01000001-BCYN01387594 (387 594 entries) | DRR046985-DRR046993 | |
Mushroom (Hypsizygus marmoreus) | WGS: BDDV01000001-BDDV01010694 (10 694 entries) | n/a | |
Transcriptome | Radish (Raphanus sativus cv. Aokubi S-h) | n/a | DRR010353-DRR010355 DRR014743-DRR014781 |
Soybean (Glycine max cv. Enrei) | n/a | DRR031435 | |
Common house spider (Parasteatoda tepidariorum) | IAAA01000001-IAAA01132843 (132 843 entries) | DRR047015-DRR047017 | |
Ayu (Plecoglossus altivelis altivelis) | thrombocyte LA715952-LA738445 (22 494 entries) neutrophil LA738446-LA761178 (22 733 entries) B lymphocyte LA761179-LA777683 (16 505 entries) | DRR024801 DRR025094 DRR024802 | |
Taiwan habu (Protobothrops mucrosquamatu) | IAAC01000001-IAAC01112307 (112 307 entries) | DRR049635-DRR049665 | |
California harvester ant (Pogonomyrmex californicus) | IAAD01000001-IAAD01311730 (311 730 entries) | DRR048539-DRR048582 | |
Ant (Formica aquilonia) | LH381539-LH513652 (132 114 entries) | DRR042077-DRR042082 (DRA003820) | |
Ant (Formica cinerea) | LH513653-LH652103 (138 451 entries) | DRR042083-DRR042088 (DRA003820) | |
Ant (Formica exsecta) | LH652104-LH973351 (321 248 entries) | DRR042089-DRR042092 (DRA003820) | |
Ant (Formica fusca) | LI000001-LI121692 (121 692 entries) | DRR042093-DRR042098 (DRA003820) | |
Ant (Formica pratensis) | LI121693-LI219804 (98 112 entries) | DRR042099-DRR042104 (DRA003820) | |
Ant (Formica pressilabris) | LI219805-LI349988 (130 184 entries) | DRR042105-DRR042110 (DRA003820) | |
Ant (Formica truncorum) | LI349989-LI476587 (126 599 entries) | DRR042111-DRR042116 (DRA003820) | |
Ant (Lasius neglectus) | LI476588-LI563515 (86 928 entries) | DRR042123-DRR042128 (DRA003820) | |
Ant (Lasius turcicus) | LI563516-LI670604 (107 089 entries) | DRR042129-DRR042134 (DRA003820) | |
Ant (Linepithema humile) | LI670605-LI795928 (125 324 entries) | DRR042117-DRR042122 (DRA003820) | |
Ant (Monomorium chinense) | LI795929-LI926639 (130 711 entries) | DRR042135-DRR042140 (DRA003820) | |
Ant (Monomorium pharaonis) | LJ000001-LJ120855 (120 855 entries) | DRR042141-DRR042146 (DRA003820) | |
Ant (Myrmica rubra) | LJ120856-LJ206166 (85 311 entries) | DRR042147-DRR042152 (DRA003820) | |
Ant (Myrmica ruginodis) | LJ206167-LJ284088 (77 922 entries) | DRR042153-DRR042158 (DRA003820) | |
Ant (Myrmica sulcinodis) | LJ284089-LJ356044 (71 956 entries) | DRR042159-DRR042164 (DRA003820) | |
Red fire ant (Solenopsis invicta) monogynous | LJ356045-LJ530869 (174 825 entries) | DRR042165-DRR042170 (DRA003820) | |
Red fire ant (Solenopsis invicta) polygynous | LJ530870-LJ707314 (176 445 entries) | DRR042171-DRR042176 (DRA003820) | |
Tausch's goatgrass (Aegilops tauschii) | Strain AT76: IAAN01000001-IAAN01045723 (45 723 entries) | DRR058959 | |
Strain KU-2003: IAAO01000001-IAAO01055813 (55 813 entries) | DRR058960 | ||
Strain KU-2025: IAAP01000001-IAAP01033680 (33 680 entries) | DRR058961 | ||
Strain KU-2075: IAAQ01000001-IAAQ01065447 (65 447 entries) | DRR058962 | ||
Strain KU-2078: IAAR01000001-IAAR01060884 (60 884 entries) | DRR058963 | ||
Strain KU-2087: IAAS01000001-IAAS01065827 (65 827 entries) | DRR058964 | ||
Strain KU-2093: IAAT01000001-IAAT01053474 (53 474 entries) | DRR058965 | ||
Strain KU-2124: IAAU01000001-IAAU01060479 (60 479 entries) | DRR058966 | ||
Strain KU-2627: IAAV01000001-IAAV01060547 (60 547 entries) | DRR058967 | ||
Strain PI499262: IAAW01000001-IAAW01055848 (55 848 entries) | DRR058968 |
Japanese genotype-phenotype archive
The JGA is a permanent archiving service for genotype and phenotype data of human individuals (11). The JGA accepts data that are de-identified by submitters. Upon submission, the JGA team will archive the original data files in encrypted form in the secure database. As of 1 September 2016, the JGA has archived 57 studies (23.5 TB) of individual-level human datasets submitted by Japanese researchers. Archived studies include ‘development of molecular targeted therapy for small cell lung cancer by comprehensive genome analysis’ (18), ‘transcriptome analysis of adolescents and young adults with Acute Lymphoblastic Leukemia’ (19) and ‘Japanese Alzheimer's disease neuroimaging initiative’ (20). Submission of these studies has been reviewed and approved by the Data Access Committee at the NBDC. The summaries of 37 studies are available to the public both on the JGA (https://ddbj.nig.ac.jp/jga/viewer/view/studies) and NBDC (http://humandbs.biosciencedbc.jp/en/data-use/all-researches) websites. To access individual-level data of these public studies, users are required to apply data access requests to the NBDC (http://humandbs.biosciencedbc.jp/en/data-use). The DAC ensures that the stated research purposes are compatible with participant consent and that the principal investigator and institution will abide by the NBDC guideline and the specific terms and conditions imposed to a given dataset. Once access has been granted by DAC, datasets with access permission can be downloaded with a secure software tool. It is required for users to establish a secure computing facility for local use of the downloaded data according to the NBDC security guideline.
DDBJ SYSTEM UPDATE
Update registration systems for the DDBJ traditional assembled sequence archives
We provide two systems for data submission to the traditional DDBJ database: (i) the Nucleotide Sequence Submission System (NSSS; 5) and (ii) the Mass Submission System (MSS; 21). The NSSS is an interactive application facilitating the entry of all items via a web-based form, http://www.ddbj.nig.ac.jp/sub/websub-e.html. The MSS is a procedure to directly send large data files, http://www.ddbj.nig.ac.jp/sub/mss_flow-e.html. Both systems were enhanced to apply the new rules of feature and qualifier usages (see http://www.ddbj.nig.ac.jp/insdc/icm2015-e.html#ft). As mentioned above, the data volume of TSA submissions to the DDBJ was dramatically increased, with individual submissions of 100 000 sequences. Therefore, we decided to improve the DDBJ accession number assignment system to accept such bulk TSA submissions. Since October 2015, the DDBJ has assigned accession numbers with four letter prefixes for TSA data submitted to the DDBJ, similar to the WGS data. During November 2015, the DDBJ released TSA data with a four letter prefix IAAA (IAAA01000001–IAAA01132843) for the first time (Table 1). See also the anonymous FTP site of TSA data, ftp://ftp.ddbj.nig.ac.jp/ddbj_database/tsa/.
Sequence analytical services
The NIG supercomputer as a sequence analytical platform
The DDBJ Center operates the NIG supercomputer which specializes in analysis of large-scale sequence data. The NIG supercomputer offers computational infrastructure for the construction of DDBJ databases and analysis services, and provides researchers with a large-scale data analysis and supercomputing environment. The NIG supercomputer is currently composed of two computer systems: (i) the Phase 1 system which was introduced in 2012 and (ii) the Phase 2 system which went into production in 2014. The Phase 1 system consists of calculation nodes for general-purpose (352 thin-nodes, each with 64 GB memory; Intel Xeon E5-2670 5632 cores, 117.14 Tflops peak performance of CPUs in total) and memory-intensive tasks, including de novo assembly of sequencing reads: two medium nodes, each with 2 TB of memory (HP DL980G7: Intel Xeon E7-4870 160 cores 1.22 Tflops in total), and one fat node with 10 TB of memory (SGI UV1000: Intel Xeon E7-8837 762 cores, 8.17 Tflops). In the general-purpose thin calculation nodes, 64 thin nodes contain NVIDIA Tesla M2090 GPGPU. The Phase 2 incorporates 202 thin nodes, each with 64 GB of memory (Intel Xeon E5-2680v2 4040 cores, 90 Tflops in total) and eight medium nodes (identical to Phase 1).
The calculation nodes in each system are interconnected with InfiniBand (QDR in Phase 1 and FDR in Phase 2) by a complete bisection fat-tree topology. To support massive I/O in the big-data analysis, the NIG supercomputer is equipped with 7 PB of the Lustre parallel distributed file system (http://www.lustre.org). The 5.5 PB MAID system is used for archiving of the Sequence Read Archive data, including the DRA and JGA (11). The number of NIG Supercomputer users increased from 2016 in 1 June 2015 to 2532 by 31 May 2016. The criteria for issuing a user login account are shown on the web page (https://sc.ddbj.nig.ac.jp/index.php/en/criteria-for-issuing-user-login-accounts).
Supported analytical tools and public datasets in the NIG Supercomputer
Many popular bioinformatics tools and libraries were installed in the system for the convenience of the login users of the NIG supercomputer, as listed on the NIG supercomputer home page (http://sc.ddbj.nig.ac.jp/index.php/ja-avail-oss). To help reproduce previously executed analysis flow, different versions of the analytical tools are installed in different directory paths. Pre-installed datasets in the NIG supercomputer for those analytical tools are listed on the web page (http://sc.ddbj.nig.ac.jp/index.php/ja-availavle-dbs).
Web BLAST, ClustalW, VecScreen, ARSA and Web API for Bioinformatics (WABI)
The DDBJ Center has provided the Web BLAST (22), ClustalW (23,24) and VecScreen (http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec) services, which receive requests from web interfaces. The DDBJ Center also provides the Web API for Bioinformatics (WABI) (25–27) for large scale data analysis and the RESTful Web API service that can process requests from computer programs. The WABI service includes BLAST, VecScreen, ClustalW, MAFFT (28,29), getentry data retrieval system via accession numbers and the ARSA keyword search system for the DDBJ flat files (14). The WABI service recently incorporated a new feature of MAFFT version 7 (–add, –addfragments, –addprofile, and –addfull options), which allow the addition of unaligned sequences into an existing alignment (29).
TXSearch to retrieve NCBI taxonomy index
TXSearch (http://ddbj.nig.ac.jp/tx_search/) is an NCBI Taxonomy browsing system in the DDBJ. This browsing system allows data submitters to find authentic scientific names used in the INSDC for the purpose of vocabulary control. Due to the replacement of the NIG supercomputer in 2012, we re-implemented most of our services on open source middleware for accommodation on the new system. The TXSearch system was built on the Apache Solr full text search system and MySQL. The RESTful Web API service is also provided. The data in the TXSearch are updated on a daily basis by downloading the NCBI Taxonomy database (30) from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/pub/taxonomy). Currently, viral records of TXSearch contain links to records of Virus Taxonomy: 2015 Release (31), released from the International Committee on the Taxonomy of Viruses (ICTV http://www.ictvonline.org/) as shown in Figure 1.
A virtual machine image for the DDBJ Pipeline
The DDBJ Read Annotation Pipeline (DDBJ Pipeline, http://p.ddbj.nig.ac.jp) is a high-throughput web annotation system of next-generation sequencing reads running on the NIG supercomputer (32). The pipeline's basic component facilitates reference genome mapping and de novo assembly, and subsequent components such as structural and functional annotation analyses with a Galaxy interface (33). During 2016, the subsequent component of DDBJ Pipeline was moved from a web service on the NIG supercomputer to a software distribution service for both the local Oracle VirtualBox and Pitagora-Galaxy community web server (http://www.pitagora-galaxy.org/) organized by Dr Ryota Yamanaka. Users are required to operate the virtual machine on their own local environment or flexible cloud computing environment. Thus, computational resources in the NIG supercomputer for the DDBJ Pipeline service was concentrated from both basic and succeeding components into only the basic component, which often requires heavy memory usages and comprises time intensive tasks.
Semantic Representation of DDBJ Annotated Sequence Records
To improve reusability of the sequence annotation data, we have developed a system to make the DDBJ records into the Resource Description Framework (RDF) version in collaboration with DBCLS (11,34,35). To semantically represent the DDBJ nucleotide sequence annotation, we have developed a DDBJ taxonomy ontology for describing taxonomic information of the source organism and a DDBJ annotated nucleotide sequence ontology for describing metadata such as submitters and references, and biological feature annotations (http://ddbj.nig.ac.jp/ontologies/). Besides semantic information based on those ontologies, the RDF dataset contains the semantic relations expressed using FALDO ontology (36), Semanticscience Integrated Ontology (37), Sequence Ontology (38) and Relation Ontology (39) for illustrating all the information in the existing DDBJ entries and INSDC resources. The RDF version of the DDBJ annotated sequence records are available at the DDBJ FTP site (ftp://ftp.ddbj.nig.ac.jp/rdf/).
FUTURE DIRECTION
In the present report, we introduced updates of the DDBJ datasets, data submissions, and analytical systems during the past year. We plan to develop a unified submission portal for all database systems, along with a semi-automatic annotation and curation system. The key technology is RDF, and the effort to translate DDBJ sequence records into RDF is under way.
The current focuses at DDBJ Center are as follows: (i) improved network security and data management for JGA; (ii) virtualization of computing infrastructure for better development and analysis on the HPC environment and (iii) restructuring of data processes for updating INSDC databases. In addition, to enhance research productivity on the NIG supercomputer, we are constructing an experimental system to enable not only the operation of HPC oriented software systems (MPI, grid engine) and big-data oriented systems (Spark, YARN) but also the operation of Linux containers (Docker etc.) which allow users to build, re-distribute, and run a set of analysis programs in various kinds of calculation environments.
Acknowledgments
We gratefully acknowledge the support of Koji Watanabe, Chiharu Kawagoe, and all members of DDBJ Center for their assistance in data collection, annotation, release and software development. We are thankful to Mari T. Minowa, Minae Kawashima, Kazunori Miyazaki and Nobutaka Mitsuhashi of NBDC as collaborators of the JGA project; Yasuhiro Tanizawa, Takako Mochizuki, Shota Morizaki for the DDBJ Pipeline updates; Tazro Ohta of DBCLS and Ryota Yamanaka of Oracle Corporation Japan for the virtual machine collaboration; Jerven Bolleman of UniProt; MicrobeDB.jp project members; Norio Kobayashi and Hiroshi Masuya of RIKEN BRC Institute; and domestic BioHackathon organizers of DBCLS for development of RDF applications. We also would like to thank Kento Aida, Shigetoshi Yokoyama, Nobuyoshi Masatani of National Institute of Information and Shinichi Miura and Satoshi Matsuoka of Tokyo Institute of Technology for the computational infrastructure of the NIG supercomputer.
FUNDING
Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT) via a management Expense grant for Inter-University Research Institute Corporation (to DDBJ); Grant-in-Aid for Scientific Research on Innovative Areas (Genome Science) (to DDBJ, DRA); JGA has been supported by NBDC of Japan Science and Technology Agency (JST). Funding for open access charge: MEXT management expense grant to DDBJ.
Conflict of interest statement. None declared.
REFERENCES
- 1.Mashima J., Kodama Y., Kosuge T., Fujisawa T., Katayama T., Nagasaki H., Okuda Y., Kaminuma E., Ogasawara O., Okubo K., et al. DNA data bank of Japan (DDBJ) progress report. Nucleic Acids Res. 2016;44:D51–D57. doi: 10.1093/nar/gkv1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Clark K., Karsch-Mizrachi I., Lipman D.J., Ostell J., Sayers E.W. GenBank. Nucleic Acids Res. 2016;44:D67–D72. doi: 10.1093/nar/gkv1276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gibson R., Alako B., Amid C., Cerdeño-Tárraga A., Cleland I., Goodgame N., Hoopen P.T., Jayathilaka S., Kay S., Leinonen R., et al. Biocuration of functional annotation at the European nucleotide archive. Nucleic Acids Res. 2016;44:D58–D66. doi: 10.1093/nar/gkv1311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cochrane G., Karsch-Mizrachi I., Takagi T. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 2016;44:D48–D50. doi: 10.1093/nar/gkv1323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kosuge T., Mashima J., Kodama Y., Fujisawa T., Kaminuma E., Ogasawara O., Okubo K., Takagi T., Nakamura Y. DDBJ progress report: a new submission system for leading to a correct annotation. Nucleic Acids Res. 2014;42:D44–D49. doi: 10.1093/nar/gkt1066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kodama Y., Shumway M., Leinonen R. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–D56. doi: 10.1093/nar/gkr854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Federhen S., Clark K., Barrett T., Parkinson H., Ostell J., Kodama Y., Mashima J., Nakamura Y., Cochrane G., Karsch-Mizrachi I. Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records. Stand. Genomic Sci. 2014;9:1275–1277. doi: 10.4056/sigs.4851102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Brunak S., Danchin A., Hattori M., Nakamura H., Shinozaki K., Matise T., Preuss D. Nucleotide sequence database policies. Science. 2002;298:1333. doi: 10.1126/science.298.5597.1333b. [DOI] [PubMed] [Google Scholar]
- 9.Salzberg S.L. Reminder to deposit DNA sequences. Nature. 2016;533:179. doi: 10.1038/533179a. [DOI] [PubMed] [Google Scholar]
- 10.Blaxter M., Danchin A., Savakis B., Fukami-kobayashi K., Kurokawa K., Sugano S., Roberts R.J., Salzberg S.L., Wu C. Reminder to deposit DNA sequences. Science. 2016;352:780. doi: 10.1126/science.aaf7672. [DOI] [PubMed] [Google Scholar]
- 11.Kodama Y., Mashima J., Kosuge T., Katayama T., Fujisawa T., Kaminuma E., Ogasawara O., Okubo K., Takagi T., Nakamura Y. The DDBJ Japanese Genotype-phenotype Archive for genetic and phenotypic human data. Nucleic Acids Res. 2015;43:D18–D22. doi: 10.1093/nar/gku1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tryka K.A., Hao L., Sturcke A., Jin Y., Wang Z.Y., Ziyabari L., Lee M., Popova N., Sharopova N., Kimura M., et al. NCBI's database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 2014;42:D975–D979. doi: 10.1093/nar/gkt1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lappalainen I., Almeida-King J., Kumanduri V., Senf A., Spalding J.D., ur-Rehman S., Saunders G., Kandasamy J., Caccamo M., Leinonen R., et al. The European Genome-phenome Archive of human data consented for biomedical research. Nat. Genet. 2015;47:692–695. doi: 10.1038/ng.3312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Ogasawara O., Mashima J., Kodama Y., Kaminuma E., Nakamura Y., Okubo K., Takagi T. DDBJ new system and service refactoring. Nucleic Acids Res. 2013;41:D25–D29. doi: 10.1093/nar/gks1152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Cochrane G., Bates K., Apweiler R., Tateno Y., Mashima J., Kosuge T., Karsch-Mizrachi I., Schafer S., Fetchko M. Evidence standards in experimental and inferential INSDC Third Party Annotation data. OMICS. 2006;10:105–113. doi: 10.1089/omi.2006.10.105. [DOI] [PubMed] [Google Scholar]
- 16.Kawahara Y., de la Bastide M., Hamilton J.P., Kanamori H., McCombie W.R., Ouyang S., Schwartz D.C., Tanaka T., Wu J., Zhou S., et al. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013;6:4. doi: 10.1186/1939-8433-6-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sakai H., Lee S.S., Tanaka T., Numa H., Kim J., Kawahara Y., Wakimoto H., Yang C.C., Iwamoto M., Abe T., et al. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics. Plant Cell Physiol. 2013;54:e6. doi: 10.1093/pcp/pcs183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Umemura S., Mimaki S., Makinoshima H., Tada S., Ishii G., Ohmatsu H., Niho S., Yoh K., Matsumoto S., Takahashi A., et al. Therapeutic priority of the PI3K/AKT/mTOR pathway in small cell lung cancers as revealed by a comprehensive genomic analysis. J. Thorac. Oncol. 2014;9:1324–1331. doi: 10.1097/JTO.0000000000000250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Yasuda T., Tsuzuki S., Kawazu M., Hayakawa F., Kojima S., Ueno T., Imoto N., Kohsaka S., Kunita A., Doi K., et al. Recurrent DUX4 fusions in B cell acute lymphoblastic leukemia of adolescents and young adults. Nat. Genet. 2016;48:569–574. doi: 10.1038/ng.3535. [DOI] [PubMed] [Google Scholar]
- 20.Iwatsubo T. Japanese Alzheimer's Disease Neuroimaging Initiative: present status and future. Alzheimers. Dement. 2010;6:297–299. doi: 10.1016/j.jalz.2010.03.011. [DOI] [PubMed] [Google Scholar]
- 21.Sugawara H., Miyazaki S., Gojobori T., Tateno Y. DNA Data Bank of Japan dealing with large-scale data submission. Nucleic Acids Res. 1999;27:25–28. doi: 10.1093/nar/27.1.25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Thompson J.D., Higgins D.G., Gibson T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Larkin M.A., Blackshields G., Brown N.P., Chenna R., McGettigan P.A., McWilliam H., Valentin F., Wallace I.M., Wilm A., Lopez R., et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–2948. doi: 10.1093/bioinformatics/btm404. [DOI] [PubMed] [Google Scholar]
- 25.Sugawara H., Miyazaki S. Biological SOAP servers and web services provided by the public sequence data bank. Nucleic Acids Res. 2003;31:3836–3839. doi: 10.1093/nar/gkg558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Miyazaki S., Sugawara H., Ikeo K., Gojobori T., Tateno Y. DDBJ in the stream of various biological data. Nucleic Acids Res. 2004;32:D31–D34. doi: 10.1093/nar/gkh127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kwon Y., Shigemoto Y., Kuwana Y., Sugawara H. Web API for biology with a workflow navigation system. Nucleic Acids Res. 2009;37:W11–W16. doi: 10.1093/nar/gkp300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Katoh K., Misawa K., Kuma K., Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30:3059–3066. doi: 10.1093/nar/gkf436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Katoh K., Standrey D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2012;40:D136–D143. doi: 10.1093/nar/gkr1178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Adams M.J., Lefkowitz E.J., King A.M., Harrach B., Harrison R.L., Knowles N.J., Kropinski A.M., Krupovic M., Kuhn J.H., Mushegian A.R., et al. Ratification vote on taxonomic proposals to the International Committee on Taxonomy of Viruses. Arch. Virol. 2016;161:2921–2949. doi: 10.1007/s00705-016-2977-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nagasaki H., Mochizuki T., Kodama Y., Saruhashi S., Morizaki S., Sugawara H., Ohyanagi H., Kurata N., Okubo K., Takagi T., et al. DDBJ read annotation pipeline: a cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data. DNA Res. 2013;20:383–390. doi: 10.1093/dnares/dst017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Goecks J., Nekrutenko A., Taylor J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11:R86. doi: 10.1186/gb-2010-11-8-r86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Katayama T., Wilkinson M.D., Aoki-Kinoshita K.F., Kawashima S., Yamamoto Y., Yamaguchi A., Okamoto S., Kawano S., Kim J.-D., Wang Y., et al. BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains. J. Biomed. Semantics. 2014;5:5. doi: 10.1186/2041-1480-5-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Aoki-Kinoshita K.F., Kinjo A.R., Morita M., Igarashi Y., Chen Y., Shigemoto Y., Fujisawa T., Akune Y., Katoda T., Kokubu A., et al. Implementation of linked data in the life sciences at BioHackathon 2011. J. Biomed. Semantics. 2015;6:3. doi: 10.1186/2041-1480-6-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bolleman J.T., Mungall C.J., Strozzi F., Baran J., Dumontier M., Bonnal R.P.J., Buels R., Hoehndorf R., Fujisawa T., Katayama T., et al. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation. J. Biomed. Semantics. 2016;7:39. doi: 10.1186/s13326-016-0067-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Dumontier M., Baker C.J.O., Baran J., Callahan A., Chepelev L., Cruz-Toledo J., Del Rio N.R., Duck G., Furlong L.I., Keath N., et al. The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J. Biomed. Semantics. 2014;5:14. doi: 10.1186/2041-1480-5-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Eilbeck K., Lewis S.E., Mungall C.J., Yandell M., Stein L., Durbin R., Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6:R44. doi: 10.1186/gb-2005-6-5-r44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Smith B., Ceusters W., Klagges B., Köhler J., Kumar A., Lomax J., Mungall C., Neuhaus F., Rector A.L., Rosse C. Relations in biomedical ontologies. Genome Biol. 2005;6:R46. doi: 10.1186/gb-2005-6-5-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]