The 20th annual Database Issue of Nucleic Acids Research includes 176 articles, half of which describe new online molecular biology databases and the other half provide updates on the databases previously featured in NAR and other journals. This year’s highlights include two databases of DNA repeat elements; several databases of transcriptional factors and transcriptional factor-binding sites; databases on various aspects of protein structure and protein–protein interactions; databases for metagenomic and rRNA sequence analysis; and four databases specifically dedicated to Escherichia coli. The increased emphasis on using the genome data to improve human health is reflected in the development of the databases of genomic structural variation (NCBI’s dbVar and EBI’s DGVa), the NIH Genetic Testing Registry and several other databases centered on the genetic basis of human disease, potential drugs, their targets and the mechanisms of protein–ligand binding. Two new databases present genomic and RNAseq data for monkeys, providing wealth of data on our closest relatives for comparative genomics purposes. The NAR online Molecular Biology Database Collection, available at, has been updated and currently lists 1512 online databases. The full content of the Database Issue is freely available online on the Nucleic Acids Research website (


This 1300-page virtual volume represents the 20th annual Database Issue of Nucleic Acids Research (NAR). It includes descriptions of 88 new online databases, 77 update articles on databases that have been previously featured in the NAR Database Issue (Table 1) and 11 articles with updates on database resources whose descriptions have been previously published in other journals (Table 2).

Table 1.

New online databases featured in the 2013 NAR Database issue

Database name URL Brief description
APPRIS A system for annotating alternative splice isoforms
BioLiP Biologically relevant ligand–protein interactions
BSRD A repository of bacterial small regulatory RNA
CellLineNavigator Cell line expression profiles by microarray analysis
ChIPBase Transcriptional regulation of lncRNA and microRNA genes from ChIP-Seq data
ChiTaRS Chimeric RNAs of two or more different transcripts
CIL-CCDB Images, videos and animations of various cell types from diverse organisms
CircaDB Circadian gene expression profiles in human and mouse
CloneDB Clones and libraries: sequence data, map positions and distributor information
ClusterMine360 Microbial PKS/NRPS Biosynthesis
Cyanolyase Sequences and motifs of the phycobilin lyase protein family
D2P2 Database of Disordered Protein Predictions
dbVar Structural variation in chromosomes: inversions, translocations, insertions and deletions
dcGO domain-centric Gene Ontology
Dfam Human DNA repeat families
DGA Disease and Gene Annotations database
DIANA-LncBase microRNA targets on long noncoding RNAs
DoBISCUIT Database Of BIoSynthesis clusters CUrated and InTegrated
EBI Enzyme Portal Various kinds of information about enzymes: small-molecule chemistry, biochemical pathways and drug compounds
ECMDB Escherichia coli Metabolome Database
EENdb Engineered endonucleases: zinc finger nucleases and transcription activator-like effector nucleases
eProS Energy profiles of protein structures
Factorbook Human transcription factor-binding data from ChIP-seq
G4LDB G-quadruplex Ligands Database
GDSC Genomics of Drug Sensitivity in Cancer: Sensitivity for anti-cancer drugs in various cell lines
GeneTack Genes with frameshifts in prokaryotic genomes and eukaryotic mRNA sequences
Genome3D Domain structure predictions and 3D models for proteins from model genomes
Glycan Fragment DB Database of glycan 3D structures
H2DB Heritability data with trait-associated genomic loci
HBVdb A knowledge database for the Hepatitis B Virus
HemaExplorer Gene expression profiles in haematopoiesis
HEXEvent Human Exone Splicing Events
HOCOMOCO, HOmo sapiens COmprehensive MOdel COllection of hand-curated transcription factor-binding site models
KIDFamMap Kinase-Inhibitor-Disease Family Map
LAMP Library of Apicomplexan Metabolic Pathways
Lncipedia Human lncRNA gene sequences and structures
LncRNADisease Long non-coding RNA-associated diseases
LUCApedia Predicted genome of Last Universal Common Ancestor
meta.MicrobesOnline Comparative genomic tools for metagenome analysis
MetaboLights Metabolomics experiments and associated metadata
MetalPDB Metal-binding sites in macromolecular structures
METscout Spatial organization of metabolic reactions in the mouse
MonarchBase Genome biology of the monarch butterfly Danaus plexippus
NetwoRx Chemogenomic experiments in yeast: connection of drug response to biological pathways, phenotypes, and networks
NCBI Bookshelf Free online books on the NCBI website
NHPRTR Non-human Primate Reference Transcriptome Resource
NIH Genetic Testing Registry Genetic tests and laboratories that perform them
NPACT Naturally occurring Plant-based Anticancer Compound Targets
OikoBase Genome expression database of Oikopleura dioica
OrtholugeDB Microbial orthology resource
OrysPSSP Small secreted proteins from rice
Papillomavirus Episteme A database of Papillomaviridae family of viruses
PGDD Plant Genome Duplication Database
PIECE Plant Intron Exon Comparison and Evolution
PlantRNA tRNAs of plants and algae
PR2 Protist Ribosomal reference database
prePPI Predicted and experimentally determined protein–protein interactions for yeast and human
PTMcode Functional associations between posttranslational modifications within proteins
Quorumpeps A database of quorum-sensing peptides
RhesusBase A Knowledgebase for the Monkey Research Community
RiceFREND Rice Functionally Related gene Expression Network Database
RNApathwaysDB A database of RNA processing pathways
SecReT4 Type IV Secretion system Resource
SEVA Standard European Vector Architecture: a collection of plasmids to analyse complex prokaryotic phenotypes
SIFTS Structure Integration with Function, Taxonomy and Sequences
SINEBase A database of short interspersed elements (SINEs)
SomamiR Somatic mutations that impact microRNA targeting in cancer
Spermatogenesis Online Spermatogenesis-related genes
SpliceAid-F Human splicing factors and their RNA-binding sites
Spliceosome Database Spliceosome genes and proteins, splicing complexes
StreptomeDB Antibiotic, anti-tumour and immunosuppressant drugs produced by Streptomyces spp.
SwissBioisostere Molecular replacements for ligand design
SwissSidechain Non-natural amino acid sidechains for protein engineering
SynSysNet Synapse proteins, their structures and interactions
TCMID Traditional Chinese Medicine Integrated Database
TFClass Human transcription factors classified according to their DNA-binding domains
TissueNet Tissue distribution of protein–protein interactions
TOPPR The Online Protein Processing Resource
TSGene Tumor Suppressor Gene database
UCNEbase Ultraconserved non-coding elements and gene regulatory blocks
UUCD Ubiquitin and ubiquitin-like conjugation database
ValidNESs Validated nuclear export signals-containing proteins
Voronoia4RNA Packing of RNA molecules and complexes
WDDD Worm Developmental Dynamics Database
WholeCellKB Pathway and genome database of Mycoplasma genitalium for whole-cell modelling
WormQTL Natural variation data in Caenorhabditis spp.
YM500 smRNA-seq database for miRNA research
ZInC Zebrafish Insertions Collection

Table 2.

Database updates new for the NAR Database issue

Database name URL Previous article Brief description
2P2Idb 2010 Structural data on protein–protein interactions and their inhibitors
Allen Brain Atlas 2009 Gene expression and neuroanatomical data on human and mouse brain
BioGPS 2009 Gene annotation portal and a resource on gene and protein function
DARNED 2010 Database of RNA Editing
DoriC 2007 Replication origin (oriC) regions in bacterial and archaeal genomes
FlyAtlas 2007 Drosophila gene expression atlas
GenColors 2005 Genome annotation and comparison database for small genomes
Genomicus 2010 Syntenic relationships between eukaryote genomes
InnateDB 2008 A database of mammalian innate immune response
MicroScope 2009 Microbial genome annotation and analysis platform
NPIDB 2007 Nucleic acids–protein interaction database

At this point it might be instructive to look back at the origin and evolution of the NAR Database Issue. Its history started from two supplementary issues that were published in NAR in April of 1991 and in May of 1992 and consisted of 18 and 19 articles, respectively (see and These articles offered descriptions of several nucleotide sequence databases, such as GenBank, the EMBL Data Library, compilations of small RNA, tRNA, and 5S, 16S, and 23S rRNA sequences (including the Ribosomal Database Project), DNA sequences from Escherichia coli and a human genome database (GDB). Those first issues also included descriptions of several protein databases, such as Swiss-Prot, PIR, Prosite, Restriction Enzyme Database (REBASE), Transcription Factors Database (TFD) and Histone database. There was also a medical genetics database, Haemophilia B, listing point mutations and indels in the coagulation factor IX (F9) gene that caused this blood clotting disorder, which has affected the royal families of several European countries.

The next issue, published on July 1, 1993, was the first one formally labelled as the Database Issue. It consisted of 24 articles, which added databases of RNA and protein structure and the Enzyme database. It was followed by NAR Database Issues in September 1994, then in January 1996, and each January after that.

In the past 20 years, the Database Issue has gradually grown in size before stabilizing at the level of ∼180 articles. However, despite the almost 10-fold increase in the number of published articles, the key topics of the current issue remain largely the same as 20 years ago. This issue again features articles from GenBank and the European Nucleotide Archive (formerly the EMBL Data Library), which, together with the DNA Data Bank of Japan, form the International Nucleotide Sequence Database collaboration, INSDC (1–4). Just as 20 years ago, there are updates from Swiss-Prot and PIR (now combined into UniProt) and Prosite (5,6).

Continuing the tradition of featuring well-curated databases of RNA sequences, this issue includes an update on SILVA, a widely used comprehensive database of bacterial, archaeal and eukaryotic 16S/18S and 23S/28S rRNA sequences (7), and a description of Protist Ribosomal Reference database (PR2), a new database that catalogs small subunit rRNA sequences from unicellular eukaryotes (8). An update on the Ribosomal Database Project, a constant feature of the NAR Database Issue since 1991 (9), was last published in 2009 (10). Other RNA databases in this issue include an update on Rfam (11), the universally acclaimed database of RNA families, as well as several databases on long non-coding RNA, microRNA and their targets. An update of Modomics, a database on RNA modification, is now supplemented by RNApathwaysDB, a database of RNA maturation and decay pathways developed by the same group (12,13).

As before, this issue presents several transcription factor (TF) databases. Two of them cover TFs themselves: TFClass offers a classification of human TFs, while NPIDB presents structural information on DNA–protein and RNA–protein complexes (14,15). Several other databases collect information on the TF-binding sites. These include Factorbook, a database of TF-binding data from the ENCODE project; HOCOMOCO, a collection of human TF-binding sites; CTCFBSDB, a database of CCCTC-binding factor (CTCF)-binding sites; RegulonDB, a database of transcriptional regulation in E. coli; and SwissRegulon, a database of regulatory sites in human, mouse and yeast genomes and in model bacteria (16–20).

The structural databases featured in this issue all show a trend towards a better integration and cross-referencing tools. This refers both to the updates of well-known databases, such as the RCSB Protein Data Bank (PDB), CATH and PDBTM, and to such databases as EBI’s SIFTS, a joint effort of UniProt and PDBe to provide a residue level mapping of their entries and supplement it with annotation from other public databases; Genome3D, a recent collaborative project aiming to provide structural annotation from CATH and SCOP to the genomic sequences; and dcGO, which develops domain-centric ontologies to link protein domains with functions, phenotypes and diseases (21–23).

Likewise, with E. coli remaining the workhorse of molecular biology, this issue includes update articles on the EcoGene (the first one since 2000), EcoCyc and RegulonDB databases, as well as a description of the newly developed E. coli Metabolome Database (20,24–26).


As discussed earlier (27), the original GDB did not survive the influx of the new data and multiple changes of ownership. Nevertheless, we now have a wide variety of databases that cover different aspects of human genome and genomes of model organisms. This issue features annual updates from Ensembl and ENCODE projects and from the UCSC Genome Browser and the Japanese H-InvDB database (28–31). The model organism databases are represented by the updates to FlyBase, Mouse Genome database, Xenbase and ZFIN (32–35).

Two new databases, RhesusBase and NHPRTR, present extensive genome and RNAseq data for non-human primates, including great apes, old world monkeys, new world monkeys and prosimians (36,37). These data could go a long way towards establishing monkeys as model organisms for comparative genomics studies. One more database is dedicated to a more distant relative of human, the urochordate Oikopleura dioica (38).

A potentially important development is the construction of two new databases of repetitive DNA elements, Dfam and SINEBase (39,40). Along with the industry standard Repbase Update (41,42) and monthly RepBase Reports (, these databases promise to contribute to a better understanding of eukaryotic repeat elements.

With the abundance of databases providing valuable tools for genome analysis, there is a clear trend towards bringing genomics ‘from the bench to the bedside’, i.e. using genomic data for a better understanding and, hopefully, better treatment of human disease. A number of projects, including ClinSeq (, DDD ( and UK10K ( are working towards these goals, and several databases featured in this issue represent important steps in this direction. Last year’s issue introduced the GWASdb database of human genetic variants identified by genome-wide association studies (43). GWAS Central, established in 2007 as HGVbaseG2P (44), has been revamped and now includes data from over 1000 studies. Now, a joint article from NCBI and EBI describes their databases of genomic structural variation, dbVar and DGVa (45). These databases cover diverse variation data including inversions, insertions and translocations that are >50 bp in length. NCBI is also developing ClinVar (, a database of relationships between human gene variation and the observed health status (46). The task of streamlining the genetic tests that provide such information is taken up by the recently created NIH Genetic Testing Registry, a database of genetic tests and laboratories that perform them, with detailed information about what exactly is measured in each test and its analytic and clinical validity (47).

The impact of the genomic data on developing targeted approaches for fighting disease is particularly evident in the case of cancer. This issue features updates from three great databases, the UCSC Cancer Genome Browser (48), the Atlas of Genetics and Cytogenetics in Oncology and Haematology (49) and the TP53 website [(50), the first update of the database on tumor factor p53 mutations since 1997]. In addition, there are two new databases dedicated to studying cancer at the level of specific cell lines. The CellLineNavigator database provides gene expression profiles of different cancer cell lines in different pathological states (51), whereas the Genomics of Drug Sensitivity in Cancer (GDSC) collects the results of high-throughput studies examining the sensitivity for anti-cancer drugs in various cell lines (52).


During the past 20 years, all databases featured in the NAR Database Issues were added to the NAR online Molecular Biology Database Collection, available at With the annual attrition rate of <5%, this Collection has been steadily growing and, in 2012, exceeded 1400 database entries (53). It was clear that the list was due for a serious clean-up, and one of the authors (XMFS) devised and set in motion a semi-automated procedure to identify obsolete and non-responsive websites. Remarkably, >90% of the databases listed in the last year’s release of the online Collection were found to be functional. Corresponding authors of close to a hundred non-responsive resources had been contacted and 44 websites (∼3.2% of the total) have been approved for deletion. About 100 entries in the Collection have been updated by receiving corrected URLs, summaries highlighting recent developments, or some other changes in the deposited data.

Although deletion of 40 databases was well within the average drop-off rate and was hardly surprising, further analysis revealed that most of these resources were not lost. Instead, in the normal course of database evolution, they have been integrated into larger projects. For example, a couple of segmental duplications databases were merged into the Database of Genomic Variants (54), NAR Database Collection entry no. 655, while the NCBI’s Cancer Chromosomes database has been merged into dbVar [described in detail in this issue, (45)]. Further, improved annotation of the human genome made redundant a number of resources that covered specific areas of the genome (e.g. the IXDB with its physical maps of human chromosome X).

In one instance, the ExDom database of exon–intron structures of genes in seven eukaryotic genomes (55) had to be removed from the Collection, as it has taken the commercial route and does not provide a free version anymore, although the author’s company offered a discounted version for academic users. Unfortunately, the tightening budgets (56) might force other databases to follow the same path.

In total, the NAR online Molecular Biology Database Collection now includes 1512 databases sorted into 14 categories and 41 subcategories. The authors wishing to have their databases, published elsewhere, to be included in the Collection are welcome to contact XMFS directly.


