Table 1.
Resource and url | Description | Primary institutions |
---|---|---|
RefSeq www.ncbi.nlm.nih.gov/refseq |
Enormous integrated database of genome sequences, transcripts and proteins, covering all domains of life. Gene annotation is primarily based on the in-house computational Gnomon pipeline, while models for key species such as human have been subjected to extensive manual curation. | National Center for Biotechnology Information (NCBI) |
GENCODE www.gencodegenes.org |
A multi-institute project providing gene annotation for human and mouse, initially as part of the larger ENCODE project. The genebuilds are a merge of manually-annotated models produced by the HAVANA group with computational models generated by Ensembl. Further experimental and in silico validation for models is provided by other groups. | Wellcome Trust Sanger Institute; European Bioinformatics Institute; University of Lausanne; Centre de Regulacio Genomica; University of California, Santa Cruz; Massachusetts Institute of Technology; Yale University; Spanish National Cancer Research Centre. |
Ensembl www.ensembl.org |
Multifaceted genome annotation resource, providing genebuilds alongside other annotations, such as regulatory and disease data. It also provides the Ensembl genome browser for integrated visualization. Gene annotation is based on the in-house Ensembl analysis pipeline. | European Bioinformatics Institute (EBI) |
UCSC Genome Browser https://genome.ucsc.edu/ | Online tool supporting the visualization of genome annotations for numerous vertebrate and invertebrate species. Includes genebuilds from RefSeq, GENCODE and Ensembl alongside other gene annotations such as AUGUSTUS, CCDS and LRG. Certain groups have provided access to their own RNA-seq model collections as ‘Track Data hubs’. | University of California, Santa Cruz (UCSC) |
WormBase www.wormbase.org |
Database providing biogical information – including genes and genome sequence - for the nematode Caenorhabditis elegans alongside other nematode species. While all C. elegans gene models were initially created computationally, each has now been subject to manual curation. Gene annotations for most other nematodes are generated computationally by the MAKER2 pipeline. | European Bioinformatics Institute; Wellcome Trust Sanger Institute; Ontario Institute for Cancer Research; Washington Univerity, St. Louis; California Institute of Technology. |
FlyBase www.flybase.org |
Central repository for genetics information relating to the insect family Drosophilidae, including a browser for gene annotations. Effectively all gene annotations have now been manually curated. | Harvard University; Indiana Universty, University of Cambridge. |
The Arabidopsis Information Resource (TAIR) www.arabidopsis.org |
Database of genetic and molecular data for the model plant Arabidopsis thaliana, including gene annotation. Models were initially produced by the Arabidopsis Genome Initiative, improved by The Insitute for Genomic Research before being further improved and maintained by TAIR. The models have been subject to extensive manual curation, and community-annotation is now facilitated via Web Apollo. | Phoenix Bioinformatics |
UniProtKB www.uniprot.org |
A unified protein repository incorporating the Swiss-Prot and TrEMBL databases of protein sequences. Swiss-Prot is manually annotated by expert curators (based on literature and manual gene curation), whereas TrEMBL contains computationally analyzed entries largely extracted from computataionally-derived transcript models. | European Bioinformatics Institute; Swiss Institute of Bioinformatics; the Protein Information Resource. |
Roadmap Epigenomics Project www.roadmapepigenomics.org |
Multi-institute collaboration developing a resource for the presentation and processing of human experimentally-derived epigemomics data. It aims to generate reference epigenomes across a large variety of cell types. It includes data on gene expression, histone modification, DNA methylation and chromatin accessibility. | The National Institute of Health Epigenomics Mapping Consortium |
The ENCODE encyclopedia https://encodeproject.org/data/annotations/ |
Computational analysis pipeline being developed by the multi-institute ENCODE project to summarize the findings of experimental datasets across the genome sequence, including RNA-seq, Hi-C, ChIP-seq and histone marks (and incorporating data from the Roadmap Epigenomics Project). For example, it can help users extrapolate whether a given region looks like an enhancer. | The ENCODE consortium |
Functional ANnoTation Of The Mamalian genome (FANTOM) http://fantom.gsc.riken.jp/ |
International research consortium seeking to obtain further knowledge of the human and mouse genomes and transcriptomes. Since 2000, the project has shifted its focus from cDNA annotation, to transcription start and promoter anlaysis, and onto the description of lncRNAs. | Coordinated by RIKEN Yokohama. |
This table is an entry point for exploring eukaryotic annotation resources in more detail. It focuses on resources discussed in the main text, and is not intended to be comprehensive; the complete list of projects and groups that have contributed to gene annotation in the genome-sequencing era would be exceptionally large. Furthermore, it has not been possible to list individual groups contributing to the FANTOM and ENCODE projects due to space limitations.
Abbreviations CCDS, consensus coding sequence; ChIP-seq, chromatin immunoprecipitation followed by sequencing; ENCODE, Encyclopedia of DNA Elements; HAVANA, Human and Vertebrate Analysis and Annotation; LRG, Locus Reference Genomic; lncRNAs, long non-coding RNAs; RNA-seq, RNA sequencing.