Skip to main content
. Author manuscript; available in PMC: 2010 Jul 19.
Published in final edited form as: Curr Protoc Mol Biol. 2009 Oct;CHAPTER:Unit10.25. doi: 10.1002/0471142727.mb1025s88

Table 10.25.1.

Features of Commonly Used Protein Sequence Databases

Database Source Download site Number of species/taxonomiesa Number of entriesa Description
Swiss-Prot Swiss Institute of Bioinformatics, European Bioinformatics Institute ftp://ftp.ebi.ac.uk/pub/databases/ 11,661 412,525 A curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy, and high level of integration with other databases
TrEMBL Swiss Institute of Bioinformatics, European Bioinformatics Institute ftp://ftp.ebi.ac.uk/pub/databases 191,318 7,341,751 A computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot
NCBInr National Center for Biotechnology Information ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz 64 7,031,513 Nonredundant protein sequence database with entries from GenPept, SwissProt, PIR, PDF, PDB, and RefSeq
IPI European Bioinformatics Institute ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ 7 309,242 Protein sets are made for a limited number of higher eukaryotic species whose genomic sequence has been completely determined but where there are a large number of predicted protein sequences that are not yet in UniProt. Nonredundant. Annotated splice variants are listed as separate entries. Assembled from Swiss-Prot, TrEMBL, RefSeq, Ensemble, TAIR, H-InvDB, Vega.
RefSeq National Center for Biotechnology Information ftp://ftp.ncbi.nih.gov/refseq/release/ 7,773 6,413,124 A nonredundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. The collection includes sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq represents a single, naturally occurring molecule from one organism.
dbEST National Center for Biotechnology Information ftp://ftp.ncbi.nih.gov/blast/db/FASTA/estothers.gz 1704 60,497,687 Contains “single-pass” cDNA sequences, or Expressed Sequence Tags, from the EST divisions of GenBank
a

Statistics valid as of 3/10/09.

HHS Vulnerability Disclosure