Table 10.25.1.
Features of Commonly Used Protein Sequence Databases
Database | Source | Download site | Number of species/taxonomiesa | Number of entriesa | Description |
---|---|---|---|---|---|
Swiss-Prot | Swiss Institute of Bioinformatics, European Bioinformatics Institute | ftp://ftp.ebi.ac.uk/pub/databases/ | 11,661 | 412,525 | A curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy, and high level of integration with other databases |
TrEMBL | Swiss Institute of Bioinformatics, European Bioinformatics Institute | ftp://ftp.ebi.ac.uk/pub/databases | 191,318 | 7,341,751 | A computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot |
NCBInr | National Center for Biotechnology Information | ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz | 64 | 7,031,513 | Nonredundant protein sequence database with entries from GenPept, SwissProt, PIR, PDF, PDB, and RefSeq |
IPI | European Bioinformatics Institute | ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ | 7 | 309,242 | Protein sets are made for a limited number of higher eukaryotic species whose genomic sequence has been completely determined but where there are a large number of predicted protein sequences that are not yet in UniProt. Nonredundant. Annotated splice variants are listed as separate entries. Assembled from Swiss-Prot, TrEMBL, RefSeq, Ensemble, TAIR, H-InvDB, Vega. |
RefSeq | National Center for Biotechnology Information | ftp://ftp.ncbi.nih.gov/refseq/release/ | 7,773 | 6,413,124 | A nonredundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. The collection includes sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq represents a single, naturally occurring molecule from one organism. |
dbEST | National Center for Biotechnology Information | ftp://ftp.ncbi.nih.gov/blast/db/FASTA/estothers.gz | 1704 | 60,497,687 | Contains “single-pass” cDNA sequences, or Expressed Sequence Tags, from the EST divisions of GenBank |
Statistics valid as of 3/10/09.