Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 1996 Jan 15;24(2):316–320. doi: 10.1093/nar/24.2.316

Cleaning the GenBank Arabidopsis thaliana data set.

P G Korning 1, S M Hebsgaard 1, P Rouze 1, S Brunak 1
PMCID: PMC145627  PMID: 8628656

Abstract

Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.

Full Text

The Full Text of this article is available as a PDF (88.1 KB).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

  1. Bonham-Smith P. C., Moloney M. M. Nucleotide and protein sequences of a cytoplasmic ribosomal protein S15a gene from Arabidopsis thaliana. Plant Physiol. 1994 Sep;106(1):401–402. doi: 10.1104/pp.106.1.401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Brunak S., Engelbrecht J., Knudsen S. Cleaning up gene databases. Nature. 1990 Jan 11;343(6254):123–123. doi: 10.1038/343123a0. [DOI] [PubMed] [Google Scholar]
  3. Brunak S., Engelbrecht J., Knudsen S. Neural network detects errors in the assignment of mRNA splice sites. Nucleic Acids Res. 1990 Aug 25;18(16):4797–4801. doi: 10.1093/nar/18.16.4797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Brunak S., Engelbrecht J., Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. doi: 10.1016/0022-2836(91)90380-o. [DOI] [PubMed] [Google Scholar]
  5. Goodall G. J., Filipowicz W. The minimum functional length of pre-mRNA introns in monocots and dicots. Plant Mol Biol. 1990 May;14(5):727–733. doi: 10.1007/BF00016505. [DOI] [PubMed] [Google Scholar]
  6. Hobohm U., Scharf M., Schneider R., Sander C. Selection of representative protein data sets. Protein Sci. 1992 Mar;1(3):409–417. doi: 10.1002/pro.5560010313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Li J., Zhao J., Rose A. B., Schmidt R., Last R. L. Arabidopsis phosphoribosylanthranilate isomerase: molecular genetic analysis of triplicate tryptophan pathway genes. Plant Cell. 1995 Apr;7(4):447–461. doi: 10.1105/tpc.7.4.447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Sander C., Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56–68. doi: 10.1002/prot.340090107. [DOI] [PubMed] [Google Scholar]
  9. Xue J., Rask L. The unusual 5' splicing border GC is used in myrosinase genes of the Brassicaceae. Plant Mol Biol. 1995 Oct;29(1):167–171. doi: 10.1007/BF00019128. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES