Molecular sequence accuracy and the analysis of protein coding regions

D J States; D Botstein

doi:10.1073/pnas.88.13.5518

. 1991 Jul 1;88(13):5518–5522. doi: 10.1073/pnas.88.13.5518

Molecular sequence accuracy and the analysis of protein coding regions.

D J States ¹, D Botstein ¹

PMCID: PMC51908 PMID: 2062834

Abstract

Molecular sequences, like all experimental data, have finite error rates. The impact of errors on the information content of molecular sequence data is dependent on the analytic paradigm used to interpret the data. We studied the impact of nucleic acid sequence errors on the ability to align predicted amino acid sequences with the sequences of related proteins. We found that with a simultaneous translation and alignment algorithm, identification of sequence homologies is resilient to the introduction of random errors. Proteins with greater than 30% sequence identity can be reliably recognized even in the presence of 1% frameshifting (insertion or deletion) error rates and 5% base substitution rates. Incorporation of prior knowledge about the location and characteristics of errors improves tolerance to error of amino acid sequence alignments. Similarly, inclusion of prior knowledge of biased codon utilization by yeast (Saccharomyces cerevisiae) allows reliable detection of correct reading frames in yeast sequences even in the presence of 5% substitution and 1% frameshift errors.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Church G. M., Kieffer-Higgins S. Multiplex DNA sequencing. Science. 1988 Apr 8;240(4849):185–188. doi: 10.1126/science.3353714. [DOI] [PubMed] [Google Scholar]
Craik C. S., Choo Q. L., Swift G. H., Quinto C., MacDonald R. J., Rutter W. J. Structure of two related rat pancreatic trypsin genes. J Biol Chem. 1984 Nov 25;259(22):14255–14264. [PubMed] [Google Scholar]
Dayhoff M. O., Barker W. C., Hunt L. T. Establishing homologies in protein sequences. Methods Enzymol. 1983;91:524–545. doi: 10.1016/s0076-6879(83)91049-2. [DOI] [PubMed] [Google Scholar]
Drmanac R., Labat I., Brukner I., Crkvenjakov R. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics. 1989 Feb;4(2):114–128. doi: 10.1016/0888-7543(89)90290-5. [DOI] [PubMed] [Google Scholar]
Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grantham R., Gautier C., Gouy M., Jacobzone M., Mercier R. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res. 1981 Jan 10;9(1):r43–r74. doi: 10.1093/nar/9.1.213-b. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ikemura T. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. J Mol Biol. 1982 Jul 15;158(4):573–597. doi: 10.1016/0022-2836(82)90250-9. [DOI] [PubMed] [Google Scholar]
Maxam A. M., Gilbert W. Sequencing end-labeled DNA with base-specific chemical cleavages. Methods Enzymol. 1980;65(1):499–560. doi: 10.1016/s0076-6879(80)65059-9. [DOI] [PubMed] [Google Scholar]
Newport G. R., McKerrow J. H., Hedstrom R., Petitt M., McGarrigle L., Barr P. J., Agabian N. Cloning of the proteinase that facilitates infection by schistosome parasites. J Biol Chem. 1988 Sep 15;263(26):13179–13184. [PubMed] [Google Scholar]
Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sinha S., Watorek W., Karr S., Giles J., Bode W., Travis J. Primary structure of human neutrophil elastase. Proc Natl Acad Sci U S A. 1987 Apr;84(8):2228–2232. doi: 10.1073/pnas.84.8.2228. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith T. F., Waterman M. S., Fitch W. M. Comparative biosequence metrics. J Mol Evol. 1981;18(1):38–46. doi: 10.1007/BF01733210. [DOI] [PubMed] [Google Scholar]
Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
Tabor S., Richardson C. C. DNA sequence analysis with a modified bacteriophage T7 DNA polymerase. Effect of pyrophosphorolysis and metal ions. J Biol Chem. 1990 May 15;265(14):8322–8328. [PubMed] [Google Scholar]

[OCR_00756] Church G. M., Kieffer-Higgins S. Multiplex DNA sequencing. Science. 1988 Apr 8;240(4849):185–188. doi: 10.1126/science.3353714. [DOI] [PubMed] [Google Scholar]

[OCR_00784] Craik C. S., Choo Q. L., Swift G. H., Quinto C., MacDonald R. J., Rutter W. J. Structure of two related rat pancreatic trypsin genes. J Biol Chem. 1984 Nov 25;259(22):14255–14264. [PubMed] [Google Scholar]

[OCR_00768] Dayhoff M. O., Barker W. C., Hunt L. T. Establishing homologies in protein sequences. Methods Enzymol. 1983;91:524–545. doi: 10.1016/s0076-6879(83)91049-2. [DOI] [PubMed] [Google Scholar]

[OCR_00764] Drmanac R., Labat I., Brukner I., Crkvenjakov R. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics. 1989 Feb;4(2):114–128. doi: 10.1016/0888-7543(89)90290-5. [DOI] [PubMed] [Google Scholar]

[OCR_00802] Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00812] Grantham R., Gautier C., Gouy M., Jacobzone M., Mercier R. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res. 1981 Jan 10;9(1):r43–r74. doi: 10.1093/nar/9.1.213-b. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00810] Ikemura T. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. J Mol Biol. 1982 Jul 15;158(4):573–597. doi: 10.1016/0022-2836(82)90250-9. [DOI] [PubMed] [Google Scholar]

[OCR_00752] Maxam A. M., Gilbert W. Sequencing end-labeled DNA with base-specific chemical cleavages. Methods Enzymol. 1980;65(1):499–560. doi: 10.1016/s0076-6879(80)65059-9. [DOI] [PubMed] [Google Scholar]

[OCR_00793] Newport G. R., McKerrow J. H., Hedstrom R., Petitt M., McGarrigle L., Barr P. J., Agabian N. Cloning of the proteinase that facilitates infection by schistosome parasites. J Biol Chem. 1988 Sep 15;263(26):13179–13184. [PubMed] [Google Scholar]

[OCR_00798] Pearson W. R., Lipman D. J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00789] Sinha S., Watorek W., Karr S., Giles J., Bode W., Travis J. Primary structure of human neutrophil elastase. Proc Natl Acad Sci U S A. 1987 Apr;84(8):2228–2232. doi: 10.1073/pnas.84.8.2228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00772] Smith T. F., Waterman M. S., Fitch W. M. Comparative biosequence metrics. J Mol Evol. 1981;18(1):38–46. doi: 10.1007/BF01733210. [DOI] [PubMed] [Google Scholar]

[OCR_00776] Smith T. F., Waterman M. S. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]

[OCR_00760] Tabor S., Richardson C. C. DNA sequence analysis with a modified bacteriophage T7 DNA polymerase. Effect of pyrophosphorolysis and metal ions. J Biol Chem. 1990 May 15;265(14):8322–8328. [PubMed] [Google Scholar]

PERMALINK

Molecular sequence accuracy and the analysis of protein coding regions.

D J States

D Botstein

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Molecular sequence accuracy and the analysis of protein coding regions.

D J States

D Botstein

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases