Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks

E E Snyder; G D Stormo

doi:10.1093/nar/21.3.607

. 1993 Feb 11;21(3):607–613. doi: 10.1093/nar/21.3.607

Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.

E E Snyder ¹, G D Stormo ¹

PMCID: PMC309159 PMID: 8441672

Abstract

Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classification procedures are determined by training a simple feed-forward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Bougueleret L., Tekaia F., Sauvaget I., Claverie J. M. Objective comparison of exon and intron sequences by means of 2-dimensional data analysis methods. Nucleic Acids Res. 1988 Mar 11;16(5):1729–1738. doi: 10.1093/nar/16.5.1729. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brunak S., Engelbrecht J., Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. doi: 10.1016/0022-2836(91)90380-o. [DOI] [PubMed] [Google Scholar]
Claverie J. M., Bougueleret L. Heuristic informational analysis of sequences. Nucleic Acids Res. 1986 Jan 10;14(1):179–196. doi: 10.1093/nar/14.1.179. [DOI] [PMC free article] [PubMed] [Google Scholar]
Claverie J. M., Sauvaget I., Bougueleret L. K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 1990;183:237–252. doi: 10.1016/0076-6879(90)83017-4. [DOI] [PubMed] [Google Scholar]
Farber R., Lapedes A., Sirotkin K. Determination of eukaryotic protein coding regions using neural networks and information theory. J Mol Biol. 1992 Jul 20;226(2):471–479. doi: 10.1016/0022-2836(92)90961-i. [DOI] [PubMed] [Google Scholar]
Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fields C. A., Soderlund C. A. gm: a practical tool for automating DNA sequence analysis. Comput Appl Biosci. 1990 Jul;6(3):263–270. doi: 10.1093/bioinformatics/6.3.263. [DOI] [PubMed] [Google Scholar]
Fields C. Information content of Caenorhabditis elegans splice site sequences varies with intron length. Nucleic Acids Res. 1990 Mar 25;18(6):1509–1512. doi: 10.1093/nar/18.6.1509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guigó R., Knudsen S., Drake N., Smith T. Prediction of gene structure. J Mol Biol. 1992 Jul 5;226(1):141–157. doi: 10.1016/0022-2836(92)90130-c. [DOI] [PubMed] [Google Scholar]
Konopka A. K., Owens J. Complexity charts can be used to map functional domains in DNA. Genet Anal Tech Appl. 1990 Apr;7(2):35–38. doi: 10.1016/0735-0651(90)90010-d. [DOI] [PubMed] [Google Scholar]
Konopka A. K., Smythers G. W., Owens J., Maizel J. V., Jr Distance analysis helps to establish characteristic motifs in intron sequences. Gene Anal Tech. 1987 Jul-Aug;4(4):63–74. doi: 10.1016/0735-0651(87)90020-3. [DOI] [PubMed] [Google Scholar]
Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
Nussinov R., Jacobson A. B. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci U S A. 1980 Nov;77(11):6309–6313. doi: 10.1073/pnas.77.11.6309. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shapiro M. B., Senapathy P. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acids Res. 1987 Sep 11;15(17):7155–7174. doi: 10.1093/nar/15.17.7155. [DOI] [PMC free article] [PubMed] [Google Scholar]
Staden R., McLachlan A. D. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 1982 Jan 11;10(1):141–156. doi: 10.1093/nar/10.1.141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Uberbacher E. C., Mural R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ulfendahl P. J., Pettersson U., Akusjärvi G. Splicing of the adenovirus-2 E1A 13S mRNA requires a minimal intron length and specific intron signals. Nucleic Acids Res. 1985 Sep 11;13(17):6299–6315. doi: 10.1093/nar/13.17.6299. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waterman M. S., Eggert M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol. 1987 Oct 20;197(4):723–728. doi: 10.1016/0022-2836(87)90478-5. [DOI] [PubMed] [Google Scholar]
Wieringa B., Hofer E., Weissmann C. A minimal intron length but no specific internal sequence is required for splicing the large rabbit beta-globin intron. Cell. 1984 Jul;37(3):915–925. doi: 10.1016/0092-8674(84)90426-4. [DOI] [PubMed] [Google Scholar]
Zuker M., Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981 Jan 10;9(1):133–148. doi: 10.1093/nar/9.1.133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00780] Bougueleret L., Tekaia F., Sauvaget I., Claverie J. M. Objective comparison of exon and intron sequences by means of 2-dimensional data analysis methods. Nucleic Acids Res. 1988 Mar 11;16(5):1729–1738. doi: 10.1093/nar/16.5.1729. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00800] Brunak S., Engelbrecht J., Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol. 1991 Jul 5;220(1):49–65. doi: 10.1016/0022-2836(91)90380-o. [DOI] [PubMed] [Google Scholar]

[OCR_00782] Claverie J. M., Bougueleret L. Heuristic informational analysis of sequences. Nucleic Acids Res. 1986 Jan 10;14(1):179–196. doi: 10.1093/nar/14.1.179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00785] Claverie J. M., Sauvaget I., Bougueleret L. K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 1990;183:237–252. doi: 10.1016/0076-6879(90)83017-4. [DOI] [PubMed] [Google Scholar]

[OCR_00853] Farber R., Lapedes A., Sirotkin K. Determination of eukaryotic protein coding regions using neural networks and information theory. J Mol Biol. 1992 Jul 20;226(2):471–479. doi: 10.1016/0022-2836(92)90961-i. [DOI] [PubMed] [Google Scholar]

[OCR_00801] Fickett J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982 Sep 11;10(17):5303–5318. doi: 10.1093/nar/10.17.5303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00835] Fields C. A., Soderlund C. A. gm: a practical tool for automating DNA sequence analysis. Comput Appl Biosci. 1990 Jul;6(3):263–270. doi: 10.1093/bioinformatics/6.3.263. [DOI] [PubMed] [Google Scholar]

[OCR_00807] Fields C. Information content of Caenorhabditis elegans splice site sequences varies with intron length. Nucleic Acids Res. 1990 Mar 25;18(6):1509–1512. doi: 10.1093/nar/18.6.1509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00832] Guigó R., Knudsen S., Drake N., Smith T. Prediction of gene structure. J Mol Biol. 1992 Jul 5;226(1):141–157. doi: 10.1016/0022-2836(92)90130-c. [DOI] [PubMed] [Google Scholar]

[OCR_00771] Konopka A. K., Owens J. Complexity charts can be used to map functional domains in DNA. Genet Anal Tech Appl. 1990 Apr;7(2):35–38. doi: 10.1016/0735-0651(90)90010-d. [DOI] [PubMed] [Google Scholar]

[OCR_00803] Konopka A. K., Smythers G. W., Owens J., Maizel J. V., Jr Distance analysis helps to establish characteristic motifs in intron sequences. Gene Anal Tech. 1987 Jul-Aug;4(4):63–74. doi: 10.1016/0735-0651(87)90020-3. [DOI] [PubMed] [Google Scholar]

[OCR_00841] Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]

[OCR_00842] Nussinov R., Jacobson A. B. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci U S A. 1980 Nov;77(11):6309–6313. doi: 10.1073/pnas.77.11.6309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00787] Shapiro M. B., Senapathy P. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acids Res. 1987 Sep 11;15(17):7155–7174. doi: 10.1093/nar/15.17.7155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00852] Staden R., McLachlan A. D. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 1982 Jan 11;10(1):141–156. doi: 10.1093/nar/10.1.141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00839] Uberbacher E. C., Mural R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991 Dec 15;88(24):11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00774] Ulfendahl P. J., Pettersson U., Akusjärvi G. Splicing of the adenovirus-2 E1A 13S mRNA requires a minimal intron length and specific intron signals. Nucleic Acids Res. 1985 Sep 11;13(17):6299–6315. doi: 10.1093/nar/13.17.6299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[OCR_00823] Waterman M. S., Eggert M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol. 1987 Oct 20;197(4):723–728. doi: 10.1016/0022-2836(87)90478-5. [DOI] [PubMed] [Google Scholar]

[OCR_00772] Wieringa B., Hofer E., Weissmann C. A minimal intron length but no specific internal sequence is required for splicing the large rabbit beta-globin intron. Cell. 1984 Jul;37(3):915–925. doi: 10.1016/0092-8674(84)90426-4. [DOI] [PubMed] [Google Scholar]

[OCR_00846] Zuker M., Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981 Jan 10;9(1):133–148. doi: 10.1093/nar/9.1.133. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.

E E Snyder

G D Stormo

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.

E E Snyder

G D Stormo

Abstract

Full text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases