Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 1999 Oct 1;27(19):3911–3920. doi: 10.1093/nar/27.19.3911

Heuristic approach to deriving models for gene finding.

J Besemer 1, M Borodovsky 1
PMCID: PMC148655  PMID: 10481031

Abstract

Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. Here we propose a new, heuristic method producing fairly accurate inhomogeneous Markov models of protein coding regions. The new method needs such a small amount of DNA sequence data that the model can be built 'on the fly' by a web server for any DNA sequence >400 nt. Tests on 10 complete bacterial genomes performed with the GeneMark.hmm program demonstrated the ability of the new models to detect 93.1% of annotated genes on average, while models built by traditional training predict an average of 93.9% of genes. Models built by the heuristic approach could be used to find genes in small fragments of anonymous prokaryotic genomes and in genomes of organelles, viruses, phages and plasmids, as well as in highly inhomogeneous genomes where adjustment of models to local DNA composition is needed. The heuristic method also gives an insight into the mechanism of codon usage pattern evolution.

Full Text

The Full Text of this article is available as a PDF (637.2 KB).


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES