Dragon Gene Start Finder: An Advanced System for Finding Approximate Locations of the Start of Gene Transcriptional Units

Vladimir B Bajic; Seng Hong Seah

doi:10.1101/gr.869803

. 2003 Aug;13(8):1923–1929. doi: 10.1101/gr.869803

Dragon Gene Start Finder: An Advanced System for Finding Approximate Locations of the Start of Gene Transcriptional Units

Vladimir B Bajic ^1,³, Seng Hong Seah ²

PMCID: PMC403784 PMID: 12869582

Abstract

We present an advanced system for recognition of gene starts in mammalian genomes. The system makes predictions of gene start location by combining information about CpG islands, transcription start sites (TSSs), and signals downstream of the predicted TSSs. The system aims at predicting a region that contains the gene start or is in its proximity. Evaluation on human chromosomes 4, 21, and 22 resulted in Se of over 65% and in a ppv of ∼78%. The system makes on average one prediction per 177,000 nucleotides on the human genome, as judged by the results on chromosome 21. Comparison of abilities to predict TSS with the two other systems on human chromosomes 4, 21, and 22 reveals that our system has superior accuracy and overall provides the most confident predictions.

As indicated by Fickett and Hatzigeorgiou (1997) and Pedersen et al. (1999), recognition of eukaryotic promoters remains a difficult problem. Numerous systems for promoter prediction have been developed (for reviews, see Fickett and Hatizigeorgiou 1997; Prestridge 2000), but the general conclusion is that the level of false positive (FP) predictions appears to be unacceptably high. The first breakthrough from such inferior performance was developed by the PromoterInspector program (Scherf et al. 2000), which reduced FP predictions to an acceptable level, while maintaining relatively high sensitivity (Se). The initially reported performance of PromoterInspector (Scherf et al. 2000, 2001) implied Se = ∼0.43 and a positive predictive value (ppv) of ∼0.43. However, in later research (Down and Hubbard 2002), it became apparent that PromoterInspector had in fact better overall performance, at least as measured on human chromosome 22. Also, Werner (2002) suggests that PromoterInspector has Se > 0.5 and ppv > 0.85. These last claims require proper validation, but we can conclude that PromoterInspector represented a breakthrough in promoter prediction.

After the appearance of PromoterInspector, several systems for promoter predictions were developed (Ioshikhes and Zhang 2000; Davuluri et al. 2001; Hannenhalli and Levy 2001; Bajic et al. 2002a,b, 2003; Down and Hubbard 2002; Ponger and Mouchiroud 2002) that resulted in an acceptably low level of FP predictions. These systems are based on different principles and do not share the same design goals. Some are aimed at recognizing the actual transcription start site (TSS), such as Dragon Promoter Finder (Dragon PF; Bajic et al 2002a,b, 2003) and Eponine (Down and Hubbard 2002). Others make predictions of a region that should be in proximity with the TSS, such as CpG-Promoter (Ioshikhes and Zhang 2000), the system of Hannenhalli and Levy (2001), and CpGProD (Ponger and Mouchiroud 2002). The third group of systems provides more comprehensive information about the promoters and first exons, such as FirstEF (Davuluri et al. 2001). All of these systems, with the exception of CpG-Promoter, have been compared with PromoterInspector in one way or another, and all reported better overall performance on the data sets that they used.

In the human genome, many genes were recognized and validated successfully (Lander et al. 2001; Venter et al. 2001) by using the so-called CpG islands as gene markers. CpG islands are unmethylated segments of DNA longer than 200 bp, with a G + C content of at least 50%, and the number of CpG dinucleotides being at least 60% of what could be expected from the G + C content of the segment (Bird et al 1986; Gardiner-Garden and Frommer 1987; Larsen et al. 1992; Cross and Bird 1995). CpG islands are found around gene starts in approximately half of mammalian promoters (Larsen et al. 1992; Cross and Bird 1995) and are estimated to be associated with ∼60% of human promoters (Cross et al. 1999). For this reason, Pedersen et al. (1999) suggested that CpG islands could represent a good global signal to locate promoters across genomes. At least in mammalian genomes, CpG islands are a good indicator of gene presence. Programs such as CpG-Promoter, the system of Hannenhalli and Levy 2001, CpGProD, and FirstEF explicitly use information on CpG islands in their promoter-finding algorithms, although the type of information varies from program to program.

Here we introduce a new system, Dragon Gene Start Finder (Dragon GSF), for predictions of promoters in mammalian genomes. This system uses information about the CpG islands, predicted TSS locations, and information about a region downstream of the predicted TSSs. This information is processed to infer promoter presence, give an estimate of the region expected to contain the TSS and to overlap with the first exon, and give an estimate of the gene start. This system is rigorously tested on genomic sequences of human chromosomes 4, 21, and 22. The system is compared in its ability to predict TSS locations with other systems that provide strand-specific prediction of TSSs, such as Eponine and FirstEF. In these tests, our system exhibited superior accuracy. Its overall performance appears to be, at the moment of this writing, the best of the currently available systems for gene start predictions. We estimate that the Se with respect to all promoters in the human genome is ∼0.65, with a ppv of ∼0.78, and the frequency of strand-specific predictions that our system makes is approximately one per 177,000 nt.

RESULTS

We analyzed the performance of Dragon GSF on three human chromosomes. No sequences from these chromosomes were used in the training and tuning of our system. We selected chromosomes 4, 21, and 22 for the analysis because of their different G + C contents in order to better understand the behavior of our system and the other systems when the G + C content varies. To obtain information about the relative performance of Dragon GSF, we compared it with the other two systems, FirstEF and Eponine.

The main results are summarized in Tables 1, 2, 3, 4, 5, 6, 7. Results are given with respect to several criteria related to the maximum allowed distance between the predicted TSS and the real TSS. In these experiments, Dragon GSF, FirstEF, and Eponine have been used with their default parameter settings (see Methods).

Table 1.

Results on Chromosome 4 With TSSs Determined Based on Mapped Full-Length cDNA Sequences From DBTSS (Sugano Laboratory)

	TP	FP	Total # of TSSs	Total # of predictions	Se	ppv	ASM	CC
Dragon GSF	179	55	304	1349	0.5888	0.7650	1.6364	0.6711
FirstEF	220	169	304	3620	0.7237	0.5656	3.4545	0.6398
FirstEF (CpG+)	217	110	304	2509	0.7138	0.6636	1.9091	0.6883
Eponine	120	36	304	2296	0.3947	0.7692	3.0000	0.5510

Open in a new tab