Abstract
Summary
FindGDPs is a program that uses a greedy algorithm to quickly identify a set of genome-directed primers that specifically anneal to all of the open reading frames in a genome and that do not exhibit full-length complementarity to the members of another user-supplied set of nucleotide sequences.
Availability
The program code is distributed under the GNU General Public License at http://www8.utsouthwestern.edu/utsw/cda/dept131456/files/159331.html
Contact
eric.hansen@utsouthwestern.edu
DNA microarrays have been used to investigate the differential expression of entire bacterial transcriptomes in response to environmental stimuli (Wilson et al., 1999). Central to the application of microarray technology in this context is the need to label bacterial mRNA with a detectable molecule such as a fluorophore. Frequently, reverse transcription of total RNA in the presence of a fluorescent nucleotide analog is employed to meet this need, but the lack of a conserved nucleotide motif at the 3′ end of bacterial mRNAs prevents the use of a single primer (e.g., oligo-dT) in bacterial labeling reactions (Lakey et al., 2002).
Talaat et al. recently described an algorithm for identifying a set of oligonucleotide primers that anneal to all of the ORFs in a microbial genome (Talaat et al., 2000). These authors demonstrated that the use of these genome-directed primers (GDPs) resulted in an improved signal-to-noise ratio over that observed when random hexamers were used. However, this program is only available for Macintosh and Windows NT/2000; it is not available for other versions of Windows or other operating systems.
This paper describes the development of FindGDPs, a program that quickly identifies a set of GDPs that fulfills two criteria. First, the members of this set anneal to all of the ORFs in a genome, and second, they do not exhibit full-length complementarity to members of another set of user-supplied nucleotide sequences. FindGDPs also offers advantages in speed and portability. It requires only seconds to identify a set of GDPs for common microbial genomes (Table 1), and since it is written in C++, FindGDPs will run on any platform for which a C++ compiler is available.
Table 1.
Comparison of running FindGDPs and GDPFinder (Talaat et al., 2000) on four different, annotated microbial genomes.
| Organism and Reference | Number of ORFsa | FindGDPs | GDPFinder | ||
|---|---|---|---|---|---|
| Runtime (sec)b | Number of GDPsc | Runtime (sec)b | Number of GDPsd | ||
| Escherichia coli K12 www.genome.wisc.edu | 4290 | 30 | 76 | 1248 | 136 | 
| Streptococcus pneumoniae TIGR4 www.tigr.org | 2236 | 15 | 67 | 492 | 117 | 
| Haemophilus infuenzae KW20 www.tigr.org | 1738 | 11 | 40 | 394 | 66 | 
| Mycoplasma genitalium G-37 www.tigr.org | 483 | 3 | 17 | 123 | 20 | 
The number of annotated genes in the given genome that were predicted to encode proteins.
Test system was an 866MHz Pentium III with 256MB of memory running Microsoft Windows 2000.
The number of 6-nucleotide GDPs, as identified by FindGDPs, that bind to the 3′ 30% of all annotated protein-encoding ORFs and that do not exhibit full-length complementarity to the 5S, 16S, or 23S rRNA sequences annotated in the genome of the corresponding organism.
The number of 6-nucleotide GDPs, as identified by the Fast Find GDPs algorithm of GDPFinder (Talaat et al., 2000), that bind to the 3’ 30% of all annotated protein-encoding ORFs.
FindGDPs is run from the command line and prompts the user for the required runtime parameters. Two input files are required prior to running the program. The first input file contains the nucleotide sequences of all the ORFs for which primers are to be designed in FastA format. The second input file, also in FastA format, contains any nucleotide sequences to which the GDPs should not exhibit full-length complementarity. The user must also specify the length of the desired GDPs (6, 7, or 8 nucleotides; referred to as n), as well as the percentage of the 3′ end of each ORF (contained in the first file) to scan for potential GDPs.
The program begins by processing the file containing sequences to which GDPs should not exhibit full-length complementarity. Each sequence in the file is read and converted to its reverse-complement to obtain the non-coding strand. The non-coding sequence is scanned using a moving window of length n, and each n-mer contained therein is noted in a table. After reading all sequences in this file, the table contains all of the n-mers that cannot be used as GDPs.
The program then performs a similar operation on each of the sequences in the file containing the ORFs for which GDPs are to be designed. A specified percentage of the 3′ end (given as a runtime parameter) of each ORF is converted to its reverse complement and scanned to identify potential GDPs in the 3′ end of the ORF. As each potential GDP is identified, the table of invalid GDPs is checked to see if the potential GDP shows full-length complementarity to any of the prohibited sequences (i.e., those in the second file). If the potential GDP is valid, it is noted in a table of potential GDPs for the current ORF. After scanning the specified region at the 3′ end of the current ORF, the table is written to a temporary file. This process is repeated for each ORF in the input file.
After all ORFs have been processed, a greedy algorithm is employed to identify a set of primers (selected from the set of valid primers) that anneal to the 3′ end of all of the ORFs. The algorithm operates by choosing an ORF for which a GDP has not yet been found, and then selecting the GDP that binds to this ORF as well as to the largest number of other ORFs that still need a GDP. The algorithm repeats until a set of GDPs has been identified such that every ORF in the first input file can be primed by at least one GDP. Furthermore, the members of this set are not complementary across their entire length to any sequence in the second input file (although partial complementarity may be exhibited).
This algorithm runs in O(pq2) time, where p is the number of potential n-mers and q is the number of ORFs. Like all greedy algorithms, the program exhibits very short run times, as illustrated in Table 1. The short run times required by FindGDPs, combined with its ability to run on multiple platforms, should facilitate its use in prokaryotic DNA microarray systems.
Acknowledgments
This study was supported by U.S. Public Health Service grants AI17621, AI36344, and AI32011 to E.J.H. The authors thank Adel Talaat, Stephen Johnston, Harold “Skip” Garner, Michael Norgard, Jonathan Lawson, and Nikki Wagner for many helpful discussions.
REFERENCES
- Lakey DL, Zhang Y, Talaat AM, Samten B, DesJardin LE, Eisenach KD, Johnston SA, Barnes PF. Priming reverse transcription with oligo(dT) does not yield representative samples of Mycobacterium tuberculosis cDNA. Microbiology. 2002;148:2567–2572. doi: 10.1099/00221287-148-8-2567. [DOI] [PubMed] [Google Scholar]
- Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]
- Talaat AM, Hunter P, Johnston SA. Genome-directed primers for selective labeling of bacterial transcripts for DNA microarray analysis. Nature Biotechnology. 2000;18:679–682. doi: 10.1038/76543. [DOI] [PubMed] [Google Scholar]
- Wilson M, DeRisi J, Kristensen H, Imboden P, Rane S, Brown PO, Schoolnik GK. Exploring drug-induced alterations in gene expression in Mycobacterium tuberculosis by microarray hybridization. Proc.Natl.Acad.Sci.U.S.A. 1999;96:12833–12838. doi: 10.1073/pnas.96.22.12833. [DOI] [PMC free article] [PubMed] [Google Scholar]
