Theatre: a software tool for detailed comparative analysis and visualization of genomic sequence

Yvonne J K Edwards; Tim J Carver; Tanya Vavouri; Martin Frith; Martin J Bishop; Greg Elgar

doi:10.1093/nar/gkg501

. 2003 Jul 1;31(13):3510–3517. doi: 10.1093/nar/gkg501

Theatre: a software tool for detailed comparative analysis and visualization of genomic sequence

Yvonne J K Edwards ^1,^a, Tim J Carver ¹, Tanya Vavouri ¹, Martin Frith ¹, Martin J Bishop ¹, Greg Elgar ¹

PMCID: PMC168908 PMID: 12824356

Abstract

Theatre is a web-based computing system designed for the comparative analysis of genomic sequences, especially with respect to motifs likely to be involved in the regulation of gene expression. Theatre is an interface to commonly used sequence analysis tools and biological sequence databases to determine or predict the positions of coding regions, repetitive sequences and transcription factor binding sites in families of DNA sequences. The information is displayed in a manner that can be easily understood and can reveal patterns that might not otherwise have been noticed. In addition to web-based output, Theatre can produce publication quality colour hardcopies showing predicted features in aligned genomic sequences. A case study using the p53 promoter region of four mammalian species and two fish species is described. Unlike the mammalian sequences the promoter regions in fish have not been previously predicted or characterized and we report the differences in the p53 promoter region of four mammals and that predicted for two fish species. Theatre can be accessed at http://www.hgmp.mrc.ac.uk/Registered/Webapp/theatre/.

INTRODUCTION

Databases comprising information on transcriptional regulation, such as TRANSFAC (1), the Eukaryotic Promoter Database (2), COMPEL (3) and TRRD (4) are valuable in classifying transcription factors and their DNA binding sites. Software developed to predict protein binding sites in sequences like MatInspector (5) and Tfscan from the EMBOSS package (6) rely on databases such as TRANSFAC for protein binding matrices and strings to search for motifs in new sequences. It is helpful to identify putative regulatory elements in the context of a defined structure or a promoter region of a gene whose expression is being affected (1–4). Comparative analysis of orthologous promoters and other types of non-coding regions are proving to be a reliable guide to finding conserved features to be tested experimentally for function (7–11). Identifying conserved non-coding sequences in genomes is an extremely useful starting point towards studying the control of expression of orthologous and paralogous genes. Comparative analysis of orthologous genomic systems can help to identify motifs or other features that may affect expression, development and differentiation shared by organisms. These are a few of the considerations that led to the development of Theatre.

Theatre has been designed to compare and display features in equivalent genome sequences. The features considered are the coding and non-coding regions, repetitive sequences, transcription factor binding sites, intron and exon sizes and nucleotide biases. The first section of this article describes the Theatre analysis tool to compare features in genomic sequences. The second part describes an application using the p53 promoter.

BIOLOGICAL BACKGROUND RELEVANT TO THEATRE

Theatre has been designed to compare equivalent gene structures and analyse likely genomic regions responsible for regulation of gene expression. Theatre integrates commonly used tools to characterize or predict the position and orientation of the protein coding regions, repetitive DNA sequences, and transcription factor binding sites. This section summarizes the sequence analysis tools and databases used by Theatre.

Transcription factor binding sites and other regulatory regions

Known transcription factor binding motifs can be searched for in DNA sequences using MatInspector (5) or Tfscan [a component of EMBOSS (6)]. MatInspector version 2.0 performs matrix searches in the nucleotide sequences. Analysis for 246 protein binding sites can be made. Tfscan uses string searches from databases of compiled transcription factor binding sites to find transcription factor binding sites. Cpgplot, also part of EMBOSS, identifies CpG islands. CpG islands are patches of non-methylated DNA coinciding with most gene promoters in the genomes of vertebrates; methylation of these motifs correlate with repression of transcription (12). The CpG islands are defined as regions longer than 200 bases with a moving average of %(G+C) in excess of 50% and a moving average of observed/expected CpG dinucleotide content >0.6.

Protein coding regions

DNA sequence BLAST searches (13) against the SPTR (SWISS-PROT and TrEMBL) databases (14) and the GeneMark ORF predictions (15) can be used to locate potential protein coding regions. These assignments, when available, are useful for determining the positions of protein coding regions.

Repetitive DNA sequences

RepeatMasker performs a fast sequence search against databases of repeat sequences such as SINES, LINES, LTR and other retrotransposable elements commonly found in genomic sequences (A.F.A. Smit and P. Green, unpublished material; http://ftp.genome.washington.edu/RM/RepeatMasker.html). Repetitive elements in the sequences are masked using RepeatMasker. The masked sequences are then searched against the SPTR database using BLAST. Any repeats identified are highlighted.

DESCRIPTION OF THEATRE AND THE WEB INTERFACE

There are three stages to running Theatre: creating a multiple sequence alignment, populating hierarchical directory structures of genomic features predicted for each sequence and generating a graphical display that superimposes features on the sequence alignment. The first stage (PreMSD) generates a multiple DNA sequence alignment using ClustalW (16). The second stage consists of a Theatre program developed in C, named MSD. The role of MSD is to submit UNIX shell scripts to a job queue. The UNIX shell scripts are generated from a template, each of which controls the input and output details and parameters to run each of the programs of the second stage (Table 1). MSD generates a flat file databank of genomic features for the sequences using the UNIX shell scripts. This stage creates a project directory containing a named subdirectory for each program. For example, a subdirectory named MatInspector holds the MatInspector results for the sequences included in the analysis. Each subdirectory contains the output from one of the programs and is searched by the programs of the third stage. The third stage consists of a program that formats the features and highlights the relationships among features displayed in equivalent regions. This display program generates the colour PostScript^TM output files. Tables 2–6 and Figures 1–4 are sample output from the Theatre display program. The programs of the Theatre package (PreMSD, MSD and the display program) can be run from the UNIX command line. They are integrated into a web-based interface using PERL (17) and the CGI.pm library (18).

Table 1. Theatre: incorporated programs, references and operational notes.

Program	Version	Program control options (the first and second stage)	Features a user can select to include in the display (the third stage)	References
ClustalW	1.74	Select to run clustalW or provide an alignment from an alternate source	An alignment is always required	(16)
BLAST	2.2.3	None	Protein coding regions	(13)
GeneMark	2.3	Select species specific dicodon matrices	Open reading frames	(15)
MatInspector	2.2	None	User selection from predicted transcription factor binding sites	(5)
Tfscan (EMBOSS)	2.5.1	Select taxonomical class to consider	ser selection from predicted transcription factor binding sites	(6)
CpGplot (EMBOSS)	2.5.1	None	CpG islands	(6)
RepeatMasker	(release: 07/07/2001)	None	Repeats	Smit and Green, unpublished

Open in a new tab

The options that can be set by the user are summarized.

Table 2. Nucleotide composition for four mammalian p53 promoters.

Sequences	C+G %	A+T %	A %	C %	G %	T %	Length (bases)
hsantp53	54.9	45.1	18.5	30.8	24.2	26.5	426
rsantp53	50.7	49.3	22.8	28.5	22.2	26.6	527
mmantp53	51.5	48.5	22.6	29.1	22.4	25.9	536
ma08134	50.7	49.3	27.6	27.6	23.1	21.7	700

Open in a new tab

Table 6. Intron sizes for two puffer fish sequences.

	1
frp53	738
tnp53	638

Open in a new tab

(A) The sequence alignment of the upstream transcription regulatory region of the gene p53 from human (hsantp53, M26864), rat (rsantp53, M26863), mouse (mmantp53, M26862) and golden hamster (ma08134, U08134). The EMBL entry identifier and the accession number (21) are provided in parentheses, respectively. The line breaks indicate gaps in the alignment. The features are shown above or below the line depending on whether they are found in the forward or complementary sequence. The transcription factor binding sites selected are known to regulate mammalian p53 transcription. (B) The key for (A). The Tfscan sites displayed are described. The PF1 binding site (MOUSE$P53_03) and the ETF motifs (MOUSE$P53_07).

(A) A detailed Theatre display showing two puffer fish p53 genes aligned at the level of the base and the genomic features predicted in intron 1 and exon 2. The translation start site is at position 1393. (B) The key for (A).

Online documentation is provided from the Theatre website. The user interacts with Theatre through a set of HTML forms. The form contains fields to be completed for the user's email address, the GeneMark matrices, a project name and the individual sequence filenames. Input requirements are genomic sequences in a variety of formats. Although Theatre can use ClustalW to align sequences, it is possible to enter an alignment in MSF created from another source (19,20). The user is sent an electronic mail message when the analysis is complete and PostScript^TM generation is completed interactively in the last stage. Table 1 lists more details regarding the use of Theatre.

THEATRE IMPLEMENTATION

The Theatre and MSD program are written in ANSI C. The programs compile and run on Solaris platforms. Control of the programs can be made through the use of UNIX command line arguments. Six programs are used for identifying regions of interest in genomic sequences (Table 1). The programs that make up stage 2 of Theatre have some limitations on the amount of data that they can process, so Theatre requires that no more than 10 kb in total are provided. Within this overall limit any number of sequences can be input. The maximum number of sequences and sequence lengths that can be used by each is dependent on the available memory and known caveats set by the individual programs. The programs ClustalW, RepeatMasker, GeneMark, BLAST, MatInspector, Tfscan and Cpgplot are available from their respective authors. Executables for Theatre can be made available on request but not the dependent modules or its interface. Theatre can be used by registered members of the Bioinformatics facilities of the Human Genome Mapping Project Resource Centre (HGMP-RC) at the following URL http://www.hgmp.mrc.ac.uk/Registered/Webapp/theatre/.

The Common Gateway Interface (CGI) scripts developed for the Theatre server are written in PERL (17). The server runs on the Solaris operating system. UNIX shell scripts drive the selected analysis programs (Table 1), jobs are created and submitted to an in-house batch queuing system. This system distributes the processing over a server farm running the Solaris 8 operating system.

THEATRE: AN ILLUSTRATED CASE STUDY

Theatre produces two types of graphical output, a concise graphical display (Figs 1 and 3) and some detailed output (Figs 2 and 4). The concise display covers a single A4 page, while the detailed display shows the individual nucleotides and may run over many pages, depending on the sequence alignment lengths. A case study is presented describing the promoter region of the p53 gene in mammalian and puffer fish species. The p53 gene is a tumour suppressor that has a fundamental role in cell cycle control and division. The functions of transcription factors that bind to motifs in the p53 promoter and regulate the expression of the p53 gene have been experimentally identified and characterized in mouse (9), human colon cancer cells (22), rat (23) and the golden hamster (24). The p53 gene was identified in the genomic sequence from Fugu rubripes (Fugu) (25,26) and Tetraodon nigroviridis (Tetraodon) (27).

(A) The sequence alignment of p53 promoter regions predicted from Fugu and Tetraodon. The predicted binding sites searched for are identical to those discussed in the text (Fig. 1). The matrices from MatInspector typically incorporate the name of the transcription factor that binds to the site. The Tfscan binding sites incorporate the name of the gene whose expression is being affected (1). The SPTR BLAST matches are included in the display as the sequences include two protein coding exons plus one non-coding exon. The predicted transcription start site of the p53 gene is estimated to be at position 500 in the concise sequence alignment. (B) The key for (A).

(A) A detailed Theatre display showing details at the level of the base and the genomic features identified. (B) The key for (A).

Figure 1 shows the Theatre concise alignment of the p53 promoters from human, rat, mouse and the golden hamster. The regions comprise the first exon (non-coding) and the immediate 5′ flanking DNA. Each sequence is between 426 and 700 bases containing an upstream non-transcribed sequence ranging from 207 to 478 bases. The transcription start site of the p53 gene for rat, mouse and human (23) is at position 508 of the multiple sequence alignment (Fig. 2A). The EMBL accession numbers provided to Theatre as input can be found in the legend to Figure 1. The alignment was performed using ClustalW and the default parameters. The selected MatInspector binding sites are those that we expected to see in the p53 promoter (9,22–24). The GeneMark predictions of the open reading frames (ORFs) and SPTR BLAST matches were switched off, as the region examined is known not to include protein coding sequences. In the golden hamster promoter, a repetitive element was identified by the RepeatMasker program and highlighted in the graphical display using Theatre. No CpG islands were predicted in any of the four sequences. Additionally, there were no conserved CAAT and TATA box motifs predicted upstream of the transcription initiation site (Figs 1 and 2), as expected from previous studies (9,22–24). In this respect, MatInspector performed well. The threshold for consensus display is set to 75% to establish highly conserved and invariant features. The four sequences have highly conserved features illustrated in the consensus (Fig. 1A) and the detailed alignment (Fig. 2A). The transcriptional regulation of the murine p53 has been well studied and localized in the region surrounding the transcription initiation site (9,22–24). The location and order of seven invariant transcription factor binding sites are listed. Many of these correspond to the experimentally verified downstream sites. The sites are listed together with their frequency and the site name in parentheses: NF1 (two, V$NF1_Q6); NFkB (two, V$NFKAPPAB_01); Sp1 (one, V$SP1_Q6); ETF (one, MOUSE$P53_07); c-myc/max (one, V$MYCMAX_02); USF (two, V$USF_C) and USF (two, V$USF_02). The c-myc/max proteins belong to the basic helix–loop–helix family of transcription factors and bind to a certain class of E-box that shares a signature motif that consists of the core hexanucleotide sequence CANNTG. The USF and NFkB sites, to a large degree, are palindromes identified as pairs of binding sites in the same location that read in the forward and reverse orientation. The upstream ETF binding site is present and conserved in all sequences, however, the binding sites do not exactly align and the site does not appear in the consensus. The human, mouse and rat sequences have a conserved upstream PF1 binding site (MOUSE$P53_03) not shared in the hamster sequence. The hamster, mouse and rat sequences have a conserved downstream P300_01 (P300_01) not found in the human sequence (Fig. 1A). Similarly, the upstream Sp1 motif is present in all except rat. Such patterns of transcription factor binding sites predicted are highlighted and in this case, they correspond well with experimentally defined protein binding sites (9,22–24).

Theatre is used to display the binding sites found in the mammalian sequences in sequences extracted from the Fugu and the Tetraodon genome (Fig. 3). Unlike the mammalian sequences, the positions of the p53 promoter sequence have not been previously predicted, identified or characterized. The draft genome sequence of Fugu and Tetraodon (http://fugu.hgmp.mrc.ac.uk/ and http://www.genoscope.cns.fr/externe/tetraodon/, respectively) were searched for contigs that contained the p53 gene. Two puffer fish p53 cDNA sequences, retrieved from public databases (SPTR: Q9W679 and EMBL: BU806111), were used to predict the transcription start site of the Fugu and Tetraodon p53 gene and the position of coding sequence (Figs 3 and 4). The p53 cDNA sequences were blasted using BLASTN and TBLASTX against the publicly available Tetraodon genome assemblies. The p53 promoter regions were predicted from Fugu in Scaffold_126 (release 2: EMBL: CAAB01000126) (26) and Scaffold_18 (release 3) and Tetraodon from FS_CONTIG_1412_2 (release 6). Regions likely to be involved in regulating expression of the p53 gene were extracted from the genomic sequence using extractseq program from EMBOSS (6) and the predicted promoter regions used as input for analysis using the programs in Theatre (see legend to Fig. 3). We compare patterns in binding sites predicted in the mammalian (Figs 1 and 2) with the two fish p53 promoters (Figs 3 and 4). All the genes possess a non-coding exon comprising exclusively 5′ untranslated sequences, in this respect the fish gene structures considered here are similar to the mammalian promoters. There is a conserved CpG island predicted in the Fugu and Tetraodon promoters but absent in the mammalian promoter. Conserved within the puffer fish p53 promoters are four ETF signals (MOUSE$P53_07) predicted by Tfscan. The MatInspector sites conserved at the promoter region include the following sites where the number of the binding sites is in parentheses: NF1 (two), CAAT (one) and NFkB (one). The sequences from the two taxonomic classes are displayed separately mainly because the order and frequency of occurrence of sites differ significantly between the mammalian and puffer fish p53 promoter. The promoter region of the puffer fish possess a conserved CAAT and CpG island in the promoter suggesting differences in gene regulation of the p53 gene compared to the mammalian species. The puffer fish p53 promoters share some similarity with the mammalian promoters in having conserved NF1, NFkB and ETF sites, however, the location of the conserved NF1 and NFkB sites are in a different order. NF1 is critical for basal expression of the mouse p53 gene whilst the NFkB recognition site has been shown to be required for trans-activation of the murine p53 promoter in the presence of NFkB (1,4,9). The USF and c-Myc/Max factor are conserved in mammals but not in the fish sequences. Both these factors bind E-box sites and are implicated in increasing transcription of p53. The differences suggest that puffer fish have a different mechanism for control of p53 gene expression compared with mammals. There is also an island of predicted and conserved motifs in intron 2. The conserved motifs are a useful guide to targeting regions of interest that can be tested for function using experimental techniques. The nucleotide composition, nucleotide sequence biases and ratios of the observed and expected for the 16 dinucleotides for the six species are also given (Tables 2–6).

DISCUSSION

The preparation of quality figures to highlight annotation in aligned sequences can prove time-consuming and can involve using graphical drawing packages and searching through the output of analysis programs (28). Programs offering related functions such as comparative genomic sequence analysis and visualization include CINEMA (29), Alfresco (30), VISTA (31), PipMaker (32) and SynPlot (33). CINEMA (Colour INteractive Editor for Multiple Alignment) is a Java based tool for manipulating and generating aligned nucleotides and amino acids (29). Alfresco is a visualization tool developed in Java to allow comparative genome sequence analysis between two sequences (30). VISTA is software for visualizing global DNA sequence alignments of arbitrary length (31). PipMaker compares two long DNA sequences to identify conserved segments (32). SynPlot (33) is a tool similar in philosophy to PipMaker and VISTA that automates the graphical display of large-scale sequence alignments. These tools are widely used by the scientific community. They hold much promise for comparative genomics and in the search for conserved non-coding sequences to test for putative enhancer and/or promoter activity (7–11,33). These tools are, however, best suited when there is a large degree of conservation of synteny between similar vertebrate genomes (e.g. mouse and human). The gene order (26,33–35) and the features in the promoter regions (36,37), are not always well conserved in the genomes of species from highly diverged taxonomical classes. This is notable between Fugu and human genomes where short sections of two to three adjacent genes are conserved in both species. If the gene orders differ in equivalent regions, the usefulness of these tools, to compare large genomic regions between species that are evolutionary divergent, will be reduced.

Theatre is intended to study genomic regions on a small to medium scale (i.e. 500 bp–10 kb) in detail and is well-suited for studying transcription factor binding sites in equivalent promoters, introns and other gene structures as illustrated in the case study. Using default parameters, Theatre is ideally suited for two or more genomic sequences that are fairly similar and approximately the same length. Theatre offers the user the choice of supplying an externally generated sequence alignment as an alternative to using ClustalW. The main aim of Theatre is to allow the user to look at the results of different programs and varying selections of features for display so that the results can be taken in at a glance. Further developments of this tool are being planned.

AVAILABILITY

Theatre is accessible for use by registered members of the Bioinformatics facilities of the UK HGMP-RC at http://www.hgmp.mrc.ac.uk/Registered/Webapp/theatre/. Registration is free to the academic community. Information regarding the registration process is available at the HGMP-RC's website at the following URL http://www.hgmp.mrc.ac.uk/. For more details contact: support@hgmp.mrc.ac.uk or yjedward@hgmp.mrc.ac.uk.

Table 3. Observed/expected dinucleotide ratio for four mammalian p53 promoters.

	ApA	ApC	ApG	ApT	CpA	CpC	CpG	CpT	GpA	GpC	GpG	GpT	TpA	TpC	TpG	TpT
hsantp53	1.23	0.91	1.10	0.86	1.28	1.09	0.54	1.12	1.10	0.98	1.37	0.59	0.43	0.95	1.13	1.33
rsantp53	1.24	0.97	1.28	0.60	1.05	1.01	0.57	1.30	0.94	1.08	1.23	0.74	0.78	0.93	1.03	1.24
mmantp53	1.06	0.94	1.40	0.67	1.14	0.99	0.43	1.38	1.07	0.94	1.41	0.61	0.73	1.09	0.93	1.19
ma08134	1.45	0.75	1.23	0.50	1.03	1.13	0.60	1.22	0.87	1.05	1.09	0.99	0.52	1.07	1.11	1.36

Open in a new tab

Table 4. Nucleotide composition for two puffer fish sequences.

Sequences	C+G %	A+T %	A %	C %	G %	T %	N %	Length (bases)
frp53	45.4	54.6	26.2	22.5	22.9	28.4	0.0	2672
tnp53	45.2	53.5	26.6	23.1	22.1	26.9	1.3	2464

Open in a new tab

Table 5. Observed/expected dinucleotide ratio for two puffer fish sequences.

	ApA	ApC	ApG	ApT	CpA	CpC	CpG	CpT	GpA	GpC	GpG	GpT	TpA	TpC	TpG	TpT
frp53	1.27	0.90	1.06	0.79	1.10	1.04	0.78	1.04	1.04	1.04	1.03	0.90	0.63	1.03	1.09	1.25
tnp53	1.30	0.88	1.03	0.83	1.03	1.10	0.88	1.03	1.10	1.13	0.99	0.85	0.64	0.98	1.13	1.31

Open in a new tab

Acknowledgments

ACKNOWLEDGEMENTS

We are grateful to our colleagues and collaborators (past and present) for helpful considerations and the authors of EMBOSS (CpGplot and Tfscan), ClustalW, GeneMark, RepeatMasker, BLAST and MatInspector for their software.

REFERENCES

1.Wingender E., Chen,X., Fricke,E., Geffers,R., Hehl,R., Liebich,I., Krull,M., Matys,V., Michael,H., Ohnhäuser,R. et al. (2001) The TRANSFAC system on gene expression regulation. Nucleic Acids Res., 29, 281–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Praz V., Perier,R., Bonnard,C. and Bucher,P. (2002) The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data. Nucleic Acids Res., 30, 322–324. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kel-Margoulis O.V., Kel,A.E., Reuter,I., Deineko,I.V. and Wingender,E. (2002) TRANSCompel: a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res., 30, 332–334. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kolchanov N.A., Ignatieva,E.V., Ananko,E.A., Podkolodnaya,O.A., Stepanenko,I.L., Merkulova,T.I., Pozdnyakov,M.A., Podkolodny,N.L., Naumochkin,A.N. and Romashchenko,A.G. (2002) Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res., 30, 312–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Quandt K., Frech,K., Karas,H., Wingender,E. and Werner,T. (1995) Matind and MatInspector—new fast and versatile tools for detection of consensus matches in nucleotide-sequence data. Nucleic Acids Res., 23, 4878–4884. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Rice P., Longden,I. and Bleasby,A.J. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet., 6, 276–277. [DOI] [PubMed] [Google Scholar]
7.Wentworth J.M., Schoenfeld,V., Meek,S., Elgar,G., Brenner,S. and Chatterjee,V.K. (1999) Isolation and characterisation of the retinoic acid receptor-alpha gene in the Japanese pufferfish, F.rubripes. Gene, 236, 315–323. [DOI] [PubMed] [Google Scholar]
8.Hardison R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet., 16, 369–372. [DOI] [PubMed] [Google Scholar]
9.Reisman D., Eaton,E., McMillin,D., Doudican,N.A. and Boggs,K. (2001) Cloning and characterization of murine p53 upstream sequences reveals additional positive transcriptional regulatory elements. Gene, 274, 129–137. [DOI] [PubMed] [Google Scholar]
10.Tompa M. (2001) Identifying functional elements by comparative DNA sequence analysis. Genome Res., 11, 1143–1144. [DOI] [PubMed] [Google Scholar]
11.Cliften P.F., Hillier,L.W., Fulton,L., Graves,T., Miner,T., Gish,W.R., Waterston,R.H. and Johnston,M. (2001) Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res., 11, 1175–1186. [DOI] [PubMed] [Google Scholar]
12.Bird A., Tate,P., Xinsheng,N., Campoy,J., Meehan,R., Cross,S., Tweedie,S., Charlton,J. and Macleod,D. (1995) Studies of DNA methylation in animals. J. Cell Sci., Suppl., 19, 37–39. [DOI] [PubMed] [Google Scholar]
13.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST, a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Borodovosky M. and McIninch,J. (1993) GENEMARK—parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–133. [Google Scholar]
16.Higgins D.G., Thompson,J.D. and Gibson,T.J. (1996) Using Clustalw for multiple sequence alignments. Meth. Enzymol., 266, 383–402. [DOI] [PubMed] [Google Scholar]
17.Wall L., Christiansen,T. and Schwartz,R. (1996) Programming Perl, 2nd Edn. O'Reilly and Associates, Inc., Sabastopol, CA. [Google Scholar]
18.Stein L. (1998) Official Guide to Programming with CGI.pm. Wiley & Sons, Inc., New York, NY. [Google Scholar]
19.Brudno M., Kim,M.F., Do,C. and Batzoglou,S. (2002) The LAGAN Server, http://lagan.stanford.edu.
20.Brudno M. and Morgenstern,B. (2002) Fast and sensitive alignment of large genomic sequences. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB). IEEE Computer Society Press Inc., Los Alamitos, CA. [PubMed] [Google Scholar]
21.Stoesser G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., Kanz,C., Kulikova,T., Leinonen,R., Lin,Q., Lombard,V. et al. (2002) The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 30, 21–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Benoit V., Hellin,A.C., Huygen,S., Gielen,J., Bours,V. and Merville,M.P. (2000) Additive effect between NF-kappaB subunits and p53 protein for transcriptional activation of human p53 promoter. Oncogene, 19, 4787–4794. [DOI] [PubMed] [Google Scholar]
23.Bienz-Tadmor B., Zakut-Houri,R., Libresco,S., Givol,D. and Oren,M. (1985) The 5′ region of the p53 gene: evolutionary conservation and evidence for a negative regulatory element. EMBO J., 4, 3209–3213. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Albor A., Laborda,J. and Notario,V. (1994) Cloning of the Syrian hamster p53 gene: structural and functional characterization of the upstream promoter region. Mol. Carcin., 11, 176–183. [DOI] [PubMed] [Google Scholar]
25.Elgar G., Clark,M.S., Meek,S., Smith,S., Warner,S., Edwards,Y.J.K., Bouchireb,N., Cottage,A., Yeo,G.S., Umrania,Y. et al. (1999) Generation and analysis of 25 Mb of genomic DNA from the puffer fish Fugu rubripes by sequence scanning. Genome Res., 9, 960–971. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Aparicio S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M., Dehal,P., Christoffels,A., Rash,S., Hoon,S. et al. (2002) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297, 1301–1310. [DOI] [PubMed] [Google Scholar]
27.Crollius H.R., Jaillon,O., Dasilva,C., Ozouf-Costaz,C., Fizames,C., Fischer,C., Bouneau,L., Billault,A., Quetier,F., Saurin,W. et al. (2000) Characterization and repeat analysis of the compact genome of the freshwater puffer fish Tetraodon nigroviridis. Genome Res., 10, 939–949. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Barton G.J. (1993) ALSCRIPT—A tool to format multiple sequence alignments. Protein Eng., 6, 37–40. [DOI] [PubMed] [Google Scholar]
29.Parry-Smith D.J., Payne,A.W.R., Michie,A.D. and Attwood,T.K. (1998) Cinema—a novel colour interactive editor for multiple alignments. Gene, 221, GC57–GC63. [DOI] [PubMed] [Google Scholar]
30.Jareborg N. and Durbin,R. (2000) Alfresco—a workbench for comparative genomic sequence analysis. Genome Res., 10, 1148–1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Mayor C., Brudno,M., Schwartz,J.R., Poliakov,A., Rubin,E.M., Frazer,K.A., Pachter,L.S. and Dubchak,I. (2000) VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16, 1046–1047. [DOI] [PubMed] [Google Scholar]
32.Schwartz S., Zhang,Z., Frazer,K.A., Smit,A., Riemer,C., Bouck,J., Gibbs,R., Hardison,R. and Miller,W. (2000) PipMaker—a web server for aligning two genomic DNA sequences. Genome Res., 10, 577–586. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Gottgens B., Gilbert,J.G., Barton,L.M., Grafham,D., Rogers,J., Bentley,D.R. and Green,A.R. (2001) Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved noncoding sequences. Genome Res., 11, 87–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Sambrook J.G., Russell,R., Umrania,Y., Edwards,Y.J.K., Campbell,R.D., Elgar,G. and Clark,M.S. (2002) Fugu orthologues of human major histocompatibility complex genes: a genome survey. Immunogenetics, 54, 367–380. [DOI] [PubMed] [Google Scholar]
35.Smith S.F., Snell,P., Gruetzner,F., Bench,A.J., Haaf,T., Metcalfe,J.A., Green,A.R. and Elgar,G. (2002) Analyses of the extent of shared synteny and conserved gene orders between the genome of Fugu rubripes and human 20q. Genome Res., 12, 776–784. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Miles C., Elgar,G., Coles,E., Kleinjan,D.J., van Heyningen,V. and Hastie,N. (1998) Complete sequencing of the Fugu WAGR region from WT1 to PAX6: dramatic compaction and conservation of synteny with human chromosome 11p13. Proc. Natl Acad. Sci. USA, 95, 13068–13072. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Davidson H., Taylor,M.S., Doherty,A., Boyd,A.C. and Porteous,D.J. (2000) Genomic sequence analysis of Fugu rubripes CFTR and flanking genes in a 60 kb region conserving synteny with 800 kb of human chromosome 7. Genome Res., 10, 1194–1203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c1] 1.Wingender E., Chen,X., Fricke,E., Geffers,R., Hehl,R., Liebich,I., Krull,M., Matys,V., Michael,H., Ohnhäuser,R. et al. (2001) The TRANSFAC system on gene expression regulation. Nucleic Acids Res., 29, 281–283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c2] 2.Praz V., Perier,R., Bonnard,C. and Bucher,P. (2002) The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data. Nucleic Acids Res., 30, 322–324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c3] 3.Kel-Margoulis O.V., Kel,A.E., Reuter,I., Deineko,I.V. and Wingender,E. (2002) TRANSCompel: a database on composite regulatory elements in eukaryotic genes. Nucleic Acids Res., 30, 332–334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c4] 4.Kolchanov N.A., Ignatieva,E.V., Ananko,E.A., Podkolodnaya,O.A., Stepanenko,I.L., Merkulova,T.I., Pozdnyakov,M.A., Podkolodny,N.L., Naumochkin,A.N. and Romashchenko,A.G. (2002) Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res., 30, 312–317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c5] 5.Quandt K., Frech,K., Karas,H., Wingender,E. and Werner,T. (1995) Matind and MatInspector—new fast and versatile tools for detection of consensus matches in nucleotide-sequence data. Nucleic Acids Res., 23, 4878–4884. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c6] 6.Rice P., Longden,I. and Bleasby,A.J. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet., 6, 276–277. [DOI] [PubMed] [Google Scholar]

[gkg501c7] 7.Wentworth J.M., Schoenfeld,V., Meek,S., Elgar,G., Brenner,S. and Chatterjee,V.K. (1999) Isolation and characterisation of the retinoic acid receptor-alpha gene in the Japanese pufferfish, F.rubripes. Gene, 236, 315–323. [DOI] [PubMed] [Google Scholar]

[gkg501c8] 8.Hardison R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet., 16, 369–372. [DOI] [PubMed] [Google Scholar]

[gkg501c9] 9.Reisman D., Eaton,E., McMillin,D., Doudican,N.A. and Boggs,K. (2001) Cloning and characterization of murine p53 upstream sequences reveals additional positive transcriptional regulatory elements. Gene, 274, 129–137. [DOI] [PubMed] [Google Scholar]

[gkg501c10] 10.Tompa M. (2001) Identifying functional elements by comparative DNA sequence analysis. Genome Res., 11, 1143–1144. [DOI] [PubMed] [Google Scholar]

[gkg501c11] 11.Cliften P.F., Hillier,L.W., Fulton,L., Graves,T., Miner,T., Gish,W.R., Waterston,R.H. and Johnston,M. (2001) Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res., 11, 1175–1186. [DOI] [PubMed] [Google Scholar]

[gkg501c12] 12.Bird A., Tate,P., Xinsheng,N., Campoy,J., Meehan,R., Cross,S., Tweedie,S., Charlton,J. and Macleod,D. (1995) Studies of DNA methylation in animals. J. Cell Sci., Suppl., 19, 37–39. [DOI] [PubMed] [Google Scholar]

[gkg501c13] 13.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST, a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c14] 14.Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c15] 15.Borodovosky M. and McIninch,J. (1993) GENEMARK—parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–133. [Google Scholar]

[gkg501c16] 16.Higgins D.G., Thompson,J.D. and Gibson,T.J. (1996) Using Clustalw for multiple sequence alignments. Meth. Enzymol., 266, 383–402. [DOI] [PubMed] [Google Scholar]

[gkg501c17] 17.Wall L., Christiansen,T. and Schwartz,R. (1996) Programming Perl, 2nd Edn. O'Reilly and Associates, Inc., Sabastopol, CA. [Google Scholar]

[gkg501c18] 18.Stein L. (1998) Official Guide to Programming with CGI.pm. Wiley & Sons, Inc., New York, NY. [Google Scholar]

[gkg501c19] 19.Brudno M., Kim,M.F., Do,C. and Batzoglou,S. (2002) The LAGAN Server, http://lagan.stanford.edu.

[gkg501c20] 20.Brudno M. and Morgenstern,B. (2002) Fast and sensitive alignment of large genomic sequences. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB). IEEE Computer Society Press Inc., Los Alamitos, CA. [PubMed] [Google Scholar]

[gkg501c21] 21.Stoesser G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., Kanz,C., Kulikova,T., Leinonen,R., Lin,Q., Lombard,V. et al. (2002) The EMBL Nucleotide Sequence Database. Nucleic Acids Res., 30, 21–26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c22] 22.Benoit V., Hellin,A.C., Huygen,S., Gielen,J., Bours,V. and Merville,M.P. (2000) Additive effect between NF-kappaB subunits and p53 protein for transcriptional activation of human p53 promoter. Oncogene, 19, 4787–4794. [DOI] [PubMed] [Google Scholar]

[gkg501c23] 23.Bienz-Tadmor B., Zakut-Houri,R., Libresco,S., Givol,D. and Oren,M. (1985) The 5′ region of the p53 gene: evolutionary conservation and evidence for a negative regulatory element. EMBO J., 4, 3209–3213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c24] 24.Albor A., Laborda,J. and Notario,V. (1994) Cloning of the Syrian hamster p53 gene: structural and functional characterization of the upstream promoter region. Mol. Carcin., 11, 176–183. [DOI] [PubMed] [Google Scholar]

[gkg501c25] 25.Elgar G., Clark,M.S., Meek,S., Smith,S., Warner,S., Edwards,Y.J.K., Bouchireb,N., Cottage,A., Yeo,G.S., Umrania,Y. et al. (1999) Generation and analysis of 25 Mb of genomic DNA from the puffer fish Fugu rubripes by sequence scanning. Genome Res., 9, 960–971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c26] 26.Aparicio S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M., Dehal,P., Christoffels,A., Rash,S., Hoon,S. et al. (2002) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297, 1301–1310. [DOI] [PubMed] [Google Scholar]

[gkg501c27] 27.Crollius H.R., Jaillon,O., Dasilva,C., Ozouf-Costaz,C., Fizames,C., Fischer,C., Bouneau,L., Billault,A., Quetier,F., Saurin,W. et al. (2000) Characterization and repeat analysis of the compact genome of the freshwater puffer fish Tetraodon nigroviridis. Genome Res., 10, 939–949. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c28] 28.Barton G.J. (1993) ALSCRIPT—A tool to format multiple sequence alignments. Protein Eng., 6, 37–40. [DOI] [PubMed] [Google Scholar]

[gkg501c29] 29.Parry-Smith D.J., Payne,A.W.R., Michie,A.D. and Attwood,T.K. (1998) Cinema—a novel colour interactive editor for multiple alignments. Gene, 221, GC57–GC63. [DOI] [PubMed] [Google Scholar]

[gkg501c30] 30.Jareborg N. and Durbin,R. (2000) Alfresco—a workbench for comparative genomic sequence analysis. Genome Res., 10, 1148–1157. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c31] 31.Mayor C., Brudno,M., Schwartz,J.R., Poliakov,A., Rubin,E.M., Frazer,K.A., Pachter,L.S. and Dubchak,I. (2000) VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16, 1046–1047. [DOI] [PubMed] [Google Scholar]

[gkg501c32] 32.Schwartz S., Zhang,Z., Frazer,K.A., Smit,A., Riemer,C., Bouck,J., Gibbs,R., Hardison,R. and Miller,W. (2000) PipMaker—a web server for aligning two genomic DNA sequences. Genome Res., 10, 577–586. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c33] 33.Gottgens B., Gilbert,J.G., Barton,L.M., Grafham,D., Rogers,J., Bentley,D.R. and Green,A.R. (2001) Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved noncoding sequences. Genome Res., 11, 87–97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c34] 34.Sambrook J.G., Russell,R., Umrania,Y., Edwards,Y.J.K., Campbell,R.D., Elgar,G. and Clark,M.S. (2002) Fugu orthologues of human major histocompatibility complex genes: a genome survey. Immunogenetics, 54, 367–380. [DOI] [PubMed] [Google Scholar]

[gkg501c35] 35.Smith S.F., Snell,P., Gruetzner,F., Bench,A.J., Haaf,T., Metcalfe,J.A., Green,A.R. and Elgar,G. (2002) Analyses of the extent of shared synteny and conserved gene orders between the genome of Fugu rubripes and human 20q. Genome Res., 12, 776–784. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c36] 36.Miles C., Elgar,G., Coles,E., Kleinjan,D.J., van Heyningen,V. and Hastie,N. (1998) Complete sequencing of the Fugu WAGR region from WT1 to PAX6: dramatic compaction and conservation of synteny with human chromosome 11p13. Proc. Natl Acad. Sci. USA, 95, 13068–13072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg501c37] 37.Davidson H., Taylor,M.S., Doherty,A., Boyd,A.C. and Porteous,D.J. (2000) Genomic sequence analysis of Fugu rubripes CFTR and flanking genes in a 60 kb region conserving synteny with 800 kb of human chromosome 7. Genome Res., 10, 1194–1203. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Theatre: a software tool for detailed comparative analysis and visualization of genomic sequence

Yvonne J K Edwards

Tim J Carver

Tanya Vavouri

Martin Frith

Martin J Bishop

Greg Elgar

Abstract

INTRODUCTION

BIOLOGICAL BACKGROUND RELEVANT TO THEATRE