Abstract
The analysis of gene regulatory networks has become one of the most challenging problems of the postgenomic era. Earlier we developed rSNP_Guide (http://util.bionet.nsc.ru/databases/rsnp.html), a computer system and database devoted to prediction of transcription factor (TF) binding sites (TF sites), which can be responsible for disease phenotypes. The prediction results were confirmed by 70 known relationships between TF sites and diseases, as well as by site-directed mutagenesis data. The rSNP_Guide is being investigated as a tool for TF site annotation. Previously analyzed and characterized cases of altered TF sites were used to annotate potential sites of the same type and at the same location in homologous genes. Based on 20 TF sites with known alterations in TF binding to DNA, we localized 245 potential TF sites in homologous genes. For these potential TF sites, rSNP_Guide estimates TF–DNA interaction according to three categories: ‘present’, ‘weak’, and ‘absent’. The significance of each assignment is statistically measured.
INTRODUCTION
Genome annotation, the computational prediction of unknown genes, regulatory regions and gene function at different levels from biochemical to phenotypic ones, has emerged as a central challenge of bioinformatics. Currently, the amount of sequence information doubles every seven months. Although a variety of tools for genome annotation have been developed to analyze this vast pool of data, there is still a need for new tools to provide detailed annotation and characterization of gene regulatory regions, particularly transcription factor (TF) binding sites (TF sites). An ample amount of experimental information about single nucleotide polymorphisms (SNPs) in regulatory gene regions has accumulated in GenBank (1), EMBL (2), HGMD (3), dbSNP (4), HGBASE (5), ALFRED (6) and OMIM (7). Genome variations in regulatory gene regions may manifest as either the complete elimination of the existing natural TF site (8) or the appearance of a new, spurious TF site (9), often in quantitative alterations of TF-binding efficiency (10,11). Since the process of transcription in eukaryotes involves a number of transcription factors, very few SNPs or genome variations in regulatory gene regions may result in variations of the expression patterns of many genes. Therefore, mutations in regulatory gene regions may cause a wide spectrum of phenotypes ranging from an increased risk of developing a disease to the degree of severity of a disease.
To search for TFs that bind to specific DNA sequences, researchers use software packages such as MatInspector (12), which approach TF site recognition based on the textual similarity of DNA analyzed to the known experimentally identified TF sites. However, many experiments (8,13) have demonstrated that a mutation-altered TF site cannot be reliably recognized by only its textual similarity to known TF sites. Recently, some new approaches aimed at TF site annotation were investigated. Some of them take into account the additional knowledge on how a particular TF binds to DNA (13,14) or how tissue-specific transcription regulation is performed (15). Also, the methods used for cross-species sequence comparisons, phylogenetic footprinting, have been adopted to identify TF sites (16).
Based on new trends in TF site analysis, we have developed a computational system, rSNP_Guide (17–19), and adapted it to genome annotation. The main idea of our approach to TF site annotation is to include both sequences of known TF-sites different types (not only of the type investigated) and available experimental data on alterations in binding of mutated DNA to unknown TFs. This work focuses on the annotation of potential TF sites using models of experimentally characterized altered TF sites derived by rSNP_Guide system for recognizing TFs relevant to the known TF site. Based on 20 TF sites with known alterations in TF binding to DNA, we localized 245 potential TF sites in homologous genes. For these potential TF sites rSNP_Guide estimates TF–DNA interaction according to three categories: ‘present’, ‘weak’, and ‘absent’. The significance of each assignment is statistically measured. The rSNP_Guide is available through the web, http://util.bionet.nsc.ru/databases/rsnp.html.
RECENT DEVELOPMENTS AND DATABASE CONTENT
In this work, the previously described rSNP_Guide method (17) has been applied to the annotation of potential TF sites of the same type and at the same location in homologous genes.
The rSNP_Guide method takes a set of known transcription factors and theoretically evaluates them for their ability to bind altered DNA, resulting in vectors of scores. Experimental data describing mutation-caused alterations in DNA binding to unknown TF are formalized, resulting in vectors of values. Theoretical and experimental vectors of values are compared using a Euclidian distance measure. Assignment of the TF from the theoretical set is then performed for the binding site of interest, which has been altered by mutation. For a given TF site with known SNP-disease association, when it is correctly predicted by rSNP_Guide, all the theoretical and experimental vectors are documented in the database, rSNP_Report (17). From this report-entry, the corresponding Java applet addressed to examining the phylogenetic footprints of this known TF site is automatically generated and stored in the knowledge-base, rSNP_Tuning.
In application of rSNP_Guide in annotation of potential sites in homologous genes, the following protocol was implemented. DNA sequence from a homologous gene is considered as an altered TF site, respective to the set of sites used to train rSNP_Guide. For these potential TF sites, rSNP_Guide estimates TF–DNA interaction according to three categories: ‘present’, ‘weak’ and ‘absent’. The significance of each assignment is then statistically measured. The results of annotation of potential TF sites are documented in the TFsite_Annotations database.
Based on 20 TF sites with known alterations in TF binding to DNA, we have localized 245 potential TF sites in homologous genes. For these potential TF sites rSNP_Guide estimates TF–DNA interaction according to three categories: ‘present’, ‘weak’ and ‘absent’. The significance of each assignment is statistically measured.
Figure 1 exemplifies the database TFsite_Annotations entry. This entry documents an annotation result obtained in the case of the SNP −20C/A, found in the promoter of human angiotensinogen gene (AGT) which is associated with an increased risk of primary hypertension and may lead to progressive renal disease (20). It was shown that transcriptional activity of reporter constructs containing human AGT promoter when nucleotide A is present at −20 is increased on co-transfection of an expression vector containing human estrogen receptor-alpha coding sequence in HepG2 cells followed by estrogen treatment (21). Based on this data, rSNP_Guide predicts that estrogen receptor (ER) preferably binds to this region of the promoter when nucleotide A is present at −20. This prediction is consistent with experimental results (21). The complete set of the initializing experimental data, intermediate computation estimates, and final prediction is documented in a previously introduced database, rSNP_Reports (17). Now we have generated a Java script applet devoted to the precise examination of the phylogenetic footprints of these sites, http://util.bionet.nsc.ru/databases/rsmp_tuning_m001.html.
Figure 1.
An example entry (fragment) of the sub-database TFsite_Annotations of the present rSNP_Guide release. Top-line has the hyperlinks to both main rSNP_Guide Home and Help pages as well as to all the related sub-database entries. Left column contains the field names hyper-linked to their Help pages. All the hyperlinks are bold-faced and underlined.
This case demonstrates the high sensitivity of rSNP_Guide in detecting potential TF sites. For example, the mutation −20C>A in the human AGT gene affects only one nucleotide of the palindrome GGTCAnnnTGACC, which is known as a consensus of the estrogen responsive element (ERE) sequence. However, most estrogen-regulated genes contain imperfect, non-palindromic EREs (22). This could be why MatInspector (12), with its default thresholds, does not predict this spurious ERE in both the normal human AGT promoter and the hypertension risky one.
Two genes orthologous to the human AGT gene were found in the EMBL databank, both in the rat and the mouse. Figure 1 shows the result of a comparison of the rat promoters with human AGT promoter in the region between TATA box and transcription start. Binding of ER to this region of AGT genes in the rat is assigned by rSNP_Guide at the same level (‘weak’ significance level −0.0025) as binding of the ER to the human AGT promoter with nucleotide C at −20 (21). At the same time, other assignments of ER binding can be considered at the lower significance level: ‘present’ at 0.005, ‘absent’ at 0.05. In the case of mouse, the annotation result was similar, http://util.bionet.nsc.ru/databases/TFsite_Annotation_m001m.html. The coexistence of the three contradictory predictions is consistent with the approach developed for the rSNP_Guide. This is reflected in the experimental studies of TF–DNA binding, where contradictory observations are commonplace due to a variety of experimental conditions and methodology, particularly of different sensitivity of experimental procedures. Our prediction of ‘weak’ binding of ER to the rat AGT promoter is consistent with the experimental results of the effect of co-transfection of the orphan receptor Arp-1 on estrogen-induced activity of reporter constructs containing human and rat AGT promoters (23).
We are looking forward to extending the variety of TF site types involved in the rSNP_Guide analysis. We are also planning to improve the automation of the system and to upgrade the user interface to make our approach for TF site annotation more convenient.
AVAILABILITY
The novel rSNP_Guide version can be accessed at http://util.bionet.nsc.ru/databases/rsnp.html. Please email all rSNP_Guide comments, corrections and requests to Dr Julia Ponomarenko (jpon@bionet.nsc.ru; Tel: +7 3832333119; Fax: +7 3832331278). No inclusion of the rSNP_Guide into other databases is permitted without explicit permission of the authors. We kindly ask that this article be cited when reporting results based on rSNP_Guide usage.
Acknowledgments
ACKNOWLEDGEMENTS
The work is supported by Russian Foundation for Basic Research, 01-04-49860 and 02-04-49485. The authors would like to thank Ms Cassie Ferguson (Sr Science Writer, San Diego Supercomputer Center, University of California at San Diego, La Jolla, California, USA) for contributing to the manuscript preparation in English.
REFERENCES
- 1.Benson D., Karsch-Mizrachi,I., Lipman,D., Ostell,J., Rapp,B. and Wheeler,D. (2002) GenBank. Nucleic Acids Res., 30, 17–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zdobnov E., Lopez,R., Apweiler,R. and Etzold,T. (2002) The EBI SRS server-new features. Bioinformatics, 18, 1149–1150. [DOI] [PubMed] [Google Scholar]
- 3.Krawczak M., Ball,E., Fenton,I., Stenson,P., Abeysinghe,S., Thomas,N. and Cooper,D. (2000) Human gene mutation database-a biomedical information and research resource. Hum. Mutat., 15, 45–51. [DOI] [PubMed] [Google Scholar]
- 4.Smigielski E., Sirotkin,K., Ward,M. and Sherry,S.T. (2000) dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res., 28, 352–355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Brookes A., Lehvaslaiho,H., Siegfried,M., Boehm,J., Yuan,Y., Sarkar,C., Bork,P. and Ortiga,F. (2000) HGBASE: a database of SNPs and other variations in and around human genes. Nucleic Acids Res., 28, 356–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cheung K., Osier,M., Kidd,J., Pakstis,A., Miller,P. and Kidd,K. (2000) ALFRED: an allele frequency database for diverse populations and DNA polymorphisms. Nucleic Acids Res., 28, 361–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.McKusick V. (1998) Mendelian Inheritance in Man. Catalogs of human genes and genetic disorders, The Johns Hopkins University Press, Baltimore, MD.
- 8.Vasiliev G., Merkulov,V., Kobzev,V., Merkulova,T., Ponomarenko,M. and Kolchanov,N. (1999) Point mutations within 663–666 bp of intron 6 of the human TDO2 gene, associated with a number of psychiatric disorders, damage the YY-1 transcription factor binding site. FEBS Lett., 462, 85–88. [DOI] [PubMed] [Google Scholar]
- 9.Knight J., Udalova,I., Hill,A., Greenwood,B., Peshu,N., Marsh,K. and Kwiatkowski,D. (1999) A polymorphism that affects OCT-1 binding to the TNF promoter region is associated with severe malaria. Nature Genet., 22, 145–150. [DOI] [PubMed] [Google Scholar]
- 10.Tsutsumi-Ishii Y., Tadokoro,K., Hanaoka,F. and Tsuchida,N. (1995) Response of heat shock element within the human HSP70 promoter to mutated p53 genes. Cell Growth Differ., 6, 1–8. [PubMed] [Google Scholar]
- 11.Langdon S. and Kaufman,R. (1998) Gamma-globin gene promoter elements required for interaction with globin enhancers. Blood, 91, 309–318. [PubMed] [Google Scholar]
- 12.Quandt K., Frech,K., Karas,H., Wingender,E. and Werner,T. (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res., 23, 4878–4884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Roulet E., Bucher,P., Schneider,R., Wingender,E., Dusserre,Y., Werner,T. and Mermod,N. (2000) Experimental analysis and computer prediction of CTF/NFI transcription factor DNA binding sites. J. Mol. Biol., 297, 833–848. [DOI] [PubMed] [Google Scholar]
- 14.Mandel-Gutfreund Y., Baron,A. and Margalit,H. (2001) A structure-based approach for prediction of protein binding sites in gene upstream regions. Pac. Symp. Biocomput., 139–150. [DOI] [PubMed] [Google Scholar]
- 15.Wasserman W. and Fickett,J. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol., 278, 167–181. [DOI] [PubMed] [Google Scholar]
- 16.Wasserman W., Palumbo,M., Thompson,W., Fickett,J. and Lawrence,C. (2000) Human-mouse genome comparisons to locate regulatory sites. Nature Genet., 26, 225–228. [DOI] [PubMed] [Google Scholar]
- 17.Ponomarenko J., Merkulova,T., Vasiliev,G., Levashova,Z., Orlova,G., Lavryushev,S., Fokin,O., Ponomarenko,M., Frolov,A. and Sarai,A. (2001) rSNP_Guide, a database system for analysis of transcription factor binding to target sequences: application to SNPs and site-directed mutations. Nucleic Acids Res., 29, 312–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ponomarenko J., Merkulova,T., Orlova,G., Fokin,O., Gorshkova,E. and Ponomarenko,M. (2002) Mining DNA sequences to predict sites which mutations cause genetic diseases. Knowledge-Based Systems, 15, 225–233. [Google Scholar]
- 19.Ponomarenko J., Orlova,G., Merkulova,T., Gorshkova,E., Fokin,O., Vasiliev,G., Frolov,A. and Ponomarenko,M. (2002) rSNP_Guide: An integrated database-tools system for studying SNPs and site-directed mutations in transcription factor binding sites. Hum. Mutat., 20, 239–248. [DOI] [PubMed] [Google Scholar]
- 20.Jeunemaitre X., Soubrier,F., Kotelevtsev,Y., Lifton,R., Williams,C., Charru,A., Hunt,S., Hopkins,P., Williams,R., Lalouel,J. et al. (1992) Molecular basis of human hypertension: role of angiotensinogen. Cell, 71, 169–180. [DOI] [PubMed] [Google Scholar]
- 21.Zhao Y., Zhou,J., Narayanan,C., Cui,Y. and Kumar,A. (1999) Role of C/A polymorphism at −20 on the expression of human angiotensinogen gene. Hypertension, 33, 108–115. [DOI] [PubMed] [Google Scholar]
- 22.Klinge C. (2001) Estrogen receptor interaction with estrogen response elements. Nucleic Acids Res., 29, 2905–2919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Narayanan C., Cui,Y., Zhao,Y., Zhou,J. and Kumar,A. (1999) Orphan receptor Arp-1 binds to the nucleotide sequence located between TATA box and transcriptional initiation site of the human angiotensinogen gene and reduces estrogen induced promoter activity. Mol. Cell Endocrinol., 148, 79–86. [DOI] [PubMed] [Google Scholar]