Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2002 Jan 1;30(1):195–199. doi: 10.1093/nar/30.1.195

SELEX_DB: a database on in vitro selected oligomers adapted for recognizing natural sites and for analyzing both SNPs and site-directed mutagenesis data

Julia V Ponomarenko a, Galina V Orlova, Anatoly S Frolov, Mikhail S Gelfand 1, Mikhail P Ponomarenko
PMCID: PMC99084  PMID: 11752291

Abstract

SELEX_DB is an online resource containing both the experimental data on in vitro selected DNA/RNA oligomers (aptamers) and the applets for recognition of these oligomers. Since in vitro experimental data are evidently system-dependent, the new release of the SELEX_DB has been supplemented by the database SYSTEM storing the experimental design. In addition, the recognition applet package, SELEX_TOOLS, applying in vitro selected data to annotation of the genome DNA, is accompanied by the cross-validation test database CROSS_TEST discriminating the sites (natural or other) related to in vitro selected sites out of random DNA. By cross-validation testing, we have unexpectedly observed that the recognition accuracy increases with the growth of homology between the training and test sets of protein binding sequences. For natural sites, the recognition accuracy was lower than that for the nearest protein homologs and higher than that for distant homologs and non-homologous proteins binding the common site. The current SELEX_DB release is available at http://wwwmgs.bionet.nsc.ru/mgs/systems/selex/.

INTRODUCTION

In vitro selection of oligomers binding target proteins is a novel technology, intensively developed during the last decade, for sieving a pool of synthetic oligomers through repeated cycles of PCR-amplification and protein-binding selection (1).

Earlier, following the Human Genome Annotation initiative, we developed and introduced the first version of the SELEX_DB database on oligomers selected in vitro, the database being supplied by web-available applets for site recognition (2). Now, during the post-genome era, single nucleotide polymorphism (SNP) analysis becomes a novel key field of intense research: over 2.84 million SNPs are currently documented (3) and 1.42 million of them are already mapped (4). As is now known, a disease may be caused not only by a mutation disrupting a regulatory site, but also by the appearance of a spurious protein-binding site that alters a gene network [e.g. the substitution –376G→A in human TNF gene promoter creates a novel OCT site associated with severe malaria (5)]. With this in mind, the in vitro selected aptamers could be very informative for SNP-analysis in the case of novel protein-binding sites appearing due to genome mutations and causing disease.

At the same time, in prokaryotes, the differences in nucleotide-position frequency matrices generated from in vitro selected and natural sites have been comprehensively demonstrated (6). In addition, the positional Information Content matrices of in vitro selected aptamers were found to be correlated with the degree of protein binding, whereas no correlation was observed for the corresponding natural sites (7). This means that, in prokaryotes, natural sites were selected in vivo according to their biological activity, but not by the affinity to protein. In eukaryotes, the relationship between in vivo and in vitro selections seems to be very complex. On the one hand, the in vitro selected TBP-binding DNAs provide the natural TATA-box activity (8). Moreover, the homologous c-Myb and v-Myb proteins, the minimal Myb/DNA-binding domain and even the Myb-enriched cell nuclear extract have selected, in vitro, significantly similar aptamers that were also similar to the natural c-Myb sites (9). On the other hand, in vitro selected YY1-binding DNAs, inserted into plasmids and transfected into various cells (‘plasmid+cell’ system), repress the reported gene (10), thus supporting the fact that the YY1-binding strength and the degree of repression do not correlate. Moreover, for these in vitro selected YY1-binding aptamers, the YY1-caused repression measured in vivo at one ‘plasmid+cell’ system does not correlate with the corresponding value in other ‘plasmid+cell’ systems (10). Since in vitro selected aptamers are found to be evidently dependent on the selection experiment design (6,7,10), a fundamental question arises about applicability of in vitro selected aptamers to the analysis of natural genes (1).

That is why the new release of the SELEX_DB has been supplemented by two databases, SYSTEM (11,12) and CROSS_TEST, storing both the experimental systems and their cross-validation tests. By the cross-validation testing, we have unexpectedly observed that for a fixed protein-binding site the recognition accuracy increases with the growth of homology between the binding sequences of target and test proteins. For natural sites, the recognition accuracy was less than that for the closest protein homologs and higher than that for the distant homologs and non-homologous proteins binding the common site. The current SELEX_DB release is available at http://wwwmgs.bionet.nsc.ru/mgs/systems/selex/.

RECENT DEVELOPMENTS

Since many recent experiments have comprehensively demonstrated that in vitro selected aptamers depend on the design of the selection system (6,7,10), the novel SELEX_DB release has been, for the first time, supplemented by the database SYSTEM describing the experiments.

As shown in Figure 1, the database SYSTEM has been introduced due to the same reasoning in two other database systems: ACTIVITY, compiling the sequence-activity relationships (11), and rSNP_Guide, storing the site-directed mutagenesis and SNP-referred data (12). Since in vitro experiments are a simplification of the in vivo phenomena, SYSTEM informs a user about limitations of the in vitro data, as well as interpretations and regularities of the latter. For making these limitations more readable, a SYSTEM entry has a full-text format. In Figure 2, it can be seen how each SYSTEM entry cites the goals of the study and conclusions made by the authors, as well as experimental details and selection techniques to take into account during analysis of the gene structure, function, regulation, expression and variation.

Figure 1.

Figure 1

Scheme of SELEX_DB components (blocked in a solid triangle). Two novel databases SYSTEM and CROSS_TEST (bold triangle) consolidate the novel SELEX_DB release on in vitro selected DNA/RNA together with our earlier presented databases ACTIVITY, on sequence–activity relationships (dotted triangle) (11), and rSNP_Guide, on disease-associated SNPs and site-directed mutagenesis (broken triangle) (12).

Figure 2.

Figure 2

Example of a SYSTEM entry. The full-text field names are in bold. The hyperlinks to the User’s Help and to the other related web-resources are underlined.

While in vitro selected data obtained for a particular site should be viewed according to their dependencies on the experiment design, the regularities gleaned from these data could be consolidated with the others and with those obtained for natural sites by cross-validation testing of these regularities. This is the main idea of the second CROSS_TEST database supplementing the new version of SELEX_DB (Fig. 1). Figure 3 exemplifies a CROSS_TEST entry. This entry contains the hyperlinks to the initial in vitro selected data (DR SELEX_DB, S00J0031) and to the data regularities represented in the nucleotide-position frequency matrix form (DR SELEX_TOOLS, S00J0031), as we have described in more detail previously (2). Next, this entry contains information about a relevant natural site compilation (DR SAMPLES; YY1) for making cross-validation of these regularities found in vitro. These data can be used to identify natural sites in genomic DNA sequences. To this aim, the field ‘C-CODE’ provides a hyperlink to a C-applet performing this cross-validation test. Besides, the entry contains the test results documented as follows. In the case of true sites, the means and standard deviation of the site recognition score and ‘false negative’ error rate are given in the field ‘ST’. The analogous statistics referring to the control data represented by 8000 random DNA sequences are stored in the field ‘RT’. Also, χ2 estimate (field ‘XI’) and its significance level α (field ‘AL’) are accumulated. In this manner, each CROSS_TEST entry is statistically characterizing how the regularity revealed by in vitro selected data corresponds to various related natural sites and/or other in vitro selected data.

Figure 3.

Figure 3

Example of a CROSS_TEST entry (a fragment).

DATABASE CONTENT

Since previous studies (6,7,10) were unclear whether in vitro selected data are applicable to the site recognition, the earlier developed databases SELEX_DB (116 entries), SELEX_BIB (70 entries) and SELEX_TOOLS (32 entries) were slightly extended. To address the above problem, we have chosen only characteristic examples of in vitro selected aptamers and additionally documented their experimental design (the first new database SYSTEM, 29 entries; 13). With these chosen aptamer sets and the corresponding natural site sets (the database SAMPLES, 77 entries), the cross-validation tests were performed and documented (the second novel database CROSS_TEST, 27 entries). These 27 entries contain over 100 individual cross-validation tests, the results of which have clearly demonstrated that for a given natural site, the recognition accuracy was less than that for the nearest protein homologs and higher than that for the distant homologs and non-homologous proteins binding this site. These results, obtained in a eukaryotic system, contradict the conclusion of Shulzaberger and Schneider (7) that in vitro selection in prokaryotes does not always mimic natural evolution. This fact may reflect different evolutionary strategies of in vivo selection. In prokaryotes, for example, selection optimizes the biological activity of individual DNA–protein complexes (6,7), whereas in eukaryotes, the regulation involves multi-component transcription machinery, which is selected by both the individual DNA–protein affinity and the integral characteristics of the protein–protein complexes. Thus, in eukaryotes, in vitro selected data appear to be applicable to recognition of the natural sites. This observation can be useful for application of in vitro selected data for genome annotation and analysis of SNPs and site-directed mutagenesis.

HOW TO USE

The user must first select a sequence to be analyzed. For example, to use CROSS_TEST for verifying that the site-directed 4 bp substitution ‘–302CCAT→GGTA’ within the human topoisomerase III gene promoter damages its natural YY1 site (14), the sequence ‘cctaggacgc cacaacccgg atcctgctac cgcggcgccg CCAT cttgacatca catgactcct ggttgtccgc gccgcgtgac’ centered at this YY1 site (bold) should be prepared (EMBL accession no. AF026813, in-between positions 921 and 1004, which corresponds to the promoter fragment from –343 to –259 bp), as well as its mutant (underlined) variant ‘cctaggacgc cacaacccgg atcctgctac cgcggcgccg GGTA cttgacatca catgactcct ggttgtccgc gccgcgtgac’. With the SELEX_DB (http://wwwmgs.bionet.nsc.ru/mgs/systems/selex/) uploaded, its textual or icon hyperlink to CROSS_TEST should be clicked on and, by using the widely accepted SRS-query search, the entry ‘TEST-S00J0031’ should be selected (Fig. 3). When the hyperlink ‘WW TOOLS’ is selected, the tools optimized for in vitro selected binding to GST–YY1 fusion protein-target appears. After that, the user selects the option ‘Natural site YY1’ and enters into the input box the sequence variants to be analyzed. Then the button ‘Execute’ is clicked, as exemplified and illustrated in our previous paper (2) describing the first release of SELEX_DB. Clicking brings up a screen with the graphical representation of the YY1 site recognition score, illustrated in Figure 4A (broken line, the natural variant; solid line, the mutant variant). As shown, the YY1 site (open circle) is successfully recognized (solid arrow) in the natural sequence and lost in the mutant variant. One ‘false positive’ YY1 error-site (broken arrow) is recognized with this individual tool.

Figure 4.

Figure 4

Example of how to use CROSS_TEST. The natural YY1 site recognition Score profiles calculated by many tools optimized upon the following training data sets: (A) in vitro selected aptamers binding to GST–YY1 fusion protein (CROSS_TEST entry ‘TEST-S00J0031’); (B) in vitro selected aptamers of the fixed core CCAT binding to YY1 protein (CROSS_TEST entry ‘TEST-S00J008a’); (C) in vitro selected aptamers of the fixed core ACAT binding to YY1 protein (CROSS_TEST entry ‘TEST-S00J008b’); (D) natural YY1 sites (SAMPLES entry ‘YY1’); (E) averaged YY1 recognition Score profiles. Broken line, the natural sequence ‘cctaggacgc cacaacccgg atcctgctac cgcggcgccg CCAT cttgacatca catgactcct ggttgtccgc gccgcgtgac’ of the human topoisomerase III gene promoter (EMBL accession no. AF026813, in-between positions 921 and 1004) with the center of the known YY1 site (bold) (14) marked by open circle. Solid line, the mutant sequence with the site-directed 4 bp substitution ‘CCAT→GGTA’ damaging the YY1 site (14). As shown, the true recognitions (solid arrows) confirm each other, whereas all ‘false positives’ (broken arrows) do not. As the result of averaging, the true YY1 recognition stays stable, while ‘false positive’ errors are discarded.

For our sequence variants, the recognition made by the tools optimized upon in vitro selected aptamers (fixed core ‘CCAT’ binding YY1 protein-target) is demonstrated in Figure 4B. The results of recognition of the true YY1 site are in agreement, whereas the ‘false positive’ results are not. Analogously, in the case of the tools optimized upon the YY1-binding aptamers of the fixed core ‘ACAT’ (Fig. 4C), the same regularity appears (i.e. the ‘true’ predictions support each other, the ‘false positives’ do not). In the control (Fig. 4D), with the tools optimized upon the natural YY1 sites, the same regularity appears. As follows from these figures, the true recognition result is the same for all tools referring to various in vitro selected data, but the ‘false positive’ result is specific for the used tool.

The above regularity pinpoints to the fact that for multiple-tools recognition of a given site, the true results support each other. In contrast, the ‘false positive’ results made by different tools contradict each other. By exploiting this novelty in our example, we have averaged the proper recognition scores, the mean profiles of which are shown in Figure 4E. It is clear that the ‘false positive’ results successfully reject each other, whereas the true YY1 site recognition stays stable.

Since the ‘false positive’ over-recognition is the main problem of almost any recognition tool (1), the new release of SELEX_DB is aimed at reducing the ‘false positive’ rate for a given natural site by applying many recognition tools optimized on independent in vitro selected data. This novelty will be useful for both Genome Annotation and analysis of SNP and site-directed mutagenesis.

AVAILABILITY

The SELEX_DB is available through the web: http://wwwmgs.bionet.nsc.ru/mgs/systems/selex/. Please send all SELEX_DB applications, comments, corrections and requests to Mrs J. Ponomarenko (Tel: +7 383 2 333 119; Fax: +7 383 2 331 278; Email: jpon@bionet.nsc.ru). No inclusion of the SELEX_DB into other databases is permitted without explicit permission of the authors. We ask that this article be cited when reporting results based on SELEX_DB usage.

Acknowledgments

ACKNOWLEDGEMENTS

The work is supported by the grant RFBR-01-04-49860. M.S.G. was partially supported by grants from HHMI (55000309), INTAS (99-1476) and the Ludwig Cancer Research Institute.

REFERENCES

  • 1.Roberts R.W. and Ja,W.W. (1999) In vitro selection of nucleic acids and proteins: what are we learning? Curr. Opin. Struct. Biol., 9, 521–529. [DOI] [PubMed] [Google Scholar]
  • 2.Ponomarenko J.V., Orlova,G.V., Ponomarenko,M.P., Lavryushev,S.V., Frolov,A.S., Zybova,S.V. and Kolchanov,N.A. (2000) SELEX_DB: an activated database on selected randomized DNA/RNA sequences addressed to genomic sequence annotation. Nucleic Acids Res., 28, 205–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Marth G., Yeh,R., Minton,M., Donaldson,R., Li,Q., Duan,S., Davenport,R., Miller,R.D. and Kwok,P.-Y. (2001) Single-nucleotide polymorphisms in the public domain: how useful are they? Nature Genet., 27, 371–372. [DOI] [PubMed] [Google Scholar]
  • 4.Sachidanandam R., Weissman,D., Schmidt,S.C., Kakol,J.M., Stein,L.D., Marth,G., Sherry,S., Mullikin,J.C., Mortimore,B.J., Willey,D.L. et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928–933. [DOI] [PubMed] [Google Scholar]
  • 5.Knight J., Udalova,I., Hill,A., Greenwood,B., Peshu,N., Marsh,K. and Kwiatkowski,D. (1999) A polymorphism that affects OCT-1 binding to the TNF promoter region is associated with severe malaria. Nature Genet., 22, 145–150. [DOI] [PubMed] [Google Scholar]
  • 6.Robison K., McGuire,A.M. and Church,G.M. (1998) A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol., 284, 241–254. [DOI] [PubMed] [Google Scholar]
  • 7.Shulzaberger R.K. and Schneider,T.D. (1999) Using sequence logos and informational analysis of Lrp DNA binding sites to investigate discrepancies between natural selection and SELEX. Nucleic Acids Res., 27, 882–887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hardenbol P., Wang,J.C. and van Dyke,M.W. (1997) Identification of preferred hTBP DNA binding sites by the combinatorial method REPSA. Nucleic Acids Res., 25, 3339–3344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Weston K. (1992) Extension of the DNA binding consensus of the chicken c-Myb and v-Myb proteins. Nucleic Acids Res., 20, 3043–3049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hyde-DeRuyscher R., Jennings,E. and Shenk,T. (1995) DNA binding sites for the transcriptional activator/repressor YY1. Nucleic Acids Res., 23, 4457–4465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ponomarenko J., Furman,D., Frolov,A., Podkolodny,N., Orlova,G., Ponomarenko,M., Kolchanov,N. and Sarai,A. (2001) ACTIVITY: a database on DNA/RNA sites activity adapted to apply sequence-activity relationships from one system to another. Nucleic Acids Res., 29, 284–287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ponomarenko J., Merkulova,T., Vasiliev,G., Levashova,Z., Orlova,G., Lavryushev,S., Fokin,O., Ponomarenko,M., Frolov,A. and Sarai,A. (2001) rSNP_Guide, a database system for analysis of transcription factor binding to target sequences: application to SNPs and site-directed mutations. Nucleic Acids Res., 29, 312–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ponomarenko M.P., Ponomarenko,J.V., Frolov,A.S., Podkolodnaya,O.A., Vorobyev,D.G., Kolchanov,N.A. and Overton,G.C. (1999) Oligonucleotide frequency matrices addressed to recognizing functional DNA sites. Bioinformatics, 15, 631–643. [DOI] [PubMed] [Google Scholar]
  • 14.Kim J.C., Yoon,J.B., Koo,H.S. and Chung,I.K. (1998) Cloning and characterization of the 5′-flanking region for the human topoisomerase III gene. J. Biol. Chem., 273, 26130–26137. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES