Abstract
The 5′ and 3′ untranslated regions of eukaryotic mRNAs may play a crucial role in the regulation of gene expression controlling mRNA localization, stability and translational efficiency. For this reason we developed UTRdb, a specialized database of 5′ and 3′ untranslated sequences of eukaryotic mRNAs cleaned from redundancy. UTRdb entries are enriched with specialized information not present in the primary databases including the presence of nucleotide sequence patterns already demonstrated by experimental analysis to have some functional role. All these patterns have been collected in the UTRsite database so that it is possible to search any input sequence for the presence of annotated functional motifs. Furthermore, UTRdb entries have been annotated for the presence of repetitive elements. All internet resources implemented for retrieval and functional analysis of 5′ and 3′ untranslated regions of eukaryotic mRNAs are accessible at http://bigarea.area.ba.cnr.it:8000/EmbIT/UTRHome/
INTRODUCTION
Understanding the basic mechanisms of cell growth, differentiation and response to environmental stimuli, i.e., the program controlling the temporal and spatial order of molecular events, is becoming a real challenge in Molecular Biology. Indeed, although most of the regulatory elements are thought to be embedded in the non-coding part of the genomes, nucleotide databases are biased by the presence of expressed sequences mostly corresponding to the protein coding portion of the genes. Among non-coding regions, the 5′ and 3′ untranslated regions (5′-UTR and 3′-UTR) of eukaryotic mRNAs have often been experimentally demonstrated to contain sequence elements crucial for many aspects of gene regulation and expression (1–7).
The main functional roles so far demonstrated for 5′- and 3′-UTR sequences are: (i) control of mRNA cellular and subcellular localization (4,7,8); (ii) control of mRNA stability (1,9); and (iii) control of mRNA translation efficiency (10,11).
Several regulatory signals have already been identified in 5′- and 3′-UTR sequences, usually corresponding to short oligonucleotide tracts, also able to fold in specific secondary structures, which are protein binding sites for various regulatory proteins.
The analysis of large collections of functionally equivalent sequences (12,13), such as 5′- and 3′-UTR sequences, could indeed be very useful for defining their structural and compositional features as well as for searching the alleged function-associated sequence patterns (14–16). For this reason we constructed UTRdb, a specialized sequence collection, deprived from redundancy, of 5′- and 3′-UTR sequences from eukaryotic mRNAs.
UTRdb entries have been enriched with specialized information not present in the primary databases, including the presence of sequence patterns demonstrated by experimental evidence to play some functional role. Additionally, because ~10% of mammalian mRNAs contain repetitive elements in their UTRs (17) which are not usually annotated in the original records, we decided to include this information in our database.
We also created UTRsite, a collection of functional sequence patterns located in the 5′- or 3′-UTR sequences which could prove very useful for automatic annotation of anonymous sequences generated by sequencing projects, as well as for finding previously undetected signals in known gene sequences.
ASSEMBLING UTRdb COLLECTIONS
The specialized database of UTR sequences was generated by UTRdb_gen, a computer program we devised for this task. Eight sequence collections were generated for both 5′- and 3′-UTR sequences, one for each of the eukaryotic divisions of the EMBL/GenBank nucleotide database, namely: (i) Human; (ii) Rodent; (iii) Other mammal; (iv) Other vertebrate; (v) Invertebrate; (vi) Plant; (vii) Fungi; and (viii) Patent.
UTRdb_gen, performing an accurate parsing of the Feature Table of the relevant EMBL entries is able to automatically generate the various UTRdb collections. Although the feature keys ‘5′UTR’ and ‘3′UTR’ are valid features for the EMBL/Genbank entries, only a small percentage of the entries are adequately annotated. Indeed, of the 120 767 primary entries where UTRdb_gen was able to extract 5′- or 3′-UTR sequences, only 15.8% contained the 5′UTR or 3′UTR feature key in the corresponding EMBL entry. UTRdb_gen is able to define UTR regions even when these keys are not reported in the primary entry by using a predefinite syntactic parsing of other relevant feature keys, such as mRNA, CDS, exon, intron, etc.
UTRdb_gen automatically annotates generated UTR entries by adding some specialized information such as completeness (or not) of the UTR region, number of spanned exons and cross-referencing to the primary database entry. A cross reference between 5′- and 3′-UTR sequences from the same mRNA has also been established.
The generation of UTR entries cleaned from redundancy has been obtained by using CLEANUP program (18) which is able to generate automatically, very quickly, cleaned collections by removing entries having a similarity and overlapping degree with longer entries present in the database above a user-fixed threshold. In this case, the cut-off parameters we used for the CLEANUP application were 95% for similarity and 90% for overlapping.
The UTR entries have been further enriched by using the program UTRnote (kindly provided by G. Grillo, Area de Ricerca di Bari del Consiglio Nazionale delle Ricerche) including information about the location of experimentally defined patterns collected in UTRsite and of repetitive elements present in the Repbase database (19). The UTRsite entries describe the various regulatory elements present in UTR regions whose functional role has been established on an experimental basis. Each UTRsite entry is constructed on the basis of information reported in the literature and revised by distinguished scientists experimentally working on the functional characterization of the relevant UTR regulatory element.
CONTENT OF UTRdb
Table 1 reports a summary description of UTRdb (release 12.0) which in total contains 120 767 entries and 37 353 172 nucleotides. On average, >29.3% of entries proved to be redundant and were removed from the database.
Table 1. Number of entries (N) and nucleotide length (L) of UTRdb collections (release 12.0) after redundancy cleaning.
Redundancy | ||||
---|---|---|---|---|
N | L | %N | %L | |
5′-UTR | ||||
Fungi | 1136 | 195 215 | 23.91 | 13.04 |
Human | 8785 | 1 887 755 | 38.61 | 28.15 |
Invertebrate | 5376 | 1 033 413 | 27.63 | 15.52 |
Other_mammal | 2429 | 339 321 | 36.06 | 27.62 |
Other_vertebrate | 3564 | 519 656 | 25.63 | 18.19 |
Plant | 8499 | 924 695 | 24.91 | 13.98 |
Rodent | 8496 | 1 629 025 | 34.98 | 24.92 |
Patent | 213 | 55 918 | 29.00 | 41.86 |
TOTAL | 38 498 | 6 584 998 | ||
3′-UTR | ||||
Fungi | 1415 | 338 564 | 13.61 | 9.47 |
Human | 10 207 | 8 367 057 | 36.91 | 30.95 |
Invertebrate | 6677 | 2 607 959 | 19.89 | 17.06 |
Other_mammal | 3202 | 1 457 422 | 29.14 | 24.27 |
Other_vertebrate | 4419 | 2 195 694 | 21.22 | 14.36 |
Plant | 11 548 | 2 777 812 | 15.16 | 14.15 |
Rodent | 9181 | 5 737 426 | 34.66 | 27.41 |
Patent | 232 | 91 287 | 27.04 | 43.03 |
TOTAL | 46 881 | 23 573 221 |
UTRdb 12.0 was generated from EMBL release 59. Relevant redundancy percentages calculated with respect to the number of entries (%N) and to the nucleotide length (%L) are also indicated.
5′-UTR sequences were defined as the mRNA region spanning from the cap site to the starting codon (excluded), whereas 3′-UTR sequences were defined as the mRNA region spanning from the stop codon (excluded) to the poly-A starting site.
A sample UTRdb entry is shown in Figure 1. The UTRdb entries have been formatted according to the EMBL database format.
Table 2 reports functional patterns and repetitive elements included in UTRsite (release 3.0). More entries will be included in further releases. A sample UTRsite entry is reported in Figure 2. Functional patterns, defined on the basis of the information reported in the literature and/or advice by the scientists expert in the field, were described by using the pattern description syntax used in the PATSCAN program (20).
Table 2. Functional patterns included so far in UTRsite (v3.0).
Functional patterns | Reference | Hits found in UTRdb 12.0 |
---|---|---|
Iron-responsive element (IRE) |
23 |
65 |
Histone 3′UTR stem–loop structure |
24 |
27 |
AU-rich class II destabilising element |
25 |
175 |
TGE translational regulation element |
26 |
45 |
Selenocysteine insertion sequence (SECIS) |
27,28 |
189 |
APP 3′-UTR stability control element |
29 |
7 |
Cytoplasmatic polyadenylation element (CPE) |
30 |
4614 |
Nanos |
31 |
397 |
ribosomal protein mRNA 5′ TOP |
32–34 |
298 |
TNF mRNA translation repression element |
35 |
14 |
Vimentin 3′UTR mRNA element |
36 |
12 |
GLUT1 mRNA stabilising element |
37 |
48 |
15-LOX-DICE |
38 |
83 |
Repetitive elements | 44 806 |
For each pattern the number of hits with UTRdb entries is also reported.
AVAILABILITY OF UTRdb
UTRdb and UTRsite are publicly available by anonymous FTP (ftp://area.ba.cnr.it/pub/embnet/database/utr/ ). All internet resources we implemented for retrieval and functional analysis of 5′- and 3′-UTR sequences are accessible at http://bigarea.area.ba.cnr.it:8000/EmbIT/UTRHome/ (21). These include SRS retrieval (22) of UTRdb and UTRsite, also available at the EBI WWW server (http://srs.ebi.ac.uk:80/ ), UTRscan and UTRfasta. The UTRscan utility allows the enquirer to search user-submitted sequences for any of the patterns collected in UTRsite. The UTRfasta utility allows database searches against fully annotated UTRdb entries.
CONCLUSIONS AND PERSPECTIVES
The important role that untranslated regions of eukaryotic mRNAs may play in gene regulation and expression is now widely recognized. Indeed, experimental studies have demonstrated that sequence motifs located in the untranslated regions are involved in crucial biological functions.
The huge amount of functionally equivalent sequences stored in UTRdb now makes possible the study of their structural and compositional features and the application of statistical methods for the identification of significant signals. Previous cleaning-up of databases is necessary however to avoid artefacts caused by redundant sequences. Even if statistical significance does not necessarily mean biological significance, it may provide a useful indication for further experimental work, such as site-directed mutagenesis.
UTRdb will be updated with the new EMBL database releases and UTRsite will be continuously updated by adding new entries describing functional patterns whose biological role has been experimentally demonstrated.
Acknowledgments
ACKNOWLEDGEMENTS
For revision of UTRsite entries we would like to thank Jim Malter (APP 3′-UTR stability control element), Alain Krol (SECIS), Matthias Hentze (IRE and 15-LOX DICE), Bill Marzluff (histone stem–loop structure), Ann-Bin Shyu (ARE), Arturo Verrotti (CPE), Robin Wharton (nanos), Elizabeth Goodwin (TGE), Roger Kaspar (ribosomal protein mRNA TOP), Danuta Radzioch (TNF mRNA translation repression element), Ruben Boado (GLUT1 mRNA stabilising element) and Zendra E. Zehner (Vimentin 3′UTR mRNA element). This work was supported by EU grant ERB-BIO4-CT96-0030 and by Programma Biotecnologie legge 95/95 (MURST 5%).
REFERENCES
- 1.Decker C.J. and Parker,R. (1994) Trends Biochem. Sci., 19, 336–340. [DOI] [PubMed] [Google Scholar]
- 2.Kaufman R.J. (1994) Curr. Opin. Biotech., 5, 550–557. [DOI] [PubMed] [Google Scholar]
- 3.Klausner R.D., Rouault,T.A. and Harford,J.B. (1993) Cell, 72, 19–28. [DOI] [PubMed] [Google Scholar]
- 4.Singer R.H. (1992) Curr. Opin. Cell Biol., 4, 15–19. [DOI] [PubMed] [Google Scholar]
- 5.Wilhelm J.E. and Vale,R.D. (1993) J. Cell Biol., 123, 269–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.McCarthy J.E.G. and Kollmus,H. (1995) Trends Biochem. Sci., 20, 191–197. [DOI] [PubMed] [Google Scholar]
- 7.Bashirullah A., Cooperstock,R.L. and Lipshitz,H.D. (1998) Annu. Rev. Biochem., 67, 335–394. [DOI] [PubMed] [Google Scholar]
- 8.Johnston D. (1995) Cell, 81, 161–170.7736568 [Google Scholar]
- 9.Beelman C.A. and Parker,R. (1995) Cell, 81, 179–183. [DOI] [PubMed] [Google Scholar]
- 10.Curtis D., Lehman,R. and Zamore,P.D. (1995) Cell, 81, 171–178. [DOI] [PubMed] [Google Scholar]
- 11.Sonenberg N. (1994) Curr. Opin. Genet. Dev., 4, 310–315. [DOI] [PubMed] [Google Scholar]
- 12.Mengeritsky G. and Smith,T.F. (1987) Comput. Appl. Biosci., 3, 223–227. [DOI] [PubMed] [Google Scholar]
- 13.Konopka A.K. (1994) In Smith,D.W. (ed.), Informatics and Genome Projects. Academic Press, San Diego, CA.
- 14.Pesole G., Liuni,S., Grillo,G. and Saccone,C. (1997) Gene, 205, 95–102. [DOI] [PubMed] [Google Scholar]
- 15.Pesole G., Grillo,G. and Liuni,S. (1996) Comp. Chem., 20, 141–144. [DOI] [PubMed] [Google Scholar]
- 16.Pesole G., Fiormarino,G. and Saccone,C. (1994) Gene, 140, 219–225. [DOI] [PubMed] [Google Scholar]
- 17.Makalowski W., Zhang,J. and Boguski,M. (1996) Genome Res., 6, 846–857. [DOI] [PubMed] [Google Scholar]
- 18.Grillo G., Attimonelli,M., Liuni,S. and Pesole,G. (1996) Comput. Appl. Biosci., 12, 1–8. [DOI] [PubMed] [Google Scholar]
- 19.Jurka J. (1998) Curr. Opin. Struct. Biol., 8, 333–337. [DOI] [PubMed] [Google Scholar]
- 20.Dsouza M., Larsen,N. and Overbeek,R. (1997) Trends Genet., 13, 497–498. [DOI] [PubMed] [Google Scholar]
- 21.Pesole G. and Liuni,S. (1999) Trends Genet., 15, 379–380. [DOI] [PubMed] [Google Scholar]
- 22.Etzold T., Ulyanov,A. and Argos,P. (1996) Methods Enzymol., 266, 114–128. [DOI] [PubMed] [Google Scholar]
- 23.Hentze M.W. and Kuhn,L.C. (1996) Proc. Natl Acad. Sci. USA, 93, 8175–8182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Williams A.S. and Marzluff,W.F. (1995) Nucleic Acids Res., 23, 654–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chen C. and Shyu,A. (1995) Trends Biochem. Sci., 20, 465–470. [DOI] [PubMed] [Google Scholar]
- 26.Goodwin E.B., Okkema,P.G., Evans,T.C. and Kimble,J. (1993) Cell, 75, 329–339. [DOI] [PubMed] [Google Scholar]
- 27.Hubert N., Walczak,R., Sturchler,C., Schuster,C., Westhof,E., Carbon,P. and Krol,A. (1996) Biochimie, 78, 590–596. [DOI] [PubMed] [Google Scholar]
- 28.Walczak R., Westhof,E., Carbon,P. and Krol,A. (1996) RNA, 2, 367–379. [PMC free article] [PubMed] [Google Scholar]
- 29.Zaidi S.H.E. and Malter,J.S. (1994) J. Biol. Chem., 269, 24007–24013. [PubMed] [Google Scholar]
- 30.Verrotti A., Thompson,S., Wreden,C., Strickland,S. and Wickens,M. (1996) Proc. Natl Acad. Sci. USA, 93, 9027–9032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dahanukar A. and Wharton,R. (1996) Genes Dev., 10, 2610–2620. [DOI] [PubMed] [Google Scholar]
- 32.Amaldi F. and Pierandrei-Amaldi,P. (1997) Prog. Mol. Subcell. Biol., 18, 1–17. [DOI] [PubMed] [Google Scholar]
- 33.Kaspar R.L., Kakegawa,T., Cranston,H., Morris,D.R. and White,M.W. (1992) J. Biol. Chem., 267, 508–514. [PubMed] [Google Scholar]
- 34.Morris D.R., Kakegawa,T., Kaspar,R.L. and White,M.W. (1993) Biochemistry, 32, 2931–2937. [DOI] [PubMed] [Google Scholar]
- 35.Hel Z., Di Marco,S. and Radzioch,D. (1998) Nucleic Acids Res., 26, 2803–2812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zehner Z.E., Shepherd,R.K., Gabryszuk,J., Fu,T.F., Al-Ali,M. and Holmes,W.M. (1997) Nucleic Acids Res., 25, 3362–3370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Boado R.J. and Pardridge,W.M. (1998) Brain Res. Mol. Brain Res., 59, 109–113. [DOI] [PubMed] [Google Scholar]
- 38.Ostareck-Lederer A., Ostareck,D., Standart,N. and Thiele,B. (1994) EMBO J., 13, 1476–1481. [DOI] [PMC free article] [PubMed] [Google Scholar]