Skip to main content
Plant Physiology logoLink to Plant Physiology
. 2003 May;132(1):75–83. doi: 10.1104/pp.102.014894

In Silico Identification of Putative Regulatory Sequence Elements in the 5′-Untranslated Region of Genes That Are Expressed during Male Gametogenesis

Raymond Jozef Maurinus Hulzink 1, Han Weerdesteyn 1, Anton Felix Croes 1, Tom Gerats 1, Marinus Maria Antonius van Herpen 1, Jacques van Helden 1,*
PMCID: PMC166953  PMID: 12746513

Abstract

During pollen development, transcription of a large number of genes results in the appearance of distinct sets of transcripts. Similar mRNA sets are present in pollen of both mono- and dicotyledonous plant species, which indicates an evolutionary conservation of genetic programs that determine pollen gene expression. In pollen, regulation of gene expression occurs at the transcriptional and posttranscriptional level. The 5′-untranslated region (UTR) of several pollen transcripts has been shown to be important for regulation of pollen gene expression. The important regulatory role of 5′-UTR sequences and the evolutionary conservation of genetic programs in pollen led to the hypothesis that the 5′-UTRs of pollen-expressed genes share regulatory sequence elements. In an attempt to identify these pollen 5′-UTR elements, a statistical analysis was performed using 5′-UTR sequences of pollen- and sporophytic-expressed genes. The analysis revealed the presence of several pollen-specific 5′-UTR sequence elements. Assembly of the pollen 5′-UTR elements led to the identification of various consensus sequences, including those that previously have been demonstrated to play a role in the regulation of pollen gene expression. Several pollen 5′-UTR elements were found to be preferentially associated to genes from dicots, wet-type stigma plants, or plants containing bicellular pollen. Moreover, three sequence elements exhibited a preferential association to the 5′-UTR of pollen-expressed genes from Arabidopsis and Brassica napus. Functional implications of these observations are discussed.


Gene expression covers a complex series of distinctive processes. So far, studies on the regulation of gene expression in plants mainly have been focused on mechanisms that underlie the process of transcription. As a consequence, the architecture and mode of action of promoter sequences of various genes from different plant systems have been investigated extensively (for review, see Novina and Roy, 1996).

Despite the importance of transcription, it becomes more evident that posttranscriptional processes also perform a key function in the regulation of plant gene expression (for review, see Gallie, 1993; Fütterer and Hohn, 1996; Bailey-Serres, 1999). In precise terms, posttranscriptional processes comprehend all steps downstream of transcription, i.e. from pre-mRNA modification to protein turnover. In many cases, the main determinant for posttranscriptional regulation is the control of translation efficiency. In eukaryotes, control of translation efficiency often occurs at the translation initiation level by either posttranslational modification of translation initiation factors or by posttranscriptional modification of individual or sets of transcripts (for review, see Pain, 1996; Bailey-Serres, 1999; Kozak, 1999). In the latter case, structural properties of the 5′-untranslated region (UTR) of mRNA molecules often play an important role. Examples of these properties are length (Gallie et al., 2000), the presence of secondary structures (Klaff et al., 1996; Gallie et al., 2000) or upstream open reading frames (Lukaszewicz et al., 1998; Wang and Wessler, 1998), and the composition of the sequence that surrounds the translation initiation codon (Geballe and Morris, 1994; Joshi et al., 1997). In addition, the presence of specific sequence elements that serve as interaction sites for antisense RNAs (Shayig, 1997; Hu et al., 1999) or RNA-binding proteins (for review, see Burd and Dreyfuss, 1994; Albà and Pagès, 1998) can also contribute to the regulatory capacity of 5′-UTRs.

To identify putative regulatory sequence elements in the 5′-UTR of coregulated genes, we focused on genes that are highly expressed during the development and germination of the male gametophyte (pollen). During pollen development, a large number of genes are transcribed (Willing and Mascarenhas, 1984; Willing et al., 1988; Guyon et al., 2000; F. Cnudde, unpublished data), which leads to the appearance of distinctive sets of transcripts (Stinson et al., 1987; Schrauwen et al., 1990; Hulzink, 2002). These mRNA sets can be found in pollen from both mono- and dicotyledonous plant species, which argues for a conservation of genetic programs that underlie pollen gene expression. The 5′-UTRs of several pollen transcripts have been shown to alter gene expression at the transcriptional (Curie and McCormick, 1997) or posttranscriptional level (Bate et al., 1996; Hulzink, 2002; Hulzink et al., 2002). With regard to the evolutionary conservation of genetic programs in pollen and the important role of the 5′-UTR in pollen gene expression, we hypothesize that the 5′-UTRs of pollen-expressed genes share regulatory sequence elements.

To identify these shared (overrepresented) regulatory sequence elements in the 5′-UTRs of pollen-expressed genes, a statistical analysis has been carried out. Two different sequence sets were collected: a test set containing 5′-UTR sequences of pollen-expressed genes (pollen sequences) and a reference set containing 5′-UTR sequences of genes that have been isolated from sporophytic tissues (reference sequences). Both sequence sets were used to identify overrepresented sequence elements (oligonucleotides) in the pollen sequences (oligo-analysis; Van Helden et al., 1998, 2000b). Although genetic programs in pollen are conserved in different plant species, it may well be that the presence of several sequence elements are associated to genes that originate from specific plant species or from subsets of plants that share similar taxonomic classifications or morphological features. Hyper-geometric statistics were applied to investigate whether the presence of the pollen elements was associated to genes from specific plant species or from sets of plants that are distinctive in the number of cotyledons (monocots or dicots), stigma type (wet or dry), or pollen type (bicellular or tricellular).

RESULTS

The 5′-UTR of Pollen-Expressed Genes Shares Several Sequence Elements

To investigate whether the 5′-UTRs of pollen-expressed genes share sequence elements (overrepresented oligonucleotides), two datasets were collected containing 5′-UTR sequences of either pollen-expressed genes (Table I, pollen sequences) or genes that have been isolated from sporophytic tissues (reference sequences). The background oligonucleotide frequencies were estimated by calculation of the relative frequencies of all oligonucleotides within the reference set. Oligonucleotide occurrences were counted in the pollen sequences, and their statistical significance was estimated on the basis of the background frequencies (for a description of the followed methodology, see “Materials and Methods”).

Table I.

List of pollen-expressed genes that were used for the “oligo-analysis”

Species Clone Accession No.
A.t. aj002280
A.t. atbnh (t5a14.2) aj249211
A.t. atprf4 u43594
A.t. at59 (a2) u83619
A.t. gmd1 af195140
A.t. lpd2 af228638
A.t. pab5 m97657
A.t. pfn4 u43324
A.t. prf1 u43590
A.t. prf2 u43591
A.t. profilin1 u43325
A.t. profilin2 u43326
A.t. rab2 y09314
A.t. rac2 af107663
A.t. rkf1 af024648
A.t. rop1at u49971
A.t. rop4 af031428
A.t. rop6 af031427
A.t. suc1 x75365
B.n. bp4a x52874
B.n. bp4c x52874
B.n. bp19 x56195
B.n. hmga af127919
B.n. rbp1 af094825
B.n. sta39-3 l47351
B.n. sta39-4 l47352
B.n. sta44-4 l19879
B.r. bcp1 x68209
B.r. bgp1 x68210
B.r. brar1 ab032260
B.r. brar2 d63154
H.a. plim1a (sf3) af187104
H.a. plim2 af047353
H.a. sf16 x74772
H.a. sf17 x81997
H.b. trx af159385
I.t. isp11 u29432
L.l. a23 af077629
L.l. ltp af171094
L.l. p35 ab012694
L.l. y5-7 af088901
L.l. 14-3-3 af191746
L.l. l18909
L.l. l18911
L.l. z17328
L.p. trx af159387
L.e. lat51 af394216
L.e. lat52 x15855
L.e. lat56 x56487
L.e. lat59 x56488
L.e. leprk1 u58474
L.e. leprk2 u58473
L.e. leprot1 af014808
L.e. leprot2 af014809
L.e. tpex (pex1) af159296
M.s. po2 u28149
M.s. po22 u40387
M.s. p65 u28148
M.s. p73 u20431
N.a. nacsld1 af304375
N.a. nagsl1 af304372
N.s. nsaap1 u31932
N.t. eif4a8 x79004
N.t. jd1 af316320
N.t. npg1 (tp27) x71017
N.t. nsk6 y08607
N.t. nsk59 aj002315
N.t. nsk91 aj224163
N.t. nsk111 aj002314
N.t. ntf4 x83880
N.t. nthsp18p x70688
N.t. ntplim1a af197567
N.t. ntplim1b af197568
N.t. ntpro2 x93465
N.t. ntpro3 x93466
N.t. ntp303 x69440
N.t. ntsut3 af149981
N.t. nt59 u85646
N.t. plim2 af116851
N.t. pronp1 aj130969
N.t. p18 aj004957
N.t. rop1 aj222545
N.t. tac25 x63603
N.t. tobaldh2a y09876
N.t. tobpdc2 x81855
N.t. tp5 aj250431
N.t. tp10 (g10) x67159
N.t. 136.1 u20490
O.s. ps1 z16402
O.s. udpgase af249880
P.h. pgps/d1 af049917
P.h. pgps/d2 af049918
P.h. pgps/d3 af049919
P.h. pgps/d4 af049920
P.h. pgps/d6 af049922
P.h. pgps/d8 af049924
P.h. pgps/d10 af049926
P.h. pgps/d11 af049927
P.h. pgps/d12 af049928
P.h. pgps/d14 af049930
P.h. pgps/nh21 af049937
P.h. pmt1 af061106
P.h. php303 (p303) af479568
P.i. ppe1 (pcpe22) l27101
P.i. prk1 l2731
P.p. ab013353
P.s. rop1ps l19093
S.b. sbpk x97980
S.b. sb401
S.c. af161330
S.t. invge aj133765
S.t. invgf aj133765
T.p. tpc70 u04298
Z.m. hsfb x82943
Z.m. hsp70 x03658
Z.m. mpex2 af159297
Z.m. mpex1 z34465
Z.m. pg1 x57627
Z.m. tua1 (mg19/6) x15704
Z.m. tua4 x63179
Z.m. tua6 x63179
Z.m. tub3 x74654
Z.m. tub4 x74655
Z.m. tub5 x74656
Z.m. zmabp1 x80820
Z.m. zmc5 y13285
Z.m. zmmads1 af112148
Z.m. zmmads2 af112149
Z.m. zmpro1 x73279
Z.m. zmpro2 x73280
Z.m. zmpro3 x73281
Z.m. zm13 s44171

The first column shows the plant species: A.t., Arabidopsis B.c., Brassica campestris; B.n., Brassica napus; B.r., Brassica rapa; H.a., Helianthus annuus; H.b., Hordeum bulbosum; I.t., Ipomoea trifida; L.l., Lilium longiflorum; L.p., Lolium perenne; L.e., Lycopersicon esculentum; M.s., Medicago sativa; N.a., Nicotiana alata; N.s., Nicotiana sylvestris; N.t., Nicotiana tabacum; O.s., Oryza sativa; P.h., Petunia hybrida; P.i., Petunia inflata; P.p., Pyrus pyrifolia; P.s., Pisum sativum; S.b., Solanum berthaultii; S.c., Solanum chacoense; S.t., Solanum tuberosum; T.p., Tradescantia paludosa; and Z.m., Zea mays. The second column shows the gene/clone names. The third column shows the GenBank accession nos.

Table II shows the hexanucleotides that are significantly overrepresented in the pollen sequences compared with the reference sequences. From the 4,096 possible hexanucleotides, 31 sequence elements are preferentially present in the 5′-UTRs of pollen-expressed genes (Table II). Similar results were obtained for penta-, hepta-, and octanucleotides (for these data, see http://rsat.ulb.ac.be/rsat/). The majority of the overrepresented oligonucleotides (pollen elements) are A rich, i.e. more than 80% of the oligonucleotides contain four or more A residues. The most significant overrepresented oligonucleotide is AAAAAA, which is 74 times present (Oocc) when 20.4 occurrences are expectable on the basis of the background model (Eocc). When the total number of possible oligonucleotides in each dataset is taken into account (4,096), the occurrence probability value converts into the occurrence significance index (sigocc). For the oligonucleotide AAAAAA, the sigocc is 15.9, which means that a hexanucleotide with a similar high-significance value is expected to occur in 1015.9 datasets of random sequences. Highly significant values are also found for several AAAAAA single-substitution variants. To examine whether the overrepresented oligonucleotides are positional biased in the pollen sequences, the distribution of each hexanucleotide was determined, and a test of homogeneity was applied (Table III). None of the hexanucleotides passed the significance threshold of 64.64, which indicates that the overrepresented sequence elements are not positional biased in the pollen 5′-UTRs.

Table II.

Overrepresented sequence elements in the 5′-UTRs of pollen-expressed genes

Oligo Oocc Eocc Pocc sigocc Oms Ems Pms sigms
aaaaaa 74 20.4 2.8e−20 15.9 44 19.4 1.8e−06 2.1
aaataa 38 8.3 4.4e−14 9.8 28 8.1 8.7e−09 4.4
aaaaat 41 11.5 1.1e−11 7.3 32 11.2 1.7e−06 2.1
aaaata 30 8.0 2.3e−09 5.0 25 7.9 3.9e−06 1.8
aataaa 31 8.5 2.4e−09 5.0 24 8.4 3.1e−05 0.9
gaaaaa 49 19.1 7.6e−09 4.5 39 18.1 3.5e−05 0.8
caaaaa 42 15.4 1.7e−08 4.2 31 14.8 1.3e−04 0.3
aaagga 22 5.1 2.3e−08 4.0 22 5.0 6.7e−09 4.6
aaggaa 23 5.8 5.9e−08 3.6 20 5.7 1.3e−06 2.3
ataaaa 27 8.3 2.1e−07 3.1 22 8.1 1.8e−04 0.1
taaaaa 28 9.0 3.3e−07 2.9 21 8.8 2.0e−04 0.1
aaaaag 40 16.2 4.3e−07 2.8 30 15.5 2.9e−04 −0.1
ggaaaa 22 6.1 4.7e−07 2.7 19 6.0 9.1e−06 1.4
caaata 15 3.3 2.4e−06 2.0 13 3.3 3.2e−05 0.9
aataag 11 1.9 4.0e−06 1.8 10 1.8 1.9e−05 1.1
aaaaac 34 14.2 6.0e−06 1.6 24 13.7 4.7e−03 −1.3
atcaaa 23 7.8 7.9e−06 1.5 18 7.7 6.6e−04 −0.4
aagaca 17 4.8 1.3e−05 1.3 14 4.8 3.3e−04 −0.1
aaaaga 37 17.2 2.2e−05 1.0 32 16.4 7.3e−04 −0.5
ctttga 15 4.1 2.5e−05 1.0 13 4.1 2.3e−04 0.0
aagaag 37 17.7 3.9e−05 0.8 24 16.8 1.7e−01 −2.8
taaagg 9 1.6 4.5e−05 0.7 9 1.6 3.7e−05 0.8
caataa 15 4.3 4.9e−05 0.7 14 4.3 1.1e−04 0.3
ccaaaa 25 10.1 5.0e−05 0.7 22 9.8 2.9e−04 −0.1
ttttaa 20 7.3 8.1e−05 0.5 16 7.2 2.3e−03 −1.0
aaggag 16 5.1 8.4e−05 0.5 14 5.0 5.4e−04 −0.3
tatcaa 13 3.6 9.8e−05 0.4 11 3.6 9.7e−04 −0.6
ttggaa 9 1.9 1.3e−04 0.3 9 1.8 1.1e−04 0.3
ggaatt 13 3.8 1.9e−04 0.1 12 3.8 4.7e−04 −0.3
aggaaa 20 7.8 1.9e−04 0.1 18 7.7 6.6e−04 −0.4
aaacaa 31 15.0 1.9e−04 0.1 22 14.4 2.8e−02 −2.1

The first column (oligo) shows the overrepresented pollen elements. Oocc, observed occurrences; Eocc, expected occurrences; Pocc, probability occurrences; sigocc, significance index occurrences; Oms, observed matching sequences; Ems, expected matching sequences; Pms, probability matching sequences; and sigms, significance index matching sequences. All sequence elements with sigocc > 0 were selected. See “Materials and Methods” for a description of the parameters.

Table III.

Position analysis of the pollen 5′-UTR sequence elements

Oligo χ2
aaaaaa 45.69
aaataa 14.74
aaaaat 12.60
aaaata 21.60
aataaa 11.56
gaaaaa 21.50
caaaaa 27.37
aaagga 15.08
aaggaa 32.21
ataaaa 10.05
taaaaa 23.14
aaaaag 27.75
ggaaaa 15.36
caaata 22.28
aataag 18.03
aaaaac 26.90
atcaaa 15.91
aagaca 9.84
aaaaga 12.28
ctttga 11.19
aagaag 24.65
taaagg 19.42
caataa 10.76
ccaaaa 39.24
ttttaa 19.42
aaggag 16.95
tatcaa 36.75
ttggaa 15.75
ggaatt 12.22
aggaaa 18.49
aaacaa 37.20

Analysis of the distribution of each overrepresented sequence element was performed with the program “position-analysis” (Van Helden et al., 2000b). Chi-square values (χ 2) were calculated using the parameters df = 30 and α = 0.000244. Sequence elements with χ 2 ≤ 64.64 do not exhibit a significant biased position in the pollen sequences. For details, see “Materials and Methods.”

In summary, these data clearly show that several sequence elements are preferentially present in the 5′-UTR of pollen-expressed genes. Within the pollen 5′-UTRs, the overrepresented sequence elements are not positional biased.

Assembly of Sequence-Related Pollen Elements Gives Rise to Several Consensus 5′-UTR Elements

Several of the identified pollen elements share sequence similarity, and their mutual overlap might reveal consensus sequence elements. To identify these consensus elements, we assembled the sequence-related pollen elements using the program “pattern-assembly” (Van Helden et al., 2000a). As shown in Table IV, assembly of sequence-related pollen elements gave rise to several consensus 5′-UTR sequence elements. Highly significant values are found for the consensus elements CAAATAAAAAT and AAAAAA.

Table IV.

Pattern assembly of the pollen 5′-UTR sequence elements

Oligo sigocc
aaaata..... 5.0
caaaaa..... 4.2
caaata..... 2.0
.aaaaaa.... 15.9
.aaataa.... 9.8
.caataa.... 0.7
.aaacaa.... 0.1
..aataaa... 5.0
..aataag... 1.8
..ataaaa... 3.1
...atcaaa.. 1.5
....gaaaaa. 4.5
....taaaaa. 2.9
.....aaaaat 7.3
.....aaaaag 2.8
.....aaaaac 1.6
caaataaaaat 15.9
caaaaa 4.2
ccaaaa 0.7
ccaaaa 4.2
tatcaa. 0.4
.ataaaa 3.1
.atcaaa 1.5
tatcaaa 3.1
aaaaaa 15.9
aaaaat 7.3
aaaata 5.0
gaaaaa 4.5
caaaaa 4.2
ataaaa 3.1
taaaaa 2.9
aaaaag 2.8
aaaaac 1.6
aaaaga 1.0
aaaaaa 15.9
taaagg.... 0.7
.aaagga... 4.0
.aaaaga... 1.0
..aaggaa.. 3.6
..aaggag.. 0.5
...aggaaa. 0.1
.... gaaaaa 4.5
....ggaaaa 2.7
taaaggaaaa 4.5
aaaaag 2.8
aataag 1.8
aagaag 0.8
aaggag 0.5
aagaag 2.8

The “oligo” column represents the pollen 5′-UTR sequence elements with their respective significant indexes (column 2: sigocc). C, Consensus sequence element.

Several Pollen Elements Are Associated with Genes from Arabidopsis, Brassica napus, Dicotyledonous Plants, or Plants Containing a Wet Stigma or Bicellular Pollen

The pollen-expressed genes that were used for the 5′-UTR analysis are derived from 25 different plant species (Table I). With regard to their taxonomic classification, the plants were separated in subsets containing mono- or dicotyledonous species. On the basis of their stigma type, the plant species were grouped further into the subsets wet and dry stigma. A wet stigma is covered with a liquid secretion layer, whereas a dry stigma is covered with less or no secretion material (Heslop-Harrison and Shivanna, 1977). Furthermore, the plant species were grouped into the subsets bi- and tricellular pollen. Sperm cells in tricellular pollen are formed during pollen maturation, whereas sperm cells in bicellular pollen are arranged during pollen tube growth (Brewbaker, 1967). On the basis of the hyper-geometric probability, we tested to what extent the pollen elements were preferentially associated to genes from each subset of plant species. Furthermore, we analyzed to what extent the pollen elements were associated to genes from specific plant species. For each sequence element, the number of matching pollen sequences in a given subset (B) was compared with the number of matching sequences in a given set (M), and the corresponding E value was calculated (for a detailed description of the methodology, see “Materials and Methods”). The analysis revealed the presence of 12 associations with an E value < 0.1 (Table V). From the 31 overrepresented sequence elements, seven are significantly associated to dicotyledonous plant species. The most significant example is the sequence element AAAAAT, which is present in 32 dicotyledonous pollen sequences. The pollen element ATCAAA is significantly associated to genes from both wet stigma and bicellular pollen plants. Interestingly, the sequence elements AAAAAA and AAAAAT are significantly associated to B. napus, whereas AAGAAG is associated to Arabidopsis.

Table V.

Statistical analysis of the extent of association of pollen elements with genes from different plant species or from plants of the subsets: one or two cotyledons (monocot or dicot), wet- or dry-type stigma (wet or dry), or bi- or tricellular pollen (bi or tri)

Subset Oligo Size M S P E sig
Dicot aaaaaa 99 44 41 3.3e−04 1.0e−02 2.0
Dicot aaaaat 99 32 32 2.1e−05 7.0e−04 3.2
Dicot aataaa 99 24 24 4.5e−04 1.4e−02 1.9
Dicot gaaaaa 99 39 37 2.7e−04 8.2e−03 2.1
Dicot caaaaa 99 31 30 5.0e−04 1.6e−02 1.8
Dicot ataaaa 99 22 22 9.1e−04 2.8e−02 1.6
Dicot aaaaga 99 32 31 3.5e−04 1.1e−02 2.0
Wet atcaaa 68 18 15 3.2e−03 9.8e−02 1.0
Bi atcaaa 76 18 17 3.3e−04 1.0e−02 2.0
Brassica napus aaaaat 8 32 7 1.9e−04 5.8e−03 2.2
Brassica napus aaaata 8 25 7 2.9e−05 9.0e−04 3.1
Arabidopsis aagaag 19 24 11 2.3e−05 7.0e−04 3.1

The first column shows the subsets that give a significant association to the pollen elements of column 2. S, No. of pollen sequences from a given subset; M, no. of pollen sequences from a given set containing a given pollen element; B, no. of pollen sequences from a given subset containing a given pollen element; P, probability value; E, expected value; and sig, significance index. See “Materials and Methods” for a description of the methodology.

DISCUSSION

A systematic approach based on the statistical analysis of oligonucleotide occurrences has led to the identification of several oligonucleotides (sequence elements) that are significantly overrepresented in the 5′-UTR of pollen-expressed genes (Table II). It is obvious that the choice of appropriate reference and test datasets is an important determinant for a reliable outcome of the analysis. Genes from both datasets were selected on the basis of the origin of their respective cDNAs (male gametophytic or sporophytic tissues) without a priori consideration of the composition of the 5′-UTRs. The extent of expression of pollen genes during pollen development and tube growth was ascertained by data from the available literature. From the analysis, we conclude that the 5′-UTRs of genes that are expressed during male gametogenesis share pollen-specific sequence elements (pollen elements). Statistical analyses of the presence of overrepresented sequence elements have led to a successful identification of regulatory elements in promoter (Van Helden et al., 1998, 2000c; Sinha and Tompa, 2002) and 3′-UTR (Jacobs-Anderson and Parker, 2000; Van Helden et al., 2000b) sequences of coregulated yeast (Saccharomyces cerevisiae) genes. To our knowledge, the present study is the first report that describes the in silico identification of sequence elements that are shared in the 5′-UTRs of coregulated plant genes.

Although many sequence elements are significantly overrepresented in the 5′-UTR of pollen-expressed genes, their statistical significance does not necessarily imply that they are functional in the regulation of pollen gene expression. However, several observations indicate that some of the pollen elements exhibit a pollen-related regulatory function. Figure 1 shows a schematic representation of the 5′-UTR of the pollen-expressed gene ntp303. A sequence region at the 5′ end of the ntp303 5′-UTR has been shown previously to affect translation efficiency, whereas a sequence region at the 3′ end was found to modulate mRNA stability (Hulzink, 2002; Hulzink et al., 2002). As shown in Figure 1, assembly of several of the pollen elements in the ntp303 5′-UTR leads to several extended sequence regions. We assume that some of these extended pollen sequence regions comprise regulatory elements because several of these regions are also present in the functional ntp303 5′-UTR regions. In addition, pattern assembly analysis (Table IV) highlighted the presence of consensus sequence elements that are conserved in various pollen-expressed genes. Because some of these consensus pollen elements are also localized in the functional ntp303 5′-UTR regions, it is assumable that at least the consensus sequence elements in the functional ntp303 5′-UTR regions resemble regulatory sequences. Regulatory sequence elements are often concentrated as short and highly conserved core elements (for review, see Novina and Roy, 1996). A representative example of such a regulatory sequence element is the consensus sequence AAGAAG, which is repetitive present in the 5′-functional region of the ntp303 5′-UTR. Deletion analysis strongly suggests that the AAGAAG repeat is involved in directing pollen gene expression (Hulzink et al., 2002). In this respect, the identification of the AAGAAG sequence as a pollen-specific 5′-UTR element provides additional clues for its regulatory function.

Figure 1.

Figure 1

Schematic representation of the distribution of overrepresented sequence elements (pollen elements) in the 5′-UTR of the pollen-expressed gene ntp303. Distribution of the pollen elements is presented above the graphic representation of the ntp303 5′-UTR. The small arrows indicate the start of a pollen element. The numbers correspond to sequence elements as presented in Table II (counted from up to down). Sequences below the ntp303 5′-UTR illustration indicate the position of consensus pollen sequence elements as presented in Table IV. The gray regions represent the 5′ (left) and 3′ (right) regulatory regions of the ntp303 5′-UTR (see text for description). The arrow at the 3′-end of the ntp303 5′-UTR indicates the position of the translation initiation site.

The pollen 5′-UTR sequences that were used in the present study originate from different plant species. Several pollen elements are preferentially associated to genes from dicotyledonous plant species, whereas none of the elements exhibit a significant association to genes from monocots (Table V). These results indicate co-evolution of sequence elements in the 5′-UTRs of pollen-expressed genes. It is plausible that the significant association of pollen 5′-UTR elements is reflected by specific properties of dicots, which are absent in monocots. In addition, the pollen element ATCAAA is found to be preferentially associated to genes from plants containing a wet stigma and bicellular pollen. With regard to self-incompatibility systems in angiosperms, a clear relationship exists between pollen and stigma characteristics, i.e. plant species with a wet-type stigma often have bicellular pollen (Brewbaker, 1967; Heslop-Harrison and Shivanna, 1977). Such a relationship might explain the preferential association of the ATCAAA element to genes from wet-type stigma and bicellular pollen plants. Besides the pollen elements that are preferentially associated to subsets of plant species, the presence of several other 5′-UTR sequence elements are not related to the number of cotyledons, stigma type, or pollen type. It is assumable that these elements play a more general role in the regulation of pollen gene expression, independent of the phylogenetic background. On the contrary, the preferential association of three other sequence elements to pollen-expressed genes from Arabidopsis and B. napus indicates a strong phylogenetic dependency of these elements for these Brassicaceae spp.

It is obvious that in silico identification of conserved sequence elements in the 5′-UTRs of coregulated genes has to be validated experimentally. Nevertheless, computational identification of shared 5′-UTR sequence elements provides useful indications for new functional studies. Although gene expression studies in pollen have been prosperous in several ways, a systematic analysis of the 5′-UTRs of the growing number of isolated pollen-expressed genes was still lacking. With the increasing number of publicly available genomic and EST sequences, we will be able to gain a better understanding of the new intriguing functional and evolutionary clues provided by our analysis.

MATERIALS AND METHODS

5′-UTR Sequence Datasets

The 5′-UTR sequences of pollen-expressed genes (pollen sequences) were collected from the GenBank database. The pollen 5′-UTR dataset consists of 5′-UTR sequences from 132 different genes that are highly expressed in mature pollen or pollen tubes. Expression of the genes in pollen or pollen tubes was ascertained by data from the available literature. Because of their expression in anther tissues, genes related to pollen coat proteins have been excluded from the analysis. The total number of nucleotides in the pollen 5′-UTR dataset is 16,645. The average length of the full-length or partial pollen 5′-UTRs is 126 nucleotides; the smallest UTR sequence consists of six nucleotides, whereas the longest UTR is 620 nucleotides in length.

The detection of overrepresented oligonucleotides (sequence elements) in the pollen sequences relies on the prior definition of a background model, which is used to estimate the random expectation for each oligonucleotide. As a background model, we used a set of non-pollen 5′-UTR sequences (named reference sequences). The reference sequences were obtained from the GenBank database (March 2001) by selecting the first 1,076 cDNA entries that did not contain any of the key words “pollen,” “gametophyte,” “flower,” and “bud.” With regard to the extraction of the reference sequences, the complete sequence upstream of the translation initiation site was selected. This procedure resulted in the collection of 113,481 nucleotides. The average length of the full-length or partial reference 5′-UTRs is 105 nucleotides; the smallest UTR sequence is 14 nucleotides in length, whereas the longest UTR consists of 1,396 nucleotides.

Sequence Purging

The presence of homologous 5′-UTR sequences in the datasets might lead to the inclusion of large conserved sequence regions. These conserved sequence regions can bias the analysis by duplicating all the oligonucleotides that are found in the redundant sequences. To avoid this phenomenon, the pollen and reference sequences were purged to remove sequence repeats larger than 50 nucleotides (containing a maximum of three mismatches) using the programs “mkvtree” and “vmatch” (Kurtz and Schleiermacher, 1999).

Oligonucleotide Analysis

A pattern discovery approach (oligo-analysis; Van Helden et al., 1998, 2000b) was used to detect overrepresented oligonucleotides in the pollen sequences. All statistics were performed on nondegenerated oligonucleotides, without the acceptation of mismatches. To calculate the expected frequency of each oligonucleotide, oligo-analysis was first applied to the reference sequence set. The obtained expected frequency values were used to estimate the expected number of occurrences for each oligonucleotide in the set of pollen sequences. The detection of overrepresented oligonucleotides is based on an estimation of the significance of the observed occurrences (Oocc). For each oligonucleotide, the P value Pocc = P(Xx) was calculated on the basis of the binomial distribution. Because the analysis comprised multiple tests (4,096 in the case of hexanucleotides), the possibility existed that even low P values appeared by chance. To correct for such a multitesting effect, the P values were multiplied by the number of oligonucleotides. This correction resulted in an expected value, the E value (Eocc). The significance index sigocc = −log(Eocc) reflects the degree of overrepresentation of each oligonucleotide on a logarithmic scale. All oligonucleotides with a positive significance index (Eocc ≤ 1) were selected. To prevent a bias due to self-overlapping occurrences (Kleffe and Borodovsky, 1992), a nonoverlapping counting mode was adopted. This means that when a self-overlapping hexanucleotide was found at a given position, its occurrence at the next five positions was ignored. The number of possible positions was corrected according to the calculation of the binomial.

Overrepresentation of an oligonucleotide can be due to either its frequent presence in most of the pollen sequences or in a subset of these sequences. The probability of the first occurrence was taken into account by calculating an additional index of overrepresentation. For each oligonucleotide, the number of sequences that contained at least one occurrence (matching sequence) was determined, and the statistical significance was calculated with the binomial probability. For a single sequence, the first occurrence probability was calculated as:

graphic file with name M1.gif

where pW represents the probability to observe the oligonucleotide W at any position of the sequence, XS represents the number of matches on sequence S, and TS represents the number of positions that was calculated from the sequence length LS and the oligonucleotide size w(TS = LSw + 1). The matching sequence probability Pms was calculated by taking the right tail of the binomial probability:

graphic file with name M2.gif

where m represents the number of sequences that contain at least one copy of the oligonucleotide, whereas S represents the total number of sequences. The E value (Ems) and the significance index (sigms) were calculated from the P value in the same way as for the occurrences. A positive sigms value indicates that the oligonucleotide is present in more sequences than what would be expected by chance alone.

Using the 1,076 non-pollen gene sequences as reference, a problem of statistical sampling was observed. Some oligonucleotides were present in the pollen sequences and not in the reference set. As a consequence, their probability was estimated to be zero and their respective significance was infinite, even if they were found in a low copy number in the pollen sequences. The problem occurred because the reference sequence set was too small to reflect all possibilities. It is likely that these oligonucleotides will appear in much larger reference datasets. To circumvent this problem for the current reference sequence set, pseudo-weights were used. This means that the oligonucleotide frequencies that were calculated from the reference sequence set contributed for 90% to the estimation of prior oligonucleotide probabilities, whereas the remaining 10% were left for the potential presence of additional oligonucleotides in the pollen sequences.

Analysis of oligonucleotide distributions in the pollen sequences was performed with the program “position-analysis” (Van Helden et al., 2000b). Occurrences were regrouped by intervals of 20 nucleotides. Given the sequence sizes, the positional distributions contained 31 classes with a class interval of 20 nucleotides. The expected distribution was calculated according to a homogeneous model, i.e. a position-independent distribution for each oligonucleotide. Observed and expected positional distributions were compared using the chi-square statistics. To take the number of oligonucleotides (n = 4096) into account, the threshold was adapted according to the Bonferoni rule, which recommends a first error risk α < 1/n. The degrees of freedom (df) of the chi-square test depended on the number of position classes (c), which in turn depended on the class interval. For the class interval of 20 nucleotides, the resulting probability value (α = 0.000244) corresponded to a theoretical value of Xtheor = 64.64.

Association Analysis

To examine to what extent the oligonucleotides were associated to specific plant species or to subsets of plants containing one or two cotyledons, a wet- or dry-type stigma, or bi- or tricellular pollen, the hyper-geometric probability test was applied. The hyper-geometric probability estimates the significance of association between the overrepresented oligonucleotides and different subsets of pollen sequences. Considering a set of n sequences (e.g. the 132 pollen sequences) that contains a subset of size S (e.g. the 99 sequences from dicotyledonous plants) and the observation that a given oligonucleotide is present in M pollen sequences, the probability that exactly B of these pollen sequences belongs to the given subset is:

graphic file with name M3.gif

The P value is the probability to observe at least B matching sequences in the given subset:

graphic file with name M4.gif

In our analysis, each oligonucleotide was compared with T = 31 different subsets of pollen sequences (dicots/monocot, wet stigma/dry stigma, bicellular pollen/tricellular pollen, and 25 different plant species). To correct for this multitesting, the P value was converted to an E value by:

graphic file with name M5.gif

Finally, the E value was converted to a logarithmic significance index:

graphic file with name M6.gif

Availability

The “Regulatory Sequence Analysis Tools” are available at http://rsat. ulb.ac.be/rsat/. The complete sets of data and calculation procedures are available on the same site.

Distribution of Materials

Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial research purposes, subject to the requisite permission from any third party owners of all or parts of the material. Obtaining any permission will be the responsibility of the requestor.

Footnotes

Article, publication date, and citation information can be found at www.plantphysiol.org/cgi/doi/10.1104/pp.102.014894.

LITERATURE CITED

  1. Albà MM, Pagès M. Plant proteins containing the RNA-recognition motif. Trends Plant Sci. 1998;3:15–21. [Google Scholar]
  2. Bailey-Serres J. Selective translation of cytoplasmic mRNAs in plants. Trends Plant Sci. 1999;4:142–148. doi: 10.1016/s1360-1385(99)01386-2. [DOI] [PubMed] [Google Scholar]
  3. Bate N, Spurr C, Foster GD, Twell D. Maturation-specific translational enhancement mediated by the 5′-UTR of a late pollen transcript. Plant J. 1996;10:613–623. doi: 10.1046/j.1365-313x.1996.10040613.x. [DOI] [PubMed] [Google Scholar]
  4. Brewbaker JL. The distribution and phylogenetic significance of binucleate and trinucleate pollen grains in the angiosperms. Am J Bot. 1967;54:1069–1083. [Google Scholar]
  5. Burd CG, Dreyfuss G. Conserved structures and diversity of functions of RNA-binding proteins. Science. 1994;265:615–621. doi: 10.1126/science.8036511. [DOI] [PubMed] [Google Scholar]
  6. Curie C, McCormick S. A strong inhibitor of gene expression in the 5′-untranslated region of the pollen-specific lat59 gene of tomato. Plant Cell. 1997;9:2025–2036. doi: 10.1105/tpc.9.11.2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fütterer J, Hohn T. Translation in plants: rules and exceptions. Plant Mol Biol. 1996;32:159–189. doi: 10.1007/BF00039382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Gallie DR. Posttranscriptional regulation of gene expression in plants. Annu Rev Plant Physiol Plant Mol Biol. 1993;44:77–105. [Google Scholar]
  9. Gallie DR, Ling J, Niepel M, Morley SJ, Pain VM. The role of 5′-leader length, secondary structure, and PABP concentration on cap and poly(A) tail function during translation in xenopus oocytes. Nucleic Acids Res. 2000;28:2943–2953. doi: 10.1093/nar/28.15.2943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Geballe AP, Morris DR. Initiation codons within 5′-leaders of mRNA as regulators of translation. Trends Biochem Sci. 1994;19:159–164. doi: 10.1016/0968-0004(94)90277-1. [DOI] [PubMed] [Google Scholar]
  11. Guyon VN, Astwood JD, Garner EC, Dunker AK, Taylor LP. Isolation and characterization of cDNAs expressed in the early stages of flavonol-induced pollen germination in petunia. Plant Physiol. 2000;123:699–710. doi: 10.1104/pp.123.2.699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Heslop-Harrison Y, Shivanna KR. The receptive surface of the angiosperm stigma. Ann Bot. 1977;41:1233–1258. [Google Scholar]
  13. Hu MC-Y, Tranque P, Edelman GM, Mauro VP. rRNA-complementarity in the 5′-untranslated region of mRNA specifying the Gtx homeodomain protein: evidence that base-pairing to 18S rRNA affects translational efficiency. Proc Natl Acad Sci USA. 1999;96:1339–1344. doi: 10.1073/pnas.96.4.1339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hulzink RJM. Post-transcriptional regulation of gene expression during male gametogenesis: regulatory and structural properties of the 5′-untranslated region of pollen-expressed genes. PhD thesis. The Netherlands: Catholic University Nijmegen; 2002. [Google Scholar]
  15. Hulzink RJM, de Groot PFM, Croes AF, Quaedvlieg W, Twell D, Wullems GJ, van Herpen MMA. The 5′-untranslated region of the ntp303 gene strongly enhances translation during pollen tube growth, but not during pollen maturation. Plant Physiol. 2002;129:342–353. doi: 10.1104/pp.001701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Jacobs-Anderson JS, Parker R. Computational identification of cis-acting elements affecting post-transcriptional control of gene expression in Saccharomyces cerevisiae. Nucleic Acids Res. 2000;28:1604–1617. doi: 10.1093/nar/28.7.1604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Joshi CP, Zhou H, Huang X, Chiang VL. Context sequences of translation initiation codons in plants. Plant Mol Biol. 1997;35:993–1001. doi: 10.1023/a:1005816823636. [DOI] [PubMed] [Google Scholar]
  18. Klaff P, Riesner D, Steger G. RNA structure and regulation of gene expression. Plant Mol Biol. 1996;32:89–106. doi: 10.1007/BF00039379. [DOI] [PubMed] [Google Scholar]
  19. Kleffe J, Borodovsky M. First and second moment of count of words in random texts generated by Markov chains. Comput Appl Biosci. 1992;8:433–441. doi: 10.1093/bioinformatics/8.5.433. [DOI] [PubMed] [Google Scholar]
  20. Kozak M. Initiation of translation in prokaryotes and eukaryotes. Gene. 1999;234:187–208. doi: 10.1016/s0378-1119(99)00210-3. [DOI] [PubMed] [Google Scholar]
  21. Kurtz S, Schleiermacher C. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics. 1999;15:426–427. doi: 10.1093/bioinformatics/15.5.426. [DOI] [PubMed] [Google Scholar]
  22. Lukaszewicz M, Jérouville B, Boutry M. Signs of translational regulation within the transcript leader of a plant plasma membrane H+-ATPase gene. Plant J. 1998;14:413–423. doi: 10.1046/j.1365-313x.1998.00139.x. [DOI] [PubMed] [Google Scholar]
  23. Novina CD, Roy AL. Core promoters and transcriptional control. Trends Genet. 1996;12:351–355. [PubMed] [Google Scholar]
  24. Pain VM. Initiation of protein synthesis in eukaryotic cells. Eur J Biochem. 1996;236:747–771. doi: 10.1111/j.1432-1033.1996.00747.x. [DOI] [PubMed] [Google Scholar]
  25. Schrauwen JAM, de Groot PFM, van Herpen MMA, van der Lee T, Reijnen WH, Weterings KAP, Wullems GJ. Stage-related expression of mRNAs during pollen development in lily and tobacco. Planta. 1990;182:298–304. doi: 10.1007/BF00197125. [DOI] [PubMed] [Google Scholar]
  26. Shayig RM. Role of gene overlap in the regulation of mRNA translation for mitochondrial cytochrome p-450c27/25 in the rat. J Biol Chem. 1997;272:4050–4057. [PubMed] [Google Scholar]
  27. Sinha S, Tompa M. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2002;30:5549–5560. doi: 10.1093/nar/gkf669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Stinson JR, Eisenberg AJ, Willing RP, Pe ME, Hanson DD, Mascarenhas JP. Genes expressed in the male gametophyte of flowering plants and their isolation. Plant Physiol. 1987;83:442–447. doi: 10.1104/pp.83.2.442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Van Helden J, André B, Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol. 1998;281:827–842. doi: 10.1006/jmbi.1998.1947. [DOI] [PubMed] [Google Scholar]
  30. Van Helden J, André B, Collado-Vides J. A web site for the computational analysis of yeast regulatory sequences. Yeast. 2000a;16:177–187. doi: 10.1002/(SICI)1097-0061(20000130)16:2<177::AID-YEA516>3.0.CO;2-9. [DOI] [PubMed] [Google Scholar]
  31. Van Helden J, del Olmo M, Pérez-Ortín J. Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res. 2000b;28:1000–1010. doi: 10.1093/nar/28.4.1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Van Helden J, Rios AF, Collado-Vides J. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 2000c;28:1808–1818. doi: 10.1093/nar/28.8.1808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wang L, Wessler SR. Inefficient reinitiation is responsible for upstream open reading frame-mediated translational repression of the maize r gene. Plant Cell. 1998;10:1733–1745. doi: 10.1105/tpc.10.10.1733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Willing RP, Bashe D, Mascarenhas JP. An analysis of the quantity and diversity of messenger RNAs from pollen and shoots of Zea mays. Theor Appl Genet. 1988;75:751–753. [Google Scholar]
  35. Willing RP, Mascarenhas JP. Analysis of the complexity and diversity of mRNAs from pollen and shoots of tradescantia. Plant Physiol. 1984;75:865–868. doi: 10.1104/pp.75.3.865. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Plant Physiology are provided here courtesy of Oxford University Press

RESOURCES