Abstract
Molecular mechanisms that regulate gene expression can occur either before or after transcription. The information for post-transcriptional regulation can lie within the sequence or structure of the RNA transcript and it has been proposed that G-quadruplex nucleic acid sequence motifs may regulate translation as well as transcription. Here, we have explored the incidence of G-quadruplex motifs in and around the untranslated regions (UTRs) of mRNA. We observed a significant strand asymmetry, consistent with a general depletion of G-quadruplex-forming RNA. We also observed a positional bias in two distinct regions, each suggestive of a specific function. We observed an excess of G-quadruplex motifs towards the 5′-ends of 5′-UTRs, supportive of a hypothesis linking 5′-UTR RNA G-quadruplexes to translational control. We then analysed the vicinity of 3′-UTRs and observed an over-representation of G-quadruplex motifs immediately after the 3′-end of genes, especially in those cases where another gene is in close proximity, suggesting that G-quadruplexes may be involved in the termination of gene transcription.
INTRODUCTION
There are numerous mechanisms that regulate gene expression either at the DNA level or at the RNA level. Mechanisms for post-transcriptional regulation include the control of mRNA processing, nucleocytoplasmic transport, cellular and subcellular localization, translation efficiency and stability. Several studies have demonstrated that the genetic information required for post-transcriptional control is located mainly in the 5′- and 3′-untranslated regions (UTRs) of mRNA, and may involve both the primary sequence and secondary structure of non-protein-coding elements (1). Control via primary sequence recognition is exemplified by the action of ∼22 nt single-stranded trans-acting regulatory elements, called microRNAs, that target mRNA sites, generally in their 3′-UTR, leading to translation repression (2). Secondary structures formed in 5′- and 3′-UTRs can also serve as regulatory elements by acting as target sites for RNA-binding factors such as proteins (3) or small molecule metabolites (4), or by interacting directly with the translation machinery (5–7).
There is scope for non-canonical nucleic acid structures to form in RNA transcripts. Certain guanine-rich nucleic acid sequences are predisposed to adopting four-stranded structures known as G-quadruplexes (8) that comprise stacks of hydrogen bonded G-tetrads, each containing four guanines. There is evidence that G-quadruplexes can form in the DNA at telomeres (10) under the control of telomere-binding proteins (9,10). Owing to the functional relationship between telomere maintenance, cell proliferation, and cancer, the telomeric DNA G-quadruplex is under consideration as a potential molecular target for anticancer therapeutics (11). It has also been proposed that DNA G-quadruplex motifs found within gene promoters may be involved in controlling gene expression at the transcriptional level (12). There is some in vitro experimental evidence for the promoter-quadruplex hypothesis from chemical biology studies on proto-oncogenes that including c-myc (13), k-ras (14) and c-kit (15). Furthermore, genome-wide computational analysis has revealed that putative quadruplex-forming sequences (putative quadruplex sequences, PQS) are prevalent in the human genome (16,17), and that there is a significant enrichment in gene promoter regions relative to the rest of the genome, with almost half of all protein-coding genes found to have putative quadruplex-forming motifs in their promoters (18). This has also been found to be the case for a variety of warm-blooded animals and the presence of putative G-quadruplex motifs in the first intron of genes has recently been studied (19).
While significant attention has hitherto been focused on DNA G-quadruplexes and their potential role in biology, certain G-rich RNA sequences can also fold into quadruplexes (20,21). Indeed, a few cases have been reported where intramolecular G-quadruplex formation within mRNAs has been proposed to be associated with function. For example, G-quadruplex formation in the 3′-UTR of insulin-like growth factor IGF-II mRNA was shown to occur downstream of an endonucleolytic cleavage site (22), the fragile X mental retardation protein (FMRP) has been shown to bind a G-quadruplex within the coding region of the corresponding mRNA (23) and an intramolecular RNA G-quadruplex motif has been found within the fibroblast growth factor (FGF-2) internal ribosome entry site (24). A cytoplasmic exoribonuclease, mXRN1p, has been shown to exhibit a substrate preference for G-quadruplex RNA (25). We recently discovered a conserved, intramolecular G-quadruplex motif within the 5′-UTR of the gene transcript of the human NRAS proto-oncogene, and we have demonstrated that this RNA G-quadruplex inhibits translation (26). Computational analysis revealed that there are 2922 genes containing 5′-UTR RNA G-quadruplex elements in the human genome. Herein, we report on a detailed computational study of putative quadruplex-forming sequences associated with the 5′- and 3′-UTRs of human mRNAs. The outcomes of this study have provided the basis for proposals as to how such motifs may be involved in the regulation of gene expression.
METHODS
Sequence data and gene descriptions were extracted from Ensembl with biomart, using build 36 of the human genome sequence throughout. Sequences analysed were known transcripts of known protein-coding genes, using the Ensembl definitions and flags throughout. We have considered every transcript individually, unless they had identical UTR sequences; thus there are in some cases more than one UTR per gene. We investigated the key conclusions of this work using only genes with precisely one transcript; the results were broadly the same, and are shown in supplementary material. Where gene-level analyses are reported, a gene was considered to have a UTR PQS if any of its transcripts did so. PQS were identified using the program quadparser (16), which is available online at http://www.quadruplex.org/?view=quadparser. Briefly, we used the default parameters for this program, which searches for sequences of the form G3+ N1–7G3+N1–7G3+N1–7G3+ on either strand of the sequence given. Other analyses were performed using custom-written perl scripts. Statistical analyses were performed using chi-squared tests or as otherwise appropriate. Full details are available in Supplementary material.
RESULTS AND DISCUSSION
Computational approaches can provide insights into the potential roles of sequence motifs that cluster in specific structural regions of the genome. We have previously used computational analysis of genomic data to suggest a functional role for G-quadruplex sequence motifs within gene promoters (18). Here, we have mapped the location of G-quadruplex motifs with high resolution in and around the 5′- and 3′-UTRs of human protein-coding genes.
Incidence of G-quadruplex motifs in 5′-UTRs and in 3′-UTRs
Using build 36 of the human genome sequence, we extracted unique 5′- and 3′-UTRs corresponding to all known protein-coding transcripts post-splicing. This yielded 21 658 unique genes, with a total of 32 985 annotated 5′-UTRs, and 32 818 3′-UTRs. The 5′-UTRs were in general significantly shorter than the 3′-UTRs, with a mean length of 243 bases, compared with 899 bases for the 3′-UTRs. The median lengths were 120 and 494 bases, respectively, showing that the distribution of UTR lengths comprised many relatively short UTRs and a tail of long UTRs, with the longest 5′-UTR being 24 kb, and the longest 3′-UTR 14 kb. From these data, using our search program quadparser (16), we investigated whether these UTRs contain PQS, on either of the two strands present in the cDNA sequence (Table 1). To distinguish between the two strands, we have referred to motifs identified in the coding strand as G-PQS (as the transcribed RNA sequence would contain a G-rich sequence capable of forming a G-quadruplex), and motifs identified in the template strand as C-PQS (as the transcribed RNA sequence would be C-rich and not form a G-quadruplex) (Figure 1).
Table 1.
5′ UTRs | 3′ UTRs | |
---|---|---|
No. UTRs | 32 985 | 32 818 |
Average length | 243 bases | 899 bases |
No. UTRs with PQS (%) | 4141 (12.6%) | 5041 (15.3%) |
With G-PQS | 2034 (6.2%) | 2740 (8.3%) |
With C-PQS | 2525 (7.7%) | 3252 (9.9%) |
Ratio C/G | 1.24 | 1.19 |
No. G-PQS | 2334 | 3530 |
No. C-PQS | 3070 | 4526 |
G-PQS density | 0.291/kb | 0.120/kb |
C-PQS density | 0.382/kb | 0.153/kb |
Ratio C/G | 1.32 | 1.28 |
Transcriptome G-PQS | 0.077/kbase | |
Transcriptome C-PQS | 0.077/kbase | |
Whole genome G-PQS | 0.057/kbase | |
Whole genome C-PQS | 0.057/kbase |
The number of UTRs with G-PQS and the number with C-PQS do not sum to the total number with PQS, as some UTRs contain both a G-PQS and a C-PQS.
Of the 32 985 5′-UTRs, 4141 (12.6%) exhibited one or more PQS on one of the two strands. However, the two strands were not equivalent, with only 2034 (6.2%) having a G-PQS, whereas 2525 (7.7%) were associated with a C-PQS. We also calculated the overall densities of PQS in 5′-UTRs to be 0.291 G-PQS/kb, and 0.382 C-PQS/kb. This shows a significantly greater proportion of C-PQS than G-PQS by a factor of 1.31 (P = 3 × 10−29). Of the 32 818 3′-UTRs, 5041 (15.3%) exhibited one or more PQS on one of the two strands. The proportion of 3′-UTR with PQS is therefore higher, but this could be easily explained (indeed, overcompensated for) by the observation that 3′-UTRs are in general much larger that 5′-UTRs, by a factor of about 4. On considering the two strands separately, we found that in the 3′-UTR, 2740 (8.3%) were associated with G-PQS, and 3252 (9.9%) with C-PQS. However, the densities of the two motifs are significantly lower (P = 1 × 10−53) in the 3′-UTR as compared with the 5′-UTR, at 0.120 G-PQS/kb and 0.153 C-PQS/kb. This gives a strand asymmetry ratio of 1.28 (P = 2 × 10−24) for 3′-UTR PQS, broadly the same as for PQS in the 5′-UTRs. These results are summarized in Table 1.
The occurrence of G-quadruplex motifs is strongly affected by base composition. Therefore, we investigated whether the asymmetry observed could be accounted for by this factor. In the 5′-UTR, there are more G bases than C bases (29.8% against 28.8%), which would suggest that more G-PQS would be expected to be present than C-PQS, in contradiction to the observed result. In the 3′-UTR, both bases are present in almost exactly equal proportions, at 21.7 and 21.6%, respectively. This could account for the difference in PQS density between the 5′- and 3′-UTRs, but not the excess presence of C-PQS.
To further study this effect, we generated simulated UTRs. To make these simulates, we studied every UTR, and counted the frequency with which each base was found at every position from the 5′-end. For each position, we counted the number of times the real UTRs terminated at that position. We then generated simulates in which the base frequencies at every position reflected the natural base frequencies at that position, and with UTR termination occurring at each position depending on the natural termination probability. One hundred replicates each consisting of 100 000 5′- and 3′-UTRs were generated, and searched for G-quadruplex motifs as above. As expected from the base frequencies, G-PQS were more frequently observed than C-PQS. For the 5′-UTRs, the simulates gave a ratio of C-PQS/G-PQS of 0.67 ± 0.09, compared to an observed ratio of 1.31 (significant, P = 5 × 10−14 based on the normal distribution curve). For the 3′-UTRs the simulated asymmetry was lower, at 0.96 ± 0.10, compared to the observed ratio of 1.28 (significant, P = 4 × 10−4 based on the normal distribution curve). Thus, base composition alone cannot account for the observed effects.
For comparison, we also examined the G-quadruplex densities of the entire transcriptome (defined as the total transcript of known human genes, including introns, according to Ensembl) as well as the whole genome [as described previously (16)]. Both of these showed virtually no strand asymmetry. In the transcriptome as a whole (1.1 Gb) we found 86 472 G-PQS and 86 038 C-PQS. This gives a transcriptome PQS density of 0.077 PQS/kb for each strand, for a total of 0.153 PQS/kb overall, slightly higher than that for the whole genome, which had an overall density of 0.115 PQS/kb, also split equally between the two strands.
Next, we analysed the Gene Ontology (GO) codes associated with all of these genes, to see if they corresponded to any over- or under-represented categories of genes (Tables 2 and 3). We used the same methodology reported in our previous analysis of gene promoters (18), which compared the number of genes in a particular GO category with G-PQS in a region to the total number of genes in that GO category, which varies significantly. We found that for the 5′-UTR G-PQS, 22 GO categories were significant at our required threshold of P < 3.9 × 10−6 (using the conservative Bonferroni correction to a P-value of 0.05, considering two tests on each of 6447 GO codes). For the 3′-UTR G-PQS, 28 GO categories were significant at the same level. All of these categories, together with the associated P-values, are listed in Tables 2 and 3. Twelve of these GO categories were the same as for the 5′-end, again showing a relationship between the two UTRs. Interestingly, some of the GO categories shown to be unlikely to have promoter PQS (18) were also found to be very unlikely to have G-PQS in either their 5′- or 3′-UTRs, such as the genes involved in olfaction and immune response.
Table 2.
GO code | Description | 5′UTR −log p | 3′-UTR −log p |
---|---|---|---|
GO:0001505 | Regulation of neurotransmitter levels | 5.9 | – |
GO:0001996 | Positive regulation of heart contraction rate by Epinephrine–norepinephrine | – | 5.5 |
GO:0001997 | Increased strength of heart contraction by epinephrine-norepinephrine | – | 5.5 |
GO:0003700 | Transcription factor activity | 5.8 | 10.5 |
GO:0003707 | Steroid hormone receptor activity | – | 7.4 |
GO:0004250 | Aminopeptidase I activity | 6.9 | – |
GO:0004345 | Glucose-6-phosphate 1-dehydrogenase activity | – | 5.5 |
GO:0004385 | Guanylate kinase activity | 5.9 | – |
GO:0004409 | Homoaconitate hydratase activity | – | 5.5 |
GO:0004965 | GABA-B receptor activity | – | 6.4 |
GO:0005083 | Small GTPase-regulator activity | 5.4 | – |
GO:0005085 | Guanyl-nucleotide exchange factor activity | 13.9 | 12.8 |
GO:0005089 | Rho guanyl-nucleotide exchange factor activity | 10.9 | 9.2 |
GO:0006003 | Fructose 2,6-bisphosphate metabolic process | – | 6.6 |
GO:0006308 | DNA catabolic process | – | 5.8 |
GO:0007264 | Small GTPase mediated signal transduction | 7.1 | – |
GO:0007275 | Multicellular organismal development | – | 9.2 |
GO:0007399 | Nervous system development | – | 7.0 |
GO:0016600 | Flotillin complex | 6.1 | – |
GO:0035023 | Regulation of Rho protein signal transduction | 10.3 | 9.2 |
GO:0042825 | TAP complex | – | 5.5 |
GO:0043565 | Sequence-specific DNA binding | 6.4 | 8.5 |
GO:0045944 | Positive regulation of transcription from RNA polymerase II promoter | – | 7.1 |
Where significant over-representation was found, the P-value associated with the test is presented as a negative logarithm. Where the test was not significant for one of the UTRs, this is marked as ‘–’.
Table 3.
GO code | Description | 5′UTR −log p | 3′UTR −log p |
---|---|---|---|
GO:0000786 | Nucleosome | 6.5 | – |
GO:0003735 | Structural constituent of ribosome | – | 10.2 |
GO:0004872 | Receptor activity | 9.0 | – |
GO:0004984 | Olfactory receptor activity | 22.3 | 26.6 |
GO:0005576 | Extracellular region | 10.2 | – |
GO:0005739 | Mitochondrion | – | 10.0 |
GO:0005840 | Ribosome | – | 8.3 |
GO:0006334 | Nucleosome assembly | 7.5 | 6.2 |
GO:0006412 | Translation | 6.3 | 8.2 |
GO:0006955 | Immune response | 9.7 | 7.2 |
GO:0007186 | G-protein coupled receptor protein signaling pathway | 18.1 | 12.0 |
GO:0007608 | Sensory perception of smell | 20.0 | 27.7 |
GO:0008152 | Metabolic process | – | 8.5 |
GO:0042612 | MHC class I protein complex | 5.9 | – |
GO:0050896 | Response to stimulus | 18.3 | 14.8 |
Where significant under-representation was found, the P-value associated with the test is presented as a negative logarithm. Where the test was not significant for one of the UTRs, this is marked as ‘–’.
Association of G-quadruplex motifs in both 3′- and 5′-UTRs
We considered whether genes with G-quadruplexes in their 5′-UTRs were also more likely to have them in their 3′-UTRs, or whether these were independent properties. Here, we have restricted our analysis to only include G-PQS sequences, as there is some evidence to support the hypothesis that these motifs may be associated with function in RNA, although the C-PQS could potentially form DNA G-quadruplexes on the template strand or a C-rich RNA secondary structure called the i-motif (27,28).
We found that 314 genes had G-PQS motifs in both the 5′- and 3′-UTRs, out of a total of 1665 genes with G-PQS in the 5′-UTR and 2154 with G-PQS in the 3′-UTR, from an overall total of 21 658 genes (Figure 2). The proportion of genes with 5′-UTR G-PQS that also have 3′-UTR PQS was 18.9% (314/1665) as compared to the proportion of all genes with 3′-UTR PQS, which was 9.9% (2154/21 658). If these two sets of motifs were entirely independent, the two proportions would necessarily be identical. Thus, there is clearly a significant association (P = 5 × 10−34) between the 5′- and 3′-UTRs and the presence of G-PQS in the two ends is not independent. This may be suggestive of a functional long-range interaction. Indeed, such long-range interactions between sequences in the 5′- and 3′-UTRs have been proposed to enhance translation efficiency (6,29). We tested whether this correlation was simply due to a correlation in UTR lengths, and found no significant correlation between the length of a 5′-UTR and the corresponding 3′-UTR (Pearson r = 0.02 ± 0.04). We also confirmed that the same result was found using only non-redundant UTRs (see Table S3).
Positional bias and strand asymmetry of PQS in the 5′-UTR
We then considered whether PQS within the 5′-UTR showed any positional bias. The data presented in Table 4 shows strong clustering of G-PQS in the first ∼50 bases, with a gradually declining frequency upon moving further away from the 5′-end. It was noteworthy that there was a strand bias to this positional effect, where C-PQS showed significantly less clustering than G-PQS (significant, P = 2 × 10−91 within 50 bases). We then performed a high-resolution mapping of all PQS in the UTRs, together with the upstream (promoter) region for comparison, studying the frequency with which every base position was occupied by a PQS (Figure 3a). Examination of the 5′-UTR region at this higher resolution (Figure 3a) confirms the results described earlier (Table 4), with clear strand asymmetry and the highest density of G-PQS at the 5′-end of the 5′-UTR, decreasing approximately linearly along the 5′-UTR. In contrast, the C-PQS on the complementary strand are relatively depleted in the first 50 bases at the 5′-end of the 5′-UTR, and then become increasingly common over the next 50 bases, before decreasing linearly. Over most of the 5′-UTR, and with the striking exception of the first 50 bases at the 5′-end, C-PQS are 30% more common than G-PQS, despite the fact that there are more G than C bases. This strand asymmetry, which is not shared by the PQS in the promoter region, clearly suggests that the functional significance of this sequence may be at the RNA level.
Table 4.
Start | 5′-end of 5′-UTR |
3′-end of 3’-UTR |
||
---|---|---|---|---|
G-PQS (%) | C-PQS (%) | G-PQS (%) | C-PQS (%) | |
Within first 10 bases | 193 (8.3) | 135 (4.4) | 13 (0.4) | 0 (0) |
Within first 20 bases | 316 (13.5) | 241 (7.9) | 18 (0.5) | 0 (0) |
Within first 50 bases | 653 (28.0) | 569 (18.5) | 106 (3.0) | 34 (0.8) |
Within first 100 bases | 1091 (46.7) | 1113 (36.3) | 356 (10.1) | 266 (5.9) |
Within first 1/20th | 253 (10.8) | 201 (6.5) | 102 (2.9) | 57 (1.3) |
Within first 1/10th | 437 (18.7) | 356 (11.6) | 287 (8.1) | 210 (4.7) |
The strand difference can be simply represented at each position along the UTR, by calculating the difference in the proportion of G-PQS and C-PQS divided by the sum of the proportions, yielding a dimensionless and normalized measure of excess. The result of this analysis is shown in Figure 3b. Figure 3c shows a model to describe this strand asymmetry, assuming two factors are affecting the distributions—generalized under-representation of G-PQS resulting from generic deselection in mRNA, and localized over-representation of G-PQS due to a functional role at the 5′-end of the 5′-UTR. This model is in good agreement with the observed asymmetries (Figure 3b). The observed data also contains an interesting periodic variation, with periodicity 30 ± 5 bases; it is unclear as to what this pattern could relate.
It has been previously shown that structural motifs such as hairpins in the 5′-UTRs can modulate mRNA translation efficiency when located in close proximity to the 5′-end of the 5′-UTR, either by interacting directly with the translation machinery (7,30,31) or by acting as target sites for proteins (3,29). We previously showed that a G-quadruplex in the 5′-UTR of the NRAS gene significantly reduces translational efficiency (26). Our previous experimental observations (26) coupled with the present computational study lead us to postulate that G-quadruplexes near the 5′-end of 5′-UTRs may be involved in translation regulation (Figure 4a), and hence that there may be an evolutionary pressure in favour of the positional bias of G-quadruplex motifs towards this initial region.
Positional bias and strand asymmetry of PQS in and around the 3′-UTR
We have also carried out a high-resolution mapping of the 3′-UTR and the region immediately downstream of the gene (Figure 5), in analogous fashion to that performed for the 5′-UTRs. The 3′-UTRs of genes have a lower density of PQS than the 5′-UTRs, but there is still a strand bias within the 3′-UTR, with an excess of C-PQS. Two noteworthy features were observed in the immediate vicinity of the transcription end site (TES) junction. First, PQS are extremely rare, compared to all other areas studied, in a region stretching from 20 bases within the 3′-end of the 3′-UTR to 10 bases downstream from the TES. Second, just 3′ of this, there is a very sharp peak in G-PQS density that is not accompanied by an equivalent C-PQS effect (i.e. the strand bias is strong), and is not reflected by an increase in the proportion of G bases (see Supplementary material). These results are suggestive of RNA PQS function localized proximal to the junction between the 3′-UTR and the 3′-downstream region.
In considering the possible functional implications of a G-quadruplex immediately downstream of the TES, we were drawn to observations and suggestions made by Proudfoot and co-workers (32,33) that structures formed by G-rich sequences at the 3′-end of a gene may help to demarcate the end of transcription, especially in cases where there is another gene shortly after the TES. Therefore, given our observation of a peak in G-PQS density in the 100 bases immediately 3′ of the TES, we were prompted to investigate whether known protein-coding genes with G-PQS in this 100 bases region were particularly likely also to have nearby genes. We identified 562 genes with G-PQS in this region, giving a total of 859 such G-PQS. In each case, we measured the distance from these genes to the next known gene in the 3′ direction relative to the gene. In one case, there was no next gene before the end of the chromosome, and that gene was not considered further. The results are shown in Table 5. Of the remaining 561 genes, 50 (8.9%) overlapped with other known genes. Of the remainder, the mean distance to the next gene was 47.3 kb. Of these genes, 97 (17.3%) had another gene within a kilobase. These results were compared to a control set where all known protein-coding genes were considered. Among this dataset of 21 679 genes, 36 did not have a next gene, and 1589 (7.3%) were overlapping. The mean distance to the next gene among the remainder was 93.4 kb, considerably larger than that found for the set with TES-associated G-PQS. Of the entire set of genes, 1633 (7.5%) had another within a kilobase, significantly less than the proportion found for genes with TES-associated G-PQS (17.3%), by a factor of 2.3 (significant, P = 2 × 10−18).
Table 5.
All known protein-coding genes | Class | Genes with G-PQS in 100 bases 3′ of TES | Enrichment |
---|---|---|---|
No. (%) | No. (%) | ||
21643 (100) | All with next gene | 561 (100) | |
1589 (7.3) | Overlapping (d ≤ 0) | 50 (8.9) | 1.21 |
20 054 (92.7) | Non-overlapping (d > 0) | 511 (91.1) | 0.98 |
1633 (7.5) | Near (0 < d ≤ 1000) | 97 (17.3) | 2.29 |
3376 (15.6) | Medium (1000 < d ≤ 5000) | 114 (20.3) | 1.30 |
15 045 (69.5) | Far (5000 < d) | 300 (53.5) | 0.77 |
3317 (15.3) | Very far (100 000 < d) | 51 (9.1) | 0.59 |
The data is binned into categories where the next gene overlaps the previous one, is within 1 kb (‘near’, shown in bold), from 1 to 5 kb away (‘medium’) or >5 kb away (‘far’). The subset where the next gene is >100 kb away (‘very far’) is also shown. Genes where there was no subsequent gene (because they were near a chromosome end) are not shown. Enrichment is calculated as percentage with G-PQS/% of all genes for each class.
A requirement for 3′-end processing of mRNAs is efficient transcription termination (36,37). This appears to be especially important in the cases of closely spaced genes (33). It has been shown that downstream G-rich sequences promote efficient transcription termination. Proudfoot and co-workers (34,35) have previously shown that the sequence (GGGGGAGGGGG)4, a tetramer of four MAZ-binding sites, strongly activates transcription termination in vitro when positioned downstream of a synthetic poly(A) site, in a manner that does not require the expression of the MAZ protein. We have noted with interest that this particular G-rich sequence is predicted by quadparser to form a stable G-quadruplex (16). Proudfoot and co-workers have further shown that mutation of the sequence to (GGTGAAAGGTG)4, which quadparser (16) does not predict to form a G-quadruplex, does not efficiently terminate transcription. Specifically, it was shown in vivo that the parent, but not the mutated sequence, promotes efficient transcription termination of a heterologous β-globin construct, and that a naturally occurring G-rich sequence, located 100 nt downstream of the poly (A) site in the human β-actin gene, is essential for transcription termination. Equally, it has been previously noted that some G-rich RNA sequences are involved in regulating 3′-end processing of alternatively processed mammalian pre-mRNAs by interaction with hnRNP H protein subfamily members (36,37). Our observation that G-PQS cluster immediately after the end of the 3′-UTR supports the proposal that G-quadruplex structure formation could play a role in 3′-end processing of mRNAs by promoting transcription termination and preventing deleterious run-through (Figure 4b).
CONCLUSIONS
From a comprehensive survey of G-quadruplex motifs in and around human genomic UTRs, we have shown that their incidence shows significant strand asymmetry and positional bias, suggestive of functional roles in RNA. In 5′-UTRs, G-quadruplex motifs tend to exist towards the 5′-end of the 5′-UTR supportive of function relating to translation initiation, as depicted in Figure 4a, and for which there is now some experimental support (26). With respect to 3′-UTRs, G-quadruplex motifs tend to cluster immediately after the 3′-end of the mRNA, particularly in cases of genes that have a proximal gene in the 3′ direction. In such cases, failure to terminate transcription could lead to problems associated with the additional transcription of the adjacent gene. We propose that G-quadruplex motifs may serve as pause elements that promote transcriptional termination, cleavage and polyadenylation (Figure 4b).
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Funding for Open Access publication charge: Trinity College, Cambridge.
Conflict of interest statement. None declared.
Supplementary Material
ACKNOWLEDGEMENTS
J.L.H. is a Research Councils UK Academic Fellow. We thank the BBSRC for project funding, Cancer Research UK for programme funding and the Cambridge Commonwealth Trust and Trinity College, Cambridge for studentship funding.
REFERENCES
- 1.Pesole G, Mignone F, Gissi C, Grillo G, Licciulli F, Liuni S. Structural and functional features of eukaryotic mRNA untranslated regions. Gene. 2001;276:73–81. doi: 10.1016/s0378-1119(01)00674-6. [DOI] [PubMed] [Google Scholar]
- 2.Bartel DP. MicroRNAs: genomics, biogenesis, mechanism and function. Cell. 2004;116:281–297. doi: 10.1016/s0092-8674(04)00045-5. [DOI] [PubMed] [Google Scholar]
- 3.Wilkie GS, Dickson KS, Gray NK. Regulation of mRNA translation by 5′ and 3′-UTR-binding factors. Trends Bioc. Sci. 2003;28:182–188. doi: 10.1016/S0968-0004(03)00051-3. [DOI] [PubMed] [Google Scholar]
- 4.Mandal M, Breaker RR. Gene regulation by riboswitches. Nat. Rev. Mol. Cell Biol. 2004;5:451–463. doi: 10.1038/nrm1403. [DOI] [PubMed] [Google Scholar]
- 5.Kozak M. Structural features in eukaryotic mRNAs that modulate the initiation of translation. J. Biol. Chem. 1991;266:19867–19870. [PubMed] [Google Scholar]
- 6.Sonenberg N. mRNA translation: influence of the 5′ and 3′ untranslated regions. Curr. Opin. Gen. Dev. 1994;4:310–315. doi: 10.1016/s0959-437x(05)80059-0. [DOI] [PubMed] [Google Scholar]
- 7.Babendure JR, Babendure JL, Ding J.-H, Tsien RY. Control of mammalian translation by mRNA structures near caps. RNA. 2006;12:851–861. doi: 10.1261/rna.2309906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Neidle S, Balasubramanian S. Quadruplex Nucleic Acids. Cambridge: RSC; 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Schaffitzel C, Berer I, Postberg J, Hanes J, Lipps HJ, Plückthun A. In vitro generated antibodies specific for telomeric guanine-quadruplex DNA react with Stylonychia lemnae macronuclei. Proc. Natl Acad. Sci. USA. 2001;98:8572–8577. doi: 10.1073/pnas.141229498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Paeschke K, Simonsson T, Postberg J, Rhodes D, Lipps HJ. Telomere end-binding proteins control the formation of G-quadruplex DNA structures in vivo. Nat. Struct. Mol. Biol. 2005;12:847–854. doi: 10.1038/nsmb982. [DOI] [PubMed] [Google Scholar]
- 11.Neidle S, Parkinson GH. Telomere maintenance as a target for anticancer drug discovery. Nat. Rev. Drug Disc. 2002;1:383–393. doi: 10.1038/nrd793. [DOI] [PubMed] [Google Scholar]
- 12.Dexheimer TS, Fry M, Hurley LH. DNA quadruplexes and gene regulation. In: Neidle S, Balasubramanian S, editors. Quadruplex Nucleic Acids. 2006. RSC Publishing, Cambridge, UK, pp. 180–207. [Google Scholar]
- 13.Siddiqui-Jain A, Grand CL, Bearss DJ, Hurley LH. Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress c-MYC transcription. Proc. Natl Acad. Sci. USA. 2002;99:11593–11598. doi: 10.1073/pnas.182256799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cogoi S, Quadrifoglio F, Xodo LE. G-rich oligonucleotide inhibits the binding of a nuclear protein to the Ki-ras promoter and strongly reduces cell growth in human carcinoma pancreatic cells. Biochemistry. 2004;43:2512–2523. doi: 10.1021/bi035754f. [DOI] [PubMed] [Google Scholar]
- 15.Bejugam M, Sewitz S, Shirude PS, Rodriguez R, Shahid R, Balasubramanian S. Trisubstituted Isoalloxazines as a new class of G-quadruplex binding ligands: small molecule regulation of c-kit oncogene expression. J. Am. Chem. Soc. 2007;129:12926–12927. doi: 10.1021/ja075881p. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Huppert JL, Balasubramanian S. Prevalence of quadruplexes in the human genome. Nucleic Acids Res. 2005;33:2908–2916. doi: 10.1093/nar/gki609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Todd AK, Johnstone M, Neidle S. Highly prevalent putative quadruplex sequence motifs in human DNA. Nucleic Acids Res. 2005;33:2901–2907. doi: 10.1093/nar/gki553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Huppert JL, Balasubramanian S. G-quadruplexes in promoters throughout the human genome. Nucleic Acids Res. 2007;35:406–413. doi: 10.1093/nar/gkl1057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Eddy J, Maizels N. Conserved elements with potential to form polymorphic G-quadruplex structures in the first intron of human genes. Nucleic Acids Res. 2008;36:1321–1323. doi: 10.1093/nar/gkm1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pan B, Xiong Y, Shi K, Sundaralingam M. Crystal structure of a bulged RNA tetraplex at 1.1 Å resolution: implications for a novel binding site in RNA tetraplex. Structure. 2003;11:1423–1430. doi: 10.1016/j.str.2003.09.017. [DOI] [PubMed] [Google Scholar]
- 21.Liu H, Matsugami A, Katahira M, Uesugi S. A dimeric RNA quadruplex architecture comprised of two G:G(:A):G:G(:A) hexads, G:G:G:G tetrads and UUUU loops. J. Mol. Biol. 2002;322:955–970. doi: 10.1016/s0022-2836(02)00876-8. [DOI] [PubMed] [Google Scholar]
- 22.Christiansen J, Kofod M, Nielsen FC. A guanosine quadruplex and two stable hairpins flank a major cleavage site in insulin-like growth factor II mRNA. Nucleic Acids Res. 1994;22:5709–5716. doi: 10.1093/nar/22.25.5709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Darnell JC, Jensen KB, Jin P, Brown V, Warren ST, Darnell RB. Fragile X mental retardation protein targets G Quartet mRNAs important for neuronal function. Cell. 2001;107:489–499. doi: 10.1016/s0092-8674(01)00566-9. [DOI] [PubMed] [Google Scholar]
- 24.Bonnal S, Schaeffer C, Creancier L, Clamens S, Moine H, Prats A.-C, Vagner S. A single internal ribosome entry site containing a G quartet RNA structure drives fibroblast growth factor 2 gene expression at four alternative translation initiation codons. J. Biol. Chem. 2003;278:39330–39336. doi: 10.1074/jbc.M305580200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bashkirov VI, Scherthan H, Solinger JA, Buerstedde JM, Heyer WD. A mouse cytoplasmic exoribonuclease (mXRN1p) with preference for G4 tetraplex substrates. J. Cell Biol. 1997;136:761–773. doi: 10.1083/jcb.136.4.761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kumari S, Bugaut A, Huppert JL, Balasubramanian S. An RNA G-quadruplex in the 5′ UTR of the NRAS proto-oncogene modulates translation. Nat. Chem. Biol. 2007;3:218–221. doi: 10.1038/nchembio864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gehring K, Leroy J, Guéron M. A tetrameric DNA structure with protonated cytosine·cytosine base pairs. Nature. 1993;363:499–510. doi: 10.1038/363561a0. [DOI] [PubMed] [Google Scholar]
- 28.Snoussi K, Nonin-Lecomte S, Leroy J.-L. The RNA i-motif. J. Mol. Biol. 2001;309:139–153. doi: 10.1006/jmbi.2001.4618. [DOI] [PubMed] [Google Scholar]
- 29.Preiss T, Hentze MW. From factors to mechanisms: translation and translational control in eukaryotes. Curr. Opin. Gen. Dev. 1999;9:515–521. doi: 10.1016/s0959-437x(99)00005-2. [DOI] [PubMed] [Google Scholar]
- 30.Pelletier J, Sonenberg N. Photochemical cross-linking of cap binding proteins to eucaryotic mRNAs: effect of mRNA 5' secondary structure. Mol. Cell Biol. 1985;5:3222–3230. doi: 10.1128/mcb.5.11.3222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kozak M. Circumstances and mechanisms of inhibition of translation by secondary structure in eucaryotic mRNAs. Mol. Cell Biol. 1989;9:5134–5142. doi: 10.1128/mcb.9.11.5134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Gromak N, West S, Proudfoot NJ. Pause sites promote transcriptional termination of mammalian RNA polymerase II. Mol. Cell Biol. 2006;26:3986–3996. doi: 10.1128/MCB.26.10.3986-3996.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ashfield R, Patel AJ, Bossone SA, Brown H, Campbell RD, Marcu KB, Proudfoot NJ. MAZ-dependent termination between closely spaced human complement genes. EMBO J. 1994;13:5656–5667. doi: 10.1002/j.1460-2075.1994.tb06904.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yonaha M, Proudfoot NJ. Specific transcriptional pausing activates polyadenylation in a coupled in vitro system. Mol. Cell. 1999;3:593–600. doi: 10.1016/s1097-2765(00)80352-4. [DOI] [PubMed] [Google Scholar]
- 35.Yonaha M, Proudfoot NJ. Transcriptional termination and coupled polyadenylation in vitro. EMBO J. 2000;19:3770–3777. doi: 10.1093/emboj/19.14.3770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kostadinov R, Malhotra N, Viotti M, Shine R, D’Antonio L, Bagga P. GRSDB: a database of quadruplex forming G-rich sequences in alternatively processed mammalian pre-mRNA sequences. Nucleic Acids Res. 2006;34:D119–D124. doi: 10.1093/nar/gkj073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bagga P, Arhin GK, Wilusz J. DSEF-1 is a member of the hnRNP H family of RNA-binding proteins and stimulates pre-mRNA cleavage and polyadenylation in vitro. Nucleic Acids Res. 1998;26:5343–5350. doi: 10.1093/nar/26.23.5343. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.