Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Oct 1.
Published in final edited form as: RNA Biol. 2008 Oct 3;5(4):255–262. doi: 10.4161/rna.7116

Over-represented sequences located on 3' UTRs are potentially involved in regulatory functions

Kihoon Yoon 1, Daijin Ko 2, Mark Doderer 3, Carolina B Livi 1,4, Luiz OF Penalva 5,*
PMCID: PMC2732352  NIHMSID: NIHMS135826  PMID: 18971640

Abstract

Eukaryotic gene expression must be coordinated for the proper functioning of biological processes. This coordination can be achieved both at the transcriptional and post-transcriptional levels. In both cases, regulatory sequences placed at either promoter regions or on UTRs function as markers recognized by regulators that can then activate or repress different groups of genes according to necessity. While regulatory sequences involved in transcription are quite well documented, there is a lack of information on sequence elements involved in post-transcriptional regulation. We used a statistical over-representation method to identify novel regulatory elements located on UTRs. An exhaustive search approach was used to calculate the frequency of all possible n-mers (short nucleotide sequences) in 16,160 human genes of NCBI RefSeq sequences and to identify any peculiar usage of n-mers on UTRs. After a stringent filtering process, we identified 2,772 highly over-represented n-mers on 3' UTRs. We provide evidence that these n-mers are potentially involved in regulatory functions. Identified n-mers overlap with previously identified binding sites for HuR and TIA-1 and, ARE and GRE sequences. We determine also that n-mers overlap with predicted miRNA target sites. Finally, a method to cluster n-mer groups allowed the identification of putative gene networks.

Keywords: 3' UTR, post-transcriptional regulation, regulatory sequences, RNA binding proteins, gene networks, translation, mRNA stability

Introduction

Post-transcriptional regulation plays a very important role in many biological processes such as embryogenesis, stem cell proliferation, spermatogenesis, sex determination, neurogenesis, erythropoiesis, etc (reviewed in ref. 1). The impact it has on the final protein outcome of a cell can be appreciated through studies that compare the steady state levels of mRNAs (transcriptome) and proteins (proteome) in the same cell population.26 In the case of some genes, substantial differences were found with accumulated levels of the protein and its corresponding message varying by as much as 30-fold.2 Unfortunately, despite its importance, post-transcriptional regulation continues to be a poorly understood subject.

There are essentially three cytoplasmic processes that can be modulated in eukaryotic cells, ultimately leading to changes in protein production: RNA transport/localization, degradation/stability and RNA translation. Most of the elements necessary for proper regulation of the former three processes are located in the 5' and 3' untranslated regions (UTR) of mRNAs. UTR sequences involved in regulation can be grouped into different categories. The most common are short sequence motifs that function as binding sites for RNA binding proteins and/or non-coding RNAs. Repetitive sequence elements, such as CUG repeats, have also been documented to function as a target of RNA binding proteins. Finally, some UTR sequences interfere with gene expression independently of the action of a regulator; their structural features pose a barrier, potentially influencing the translation of mRNAs. This is the case of moderately stable secondary structures that are typically located in the 5' UTR relatively close to the AUG start codon (reviewed in ref. 7).

There are several examples of UTR mediated regulation in connection to health related issues. For instance, Iron Regulatory Protein (IRP) controls the expression of several mRNAs (ferritin, transferrin, mitochondrial aconitase, etc.) that have a regulatory element named iron responsive element (IRE). IRPs bind to IREs in situations of iron deprivation and inhibit mRNA translation. Mutations that affect the IRE can lead to human disease such as hereditary hyperferritinemia-cataract (reviewed in Rouault, 2006).8 Another good example is the amyloid-β precursor protein (APP) implicated in Alzheimer's and Down syndrome. Translation of APP mRNA is upregulated by interleukin-1 through 5' UTR sequences (reviewed in ref. 9). UTR-mediated regulation is also associated with cancer. For instance, approximately 10% of all mRNAs have atypically long 5' UTRs, in many cases containing a variety of regulatory elements. 75% of genes with long 5' UTRs encode oncogenes and genes implicated in cell growth, death and proliferation.9

Only a small fraction of the regulatory elements located on human UTRs are currently known. In most cases, the described elements were derived from studies of individual genes and their specific regulators. Unfortunately, current engines that predict putative UTR regulatory elements do not produce the expected results in high throughput searches; there are not sufficient labeled instances to allow the employment of machine learning techniques (for example, neural network or Markov models) to construct predictive models. Current UTR search/prediction tools are very rudimentary and cannot be compared to the sophisticated ones that predict transcription factor binding sites (e.g., TRANSFAC).10 Therefore, novel alternative computational approaches that do not rely exclusively on previously described elements are needed. We describe here a method based on over-representation to identify putative regulatory sequences in 3' UTRs.

Results

Identification of over-represented sequences on UTRs of human genes

We used over-representation as a strategy to map putative regulatory sequences on 3' UTRs. Over-represented sequences are frequently used as point of reference to identify putative regulatory elements.11 It assumes that an observed sequence bias can indicate the presence of a regulatory element. A given sequence motif (n-mer) is considered over-represented if it appears more frequently than its statistically expected frequency. Contrary to previous studies, we employed a more elaborate method specifically designed for a UTR study. First, having in mind that transcribed and non-transcribed sequences are under different selective pressure and that repetitive sequences present in intergenic regions can create a bias and alter the final results; we used mRNA sequences instead of genomic sequences as the sample for the counting process. Second, since regulatory elements can vary in size, we opted not to restrict the size of the n-mers to be counted.

Our strategy to identify highly significant over-represented n-mers located on 3' UTRs takes into consideration statistical expected frequencies, size of the n-mer and average length of the different portions of the mRNA (5' UTR, coding region and 3' UTR) in our sample pool. The established selection parameters are the number of genes in which an individual n-mer appears and the minimum/maximum appearance values of an individual n-mer for 5' UTR, coding regions and 3' UTR. These criteria were selected based on our statistical analysis in order to keep the significance of over-represented n-mers high and maintain the number of selected n-mers manageable. In this scenario, n-mers are selected only if they are over-represented in 3' UTR and at the same time have low counts in the other two sections of the mRNA. By using this approach, we identified 2,772 3' UTR over-represented motifs. A graph showing the distribution of 3' UTR over-represented elements and the cut off point used in our analysis is represented in Figure 1.

Figure 1.

Figure 1

Distribution of 3' UTR n-mers according to the number of genes in which they appear. 2,773 3' UTR over-represented n-mers representing 13% of the n-mers with adjusted p-value less than or equal to 0.01 were selected for further analysis. The red line indicates the cut-off point.

Table 1 shows examples of identified over-represented n-mers located on 3' UTR. In order to facilitate future analyses, we ranked the n-mers according to the adjusted p-value that indicates the statistical significance of the “fold increase” in relation to the adjusted expected frequency. p-values are reported in log units and for p < 10−200 we set it to be 10−200. The entire list of over-represented n-mers is present in Supplementary data file 1.

Table 1.

Example of over-represented n-mers for 3' UTR identified after very stringent criteria

3' UTR n-mers adj. p value Gene count
CTGGCCAACATGGTGAAACCC −200 169
AGCCTGGCCAACATGGTGAAA −200 144
AACTCCTGACCTCAGGTGATC −200 125
AACCCCGTCTCTACTAAAAAT −200 122
GATCACCTGAGGTCAGGAGTT −200 116
CTGGCCAACATGGTGAAACCCC −200 115
GTGGCTCACACCTGTAATCCC −200 113
TCCCAGCTACTCAGGAGGCTG −200 102
TGGCTCACACCTGTAATCCCAG −200 101
ACTGCACTCCAGCCTGGGTGA −200 100
ACCTGTAATCCCAGCACTTTG −200 98
CACTGCACTCCAGCCTGGGTG −200 93
TTTTTTTTTTTTTTGAGACAG −200 88

n-mer sequences overlap with previously identified regulatory motifs

If over-represented n-mers are indicative of the presence of regulatory sequences, one would expect to see an overlap between them and already mapped RNA binding protein recognition sites. In order to test this hypothesis, we compared our n-mer list to binding sites of the RNA binding proteins HuR and TIA-1 obtained via RIP-Chip analysis. These binding sites were deduced with computational methods based on commonalities at the level of RNA sequence and structure and information from previously characterized HuR and TIA-1 sites.12,13 Detailed information was kindly provided by Dr. Isabel Lopez de Silanes. To determine if our results are statistically significant, we generated a total of 1,000 random sequence sets from actual human UTR sequences; the length of individual sequences present in the over-represented n-mer lists was considered when preparing those lists. Finally, we compared the lists of TIA-1 and HuR binding sites to the lists of random sequences to determine the number of overlaps. The results we obtained are summarized in Table 2. In agreement with the idea that there is a correlation between over-representation and biological function, the number of over-represented sequences (n-mers) in 3' UTRs matching either HuR or TIA-1 binding sites is significantly higher than the numbers obtained from the comparison with the random sets (p-values < 0.001). The list of n-mers match to HuR and TIA-1 binding sites are shown in Table S1 and S2, respectively.

Table 2.

n-mer comparison to HuR and TIA-1 binding sites

In 3' UTR Comparison to HuR binding sites Comparison to TIA-1 binding sites
Over-represented n-mers Mean (SD) of random n-mer samples Over-represented n-mers Mean (SD) of random n-mer samples
Number of RNAs with at least one mapped HuR/TIA-1 binding site matching a n-mer 839 101.6 (27.7) 314 41.8 (24.5)
Total number of mapped Hur/TIA-1 binding site matching a n-mer 1078 108.1 (47.5) 377 57.5 (83.2)

Each comparison is represented in two columns. In the first column, the numbers reflect perfect overlaps between described HuR or TIA-1 binding sites and over-represented n-mers. In the second column, the numbers reflect average values obtained from 1,000 comparisons between described HuR or TIA-1 binding sites and random n-mer sets generated from sequences present in our mRNA set. (SD, standard deviation).

We employed another approach to determine if over-represented n-mers coincide with previously described UTR regulatory elements. We compared our dataset to ARE sequences (described in the previous section) and to the recently identified GU-Rich elements (GRE).14 The AUUUA and the UAUUUAU motifs have been described as the basic core of ARE sequences. We expected to see a large portion of the 3' UTR n-mers that contained the core sequence as well as a bias towards the 3' UTR since ARE sequences have not been assigned for 5' UTRs. Indeed, the number of over-represented sequences (n-mers) that have core ARE sequences is significantly higher than that obtained from random sets. Moreover, a 3' UTR bias was observed (p values < 0.001)—Table 3. The GU-rich element (GRE), whose consensus is UGUUUGUUUGU, was identified via computational methods to find conserved sequences in the 3' UTR of genes that exhibited rapid decay in primary human T-cells. These sequences were determined to be involved in mRNA stability and to be regulated by the CUG-binding protein 1.14 Identically to what was observed for the ARE sequences, we determined that the number of over-n-mers containing a GRE is significantly higher than that obtained from random sets; a 3' UTR bias (p values < 0.001) was observed as well—Table 4. In conclusion, the results of this section indicate that n-mers do overlap with regulatory elements. The lists of n-mers containing ARE and GRE sequences are in Supplemental tables S3–S5. Detailed results of both analyses of this section are provided as supporting materials.

Table 3.

n-mer comparison to ARE sequences

In 3' UTR Over-represented n-mer Mean (SD) of random samples
Number of appearances of the AUUUA motif in a n-mer set 122 70.5 (8.3)
Number of appearances of the UAUUUAU motif in a n-mer set 35 9.0 (2.9)
Number of n-mers containing one or more AUUUA motif 116 67.1 (7.8)
Number of n-mers containing one or more UAUUUAU motif 35 8.7 (3.0)

The table contains the number of over-represented n-mers containing AUUUA and UAUUUAU sequences as well as average values obtained for 1,000 comparisons with random n-mer sets generated from sequences present in our mRNA set. (SD, standard deviation).

Table 4.

n-mer comparison to GRE sequences

In 3' UTR Over-represented n-mer Mean (SD) of random samples
Number of appearances of the UGUUUGUUUGU motif in a n-mer set 12 0.29 (0.79)
Number of n-mers containing one or more UGUUUGUUUGU motif 5 0.15 (0.39)

The table contains the number of over-represented n-mers containing GRE sequences as well as average values obtained for 1,000 comparisons with random n-mer sets generated from sequences present in our mRNA set. (SD, standard deviation).

n-mer sequences overlap with predicted miRNA target sites

We tested over-represented 3' UTR n-mers against predicted miRNA target sites to see if the results fit also into our assumption that over-represented n-mers function as indicators of the presence of regulatory sequences. In order to test if predicted miRNA target sites overlap with over-represented 3' UTR n-mers and vice-versa, we obtained three different measures; total number of overlaps, number of unique n-mers in overlaps and number of unique miRNA target sites in overlaps. The unique n-mer count and unique miRNA target site count are non-redundant overlap counts and indicate whether the total number of overlaps is due to repetitive counting from either a small subset of n-mers or predicted miRNA targets. In comparison to our selected n-mers, random samples were used to test the statistical significance of overlap counts.

As shown in Table 5, the overlap counts for over-represented 3' UTR n-mers are higher than the counts obtained for random samples (p value < 0.001). The log transformations of random counting data are approximately normal. We used Normal distribution to see how extreme n-mer counts are in comparison to random samples. We concluded that the observed counts are significantly higher than random set counting with p-values less than 1.04E-06.

Table 5.

Comparison between over-represented n-mers and predicted microRNA target sites

Total number of overlaps Number of unique n-mers in overlaps Number of unique miRNA targets in overlaps
Random sets mean 108.83 39.75 107.76
stdev 29.81 6.19 29.39
Over-represented 3' UTR n-mers 382 140 371
p-value 1.04E-06 1.63E-15 1.47E-06

We verified if over-represented n-mers overlap with predicted miRNA sites and vice-versa. In order to determine if the numbers obtained are statistically significant, we perform the same study with 1,000 sets of random sequences derived from 3' UTRs. In the table, the total number of overlaps reflects the number of cases in which a given n-mer or random sequence is entirely contained in the sequence of a predicted miRNA target or the other way around. The number of unique n-mers in overlaps reflects how many n-mers (out of 2,772) are counted at least once. Similarly, the number of unique miRNA targets in overlaps represents the number of targets counted at least once.

Cluster analysis and identification of putative gene networks regulated at the post-transcriptional level

All biological processes depend on the coordinated activity of a selected group of proteins. Before a given biological process like cell division takes place, it is necessary to synchronize the expression of genes that code for the set of implicated proteins. This synchronization can be achieved at the post-transcriptional level through the action of specific RNA binding proteins and non-coding RNAs that recognize UTR sequences shared by the gene group. Regulators and their corresponding target mRNAs form the so called post-transcriptional operons.15,16

In order to identify genes that could be potentially co-regulated, forming a functional post-transcriptional operon, we employed a method to identify gene clusters that share sets of n-mers. Briefly, we considered that two n-mers are `similar' if these n-mers are frequently appearing in the same genes. Unlike a clustering method based on the sequence similarity, a good cluster is defined as a group of n-mers sharing nearly identical gene lists. Figure 2 illustrates the clustering analysis procedure. A more detailed explanation about the cluster is described in the Methods section. We generated 100 3' UTR clusters. We then performed multiple sequence alignments for each set of similar n-mers present in a given cluster to identify a core element. If our cluster analysis functions as a method to identify gene networks, we should be able to identify strong biological associations among genes in the same cluster at least in some of the cases. To identify these possible associations, we analyzed the gene clusters using `Pathway Studio 5'. This analysis indicated that several sets of gene clusters share commonalities in terms of pathway and function. Moreover, interacting proteins turned out to be present in numerous clusters. Figure 3 shows examples of strong biological associations as well as core sequence elements identified in two different gene clusters. The sequence elements and corresponding genes in the clusters are listed in Table S6 and S7. The remaining clusters that turned out to show positive results and n-mer comparisons are described in the supporting material website. To check that the identified relations in Figure 3 are not a random incident, we performed 100 cluster analyses with sets of random genes. We concluded that the same type of direct correlations as exemplified in Figure 3 cannot be obtained by chance alone (p < 0.01) (see the supporting materials website for details).

Figure 2.

Figure 2

Schematic representation of the cluster analysis used to identify putative post-transcriptional operons. First, a dissimilarity score matrix was constructed by comparing gene lists associated with each over-represented n-mer to all the others. Next, Partitioning Around Medoids (PAM) clustering algorithm starts with randomly selected arbitrary number of n-mers that serve as medoids (centers of clusters). The rest of the non-medoid n-mers are assigned to the nearest medoids according to their dissimilarity scores. After the initial partitioning, the algorithm swaps the current medoids with non-medoid n-mers and updates cluster memberships for non-medoid n-mers to check if new medoids lead to a better partition in term of the average dissimilarities in clusters. These steps are repeated until the average dissimilarities of clusters cannot be reduced further.

Figure 3.

Figure 3

Examples of gene clusters that show strong biological associations. Most relevant gene clusters identified in our study were analyzed with the Pathway Studio software to identify possible biological interactions amongst the genes present in them. Only direct associations/interactions are illustrated in the figure. The results from multiple sequence alignments represent possible `core n-mers' that were built with n-mers present in the cluster. Multiple sequences alignments were performed by using Clustal X.

Discussion

We designed a specific method based on over-representation to map putative regulatory sequences present on 3' UTRs. A very strict filter consisting of minimum and maximum values of appearances for each region of the mRNA (5' UTR, coding region and 3' UTR) was used to select a group of 2,772 highly relevant over-represented sequences (n-mers) located in 3' UTRs. The evidence strongly indicates a correlation between over-representation and function. The identified n-mers overlap with previously identified binding sites for HuR and TIA-1 and AU-rich and GU-rich sequences. Finally, a method to cluster n-mer groups allowed the identification of putative post-transcriptional gene networks.

Recently, a computational method was used to identify short sequence motifs (named pyknons) that are over-represented in the genome. After analyzing the distribution of pyknons, it was observed that there is a bias towards UTRs.17 Pyknons can constitute a valuable resource in terms of providing new lists of putative regulatory elements. Another recent study used the power of evolutionary biology to map novel putative regulatory sequences via sequence alignment on promoters and 3' UTRs. This study successfully predicted new miRNA target sequences18 and constitutes another useful source for the identification of UTR regulatory elements. Finally, an over representation method was used recently to predict target sites of miRNAs on 3' UTR of human genes.19 Our work goes a step beyond these analyses by using a method that specifically calculates over-represented sequences on human 3' UTRs. The method we employed differs from previous analysis of UTR motifs in several ways. First, the choice of sequence data for UTR analysis is different from others' published efforts. Most previous UTR analyses dealt with the entire chromosome sequences to construct their models while we used only transcript sequences for our analysis in order to build a more accurate background model. It is also notable that we explicitly handled the number of n-mer appearances in non-target regions with the consideration of length effects. Moreover, our clustering approach was based on functional relations, not sequence similarities, which has more biological sense. When all the evidence is combined, we believe this dataset contains information that will guide the discovery of novel functional elements. All data is available online to the scientific community.

Materials and Methods

Preparation of mRNA sequence lists

In order to prepare a reliable list of human mRNA sequences, we started by conducting a feasibility study on 40,874 sequences obtained from NCBI Human Genome FTP site (ftp://ftp.ncbi.nih.gov/genomes/). To ensure the quality of data used, all mRNAs were constructed from chromosome sequences (Build 36.2) based on gene information from RefSeq. Only `validated' or `reviewed' gene information was used for mRNA construction. Subsequently, coding regions were extracted from the constructed mRNAs and BLASTed against the entire nucleotide database to confirm that the gene information from RefSeq was correct. If BLAST returned the identical gene ID with a perfect sequence match for a queried coding region, we retained the queried gene in the valid set of mRNAs. After filtering the data, we were left with 20,840 human mRNA sequences corresponding to 16,160 genes. This subset of sequences was used in our analysis.

n-mer counting

An exhaustive search approach measured the appearance of all possible n-mers (2 ≤ n ≤ 21) in the mRNA data set. Appearances were counted on the 5' UTR, coding region and 3' UTR individually. In order to handle large number of possible n-mers within optimal time and space, we used a suffix tree counter, which is a type of data structure that allows efficient string matching and searching. Although there are many different flavors of suffix tree implementations, we used a straight forward implementation without data compression functions since our sole purpose is simple counting rather than string searching or matching. The counting procedure collected n-mer information such as the list of associated mRNAs and locations of n-mers on the mRNA sequences.

To determine if a given n-mer is over-represented in 3' UTRs, we first calculated the lengths of different regions in mRNAs (5' UTRs, Coding and 3' UTRs) from the 20,840 transcripts present in our mRNA set. The total lengths of each region were 4,626,913 nucleotides for 5' UTR, 37,499,577 nucleotides for coding region and 22,871,121 nucleotides for 3' UTR. We then constructed a conversion table in order to perform a balanced analysis that allows the comparison among n-mers of different sizes. Table S8 in Supplementary data shows the adjusted mathematical expected appearance value for each n-mer based on its length. These values were then used to determine if a particular n-mer is over-represented in 3' UTRs.

Statistical analysis to identify over-represented sequences

After determining that all four nucleotides are equally distributed in our mRNA dataset, we estimated the probability P of finding a specific pattern of L-mer to be P = 4-L. Hence in the data base of total length D of UTR or coding regions of length at least L, the expected number of the given pattern of L-mer is λL= (D - n * (L - 1)) * 4-L where n is the number of RNAs whose UTR or coding regions are of length at least L. If the motifs were randomly distributed over the different sections of the mRNA, the distribution of the number of RNAs with a specific motif would be Poisson with mean rate λL. We used the Poisson distribution to calculate the probability of observing k RNAs with the specific motif pattern, which is (λL)ke−k/k!. So with the expected number λL of the given pattern of L-mer, the probability of observing k or more RNAs with the L-mer is 1 - P(k - 1, λL) and k or less instances is P(k, λL) where

P(k,λL)=x=0k1x!(λL)xeλL

For example, the probability of observing k or more instances in 3' UTR and at the same time, 0 instances in both 5' UTR and coding region is therefore P = P(0, λL5) * P(0, λLC) * (1 - P(k - 1, λL3)) where λL5, λL3 and λLC are the expected numbers in 5' UTR, 3' UTR and coding region respectively. When several RNAs come from a single gene, dependence among the RNAs is expected. To achieve the conservative p-values, we use the number of gene-instances instead of the number of RNA-instances for the over-counting (k or more instances). For under-counting (k or less RNA instances), we used the number of RNA-instances to make p values more conservative. Since we were testing the significance of a specific pattern for all the patterns, we used Bonferroni-Correction to adjust p values for multiple testing. The total number (TN) of the patterns from 2-mer through 21-mer is

n=2214n=5.864062e+12

and adjusted p value is min(TN*P, 1).

Parameters to identify highly significant over-represented n-mers

The number of appearances in 3' UTRs of each individual n-mer was counted and its statistical significance was calculated. A total of 21,537 n-mers with adjusted p-value less or equal to 0.01 were identified. We establish further parameters to identify highly over-represented n-mers. We selected only n-mers that appeared in more than 20 genes of 3' UTRs and at the same time less than 1 time in 5' UTRs and less than 4 times in coding regions. A total of 2,773 3' UTR over-represented n-mers were identified; representing 13% of n-mers with adjusted p-value less than or equal to 0.01 (Fig. 1).

Preparation of random samples

One thousand sets of random n-mer sequences were generated from the sequences present in our list of mRNAs. To construct the random sets, we took into consideration the size and the number of over-represented n-mers present in our final data set.

Comparison between n-mers and HuR and TIA-1 binding sites

The data provided by Dr. Isabel Lopez de Silanes contains binding sites for the RNA binding proteins HuR and TIA-1 obtained via RIP-Chip and computational methods.12,13,20 We located all these binding sites on the mRNA sequences present in our list. These locations were then compared to the positions of the over-represented and random n-mers. Since the length of binding sites obtained for HuR and Tia-1 is longer than 21 nucleotides, we only considered two sequences to be a `match' when an over-represented n-mer or a random n-mer appears within a HuR or TIA-1 binding site. The numbers of matches obtained for the over-represented n-mer list was compared to the numbers obtained for a 1,000 sets of random n-mers.

Search for n-mers containing the AUUUA (UAUUUAU) motif

We searched for AUUUA and UAUUUAU motifs (ARE core sequences) in the 3' UTR over-represented n-mer sets and in random n-mer sets. Initially, the total number of AUUUA (or UAUUUAU) appearances in a data set was simply counted. However, it is possible that a small number of AU-rich n-mers in a given data set contribute to two or more motif counts. We counted then in each data set the total number of n-mers containing one or more AUUUA (or UAUUUAU) motif. All the counting results were compared between the over-represented n-mer set and random n-mer sets.

Search for n-mers containing GU-Rich elements GRE

We searched for the UGUUUGUUUGU motif (GRE consensus sequence) in the 3' UTR over-represented n-mer sets and in random n-mer sets. The total number of GRE appearances in a data as well as the number of n-mers containing one or more GRE motifs was counted. All the counting results were compared between the over-represented n-mer set and random n-mer sets.

Comparison between n-mers and predicted miRNA target sites

The data for miRNA target site predictions for human genes was obtained from `microRNA.org' (www.microrna.org/microrna/getDownloads.do) A subset of 20,615 predicted mRNA sequence segments (8~35 nts) with alignment score greater than 170 was selected from 1,791,960 total prediction sites and used in further analysis.

We defined a perfect overlap between two given sequences as follows; if one sequence is a subsequence of the other, it is counted as an overlap. For instance, if we compare a given n-mer that is 12 nucleotides long and a predicted miRNA site that is 21 nucleotides long, the sequence of the n-mer has to be entirely contained in the miRNA predicted target site; if the n-mer sequence is longer than the miRNA predicted site, the opposite has to occur. By using this definition, we counted total numbers of overlaps from microRNA targets versus over-represented 3' UTR n-mers and microRNA targets versus random samples (1,000 random 3' UTR n-mers). Also, we checked how many targets and 3' UTR n-mers are in total overlap counts. These unique numbers of targets and n-mers confirms that the numbers are not derived from a small subset counted several times. In order to compare overlap counts of 3' UTR over-represented n-mers to the ones obtained for random samples, we used a normal approximation of log(counts) distributions.

Cluster analysis on over-represented n-mers

We performed cluster analysis based on functional similarity. Functional Similarity between two n-mers, say n-mer 1 and n-mer 2, is defined as the number of genes that have both n-mers divided by the number of genes that have either one or both. The dissimilarity (distance) between two n-mers is defined as 1 minus the similarity. Using this dissimilarity measure and Kaufman and Rousseeuw's Partitioning Around Medoids (PAM) algorithm,21 we organized 3' UTR n-mers into clusters (Table S9). Average silhouette lengths were used to order the clusters. We represent in our supporting material website (http://gccri.uthscsa.edu/data.asp) the top 100 3' UTR clusters (Table S9).

After grouping the over-represented n-mers into clusters, gene members in a cluster were analyzed by using `Pathway Studio 5' (www.ariadnegenomics.com/) in order to identify known functional relationships among them.

Supplementary Material

Over-represented nmers

Acknowledgements

This work was supported by the Computational Biology Initiative (UTSA/UTHSCSA), and UTSA College of Business Summer Grant.

Abbreviations

UTR

untranslated region

IRP

iron regulatory protein

IRE

iron responsive element

APP

amyloid-β precursor protein

ARE

AU rich elements

GRE

GU rich elements

RIP-Chip

RNA immuno-precipitation-chip

Footnotes

References

  • 1.Kuersten S, Goodwin EB. The power of the 3' UTR: translational control and development. Nat Rev Genet. 2003;4:626–37. doi: 10.1038/nrg1125. [DOI] [PubMed] [Google Scholar]
  • 2.Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol. 1999;19:1720–30. doi: 10.1128/mcb.19.3.1720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Knoll-Gellida A, Andre M, Gattegno T, Forgue J, Admon A, Babin PJ. Molecular phenotype of zebrafish ovarian follicle by serial analysis of gene expression and proteomic profiling, and comparison with the transcriptomes of other animals. BMC genomics. 2006;7:46. doi: 10.1186/1471-2164-7-46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Unwin RD, Whetton AD. Systematic proteome and transcriptome analysis of stem cell populations. Cell Cycle. 2006;5:1587–91. doi: 10.4161/cc.5.15.3101. [DOI] [PubMed] [Google Scholar]
  • 5.Habermann JK, Paulsen U, Roblick UJ, Upender MB, McShane LM, Korn EL, Wangsa D, Kruger S, Duchrow M, Bruch H-P, Auer G, Ried T. Stage-specific alterations of the genome, transcriptome, and proteome during colorectal carcinogenesis. Genes Chromosomes Cancer. 2007;46:10–26. doi: 10.1002/gcc.20382. [DOI] [PubMed] [Google Scholar]
  • 6.Lu P, Vogel C, Wang R, Yao X, Marcotte EM. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol. 2007;25:117–24. doi: 10.1038/nbt1270. [DOI] [PubMed] [Google Scholar]
  • 7.Mignone F, Gissi C, Liuni S, Pesole G. Untranslated regions of mRNAs. Genome Biol. 2002;3 doi: 10.1186/gb-2002-3-3-reviews0004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rouault TA. The role of iron regulatory proteins in mammalian iron homeostasis and disease. Nat Chem Biol. 2006;2:406–14. doi: 10.1038/nchembio807. [DOI] [PubMed] [Google Scholar]
  • 9.Pickering BM, Willis AE. The implications of structured 5' untranslated regions on translation and disease. Semin Cell Dev Biol. 2005;16:39–47. doi: 10.1016/j.semcdb.2004.11.006. [DOI] [PubMed] [Google Scholar]
  • 10.Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E. TRANSFAC (R) and its module TRANSCompel (R): transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34:108–10. doi: 10.1093/nar/gkj143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Defrance M, Touzet H. Predicting transcription factor binding sites using local over-representation and comparative genomics. BMC Bioinformatics. 2006;7:396. doi: 10.1186/1471-2105-7-396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lopez de Silanes I, Fan J, Galban CJ, Spencer RG, Becker KG, Gorospe M. Global analysis of HuR-regulated gene expression in colon cancer systems of reducing complexity. Gene Expr. 2004;12:49–59. doi: 10.3727/000000004783992215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lopez de Silanes I, Galban S, Martindale JL, Yang X, Mazan-Mamczarz K, Indig FE, Falco G, Zhan M, Gorospe M. Identification and functional outcome of mRNAs associated with RNA-binding protein TIA-1. Mol Cell Biol. 2005;25:9520–31. doi: 10.1128/MCB.25.21.9520-9531.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Vlasova IA, Tahoe NM, Fan D, Larsson O, Rattenbacher B, John JRS, Vasdewani J, Karypis G, Reilly CS, Bitterman PB, Bohjanen PR. Conserved GU-rich elements mediate mRNA decay by binding to CUG-binding protein 1. Mol Cell. 2008;29:263–70. doi: 10.1016/j.molcel.2007.11.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Keene JD, Tenenbaum SA. Eukaryotic mRNPs may represent posttranscriptional operons. Mol Cell. 2002;9:1161–7. doi: 10.1016/s1097-2765(02)00559-2. [DOI] [PubMed] [Google Scholar]
  • 16.Keene JD. RNA regulons: coordination of post-transcriptional events. Nat Rev Gen. 2007;8:533–43. doi: 10.1038/nrg2111. [DOI] [PubMed] [Google Scholar]
  • 17.Rigoutsos I, Huynh T, Miranda K, Tsirigos A, McHardy A, Platt D. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. Proc Natl Acad Sci USA. 2006;103:6605–10. doi: 10.1073/pnas.0601688103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005;434:338–45. doi: 10.1038/nature03441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Cora D, Di Cunto F, Caselle M, Provero P. Identification of candidate regulatory sequences in mammalian 3' UTRs by statistical analysis of oligonucleotide distributions. BMC bioinformatics. 2007;8:174. doi: 10.1186/1471-2105-8-174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lopez de Silanes I, Lal A, Gorospe M. HuR: post-transcriptional paths to malignancy. RNA Biol. 2005;2:11–3. doi: 10.4161/rna.2.1.1552. [DOI] [PubMed] [Google Scholar]
  • 21.Kaufman L, Rousseeuw P. John Wiley & Sons. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Over-represented nmers

RESOURCES