Abstract
Regulatory elements in mRNA play an often pivotal role in post-transcriptional regulation of gene expression. However, a systematic approach to efficiently identify putative regulatory elements from sets of post-transcriptionally coregulated genes is lacking, hampering studies of coregulation mechanisms. Although there are several analytical methods that can be used to detect conserved mRNA regulatory elements in a set of transcripts, there has been no systematic study of how well any of these methods perform individually or as a group. We therefore compared how well three algorithms, each based on a different principle (enumeration, optimization, or structure/sequence profiles), can identify elements in unaligned untranslated sequence regions. Two algorithms were originally designed to detect transcription factor binding sites, Weeder and BioProspector; and one was designed to detect RNA elements conserved in structure, RNAProfile. Three types of elements were examined: (1) elements conserved in both primary sequence and secondary structure; (2) elements conserved only in primary sequence; and (3) microRNA targets. Our results indicate that all methods can uniquely identify certain known RNA elements, and therefore, integrating the output from all algorithms leads to the most complete identification of elements. We therefore developed an approach to integrate results and guide selection of candidate elements from several algorithms presented as a web service (https://dbw.msi.umn.edu:8443/recit). These findings together with the approach for integration can be used to identify candidate elements from genome-wide post-transcriptional profiling data sets.
Keywords: post-transcriptional regulation, RNA elements, algorithm comparison, algorithm integration, motif discovery, polyribosome microarray
INTRODUCTION
Post-transcriptional control is the step in gene expression regulation that occurs after the primary transcript has been generated but before synthesis of the polypeptide. It includes splicing, RNA editing, RNA transport, RNA stability, and ribosome recruitment. Post-transcriptional control is involved in regulation of many physiological and pathological processes, including normal development, homeostatic functions, and carcinogenesis. Information embedded in the messenger ribonucleic acid (mRNA) sequence is used by the cell to regulate groups of transcripts (designated post-transcriptional operons or regulons) (Keene 2007) and has been mechanistically characterized in transcription termination (Brambilla et al. 1997), mRNA localization (Macdonald et al. 1993; Kloc et al. 2000; Chabanon et al. 2004), stability (Brewer 1991; Theil 1993; Akgul and Tu 2007; Vlasova et al. 2008), alternative splicing (Massiello et al. 2004), and ribosome recruitment (Larsson et al. 2006).
Methodological advances have enabled genome-wide studies of different steps of post-transcriptional regulation, including ribosome recruitment (Rajasekhar et al. 2003; Blais et al. 2004; Kitamura et al. 2004; Qin and Sarnow 2004; Larsson et al. 2006, 2007; Lu et al. 2006; Bilanges et al. 2007; Mamane et al. 2007) and mRNA degradation (Grigull et al. 2004; Raghavan et al. 2004; Catts et al. 2005; Vlasova et al. 2008). Importantly, these approaches include a step designed to eliminate effects from differential transcription; either by pharmacologically inhibiting transcription during studies of RNA degradation or by measuring and correcting for transcription during analysis of ribosome recruitment data. This is likely to result in the high enrichment of transcripts that are subject to post-transcriptional control independent of transcription, which makes such data sets ideal for discovery of novel post-transcriptional regulatory RNA elements. Analysis of data from such studies reveals elements occurring at frequencies in the range of 10%–20% that are verified to be sufficient or necessary for the observed regulation (Larsson et al. 2006; Mamane et al. 2007; Vlasova et al. 2008). Therefore, algorithms that can identify conserved mRNA elements from subsets of coregulated transcripts at these frequencies are key tools to elucidate mechanisms of post-transcriptional control.
There are three strategies to identify conserved RNA structural elements. One strategy is to align sequences using standard multiple sequence alignment tools. The mutual-information measure is usually used for this approach (Chiu and Kolodziejczak 1991; Gutell et al. 1992; Gorodkin et al. 1997). The covariance model, a specific type of stochastic context-free grammar, was introduced in COVE to build structural multiple alignments and to scan a genome or database using this model (Eddy and Durbin 1994). Later, a combination of minimum free energy (MFE) and a covariation score (Hofacker et al. 2002; Ruan et al. 2004) or stochastic context-free grammars (Knudsen and Hein 1999) were used as well. Algorithms that recognize strongly correlated positions in a multiple alignment, e.g., RNAz (Washietl et al. 2005) and EvoFold (Pedersen et al. 2006), are often computationally efficient. However, the requirement for curated alignments, which are typically lacking in sets of coregulated mRNAs, renders all these methods nonapplicable for identification of elements from genome-wide studies of post-transcriptional regulation. Importantly, it was concluded that diverse data sets, which are expected from sets of coregulated genes, lead to poor sequence alignment that can destroy any covariation signal (Gardner and Giegerich 2004). The second strategy is to simultaneously align and fold RNA sequences as proposed by Sankoff (1985). Based on this algorithm, several software packages have been developed, such as Foldalign (Gorodkin et al. 2001), Dynalign (Mathews and Turner 2002), Stemloc (Holmes 2005), and CARNAC (Touzet and Perriquet 2004). These algorithms usually predict common structures of two sequences, and then use a progressive method like that of ClustalW to find conserved structures among a set of sequences. However, these algorithms are computationally intensive for larger data sets, which renders them nonapplicable for the hundreds of long coregulated sequences that populate the data set. The third strategy is to predict the structures of individual sequences separately and then find the conserved structure. Some of the software packages using this strategy achieve this by building multiple structural alignments, such as RNAProfile (Pavesi et al. 2004a), CMfinder (Yao et al. 2006), and RNAmine (Hamada et al. 2006). There are several algorithms that take folded RNA sequence as input and perform multiple sequence alignment (e.g., RNAforester [Hochsmann et al. 2003], MARNA [Siebert and Backofen 2005], MXSCARNA [Tabei et al. 2008], MASTR [Lindgreen et al. 2007]). Some software such as comRNA (Ji et al. 2004), RNA Sampler (Xu et al. 2007), GPRM (Hu 2003), and GeRNAMo (Ji et al. 2004; Michal et al. 2007) do not use alignment. However, almost of all of these software packages were designed to identify RNA elements in a relatively small number of sequences with limited length. Thus, even though they function well with small data sets, applying them to genome-wide coregulated data sets is not possible. For example, the GPRM server requires that each sequence has <1000 nucleotides (nt), and the total number of nucleotides in the data set cannot exceed 60,000. In addition, for the typical coregulated data set, CMfinder, comRNA, and RNA Sampler exceed the capacity of computers with 4 GB RAM memory. Based on these restrictions, we selected all algorithms that were applicable and that could be applied using either a desktop computer (RNAProfile) or a Linux cluster (CMfinder) for the algorithm comparison.
In contrast, algorithms that predict transcription factor binding sites (TFBSs) in DNA are often less computationally intensive. Most can be divided into two categories, enumeration and optimization. In enumeration methods, such as Weeder (Pavesi et al. 2004b), the algorithm exhaustively searches the space of all possible elements. Optimization methods, such as BioProspector (Liu et al. 2001) and MEME (Bailey et al. 2006), use probabilistic optimization or deterministic optimization to identify the position of elements. MEMERIS (MEME in RNAs including secondary structures) is an extension of MEME. MEMERIS computes the probabilities of unpaired bases and then uses this measure to guide element finding in single-stranded regions (Hiller et al. 2006). As methods based only on sequence conservation could be used to search a set of coregulated transcripts, this raises the question of whether currently available TFBS detection methods could accurately and efficiently detect mRNA elements, which are often defined not only by sequence but also by combinations of sequence and structure.
Comparing algorithms for element discovery is generally difficult (Tompa et al. 2005; Sandve and Drablos 2006). The performance of each method depends on parameter tuning, computational complexity, type of genome sequences, and several other factors. Still it is essential for future understanding of post-transcriptional control to know which methods to use and if more than one is needed; i.e., how to combine their outputs. Therefore, we compared the RNA-element detection performance of two currently available TFBS algorithms based on sequence conservation, Weeder (enumeration) (Pavesi et al. 2004b) and BioProspector (optimization) (Liu et al. 2001), and one RNA element finding tool RNAProfile (Pavesi et al. 2004a) based on sequence/structure conservation. Here we show that all three algorithms can identify known mRNA elements and that each has its advantages depending on the element being detected. A range of complementary algorithms, based on similar principles, did not increase the number of detected elements. This suggests that a combination of methods based on different principles is necessary for the most complete element identification. Therefore, we devised an approach to combine the results from the three algorithms and/or any additional algorithm, and show its potential to guide selection of elements for mechanistic studies of mRNA coregulation using an experimentally derived data set. This study is the first to systematically compare and integrate outputs from current algorithms for post-transcriptional mRNA regulatory element discovery, providing an approach to evaluate and incorporate new algorithms in studies of post-transcriptional control.
RESULTS
Study design
To assess the utility of sequence conservation-based algorithms for finding mRNA elements, we selected two sequence conservation-based algorithms, one enumeration and one optimization (Weeder and BioProspector). We also selected one algorithm based on sequence/structure profile (RNAProfile). While we recognize that each of the RNA secondary structure methods have their unique properties and advantages, we initially chose RNAProfile to represent this category as it is reasonably fast and simple. All three methods were used with minimum parameter tuning to mimic the typical situation when nothing about the element is known. Ideally we would have used experimentally validated RNA elements as collected in, e.g., TarBase (Sethupathy et al. 2006). However, in reality, such experimentally validated data bases do not contain sufficient numbers of elements and validated targets for a study such as this. Given these restrictions, we selected UTRdb as a curated database of RNA elements within the 5′ and 3′ untranslated regions (UTRs) (Mignone et al. 2005) with a reasonable number of element containing transcripts and TargetScan as a well-populated and widely used microRNA (miRNA) target database (Lewis et al. 2003, 2005). To examine the relative algorithm performance, we collected all elements in UTRdb with a reasonable number of members (12 elements had at least 35 members) and six miRNA targets from TargetScan3.0 (Table 1). We divided non-miRNA elements into classes based on whether they are conserved in structure/sequence or sequence following the annotation from the UTRsite (Pesole et al. 2002). The average information content of the element was used to assess relative sequence conservation. The six miRNA targets (Table 1) were chosen as they have been indicated to show biological activity (Cheng et al. 2005; Ciafre et al. 2005; Akao et al. 2006; Larsson et al. 2007).
TABLE 1.
Element category, ID, and average information content
To compare how well each algorithm detected the different types of elements, we generated data sets of 35 UTR sequences containing different percentages of transcripts harboring the true elements. We recognize that a realistic sample size is more in the range of 150–300 transcripts, rather than 35 (Larsson et al. 2006, 2007; Vlasova et al. 2008), but we chose this relatively low number of transcripts per data set to allow us to keep as many elements as possible for our study (six elements did not have a sufficient number of defined members necessary to survey the complete range of true positives for larger sample sets, such as 200). To examine the validity of this decision, we selected those elements that could be used to build data sets of 200 sequences, and examined the stability of the results by comparing 66 searches using 35 sequences to that of 200 sequences in which both the set of 35 and the set of 200 had the same percentage of true positives. Using the thresholds for determining whether the element was found or not (see below), the results showed that the sets of 200 and the sets of 35 gave similar predictions for 80% of the searches for non-miRNA target elements (Supplemental Fig. 1A). One element, the cytoplasmic polyadenylation element (CPE), behaved as an outlier. After removing CPE, 95% of the searches gave similar predictions (Supplemental Fig. 1B). Therefore, we concluded that using sets of 35 will enable us to test more elements and will still be representative of larger subsets for most elements. For miRNA targets, only 70% of the predictions were the same (Supplemental Fig. 1C). This relatively poor concordance, together with documentation that all miRNAs targets we selected had more than 200 known target sequences, led us to analyze 200 member sets for miRNA targets.
The percentages of true positives in the 35 (or 200 for miRNA targets) sequence sets were selected to resemble what could be expected to occur in data sets derived from current large genome-wide studies of post-transcriptional regulation (we used 20% as a realistic percentage), or to test whether the algorithms could ever identify the element (40%–100%). We selected the top predictions from each algorithm and calculated the sensitivities to detect each element. As each algorithm predicted multiple elements for each data set, the prediction with the highest sensitivity was used to compare the algorithms. We considered an element to be identified by the algorithm if the site sensitivity was ≥50% and the specificity ≥25% (see below in Materials and Methods).
Identification of elements conserved in sequence and structure
For elements conserved in sequence and structure (Table 1), there were large differences in terms of both how well the algorithms performed as well as how well the test elements were detected (Fig. 1A). The histone 3′UTR stem–loop structure (HSL3) was identified efficiently by both BioProspector and Weeder. RNAProfile identified the internal ribosome entry site (IRES) at a realistic percentage of positive transcripts while the remaining 2 elements could not be identified (Fig. 1A). We were surprised that RNAProfile did not detect additional elements conserved in structure and sequence compared with Weeder and BioProspector (particularly the iron response element [IRE], which was shown previously) (Pavesi et al. 2004a). Given the limited set of examples, this could be due to the characteristics of the examples and not the performance of RNAProfile. To investigate this, we created a test set consisting of sequences harboring a validated IRE from mouse and human. As there were few examples of validated IREs, we created a modified data set where a fixed number of positive transcripts were diluted with transcripts not carrying the element to achieve the desired concentrations of true positives. This approach is not ideal as the surveyed sets carrying different concentrations of true positives will have different sizes, but we reasoned that despite the limitations, it would allow us to investigate if RNAProfile could provide any advantages for the detection of this type of element. In this data set, Weeder identified IRE at a realistic percentage of true positives, whereas RNAProfile needed 40% true positive transcripts for detection (Fig. 1B). Similarly we performed an analysis using selenocysteine insertion sequence (SECIS) elements from mouse and human collected in RFAM. There were 25 SECIS elements, and we used the original approach with a fixed size of the query set. Using this data set, Weeder identified SECIS at 60% true positive transcripts (Fig. 1C).
FIGURE 1.
Identification of elements conserved in sequence and structure. (A) A comparison of sensitivities for detecting elements conserved in sequence and secondary structure using BioProspector, Weeder, and RNAProfile. Each graph shows the sensitivities obtained in data sets with different percentages of transcripts carrying the indicated element. (B,C) A comparison of sensitivities of BioProspector, Weeder, and RNAProfile for detecting IRE and SECIS using sequences obtained from mouse and human.
Identification of elements conserved in sequence
For elements conserved in sequence, we detected large differences in performance depending on algorithm and element (Fig. 2). BioProspector and Weeder identified three and two out of eight elements (three unique elements) at a realistic percentage of positive transcripts (20%), respectively. CPE and the 15-lipoxygenase differentiation control element (15-LOX-DICE) were identified at higher percentages (40%–100%), while the TGE translational regulation element (TGE), terminal oligopyrimidine tract (TOP), and male-specific lethal 3′UTR cis-acting elements (MSL2-3UTR) could not be identified at all. As expected, judging from the number of identified elements, both Weeder and BioProspector performed better than RNAProfile for the elements conserved primarily in sequence (Fig. 2).
FIGURE 2.
Identification of elements conserved in sequence. Comparison of sensitivities for detecting elements conserved in primary sequence using BioProspector, Weeder, and RNAProfile. Each graph shows the sensitivities obtained in data sets with different percentages of transcripts carrying the indicated element.
Identification of miRNA targets
We did not run RNAProfile on miRNA targets, as RNAProfile is not designed to identify short elements conserved only in sequence. Four out of six miRNA targets were identified at a realistic percentage (20%) by BioProspector and/or Weeder. All six miRNA targets were identified by BioProspector and/or Weeder at higher percentages (40%–100%). The specificities varied between 33% and 79% (Table 2). BioProspector and Weeder were complementary in identifying miRNA targets as, at realistic percentages of true positives, BioProspector and Weeder each identified miRNA targets that the other method did not (see Fig. 3).
TABLE 2.
Percentage of positive transcripts for which a sensitivity ≥50% and specificity ≥25% is achieved
FIGURE 3.
Identification of miRNA targets. A comparison of sensitivities for detecting miRNA target sites using BioProspector and Weeder. Each graph shows the sensitivities obtained in data sets with different percentages of transcripts carrying the indicated element.
Performance of complementary methods
Our selection of Weeder and BioProspector was based on their good performance in a study comparing many TFBS algorithms (Tompa et al. 2005), while our selection of RNAProfile was based on its ability to be used on a desktop system. However, we were interested to compare these methods to methods that are based on similar principles or those that are beyond desktop computing (some requiring >8 GB of RAM memory to analyze even these relatively small data sets of 35 sequences). Our goal was to try all methods that could be applicable on sets of coregulated nonaligned transcripts (for discussion of limitations, see the Introduction). We considered GPRM (Hu 2003), GeRNAMo (Michal et al. 2007), CMfinder (Yao et al. 2006), comRNA (Ji et al. 2004), MEME (Bailey et al. 2006), MEMERIS (Hiller et al. 2006), and RNA Sampler (Xu et al. 2007). Among these, our computational resources (Linux cluster allowing up to 90 GB of RAM) permitted us to use CMfinder, MEME, and MEMERIS (although we could not use MEMERIS on all possible element lengths due to computation time restrictions) (see Materials and Methods). However, none of these methods could provide any clear advantage in terms of elements detected at 20% true positives when identifying elements conserved in sequence/structure, sequence, or miRNA targets (see Table 2; Supplemental Figs. 2–4) or the modified set using IRE or SECIS from both mouse and human (Supplemental Fig. 5). Thus it appears that adding additional methods that are overlapping in terms of detection principle does not provide any clear advantages.
Summary of algorithm performance at a realistic element abundance
To get an overview of expected algorithm performance in real data sets, we focused on performance at 20% true positives and the three methods originally selected (BioProspector, Weeder, and RNAProfile), as these provide a computationally efficient set of methods. As discussed above, our selection of 20% as a threshold for a realistic concentration was derived from prior studies indicating that genome-wide studies of post-transcriptional regulation, in which the transcriptional effects have been controlled, show that ∼10%–20% of the identified coregulated transcripts carry a particular element (Larsson et al. 2006; Vlasova et al. 2008). At a frequency of 20% true-positive transcripts, five out of 12 non-miRNA elements and four out of six miRNA target sequences could be detected. BioProspector detected four non-miRNA target elements and one miRNA target sequence; Weeder detected three non-miRNA target elements and three miRNA target sequences, and RNAProfile detected one non-miRNA target elements. One element and one miRNA target was detected by BioProspector only; three miRNA targets were detected by Weeder only; and IRES was detected by RNAProfile only (Table 2). At higher percentages of true positives (60%), seven out of 12 non-miRNA target elements and all six miRNA targets were detected by any method. Five of the non-miRNA target elements could not be detected at any percentage of true positives, including 100%. Thus for the most complete element finding, a combination of all three methods performed best.
Determinants of algorithm performance
Our data indicated that each algorithm performs differently depending on element type, but there also seems to be element-specific aspects. There are several possible explanations. One factor that is likely to be important for the detection of an element using methods based on sequence conservation is the average information content. If the element is well conserved (high information content), it is expected that sequence conservation-based methods would show higher sensitivity. Information content is predicted to be less important for RNAProfile as this algorithm relies on a combination of sequence and structure conservation. To assess this relationship, we compared the sensitivities of the methods across all true positive transcript percentages and mRNA elements to the information content (Fig. 4A), percentage of true positives (Fig. 4B), and length of the element (Fig. 4C) (after excluding the element with the highest information content [HSL3] and all miRNA targets). This analysis indicated no strong relationships between the characteristics and the sensitivity except for length when using BioProspector and Weeder. As large variation was apparent for all algorithms, we also stratified the analysis based on element type:
For elements conserved in sequence and structure, BioProspector and Weeder both showed positive correlations between sensitivity and element lengths. (Supplemental Fig. 6).
For elements conserved in sequence, there were no significant correlations (Supplemental Fig. 7).
For miRNAs targets, there was a relationship between the sensitivity and the percentage of true-positive sequences for Weeder (Supplement Fig. 8). miRNA targets show the same length and very similar information contents, and therefore, these analyses were excluded.
FIGURE 4.
An analysis of factors that could influence sensitivity to detect the predicted element. (A) Average information content. (B) Percentage of true positive transcripts. (C) Element length. Black, blue, and pink dots represent the sensitivities from BioProspector, Weeder, and RNAProfile, respectively. The black, blue, and pink lines present the linear regression model for BioProspector, Weeder, and RNAProfile, respectively. The P-values for the slopes are shown.
In summary, element length seems to be of some importance for algorithm performance. However, all of these trends could be related to the specific elements and settings used but, nevertheless, give some indications.
An approach to compare and integrate predicted elements between algorithms
With the current approach using the three algorithms and selecting the top scoring elements from each, a total of 65 predicted elements are generated. It is desirable to develop an approach that integrates the results by combining these outputs in order to prioritize detected elements for laboratory-based tests of functional significance. There are important differences in lengths of the predicted mRNA elements among the algorithms that require correction before an efficient comparison and consolidation is possible. For example, Weeder only identifies elements up to 12 nt, whereas BioProspector and RNAProfile can produce much longer candidate elements. To compare the element's predicted intra-algorithm and interalgorithm, and to define as much of the true element as possible, we developed a general approach (for more details, see Materials and Methods):
The flanking sequences of the element are collected for each member sequence. Overlapping sequences are removed.
A sliding window starting at the center of the element is used to detect high information content, and the element is extended as long as one position within the sliding window shows larger information contents than a set threshold.
The resulting position-specific probability matrixes (PSPMs) are compared with each other using correlations, and a hierarchical clustering of the correlation matrix of all PSPMs to all PSPMs is performed (Fig. 5).
FIGURE 5.
A proposed approach to compare and select mRNA regulatory elements. To identify conserved mRNA regulatory elements in a set of coregulated transcripts, algorithms representing three different principles—enumeration (Weeder), optimization (BioProspector), and sequence/structure (RNAProfile)—are used. The lengths of the predicted elements are optimized as described in the Results and Materials and Methods sections. To build PSPMs for each predicted element, the sequences predicted to constitute the element can be aligned or used without alignment. Then the maximum local correlations of all pairs of PSPMs are calculated, and the correlation matrix is clustered. The clusters with well-correlated PSPMs are the best candidates for biological testing.
A representative result is depicted in Figure 6, where PSPMs with high sensitivities for CPE predicted by BioProspector, Weeder, and RNAProfile are clustered together. This analysis shows that in this case—as well as in several other cases not shown—the real element can be found in several of the outputs from the same and/or different methods. In the CPE analysis, a second cluster of elements predicted by BioProspector correlates with each other (labeled as “interesting cluster” in Fig. 6), suggesting that there are at least two different elements identified by the analysis.
FIGURE 6.
Integration of outputs. The figure shows a hierarchical clustering of extended PSPMs for CPE in a data set with 60% true positive transcripts by BioProspector, Weeder, and RNAProfile. The grayscale bar on the left of the plot indicates the sensitivities of the PSPMs. Black bars indicate high sensitivity. The column sidebars on the top of the plot indicate the method: blue represents BioProspector, yellow represents Weeder, and violet represents RNAProfile. The color of the cell represents the Pearson correlation of the two PSPMs: red represents 0, and white represents 1.
Application of the approach to a study of ribosome recruitment
Using a combined polyribosome preparation–microarray analysis, we previously identified 255 transcripts with increased ribosome recruitment when translation factor eIF4E mediates rescue of fibroblasts from apoptosis (Larsson et al. 2006). We used this data set to test whether the approach utilized in the current study would have helped us pick the element that we experimentally showed to target a reporter mRNA for translational activation under pro-apoptotic stress. We generated 65 candidate elements, and the clustered PSPM correlations of the elements are presented in Figure 7. In the original report, we identified a 55-nt mRNA consensus hairpin structure that was overrepresented in the 5′UTR of translationally activated transcripts using BioProspector and BioOptimizer (Jensen and Liu 2004). The present analysis shows that the element that led to the 55-nt structure correlated very well with many other elements identified and was one of only a few candidates that showed a distinct cluster. We therefore conclude that the approach presented here would have helped us identify the correct candidate element in this published study.
FIGURE 7.
RNA regulatory elements identified from eIF4E-mediated rescue of NIH 3T3 cells from apoptosis. The figure shows a hierarchical clustering of extended PSPMs from the eIF4E data set search (Larsson et al. 2006). The color of the cell represents the Pearson correlation of the two PSPMs: red represents 0, and white represents 1. The element that was validated as functional is indicated.
Web service
We have implemented a web service for the scientific community that is designed to assist investigators with appropriate data sets to integrate and prioritize elements for mechanistic laboratory-based experiments. The user can input the results from BioProspector, Weeder, and RNAProfile or PSPMs from other element discovery software, and the web tool will generate the type of visualization shown in Figures 6 and 7. The URL of the web service is https://dbw.msi.umn.edu:8443/recit.
DISCUSSION
Identification of mRNA elements is an essential step toward understanding the mechanisms of post-transcriptional regulation of gene expression. Using genome-wide tools, subsets of coregulated genes are currently being identified that will be investigated for the occurrence of mRNA elements as their mechanism for coregulation. For this purpose, several algorithms predicting conserved mRNA structure without multiple sequence alignments have been developed. However, algorithms designed for TFBS finding could be a helpful complement to these methods, since the TFBS procedures require less computation time. Therefore, here we sought to address two questions. The first was how well current methods perform for detecting mRNA regulatory elements; and the second was how to integrate the information generated both inter- and intra-algorithm in order to prioritize elements for laboratory-based tests of functional import.
To carry out our analysis, we selected the UTRdb as the source of elements for comparison and three methods based on different principles. Our results show that all three methods—BioProspector, Weeder, and RNAProfile—each have advantages as each identified elements that the other method did not. This highlights the importance of using several methods and integrating the results when searching for an element of unknown characteristics. When using these three methods (BioProspector, Weeder, and RNAProfile), five out of 12 tested elements and four out of six miRNA targets could be identified at a realistic concentration of true positives (20%), and seven out of the 12 tested elements and six out of six miRNA targets could be identified at ≤60% true positives. This not only indicates that a reasonable percentage of unknown elements can likely be identified using this approach but also highlights that additional methods will be necessary for a more complete understanding of post-transcriptional regulation.
We recognize that in the UTRdb, some elements are better defined than others. For this reason, we expect that our generated data sets included different numbers of false positive elements depending on the element. This has the potential to generate biases in the present study. However, as more false positives would be associated with a reduction in information content and we see no correlation between the information content and algorithm performance, we conclude that this did not lead to a significant bias in the present study.
In addition to the algorithms we selected to represent different detection principles (Weeder, BioProspector, and RNAProfile), we also tested a set of complementary methods (CMfinder, MEME, and MEMERIS) that, in our hands, did not provide any additional discoveries. While we cannot exclude the possibility that these additional methods may provide advantages for detection of elements not evaluated in this study, we found no evidence that they would substantially improve element detection. Since both MEMERIS and CMfinder require more extensive computational resources than the others, these may also be technically hard to apply for a laboratory without extensive computer resources. During the completion of this study, a novel enumeration algorithm (Amadeus) was presented that showed good performance for miRNA discovery (Linhart et al. 2008) and could provide advantages for miRNA identification in addition to Weeder and BioProspector.
Once candidate elements are detected, the next important task in a genome-wide element search is to integrate the results and select the best candidates for subsequent detailed biochemical and biological analyses. Methods for comparing PSPMs/PSSMs have been developed by several investigators (Pietrokovski 1996; Hughes et al. 2000; Wang and Stormo 2003; Sandelin and Wasserman 2004; Schones et al. 2005). Pietrokovski (1996) showed that the Pearson correlation coefficient is an effective measure for protein sequence alignments. We selected Pearson correlation in our study, as this approach is simple and straightforward. Our approach to cluster elements based on this measure gives strong visual guidelines about the number of true elements found in a search. We interpret our data to indicate that focusing only on those elements that emerge from all methods may lead to the rejection of too many true regulatory elements, because different algorithms are better at identifying different types of elements than others. Yet it is essential to combine the outputs to identify true, unique elements. Therefore, we suggest that if an element is found by the same algorithm several times in different versions (e.g., length) or by several algorithms, either outcome indicates that the element has promise as an authentic mediator of post-transcriptional control. Thus the suggested approach utilizes all elements found by multiple algorithms and provides a procedure for biological scientists to prioritize elements for further study of mechanism.
MATERIALS AND METHODS
Generation of data sets for UTRsite elements
The UTR of human RefSeqs were retrieved from UTRdb (06/20/2006) for the elements: U0001 (HSL3), U0002 (IRE), U0003 (SECIS type 1 [SECIS-1]), U0015 (IRES), U0006 (CPE), U0007 (TGE), U0009 (15-LOX-DICE), U0010 (AU-rich class-2 element or ARE2), U0011 (TOP), U0017 (MSL2-3UTR), U0019 (Bruno 3′UTR responsive element [BRE]), and U0035 (Mos polyadenylation response element [Mos-PRE]). Sequences with or without the element were separated into two groups. In each group, the sequences were given a random rank number. Five data sets were created from these two groups of sequences. Each data set contains 35 sequences from 35 unique genes. The data set was composed of top x (x = 7, 14, 21, 28, 35) sequences from the pool with the element and top 35-x sequences from the pool without the element. As an alternative approach for SECIS and IRE, we used sequences from both mouse and human. SECIS sequences were obtained from RFAM 9.0 after removing all redundant sequences or sequences from unknown genes. Twenty-five such sequences were found, and data sets with different percentages of true-positive transcripts were created as above but with a size of 25 instead of 35. For IRE, experimentally validated targets from UTRdb were used, and because there were only seven such examples, these were “diluted” with control transcripts to generate data sets containing different percentages of true positives. These IRE data sets will have different sizes; which is in contrast to all other data sets used.
Generation of data sets for miRNA targets
The UTR sequences for miRNA target detection and miRNA targets information were downloaded from the TargetScan database (TargetScan 3.0) (Lewis et al. 2003, 2005). The data sets were generated similarly to the UTRsite elements except we used sets of 200 UTRs as miRNA detection was dependent on size of the set (see Results).
Element finding algorithms
Weeder (version 1.3) (Pavesi et al. 2004b) was downloaded from http://159.149.109.16:8080/weederWeb/ and installed on a SUN machine. Background frequency files for human RefSeq (1005_10_03) UTRs were calculated by counting the occurrences of all possible 6-mers and 8-mers. WeederTFBS.out was used to predict conserved elements with lengths of 6, 8, or 10 nt; allowing one, two, or three mismatches, respectively. Settings were as follows: R = 25 (at least 25% of the sequences must contain the element) and T = 100 (the top 100 scoring elements of the run were reported). The adviser program was used to identify elements that were redundant both within the same run and in runs of different lengths. The top 10 elements from adviser were considered.
BioProspector (Liu et al. 2001) for a SUN machine was downloaded on October 19, 2005. The background distribution of human RefSeq UTRs was calculated using the genome.bg algorithm. Settings were as follows: n = 100 (the number of element finding iterations) and r = 25 (N for the Monte Carlo simulations used to obtain the null distribution). For each data set, elements ranging from 6–50 nt were examined; the highest scoring element for each length was considered for subsequent analysis.
RNAProfile version 2.2 (Pavesi et al. 2004a) was downloaded from http://159.149.109.16:8080/weederWeb/ and installed on a Linux machine. Settings were as follows: P = 10 (10 best profiles were saved in each run) and p = 1 (one best profile originating from the same profile or region). Each data set ran five times with different random seeds (the r switch was on). Ten profiles with the highest unique scores are analyzed. Elements with fitness larger than −1 were considered to be members of the element.
MEME (Bailey et al. 2006) was downloaded from http://meme.sdsc.edu/meme/. It was installed on a Linux machine. A second-order Markov model of human RefSeq 3′UTRs or 5′UTRs was used as background. ZOOPS model was used. The maximum length of the element is 50, and the minimum length is six. Ten elements were reported for each data set.
MEMERIS (Hiller et al. 2006) 1.0 was downloaded from http://www.bioinf.uni-freiburg.de/∼hiller/MEMERIS/. It was installed on a Linux machine. The same second-order Markov model background was used as in MEME. ZOOPS model was used. Elements with widths of 7, 14, 21, 28, 35, 42, and 49 were considered because RNA secondary structure prediction for every width would be too time consuming. We used a pseudocount pi = 0.01, spfuzz = 2.
CMfinder (Yao et al. 2006) was downloaded from http://bio.cs.washington.edu/yzizhen/CMfinder/. The fraction of sequences containing the element was set to 0.2; the minimum length of a motif was 15. Following the procedure of Yao et al. (2006), we used an alignment score of 10 as a cutoff threshold for an element to be considered.
Evaluation of results
The start sites and end sites of predicted elements were compared with the authentic start and end sites of the element. The number of nucleotides overlapping with the known element was calculated as nucleotide-level true positives (nTP). If nTP was ≥ 50% of the algorithm-predicted element length, then the predicted element was considered a true positive sequence (sTP). Sensitivity is calculated as sTP/(sTP + sFN), where sFN are false negative sequences. For Weeder, only predictions with scores >90 were evaluated. Specificity is calculated as sTN/(sTN + sFP), where sTN are true negative sequences. For data sets with 100% positive transcripts, specificity is NA because sTN + sFP = 0.
Calculations of information content
To examine how well the element is conserved among its members, we calculated the information content of each element in each data set, multiple sequence alignment of the predicted true-positive elements in each data set was built by ClustalW (version 1.83, using the BestFit scoring matrix). The information content or the conservation of the element at a particular position is calculated as
![]() |
where N is the number of different symbols (four for RNA), and pn is the observed frequency of symbol n at a particular position in the element (Crooks et al. 2004). The information content of an element is the average information content of all positions of the element.
Integration of outputs for different algorithms using element extensions and clustering of PSPMs
The predicted element sequences from each output were extended to include 50 nt on both sides of the predicted element. When an element was close to the end of the UTR, the missing positions were filled in with the symbol “N” until its length was 50 nt. In the case when two elements including extended sequences overlapped (if there were two predicted elements close to each other), the element with more defined flanking sequence (i.e., less Ns) was kept. If an equal amount of defined sequence (equal numbers of Ns) existed, one was randomly kept. The resulting sequences could be aligned or used without alignment. Next the information content was calculated at each position of the expanded element. A sliding window of 10, starting at the center of the element and allowing a minimal element length of 6 nt, was used in both directions. The element was extended if one position within the sliding window reached information content larger than 1.5. This extended element was used to generate the extended PSPM. The extended PSPMs were converted into arrays following PA1PC1PG1PT1, PA2PC2PG2PT2…PAiPCiPGiPTi, where i is the position of the nucleic acid and P represents the frequency of a particular nucleic acid at the i position. For any pair of PSPMs, two arrays will be aligned, with the 5′ end of array A and the 3′ end of array B. The minimal overlap was set to 6 nt. The nonoverlap part is filled with the random frequency of nucleic acids calculated from the UTR of human RefSeqs. The Pearson correlation of the two arrays is calculated. Array A then slides one position to the 5′ end of array B. Correlation is calculated again. Array A keeps sliding until the 3′ end of array A and the 5′ end of array B has an overlap of 6 nt. The maximum correlation of the two arrays is recorded for array A and array B. The correlation matrix for all extended PSPM to all PSPM is generated, and a hierarchical clustering is performed on the correlation matrix (average linkage and Euclidian distance). All scripts and data sets are available upon request. We also provide a web service for this approach at https://dbw.msi.umn.edu:8443/recit
SUPPLEMENTAL MATERIAL
Supplemental material can be found at http://www.rnajournal.org.
ACKNOWLEDGMENTS
O.L. was supported by a post-doctoral fellowship from the Swedish Research Council during the initiation of the study and a post-doctoral fellowship from the Knut and Alice Wallenberg foundation during the completion of the study. D.F. and P.B.B. are supported by HL076779 and HL073719 from the NIH. We thank Dr. Wayne Xu from the Computational Genetics Laboratory at the University of Minnesota for helping with the web service. This study was performed using resources from the Minnesota Supercomputing Institute computational genetics laboratory.
Footnotes
Article published online ahead of print. Article and publication date are at http://www.rnajournal.org/cgi/doi/10.1261/rna.1617009.
REFERENCES
- Akao Y, Nakagawa Y, Naoe T. MicroRNAs 143 and 145 are possible common onco-microRNAs in human cancers. Oncol Rep. 2006;16:845–850. [PubMed] [Google Scholar]
- Akgul B, Tu CP. Regulation of mRNA stability through a pentobarbital-responsive element. Arch Biochem Biophys. 2007;459:143–150. doi: 10.1016/j.abb.2006.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bailey TL, Williams N, Misleh C, Li WW. MEME: Discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006;34:W369–W373. doi: 10.1093/nar/gkl198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bilanges B, Argonza-Barrett R, Kolesnichenko M, Skinner C, Nair M, Chen M, Stokoe D. Tuberous sclerosis complex proteins 1 and 2 control serum-dependent translation in a TOP-dependent and -independent manner. Mol Cell Biol. 2007;27:5746–5764. doi: 10.1128/MCB.02136-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blais JD, Filipenko V, Bi M, Harding HP, Ron D, Koumenis C, Wouters BG, Bell JC. Activating transcription factor 4 is translationally regulated by hypoxic stress. Mol Cell Biol. 2004;24:7469–7482. doi: 10.1128/MCB.24.17.7469-7482.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brambilla A, Mainieri D, Agostoni Carbone ML. A simple signal element mediates transcription termination and mRNA 3′ end formation in the DEG1 gene of Saccharomyces cerevisiae. Mol Gen Genet. 1997;254:681–688. doi: 10.1007/s004380050466. [DOI] [PubMed] [Google Scholar]
- Brewer G. An A+U-rich element RNA-binding factor regulates c-myc mRNA stability in vitro. Mol Cell Biol. 1991;11:2460–2466. doi: 10.1128/mcb.11.5.2460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Catts VS, Catts SV, Fernandez HR, Taylor JM, Coulson EJ, Lutze-Mann LH. A microarray study of post-mortem mRNA degradation in mouse brain tissue. Brain Res Mol Brain Res. 2005;138:164–177. doi: 10.1016/j.molbrainres.2005.04.017. [DOI] [PubMed] [Google Scholar]
- Chabanon H, Nury D, Mickleburgh I, Burtle B, Hesketh J. Characterization of the cis-acting element directing perinuclear localization of the metallothionein-1 mRNA. Biochem Soc Trans. 2004;32:702–704. doi: 10.1042/BST0320702. [DOI] [PubMed] [Google Scholar]
- Cheng AM, Byrom MW, Shelton J, Ford LP. Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis. Nucleic Acids Res. 2005;33:1290–1297. doi: 10.1093/nar/gki200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chiu DK, Kolodziejczak T. Inferring consensus structure from nucleic acid sequences. Comput Appl Biosci. 1991;7:347–352. doi: 10.1093/bioinformatics/7.3.347. [DOI] [PubMed] [Google Scholar]
- Ciafre SA, Galardi S, Mangiola A, Ferracin M, Liu CG, Sabatino G, Negrini M, Maira G, Croce CM, Farace MG. Extensive modulation of a set of microRNAs in primary glioblastoma. Biochem Biophys Res Commun. 2005;334:1351–1358. doi: 10.1016/j.bbrc.2005.07.030. [DOI] [PubMed] [Google Scholar]
- Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A sequence logo generator. Genome Res. 2004;14:1188–1190. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddy SR, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. 1994;22:2079–2088. doi: 10.1093/nar/22.11.2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner PP, Giegerich R. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics. 2004;5:140. doi: 10.1186/1471-2105-5-140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gorodkin J, Heyer LJ, Brunak S, Stormo GD. Displaying the information contents of structural RNA alignments: The structure logos. Comput Appl Biosci. 1997;13:583–586. doi: 10.1093/bioinformatics/13.6.583. [DOI] [PubMed] [Google Scholar]
- Gorodkin J, Stricklin SL, Stormo GD. Discovering common stem–loop motifs in unaligned RNA sequences. Nucleic Acids Res. 2001;29:2135–2144. doi: 10.1093/nar/29.10.2135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grigull J, Mnaimneh S, Pootoolal J, Robinson MD, Hughes TR. Genome-wide analysis of mRNA stability using transcription inhibitors and microarrays reveals posttranscriptional control of ribosome biogenesis factors. Mol Cell Biol. 2004;24:5534–5547. doi: 10.1128/MCB.24.12.5534-5547.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gutell RR, Power A, Hertz GZ, Putz EJ, Stormo GD. Identifying constraints on the higher-order structure of RNA: Continued development and application of comparative sequence analysis methods. Nucleic Acids Res. 1992;20:5785–5795. doi: 10.1093/nar/20.21.5785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamada M, Tsuda K, Kudo T, Kin T, Asai K. Mining frequent stem patterns from unaligned RNA sequences. Bioinformatics. 2006;22:2480–2487. doi: 10.1093/bioinformatics/btl431. [DOI] [PubMed] [Google Scholar]
- Hiller M, Pudimat R, Busch A, Backofen R. Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic Acids Res. 2006;34:e117. doi: 10.1093/nar/gkl544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hochsmann M, Toller T, Giegerich R, Kurtz S. Local similarity in RNA secondary structures. Proc IEEE Comput Soc Bioinform Conf. 2003;2:159–168. [PubMed] [Google Scholar]
- Hofacker IL, Fekete M, Stadler PF. Secondary structure prediction for aligned RNA sequences. J Mol Biol. 2002;319:1059–1066. doi: 10.1016/S0022-2836(02)00308-X. [DOI] [PubMed] [Google Scholar]
- Holmes I. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005;6:73. doi: 10.1186/1471-2105-6-73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu YJ. GPRM: A genetic programming approach to finding common RNA secondary structure elements. Nucleic Acids Res. 2003;31:3446–3449. doi: 10.1093/nar/gkg521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000;296:1205–1214. doi: 10.1006/jmbi.2000.3519. [DOI] [PubMed] [Google Scholar]
- Jensen ST, Liu JS. BioOptimizer: A Bayesian scoring function approach to motif discovery. Bioinformatics. 2004;20:1557–1564. doi: 10.1093/bioinformatics/bth127. [DOI] [PubMed] [Google Scholar]
- Ji Y, Xu X, Stormo GD. A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics. 2004;20:1591–1602. doi: 10.1093/bioinformatics/bth131. [DOI] [PubMed] [Google Scholar]
- Keene JD. RNA regulons: Coordination of post-transcriptional events. Nat Rev Genet. 2007;8:533–543. doi: 10.1038/nrg2111. [DOI] [PubMed] [Google Scholar]
- Kitamura H, Nakagawa T, Takayama M, Kimura Y, Hijikata A, Ohara O. Post-transcriptional effects of phorbol 12-myristate 13-acetate on transcriptome of U937 cells. FEBS Lett. 2004;578:180–184. doi: 10.1016/j.febslet.2004.11.008. [DOI] [PubMed] [Google Scholar]
- Kloc M, Bilinski S, Pui-Yee Chan A, Etkin LD. The targeting of Xcat2 mRNA to the germinal granules depends on a cis-acting germinal granule localization element within the 3′UTR. Dev Biol. 2000;217:221–229. doi: 10.1006/dbio.1999.9554. [DOI] [PubMed] [Google Scholar]
- Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999;15:446–454. doi: 10.1093/bioinformatics/15.6.446. [DOI] [PubMed] [Google Scholar]
- Larsson O, Perlman DM, Fan D, Reilly CS, Peterson M, Dahlgren C, Liang Z, Li S, Polunovsky VA, Wahlestedt C, et al. Apoptosis resistance downstream of eIF4E: posttranscriptional activation of an anti-apoptotic transcript carrying a consensus hairpin structure. Nucleic Acids Res. 2006;34:4375–4386. doi: 10.1093/nar/gkl558. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Larsson O, Li S, Issaenko OA, Avdulov S, Peterson M, Smith K, Bitterman PB, Polunovsky VA. Eukaryotic translation initiation factor 4E induced progression of primary human mammary epithelial cells along the cancer pathway is associated with targeted translational deregulation of oncogenic drivers and inhibitors. Cancer Res. 2007;67:6814–6824. doi: 10.1158/0008-5472.CAN-07-0752. [DOI] [PubMed] [Google Scholar]
- Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Prediction of mammalian microRNA targets. Cell. 2003;115:787–798. doi: 10.1016/s0092-8674(03)01018-3. [DOI] [PubMed] [Google Scholar]
- Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120:15–20. doi: 10.1016/j.cell.2004.12.035. [DOI] [PubMed] [Google Scholar]
- Lindgreen S, Gardner PP, Krogh A. MASTR: Multiple alignment and structure prediction of noncoding RNAs using simulated annealing. Bioinformatics. 2007;23:3304–3311. doi: 10.1093/bioinformatics/btm525. [DOI] [PubMed] [Google Scholar]
- Linhart C, Halperin Y, Shamir R. Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets. Genome Res. 2008;18:1180–1189. doi: 10.1101/gr.076117.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X, Brutlag DL, Liu JS. BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 2001;2001:127–138. [PubMed] [Google Scholar]
- Lu X, de la Pena L, Barker C, Camphausen K, Tofilon PJ. Radiation-induced changes in gene expression involve recruitment of existing messenger RNAs to and away from polysomes. Cancer Res. 2006;66:1052–1061. doi: 10.1158/0008-5472.CAN-05-3459. [DOI] [PubMed] [Google Scholar]
- Macdonald PM, Kerr K, Smith JL, Leask A. RNA regulatory element BLE1 directs the early steps of bicoid mRNA localization. Development. 1993;118:1233–1243. doi: 10.1242/dev.118.4.1233. [DOI] [PubMed] [Google Scholar]
- Mamane Y, Petroulakis E, Martineau Y, Sato TA, Larsson O, Rajasekhar VK, Sonenberg N. Epigenetic activation of a subset of mRNAs by eIF4E explains its effects on cell proliferation. PLoS One. 2007;2:e242. doi: 10.1371/journal.pone.0000242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Massiello A, Salas A, Pinkerman RL, Roddy P, Roesser JR, Chalfant CE. Identification of two RNA cis-elements that function to regulate the 5′ splice site selection of Bcl-x pre-mRNA in response to ceramide. J Biol Chem. 2004;279:15799–15804. doi: 10.1074/jbc.M313950200. [DOI] [PubMed] [Google Scholar]
- Mathews DH, Turner DH. Dynalign: An algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002;317:191–203. doi: 10.1006/jmbi.2001.5351. [DOI] [PubMed] [Google Scholar]
- Michal S, Ivry T, Cohen O, Sipper M, Barash D. Finding a common motif of RNA sequences using genetic programming: The GeRNAMo system. IEEE/ACM Trans Comput Biol Bioinform. 2007;4:596–610. doi: 10.1109/tcbb.2007.1045. [DOI] [PubMed] [Google Scholar]
- Mignone F, Grillo G, Licciulli F, Iacono M, Liuni S, Kersey PJ, Duarte J, Saccone C, Pesole G. UTRdb and UTRsite: A collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2005;33:D141–D146. doi: 10.1093/nar/gki021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavesi G, Mauri G, Stefani M, Pesole G. RNAProfile: An algorithm for finding conserved secondary structure motifs in unaligned RNA sequences. Nucleic Acids Res. 2004a;32:3258–3269. doi: 10.1093/nar/gkh650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavesi G, Mereghetti P, Mauri G, Pesole G. Weeder Web: Discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004b;32:W199–W203. doi: 10.1093/nar/gkh465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006;2:e33. doi: 10.1371/journal.pcbi.0020033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pesole G, Liuni S, Grillo G, Licciulli F, Mignone F, Gissi C, Saccone C. UTRdb and UTRsite: Specialized databases of sequences and functional elements of 5′ and 3′ untranslated regions of eukaryotic mRNAs. Update 2002. Nucleic Acids Res. 2002;30:335–340. doi: 10.1093/nar/30.1.335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pietrokovski S. Searching databases of conserved sequence regions by aligning protein multiple alignments. Nucleic Acids Res. 1996;24:3836–3845. doi: 10.1093/nar/24.19.3836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qin X, Sarnow P. Preferential translation of internal ribosome entry site-containing mRNAs during the mitotic cycle in mammalian cells. J Biol Chem. 2004;279:13721–13728. doi: 10.1074/jbc.M312854200. [DOI] [PubMed] [Google Scholar]
- Raghavan A, Dhalla M, Bakheet T, Ogilvie RL, Vlasova IA, Khalid SA, Williams BRG, Bohjanen PR. Patterns of coordinate down-regulation of ARE-containing transcripts following immune cell activation. Genomics. 2004;84:1002–1013. doi: 10.1016/j.ygeno.2004.08.007. [DOI] [PubMed] [Google Scholar]
- Rajasekhar VK, Viale A, Socci ND, Wiedmann M, Hu X, Holland EC. Oncogenic Ras and Akt signaling contribute to glioblastoma formation by differential recruitment of existing mRNAs to polysomes. Mol Cell. 2003;12:889–901. doi: 10.1016/s1097-2765(03)00395-2. [DOI] [PubMed] [Google Scholar]
- Ruan J, Stormo GD, Zhang W. An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics. 2004;20:58–66. doi: 10.1093/bioinformatics/btg373. [DOI] [PubMed] [Google Scholar]
- Sandelin A, Wasserman WW. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol. 2004;338:207–215. doi: 10.1016/j.jmb.2004.02.048. [DOI] [PubMed] [Google Scholar]
- Sandve GK, Drablos F. A survey of motif discovery methods in an integrated framework. Biol Direct. 2006;1:11. doi: 10.1186/1745-6150-1-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sankoff D. Simultaneous solution of the RNA folding, alignment, and protosequence problems. SIAM J Appl Math. 1985;45:810–825. [Google Scholar]
- Schones DE, Sumazin P, Zhang MQ. Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics. 2005;21:307–313. doi: 10.1093/bioinformatics/bth480. [DOI] [PubMed] [Google Scholar]
- Sethupathy P, Corda B, Hatzigeorgiou AG. TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA. 2006;12:192–197. doi: 10.1261/rna.2239606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siebert S, Backofen R. MARNA: Multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics. 2005;21:3352–3359. doi: 10.1093/bioinformatics/bti550. [DOI] [PubMed] [Google Scholar]
- Tabei Y, Kiryu H, Kin T, Asai K. A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics. 2008;9:33. doi: 10.1186/1471-2105-9-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Theil EC. The IRE (iron regulatory element) family: Structures which regulate mRNA translation or stability. Biofactors. 1993;4:87–93. [PubMed] [Google Scholar]
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. doi: 10.1038/nbt1053. [DOI] [PubMed] [Google Scholar]
- Touzet H, Perriquet O. CARNAC: Folding families of related RNAs. Nucleic Acids Res. 2004;32:W142-145. doi: 10.1093/nar/gkh415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vlasova IA, Tahoe NM, Fan D, Larsson O, Rattenbacher B, SternJohn JR, Vasdewani J, Karypis G, Reilly C, Bitterman PB, et al. Conserved GU-rich elements mediate mRNA decay by binding to CUG-Binding Protein 1. Mol Cell. 2008;29:263–270. doi: 10.1016/j.molcel.2007.11.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T, Stormo GD. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003;19:2369–2380. doi: 10.1093/bioinformatics/btg329. [DOI] [PubMed] [Google Scholar]
- Washietl S, Hofacker IL, Stadler PF. Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci. 2005;102:2454–2459. doi: 10.1073/pnas.0409169102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu X, Ji Y, Stormo GD. RNA Sampler: A new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics. 2007;23:1883–1891. doi: 10.1093/bioinformatics/btm272. [DOI] [PubMed] [Google Scholar]
- Yao Z, Weinberg Z, Ruzzo WL. CMfinder: A covariance model based RNA motif finding algorithm. Bioinformatics. 2006;22:445–452. doi: 10.1093/bioinformatics/btk008. [DOI] [PubMed] [Google Scholar]










