Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2005 Feb 9;102(7):2269–2270. doi: 10.1073/pnas.0500129102

Tracking down noncoding RNAs

Vincent Moulton 1,*
PMCID: PMC549017  PMID: 15703286

Until relatively recently, RNA has taken a predominantly backstage role compared to protein in genome studies. However, this is changing dramatically with the discovery of a plethora of RNAs that do not act as messenger (mRNA), transfer (tRNA), or ribosomal (rRNA) RNAs (13). These noncoding RNAs (ncRNAs) play a role in a variety of processes such as transcriptional regulation, chromosome replication, RNA processing and modification, and protein degradation and translocation. Even so, ncRNAs usually lack the statistical signals in their primary sequence (like ORFs and codon bias) that have been used to such great effect in the identification of novel protein encoding genes, making the task of systematically identifying new ncRNAs in genomes currently one of the most exciting challenges in computational biology. The work of Washietl et al. in this issue of PNAS (4) faces this challenge head on. Through an elegant use of structural properties of RNA, the authors present an efficient comparative genomics approach to identifying novel ncRNAs and related genomic elements that promises to significantly contribute to the burgeoning field of computational RNomics.

Predicting RNA Structure

As with other computational approaches to identifying ncRNAs, the method of Washietl et al. (4) relies on structural properties of RNA. Unlike double-stranded DNA, an RNA molecule is comprised of a single-stranded chain or sequence of nucleotides. As a consequence, parts of the molecule can basepair with other complementary parts of the molecule, so that the nucleotide sequence plays a vital role in how the molecule folds. For this reason, it is possible to develop computational methods for predicting structural properties of an RNA molecule based on knowledge of its primary sequence.

As with proteins, the problem of predicting the three-dimensional structure of an RNA molecule directly from its primary sequence is still beyond current computational methods. However, the three-dimensional structure of an RNA molecule often builds on a simpler scaffold known as its secondary structure. This structure consists essentially of nested base-pairings, which makes it well suited to computational prediction. Moreover, secondary structure is commonly preserved under evolution (even when primary sequence is not), suggesting relevance to RNA function.

Stability can be used as a diagnostic feature for detecting noncoding RNAs.

One of the first efficient algorithms for predicting secondary structure for an RNA sequence used dynamic programming to compute a maximum set of nested base-pairings (5). A more sophisticated extension of this algorithm soon followed (6), which incorporated more detailed secondary structure information. Basically, it used thermodynamic considerations to compute a secondary structure with minimum free energy for an RNA sequence. Although the method has been substantially developed since its introduction, and even greatly extended for the prediction of probably more realistic ensembles of secondary structures (7, 8), the underlying algorithm still lies in essence at the heart of many present day RNA secondary structure prediction tools. However, such tools use primary sequence alone, so they tend not to perform as well as one might hope, commonly predicting only 50–70% of base pairs correctly on average (9).

Comparative Sequence Analysis

Because secondary structure is often preserved between homologous RNAs, comparative sequence analysis can provide a powerful alternative for its prediction. One of the earliest methods based on comparative analysis used mutual information to detect covarying columns in an alignment of RNA sequences (10). Related, but much more sophisticated, covariance models (11), the RNA analogue of hidden Markov models, were subsequently developed and successfully used in genomic searches for ncRNAs and are now available as part of the recently established Rfam database for RNA families (12).

Covariance models are family-specific and, as such, do not provide a generic tool for finding novel ncRNAs. However, the preservation of RNA secondary structure in an alignment naturally suggests a comparative genomics approach to finding ncRNAs: form alignments between conserved subsequences of genomes and then, by using secondary structure detection approaches, try to decide which of these are alignments of ncRNAs. One of the first programs to employ this strategy was qrna (13), which used probabilistic models to search for covariation in pairwise alignments and has been used to identify novel ncRNAs in bacteria and yeast. More recent methods include ddbrna (14) and msari (15), which look for statistically significant covariation in multiple sequence alignments.

Picking Up the Signal

The method of Washietl et al. (4) employs a similar strategy. Le et al. (16) proposed that ncRNAs are more thermodynamically stable than is expected by chance. There has been much debate over this hypothesis, and the current general consensus is that it is not generally true. Even so, recent findings indicate that certain families of ncRNAs are, in fact, more stable than is expected by chance (most notably microRNA precursors; ref. 17), and Washietl et al. demonstrate that stability can, at the very least, be used as a diagnostic feature for detecting ncRNAs.

In particular, they associate two scores to an alignment: the z score, a measure thermodynamic stability, and the structure conservation index (SCI), a measure of evolutionary conservation. The z score is quite well known in the RNA computational biology community. However, the SCI is new. It is computed by comparing the minimum free energies of the sequences in an alignment with a “consensus energy,” which is computed by incorporating covariation terms into a free energy minimization computation (18). Subsequently, a support vector machine is used to classify alignments as “functional” or “other” in the SCI/z score plane. This approach has the advantage of not requiring costly sampling of shuffled sequences or alignments, and the results obtained on benchmark data sets indicate that it has high sensitivity and specificity.

A Bright Future

Given the wealth of genomic data that is becoming available and new methods for generating high quality alignments (19), we can soon expect more answers to the question presented in ref. 2: “How many ncRNAs are encoded by the genome?” Even so, we are still faced with tasks such as identifying ncRNAs with little or no conserved secondary structure and elucidating function of newly discovered ncRNAs. Computational approaches will almost certainly play a key role in shedding light on these problems. Thus, in view of the remarkable new discoveries being made concerning the cellular function of ncRNAs, we can expect RNA computational biology to become an increasingly important field in the next few years.

See companion article on page 2454.

References


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES