Abstract
Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remains difficult. This protocol introduces methods developed by the Rfam database for identifying “families” of homologous ncRNAs starting from single “seed” sequences using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs, then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process.
Keywords: RNA, Homology, Covariance model
Introduction
Over the past twenty years, the development and commodification of high-throughput technologies for reading DNA has led to dramatic changes in the way biology is done. The development of reliable and inexpensive transcriptomic technologies, in particular, has lead to major changes to our understanding of the role RNA plays in both bacterial and eukaryotic systems (Barquist and Vogel, 2015; Rinn and Chang, 2012). While exceptions to the central dogma have been known since the discovery of tRNA in the late 1950’s, it was only with the development of genome-scale sequencing technologies that the pervasive nature of regulation mediated by non-coding RNA (ncRNA) has become clear. This has led to a diverse menagerie of RNA classes, ranging from the microRNAs (miRNAs) and bacterial small RNAs (sRNAs) that operate to modulate gene expression through RNA:RNA antisense interactions, to small nucleolar RNAs (snoRNAs) that guide RNA modification enzymes, to the riboswitches capable of sensing metabolites with so-called RNA aptamers, to the diverse long non-coding RNAs (lncRNAs) suspected to blanket eukaryotic genomes. Characterizing a single ncRNA is a challenge in itself; to maximize the utility of this work, it is often desirable to transfer functional annotations between organisms. For instance, one might identify an ncRNA associated with disease in humans, but want to study it in a system where technical and (more importantly) ethical considerations allow for genetic manipulation, such as mice. Similarly in bacteria, molecular tools are often best developed in model strains such as E. coli, making them attractive platforms for characterizing ncRNAs, which can then be inferred to operate in a similar function in related bacteria. However, this inference depends crucially on computational tools capable of transferring these hard-won functional annotations through homology prediction.
This problem of homology prediction is usually approached as the problem of finding sequences which align well to our source sequence, on the reasonable assumption that similar sequences are more likely to be evolutionarily and functionally related. A wide range of critical applications in genomics rely on our ability to produce “good” alignments. Single-sequence homology search as implemented in tools such as BLAST (Altschul et al., 1990) is an (often heuristic) application of alignment. The sensitivity and specificity of homology search can be improved by the use of evolutionary information in the form of accurate substitution and insertion-deletion (indel) rates derived from multiple sequence alignments (MSAs), captured in the statistical models used by HMMER (Finn et al., 2011, 2015; Eddy, 2011) and Infernal (Nawrocki et al., 2009; Nawrocki and Eddy, 2013) for protein and RNA alignments respectively. These models can be thought of as defining “families” of homologous sequences, as in the Pfam and Rfam databases (Finn et al., 2014; Nawrocki et al., 2015). By using these models to classify sequences, we can infer functional and structural properties of uncharacterized sequences.
Unfortunately, producing the high-quality “seed” alignments of RNA these methods require remains difficult. While proteins can be aligned accurately using only primary sequence information with pairwise sequence identities as low as 20% for an average-length sequence (Rost, 1999; Thompson et al., 1999), it appears that the “twilight zone” where blatantly erroneous alignments occur between RNA sequences may begin at above 60% identity (Gardner et al., 2005) (Gardner et al., 2005)(Lindgreen et al., 2014). The inclusion of secondary structure information can improve alignment accuracy (Freyhult et al., 2007), but predicting secondary structure is not trivial (Gardner and Giegerich, 2004; Puton et al., 2014). An instructive example of the difficulties this can lead to is the case of the 6S gene, a bacterial ncRNA which modulates σ70 activity during the shift from exponential to stationary growth. The Escherichia coli 6S sequence was determined in 1971 (Brownlee, 1971) and its function determined in 2000 (Wassarman and Storz, 2000). However, the extent of this gene's phylogenic distribution was not realized until 2005 when Barrick and colleagues carefully constructed an alignment from a number of deeply diverged putative 6S sequences, and through successive secondary-structure aware homology searches demonstrated its presence across large swaths of the bacterial phylogeny (Barrick et al., 2005). Even now, new homologs are discovered on a regular basis (Sharma et al., 2010; Weinberg et al., 2010; Wehner et al., 2014), and 6S appears to be an ancient and important component of the bacterial regulatory machinery. Similar examples of the power of enhanced homology search considering both sequence and structural information can be found for other classes of ncRNAs, such as riboswitches (Barrick and Breaker, 2007), ribosomal leader sequences (Fu et al., 2013), ribozymes (Weinberg et al., 2015), and snoRNAs (Gardner et al., 2010).
In this unit, it is our hope to make these techniques accessible to sequence analysis novices. We introduce the techniques necessary to construct a high-quality RNA alignment from a single seed sequence, and then use the information contained in this alignment to identify additional more distant homologs, expanding the alignment in an iterative fashion. These methods, while time-consuming, can be far more sensitive than a BLAST search (Menzel et al., 2009). We present a brief protocol which starts with a single sequence, and then uses a collection of web and command-line based tools for alignment, structure prediction, and search to construct an Infernal covariance model (CM), a probabilistic model which captures many important features of structured RNA sequence variation (Nawrocki and Eddy, 2013). These models may then be used in the iterative expansion of alignments or for homology search and genome annotation. CMs are also are used by the Rfam database in defining RNA sequence families, and are the subject of a dedicated RNA families track at the journal RNA Biology (Gardner and Bateman, 2009). We include as an instructive example the construction of an RNA family for the enterobacterial small RNA MicA, used as the basis for the current Rfam model, discovering a convincing divergent clade of homologs in the process.
Strategic Planning
A large variety of tools exist for RNA sequence analysis. Given the diversity of ncRNA sequence, structure, and function, it is likely no one set of tools will work optimally for every ncRNA. Here we review a number of alternative methods and tools that can be easily substituted at various stages of our protocol.
Single Sequence Search
We rely on NCBI BLAST (Altschul et al., 1990) to quickly identify close homologs of RNA sequences in this protocol, though other methods can be substituted (Table 1). NCBI and EMBL-EBI both maintain servers (Boratyn et al., 2013; McWilliam et al., 2013) with slightly different interfaces, though there are no substantive differences in the implementations. We use the NCBI server here. EBI also maintains servers for a number of BLAST and FASTA derivatives, which may be helpful. Both sites also allow users to BLAST against databases of expressed sequences including GEO at NCBI, and high throughput cDNA and transcriptome shotgun assembly databases at EMBL-EBI (Barrett et al., 2013; Silvester et al., 2015). Such searches can be helpful for gathering comparative expression data for your ncRNA.
Table 1. Resources for single sequence homology search.
Resource | Reference | URL |
---|---|---|
NCBI-BLAST | (Johnson et al., 2008) | http://blast.ncbi.nlm.nih.gov/Blast.cgi |
EMBL-EBI NCBI-BLAST | (McWilliam et al., 2013) | http://www.ebi.ac.uk/Tools/sss/ncbiblast/ |
EMBL-EBI Sequence Search | (McWilliam et al., 2013) | http://www.ebi.ac.uk/Tools/sss/ |
HMMER3 | (Finn et al., 2015, 2011) | http://www.ebi.ac.uk/Tools/hmmer/ |
RNAcentral nhmmer search | (RNAcentral Consortium, 2015; Bateman et al., 2011) | http://rnacentral.org/sequence-search/ |
A nucleotide version of the HMMER3 package (Eddy, 2011) for profile-based sequence search provides both increased sensitivity and specificity over BLAST at little additional computational cost (Wheeler and Eddy, 2013). We hope that a web server similar to the one currently available for protein sequences (Finn et al., 2015) will be forthcoming. In the meantime, RNAcentral (RNAcentral Consortium, 2015; Bateman et al., 2011) offers an nhmmer-based search facility, however it is limited to searching known ncRNA sequences. If it is possible that homologous sequences are spliced (e.g. introns in the U3 snoRNA (Myslinski et al., 1990)), then a splice-site aware search method may be useful, such as BLAT (Kent, 2002) or GenomeWise (Birney et al., 2004), but we are not aware of any publicly available webservers.
Alignment and Secondary Structure Prediction Tools
We find it best to run a variety of alignment and secondary structure prediction tools simultaneously (see Table 2). Each has its own peculiarities, and our hope is that by looking for shared homology and secondary structure predictions we can mitigate some of the problems discussed in the introduction. In this protocol, we use the WAR webserver (Torarinsson and Lindgreen, 2008) which allows the user to run 14 different methods simultaneously. These include Sankoff-type methods: FoldalignM (Torarinsson et al., 2007), LocARNA (Will et al., 2007), MXSCARNA (Tabei et al., 2008), Murlet (Kiryu et al., 2007), and StrAL (Dalli et al., 2006) with PETcofold (Seemann et al., 2011); Align-then-fold methods, which use a traditional alignment tool (ClustalW (Larkin et al., 2007) or MAFFT (Katoh and Standley, 2013)) followed by structure prediction (RNAalifold (Bernhart et al., 2008) or Pfold (Knudsen and Hein, 2003)); Fold-then-align methods, which predict structures in all the input sequences and attempt to align these structures (RNAcast (Reeder and Giegerich, 2005) + RNAforester (Höchsmann et al., 2003)); Sampling methods which attempt to iteratively refine alignment and structure: MASTR (Lindgreen et al., 2007) and RNASampler (Xu et al., 2007); and other methods which do not fit into the above traditional categories: CMfinder (Yao et al., 2006) and LaRA (Bauer et al., 2007). Finally, WAR also computes a maximum consistency alignment using all the alignment predictions with T-Coffee (Notredame et al., 2000) (Notredame et al., 2000).
Table 2. Resources for RNA sequence alignment.
Resource | Reference | URL |
---|---|---|
Webserver for Aligning structural RNAs (WAR) | (Torarinsson and Lindgreen, 2008) | http://genome.ku.dk/resources/war/ |
Vienna RNA | (Gruber et al., 2008b) | http://rna.tbi.univie.ac.at/ |
Freiburg RNA Tools | (Smith et al., 2010) | http://rna.informatik.uni-freiburg.de |
CBRC Functional RNA Project | (Asai et al., 2008) | http://software.ncRNA.org |
RTH Resources | N/A | http://rth.dk/pages/resources.php |
EMBL-EBI Alignment Tools | (McWilliam et al., 2013) | http://www.ebi.ac.uk/Tools/msa/ |
However, WAR is by no means exhaustive, and the methods may not be the most recent versions available. A number of groups maintain their own servers for RNA sequence analysis. Notable servers include the Vienna RNA WebServers (Gruber et al., 2008b), the Freiburg RNA Tools (Smith et al., 2010), the CBRC Functional RNA Project (Asai et al., 2008; Mituyama et al., 2009), and the Center for Non-Coding RNA in Technology and Health (RTH) Resources page. In addition, EMBL-EBI maintains a number of webservers for popular multiple sequence alignment tools (McWilliam et al., 2013). Ultimately, as you become more comfortable with RNA sequence analysis you may want to begin installing and running new tools on a local *NIX machine; however, this is beyond the scope of the current protocol.
Genome Browsers
Genome browsers are essential for checking the context of putative homologs, and several are available online that offer access to genome annotations (see Table 3). The ENA (Silvester et al., 2015) provides a no-frills sequence browser perfect for quickly checking annotations. For deeper annotations, the UCSC genome browser (Karolchik et al., 2014) and Ensembl (Flicek et al., 2014) both contain a wide range of information for the organisms they cover. For bacterial and archaeal genomes, the Lowe lab maintains a modified version of the UCSC genome browser (Chan et al., 2012; Schneider et al., 2006) which provides a number of tracks of particular interest to those working with ncRNA. The CBRC Functional RNA Project maintains a UCSC genome browser mirror (Mituyama et al., 2009) for a number of eukaryotic organisms with a larger number of ncRNA-related tracks. Finally, the ENCODE Project (ENCODE Project Consortium, 2011) provides access to a large number of functional assays in a number of eukaryotic organisms, many of which have relevance to ncRNA. Offline genome browsers are also available for use on your own computer, for example IGV and Artemis (Thorvaldsdóttir et al., 2013; Carver et al., 2012). There are also methods available for comparing synteny (gene order) information between genomes (Alikhan et al., 2011; Sullivan et al., 2011; Carver et al., 2005).
Table 3. Web-based resources for genome annotations.
Resource | Reference | URL |
---|---|---|
European Nucleotide Archive | (Silvester et al., 2015) | http://www.ebi.ac.uk/ena |
UCSC Genome Browser | (Karolchik et al., 2014) | http://genome.ucsc.edu/ |
Ensembl | (Flicek et al., 2014) | http://www.ensembl.org |
UCSC Microbial Genome Browser | (Chan et al., 2012; Schneider et al., 2006) | http://microbes.ucsc.edu/ |
CBRC UCSC Genome Browser for Functional RNA | (Mituyama et al., 2009) | http://www.ncrna.org/glocal/cgi-bin/hgGateway |
ENCODE Project | (ENCODE Project Consortium, 2011) | https://www.encodeproject.org/ |
Alignment Editors
It is possible to edit alignments in any text editor; however we highly recommend using a secondary structure-aware editor such as Emacs with the RALEE major mode (Griffiths-Jones, 2005). RALEE allows you to color bases according to base identity, secondary structure, or base conservation. It also allows the easy manipulation of sequences which are involved in structural interactions but are not close in sequence space through the use of split screens. A number of other specialized RNA editors are available (Table 4): BoulderALE (Stombaugh et al., 2011) and S2S (Jossinet and Westhof, 2005) both allow the end user to visualize and manipulate tertiary structure in addition to secondary structure, and may be particularly useful if crystallographic information is available for your RNA. Other alternatives for editing RNA secondary structure are SARSE (Andersen et al., 2007) and MultiSeq (Roberts et al., 2006). Recent versions of JalView (Waterhouse et al., 2009) have begun to support RNA secondary structure as well. Finally, a recent attempt has been made to “gamify” RNA alignment editing. This approach, called Ribo, samples poorly resolved regions of alignments, and feeds these regions to gamers who manually refine the alignments. These re-aligned regions can then be reinserted into the original alignment (Waldispühl et al., 2015).
Table 4. RNA alignment editors.
Resource | Reference | URL |
---|---|---|
BoulderALE | (Stombaugh et al., 2011) | http://boulderale.sourceforge.net/ |
JalView | (Waterhouse et al., 2009) | http://www.jalview.org |
MultiSeq | (Roberts et al., 2006) | http://www.scs.illinois.edu/schulten/multiseq/ |
RALEE | (Griffiths-Jones, 2005) | http://sgjlab.org/ralee/ |
Ribo | (Waldispühl et al., 2015) | http://ribo.cs.mcgill.ca/ |
S2S | (Jossinet and Westhof, 2005) | http://bioinformatics.org/S2S/ |
SARSE | (Andersen et al., 2007) | http://sarse.ku.dk/ |
Infernal
The centerpiece of our protocol is the Infernal package for constructing covariance models (CMs) from RNA multiple alignments (Nawrocki and Eddy, 2013). We will use this to construct models of our RNA family. CMs model the conservation of positions in an alignment similar to a hidden Markov model (HMM), while also capturing covariation in structured regions (Eddy and Durbin, 1994; Sakakibara et al., 1994; Durbin et al., 1998). Covariation is the process whereby a mutation of a single base in a hairpin structure will lead to selection in subsequent generations for compensatory mutations of its structural partner in order to preserve canonical base-pairing, i.e.: Watson-Crick plus G-U pairs, and a functional structure. This combination of structural-evolutionary information has been shown to provide the most sensitive and specific homology search for RNA of any tools currently available (Freyhult et al., 2007; Gardner, 2009). Unfortunately, this sensitivity and specificity come at a high computational cost, and Infernal searches can be time-consuming with genome-scale searches often taking hours on desktop computers. The development of heuristics to reduce this computational cost is an area of active research for the Infernal team, and has already been mitigated to some extent by the use of HMM filters and query-dependent banding of alignment matrices (Nawrocki and Eddy, 2007, 2013). We refer the reader to Eric Nawrocki's primer on annotating functional RNAs in genomic sequence for a friendly introduction to the mechanics of the Infernal package (Nawrocki, 2014).
Resource | Reference | URL |
---|---|---|
Infernal | (Nawrocki et al., 2009; Nawrocki and Eddy, 2013) | http://eddylab.org/software/infernal/ |
Choosing the right protocols
We assume for the sake of this unit that you are starting with a single sequence of interest. We will illustrate our protocol using the example of MicA, an Hfq-dependent bacterial trans-acting antisense small RNA (sRNA). Many bacterial sRNAs are similar in function to eukaryotic microRNAs, pairing to target mRNA transcripts through a short antisense-binding region, generally targeting the transcript for degradation (Barquist and Vogel, 2015). MicA is known to target a wide-range of outer membrane protein mRNAs using a 5′ binding-region in both E. coli (Gogol et al., 2011) and S. enterica (Vogel, 2009) in response to membrane stress. The previous covariance model for MicA (accession RF00078) in Rfam (release 10.1) was largely restricted to E. coli, S. enterica, and Y. pestis. Here, as an example, we produce an improved model, which has served as the core for the current Rfam model (release 12.0). In the process, we discover previously unreported MicA homologs in the nematode symbionts of the γ-proteobacterial genus Xenorhabdus.
If you already have a set of putative homologs, you may wish to further diversify your collection of sequences using the methods described in protocol 1, or you may skip directly to protocol 2, or 3 if a secondary structure is known. No matter how many sequences you are starting with, it is always a good idea to run the sequence search tools available on the Rfam website (http://rfam.xfam.org/) on them. This will verify that there isn't already a CM available that covers your sequences. There are a number of other specialist databases that may also be worth searching if you have reason to believe your RNA sequence is a member of a well-defined class of RNAs, i.e. microRNAs, snoRNAs, rRNAs, tRNAs, etc. We have recently reviewed these specialist RNA databases in (Hoeppner et al., 2014). A centralized RNA sequence database aiming to capture all known RNA sequences, RNAcentral (Bateman et al., 2011; RNAcentral Consortium, 2015), has recently launched and provides an integrated resource for easily identifying similar sequences with some evidence of transcription via an integrated nhmmer search.
Basic Protocol 1: Gathering an initial set of homologous sequences
Now that you have confirmed that your sequence is novel using Rfam, RNAcentral, or appropriate specialist data bases, we will use NCBI-BLAST to identify additional homologous sequences. Once you have navigated to the nucleotide BLAST server (http://blast.ncbi.nlm.nih.gov/) there are a number of important options to set.
Necessary Resources:
Computer with a modern web browser (e.g. Firefox, Chrome, Internet Explorer)
Basic text editor
Setting NCBI-BLAST Parameters
1. First, it is important to choose a search set appropriate to your sequence. At this initial phase, we want to limit our exposure to sequences which are very distant from ours to limit the number of obviously spurious alignments we will need to examine, increasing the power of our search. So, if your initial sequence is of human origin, you may want to limit your search to Mammalia, Tetrapoda, or Vertebrata depending on sequence conservation. Similarly, if you are working with an Escherichia coli sequence, you may want to limit your initial searches to Enterobacteriaceae or the Gammaproteobacteria. NCBI-BLAST searches are relatively fast, so try several search sets to get a feel for how conserved your sequence is.
2. The second set of options to set are the “Program Selection” and the “Algorithm Parameters”. We recommend blastn as it allows for smaller word sizes. The word size describes the minimum length of an initial perfect match needed to trigger an alignment between our query sequence and a target. Smaller word sizes provide greater sensitivity, and seem to perform better for non-coding RNAs. We recommend a word size of 7, the smallest the NCBI-BLAST server allows.
3. Finally, you should set “Max Target Sequences” parameter to at least 1000. NCBI-BLAST returns hits in a ranked list from best match to worst by E-value (or the number of matches with the same quality expected to be found in a search over a database of this size), and will only display as many as “Max Target Sequences” is set to. We are primarily interested in matches on the edge of what NCBI-BLAST is capable of detecting reliably, and these will naturally fall towards the end of this list.
-
4. Our example sequence, MicA, is from E. coli, so we will limit our initial searches to Enterobacteriaceae. This sequence, obtained from Gisela Storz’s collection of known E. coli sRNAs (http://cbmp.nichd.nih.gov/segr/ecoli_rnas.html) is:
GAAAGACGCGCATTTGTTATCATCATCCCTGAATTCAGAGATGAAATTTTGGCCACTCACGAGTGGCCTTTTT
Paste this sequence into the query sequence box at the top of the page. Limit the space of your search by entering “Enterobacteriaceae” (or equivalently “taxid:91347”) in the “Organism” box. Under “Program Selection”, click the radio button to select blastn search. Access additional options by clicking the plus sign next to “Algorithm parameters”. Under “General Parameters” set “Max target sequences” to 1000 and “Word size” to 7. Click the “BLAST” button to run the search.
Selecting Sequences
Our goal at this stage is to pick a representative set of homologous sequences with which to “seed” our alignment. As discussed in the introduction, single sequence alignment for nucleotides is generally only reliable to around 60 percent pairwise sequence identity. At the same time, picking a large number of sequences with high percent identity can lead to overfitting of the secondary structure; that is, if our sequences are too similar we can end up predicting alignments and secondary structures which capture accidental features of a narrow clade, rather than conserved structure and sequence variation.
5. There are 3 major criteria we pick additional sequences based on, in rough order of importance: percent sequence identity, taxonomy, and sequence coverage. Handily, the NCBI-BLAST output displays measures of all of these (Figure 1). Our first selection criterion, percent identity, should fall between 65% and 95%: much lower and the sequence will be difficult to align, higher and it will be too similar to have any meaningful variation.
6. The second selection criterion, taxonomy, will depend somewhat on the organisms your sequence is associated with, but we generally want to limit the inclusion to a single (orthologous) instance per species. The exception to this rule is for diverged paralogous sequences within the species; if paralogs exist, you will need to decide how broadly you wish to define your family, though it can be difficult to construct families that capture the full phylogenetic range of an element while retaining the ability to discriminate between paralogs within single genomes. Additionally, it may be useful to further limit the maximum percent identity to, say, 90% within a densely sequenced genus to further limit the number of highly similar sequences in your initial alignment.
7. Finally, assuming that you are sure of your sequence boundaries, we want to select sequences that cover the entire starting sequence. If you see many matches covering only a short section of your sequence, this may be due to the matching of a short convergent motif. This most commonly happens with the relatively long, highly-constrained bacterial rho-independent terminators (Gardner et al., 2011), but may occur with other motifs (Gardner and Eldai, 2015). Alternatively, if you do not have well-defined sequence boundaries, you will need to determine these from the conservation you see in your BLAST hits - look for taxonomically diverse hits covering the same segment of your query sequence. In some cases, such as the long non-coding RNAs, conserved domains may be much shorter than the complete transcribed sequence (Burge et al., 2013), but stay aware of the potential motif issue. A taxonomic distribution of sequences that makes biological sense given your knowledge of the molecule's function and that can be explained by direct inheritance of the sequence will be your best guide.
-
8. Continuing our MicA example, we want to select a group of sequences with a reasonably diverse taxonomic range and as much sequence diversity as possible, while being reasonably confident that they are true homologs. In this case we will choose based on maximizing genus diversity, a percent identity between 75% and 90%, and 100% sequence coverage as we're fairly confident in the MicA gene boundaries. You can retrieve individual sequences by clicking “Download” on the alignment, then selecting the “FASTA (aligned sequence)” option. Note that you may have to reverse complement sequences (for instance using this webserver: http://www.bioinformatics.org/sms/rev_comp.html), as by default NCBI-BLAST returns sequences from the forward strand of the genomic sequence. For our example alignment, in addition to our original sequence from E. coli (EMBL-Bank accession: U00096) we have chosen sequences from Salmonella Typhimurium (FQ312003), Klebsiella pneumoniae (CP002910), Enterobacter cloaca (CP002272), Yersinia pestis (AM286415), Pantoea sp. At-9b (CP002433), and Erwinia pyrifoliae (FP236842). Collect these in FASTA format as below:
>U00096.3
GAAAGACGCGCATTTGTTATCATCATCCCTGAATTCAGAGATGAAATTTTGGCCACTCACGAGTGGCCTTTTT
>FQ312003
GAAAGACGCGCATTTGTTATCATCATCCCTGTTTTCAGCGATGAAATTTTGGCCACTCCGTGAGTGGCCTTTTT
>CP002272
GAAAGACGCGCATTTGTTATCATCATCCCTGACTTCAGAGATGAAATGTTTGGCCACAGTGATGTGGCCTTTTT
>CP002910
GAAAGACGCGCATTTATTATCATCATCATCCCTGAATCAGAGATGAAAGTTTGGCCACAGTGATGTGGCCTTTTT
>AM286415
GAAAGACGCGCATTTGTTATCATCATCCCTGTTATCAGAGATGTTAATTTGGCCACAGCAATGTGGCCTTTT
>CP002433
GAAAGACGCGCATTTGTTATCATCATCCCTGACAACAGAGATGTTAATTCGGCCACAGTGATGTGGCCTTTT
>FP236842
GAAAGACGCGTATTTGTTATCATCATCTCATCCCTGACAACAGAGATGTTAATTTAGGCCACAGTGACGTGGCCTTTTT
Figure 1. Partial results of a BLAST search using the E. coli MicA sequence from the “nr” sequence database.
The tabular view provides important information that can be used to pick putative homolog sequences including species and strain information (column 1), query sequence coverage (column 4), E-value (column 5), and percent identity (column 6). Also note the accession number in column 7, this will be useful looking up sequences in other databases (e.g. ENA).
Examining Your Initial Homolog Set
9. Once you have assembled a set of sequences fitting the criteria described above, it is worth taking a closer look at them. Remember that these sequences will form the core of your alignment and CM, and errors at this stage can dramatically bias your results. A good first test is to examine the taxonomy of your sequences, and make sure it makes sense. Can you identify a clear pattern of inheritance that might explain the taxonomic distribution you see at this stage? Another good check is to examine your sequences in the ENA browser, or a domain-specific browser if one exists for your organisms. For many independently transcribed RNAs, genomic context is better conserved than sequence, and ncRNA genes will often fall in homologous intergenic or intronic regions even over large evolutionary distances. If you are particularly ambitious, and the tools are available for your organisms of interest, you may wish to try to identify promoter sequences upstream of your candidate or terminator sequences downstream. If your sequence is a putative cis-regulatory element, such as a riboswitch, thermosensor, or attenuator, you may want to check that it occurs upstream of genes with similar functions or in similar pathways (Weinberg et al., 2015). Finally, it is always worth searching your putative homologs through the Rfam website even if your initial sequence had no matches - Rfam's models are not perfect, and may miss distant homologs of known families.
10. You can quickly examine your chosen sequences with the ENA browser, accessed by appending the accession number to http://www.ebi.ac.uk/ena/data/view/ (Figure 2). It appears that all of these sequences fall in an intergenic region between a luxS protein homolog and a gshA protein homolog, further increasing our confidence that these are true homologs. From our results, we can also see a few promising hits that don't quite meet our criteria, such as Dickeya, Xenorhabdus, Photorhabdus, and Wigglesworthia. We will keep these in mind later to expand our coverage.
Figure 2. Genomic context of micA.
ENA browser view of the region surrounding micA in E. coli. All of our selected homologs show conserved synteny with the luxS and gshA genes, providing additional evidence for their evolutionary relationship. Note the Rfam annotation in this region, matching our sequence. This view can be generated by navigating to http://www.ebi.ac.uk/ena/data/view/U00096.3 and entering the genome coordinates in the “Base range” boxes.
Basic Protocol 2: Aligning and predicting secondary structure
We will use the WAR server to construct an initial alignment. Because of the criteria we've set for sequence similarity in our gathering step, all of the sequences in our initial homolog set should have at least 60% pairwise sequence identity with at least one other sequence in the set. Under these conditions alignment methods using primary sequence information only can perform adequately, as discussed in the background information section. These methods combined with alignment folding tools that identify conserved structural signals and covariation can produce reasonable secondary structures predictions (Gardner and Giegerich, 2004). However it is still often useful to observe the behavior of as many alignment tools as possible. Using WAR, for a fairly fast alignment we recommend running CMfinder, StrAL+PETfold, ClustalW and MAFFT with RNAalifold and Pfold. WAR will also produce a consensus alignment using T-Coffee, which will attempt to find an alignment consistent with all of the individual alignments produced by other methods.
Necessary Resources:
Computer with a modern web browser (e.g. Firefox, Chrome, Internet Explorer)
Basic text editor
Submit your sequences to the WAR webserver
1. You will need to format your input sequence in FASTA format, an example can be found in protocol 1.2. Navigate to http://genome.ku.dk/resources/war/. Paste this into the sequence box. Uncheck the tick boxes next to all methods besides CMfinder, Stral+PETfold, ClustalW+Pfold, ClustalW+RNAalifold, MAFFT+Pfold, MAFFT+RNAalifold; this will limit WAR to running relatively fast methods. You can include additional methods in later runs to see how this affects the resulting alignments. Submit this together with your email address. Using fast alignment methods you should get results within 10 minutes, assuming the server isn’t busy.
Understanding WAR results
2. Once WAR returns your alignment results, there are a number of things you should take note of that will assist you in picking an alignment and further in manual refinement. First, the consensus alignment page will display a graphical representation of the consistency of the alignments which will allow you to quickly tell which areas of the alignment may require attention during manual refinement, or areas that may harbor structure not captured by the majority consensus. For instance, in our example MicA alignment (Figure 3) there is some alignment and structural disagreement around the first alignment gap, suggesting this area might require some manual curation to correct. The consensus can be recomputed based on differing subsets alignment methods, if you believe one method (or set of methods) may be unduly influencing the consensus. Once you've carefully looked over the consensus alignment, examine each alignment produced by WAR in turn: What structures are shared? Where do the alignments differ from each other? Can you identify any sequence or structural motifs which may help to guide your alignment? At this level of sequence identity, you should hope to see fairly consistent alignments in functional regions of the sequence, interspersed with more difficult to align regions, presumably under weaker selective pressure. Often the consensus alignment is a good choice to move forward with. However, there are cases where certain classes of tools will obviously mis-align regions of the sequence and bias the consensus. Keep in mind what you've seen in the alternative alignments as well; this information may be useful in manual refinement. You will want to save the stockholm file for the alignment you've chosen to your local computer at this point. This can be done by clicking the “Stockholm” link at the top right of any result page.
Figure 3. Consensus alignment of MicA sequences.
Visualization of a T-coffee consensus alignment incorporating information from 6 different alignment and structure prediction methods on the WAR webserver.
Aligning more distantly related sequences
3. Later in the family-building process when you have identified more distant homologs, the average pairwise identity of the sequences in your data set may have dropped below 60%. At this point, you may want to begin including some of the Sankoff-type alignment methods available in WAR. Using these methods can dramatically increase the runtime for your sequence alignment jobs, though, particularly for sequences over a couple of hundred of bases long. We will discuss alternatives to re-aligning sequences during the iterative expansion of the alignment in section 3.5.
Basic Protocol 3: Guidance for manually refining alignments
Our goal in manual refinement is to attempt to correct errors made by automatic alignment tools. We generally use RALEE (Griffiths-Jones, 2005), an RNA editing mode for Emacs, for editing alignments. However, any editor you are comfortable with in which you can easily visualize sequence and structural conservation will work; a number of alternative editors are listed in the strategic planning section. RALEE offers coloring based on stucture, sequence conservation, and compensatory mutations. These can be accessed from the “Structure” menu in Emacs (Figure 4).
Figure 4. Editing alignments in emacs RALEE mode.
Screenshots showing three different color markups of the MicA consensus alignment highlighting secondary structure (top), sequence conservation (middle), and compensatory mutations (bottom).
Necessary Resources:
Computer, preferably running a *NIX-based operating system (e.g. Linux, OS X)
Emacs with RALEE mode installed (see http://sgjlab.org/ralee/)
1. A good place to start editing is around the edges of predicted hairpin structures. Are there base-pairs which appear to be misaligned? Can you add base-pairs to the structure? Are there predicted base-pairs which don't appear to be well conserved that should be trimmed? Can individual bases be moved in the alignment to create more convincing support for the predicted structure?
2. Once you are satisfied with your manual refinement of predicted secondary structure elements, next you should turn your attention to areas identified as uncertain in the WAR/T-Coffee consensus alignment. Were there alternative structures predicted in these regions? Do you see support for these structures in the sequences? If these regions are unstructured, can you identify any conserved sequence motifs in the region? If you will be regularly working with a particular class of ncRNA, it can be useful to familiarize yourself with predicted binding motifs of associated RNA-binding proteins as these are likely to be conserved but can have many variable positions (Gardner and Eldai, 2015).
3. Our MicA consensus alignment provides a few opportunities for improvement (Figure 5). The first base-pair in the first stem in CP002433 can be rescued by shifting a few nucleotides, and by pulling apart the alignment between the first and second stem we reveal what appears to be a well-conserved AAUUU sequence motif that was previously hidden (Figure 5). The bacterial RNA chaperone Hfq is known to bind to A/U rich sequences, so this motif may have some functional significance (Schu et al., 2015).
4. At this stage, it is also possible to include information from experimental data. Crystal structure information from a single sequence in the seed alignment can be used to validate and improve a predicted secondary structure. Tertiary structure-aware editors such as BoulderAle (Stombaugh et al., 2011) can help in applying this information to the alignment. Other experimental evidence, such as chemical footprinting, particularly when coupled with high-throughput sequencing (Spitale et al., 2014; Kwok et al., 2015), can also provide valuable information. Knowing whether even a single base is involved in a pairing interaction can drastically reduce the space of possible structures the sequence can fold into, simplifying the problem of predicting secondary structure. Both the RNAfold and RNAalifold web servers available through the Vienna RNA website (Gruber et al., 2008b) are capable of taking advantage of this information in the form of folding constraints. Structural mapping datasets are beginning to be archived in consistent formats (Rocca-Serra et al., 2011; Cordero et al., 2012), and will be increasingly available for a variety of non-coding RNA molecules in the future.
Figure 5. MicA alignment following manual refinement.
Manual refinement using RALEE restores full conservation to the first stem-loop and removes an unlikely insertion within the hairpin structure. Additionally careful manipulation of the sequence between the two hairpin reveals strong conservation of a short A/U-rich sequence motif that was previously obscure.
Basic Protocol 4: Building a covariance model with Infernal
For those comfortable with the *NIX command line, building an Infernal CM is fairly straight-forward. We refer the reader to the User's Guide available from the Infernal website (http://http://eddylab.org/software/infernal/) for installation instructions and a detailed tutorial.
Necessary Resources:
Computer running a *NIX-based operating system (e.g. Linux, OS X)
Infernal (see http://eddylab.org/infernal/)
Run cmbuild
-
1. Assuming you have successfully installed Infernal somewhere in your path and you have saved your edited alignment in a stockholm file named my.sto (see figure 5 for an example of stockholm format), run:
> cmbuild my.cm my.sto
This will construct a CM and save the results in a file named my.cm. This command also returns some useful statistics on your model, such as the effective number of sequences, and the relative entropy of your CM compared to an HMM with the same sequences, i.e. how much information is gained by including structural information in your model.
Run cmcalibrate
-
2. To generate E-values in search results, the CM must now be calibrated. This calibration depends on a sampling procedure, and so can be time-consuming. For our MicA CM, it will take about 5 minutes. Run:
> cmcalibrate my.cm
Note that calibration can take a long time -- hours for longer models. You can get a quick estimate of the time calibration will take using the command:
> cmcalibrate --forecast 1 my.cm
Congratulations! You should now have a working CM for your RNA family. This is a fully capable model, and can be used as is for homology search and genome annotation. However, as it stands, your CM will only capture the sequence diversity which was able to be detected by our initial BLAST search. In order to fully take advantage of the power of CMs, you may want to expand the diversity of the sequence it is trained on through iterative expansion of our initial set of sequence homologs.
Basic Protocol 5: Strategies for expanding model coverage
Now that you have constructed your CM from BLAST results, you will want to try to expand your model coverage to fully capture the phylogenetic diversity of your sequence. There are several strategies for doing this that we discuss below, not all of which will work for every ncRNA.
Plan A: Iterative search of sequence databases
The method Rfam used to identify more divergent homologs to seed sequences prior to recent HMM-based acceleration of the Infernal pipeline (Nawrocki and Eddy, 2013; Nawrocki et al., 2015) was to pre-filter CM-based searches with WU-BLAST. This allows us to cover a large sequence space with a (comparatively) modest investment of computational time. Any of the single sequence search tools mentioned in the strategic planning section would make an effective pre-filter.
The easiest way to perform filtering yourself is to use the NCBI BLAST webserver to search each sequence in your seed alignment following the methods outlined for collecting your initial set of homologs in Basic Protocol 1. You may wish to relax the criteria slightly, then use the CM to perform a more sensitive search on this set of filtered sequences. This will enable you to detect more distantly related sequences, though you should always examine sequence context and the phylogenetic relationship between sequences as a sanity check before including them in your seed. These methods can be automated with basic scripting and bioinformatics modules such as BioPerl (Stajich et al., 2002) or Biopython (Cock et al., 2009), though this is beyond the scope of this chapter.
Once you have identified a new set of homologs, you can align them to your previous CM using Inferal's cmalign:
> cmalign --mapali my.sto my.cm newsequences.fasta > mynewalignment.sto
This --mapali option includes your original training alignment in the mynewalignment.sto output. This alignment can then be used to build a new CM, which will capture the additional sequence variation you have discovered in your BLAST searches.
The disadvantage of this method is that each search only uses the information available in a single sequence, meaning that valuable information about variation is lost and as a result the power of the search suffers. Fast profile-based methods implemented in HMMER3 (Wheeler and Eddy, 2013) can be used to remedy this (Lindgreen et al., 2014), but require significant local computational resources to be used in a manner similar to the NCBI-BLAST.
Plan B: Directed search of chosen sequences
Another approach is to run the unfiltered CM over selected genomes or genomic regions. While the greater sensitivity and specificity of this method can help identify more distant homologs than is possible with BLAST, it has the disadvantage that it requires a much larger investment of computational resources to provide an equivalent phylogenetic coverage. This method can be particularly powerful in bacterial and archaeal genomes, where small genome size allows us to search a phylogenetically-representative sample of genomes in less than a day on a desktop computer. In the case of larger eukaryotic genomes, it may be necessary to search a few genomes to determine if homologs of your RNA are likely to exist in certain lineages, then extract homologous intergenic regions to continue searching. Our rationale here is much the same as in limiting the database for our initial BLAST search: by only looking in genomes where we have some prior belief that they may contain homologous sequence we reduce the noise in our low-scoring hits, meaning that we have to manually examine less hits to establish a score threshold for likely homologs.
Once you have examined candidates following the principles outlined earlier, it is easy to incorporate your new sequences using the easel package included with Infernal. First, search the genome generating a tabfile:
> cmsearch --tblout searchfile.tab my.cm genome.fasta
Then use easel to index the genome and extract the hits:
> esl-sfetch --index genome.fasta
> esl-sfetch -Cf searchfile.tab genome.fasta > hits.fasta
These sequences can then be aligned and merged as with BLAST hits. Alternatively, if you discover a divergent lineage, it may be easiest to construct a separate alignment for these sequences, then use shared structural and sequence motifs to manually combine the two alignments. Sankoff-type alignment method may also be useful for aligning divergent clades.
Plan C: When A and B fail…
In some cases, it will be very difficult to identify homologs of a candidate RNA across its full phylogenetic range. This can be because of high sequence variability, as in the Vault RNAs (Stadler et al., 2009). Alternatively, some longer RNAs, such as the RNA component of the telomerase ribonucleoprotein, consist of structurally conserved segments interspersed with long variable regions that can't be easily discovered by standard search with naive covariance models (Marz et al., 2012; Xie et al., 2008).
A number of computational techniques exist for approaching these difficult cases, reviewed by Mosig and colleagues (Mosig et al., 2009). These methods include fragrep2 (Mosig et al., 2007), which allows the user to search fragmented conserved regions (including pol III promotor and terminator motifs (Stadler et al., 2009)(Gruber et al., 2008a)(Marz et al., 2009)), fragrep3, which allows the user to incorporate custom structural motifs with fragmented search, and GotohScan (Hertel et al., 2009), which implements a semi-global alignment algorithm that will align a query sequence to a (potentially) extended genomic region.
Applying Plans A & B to MicA
Now we will follow Plan B to add sequences to our alignment using the genomes for the low-scoring BLAST hits we had previously made a note of while collecting our initial set of sequences, though you could also choose these sequences based on your knowledge of your organisms phylogeny or the suspected function of your RNA. The genomes we have chosen here are Dickeya zeae (CP001655), Sodalis glossinidius (AP008232), Xenorhabdus nematophila (FN667742) and Wigglesworthia glosinidia (BA000021); these can all be downloaded from the EMBL bacterial genomes pages (http://www.ebi.ac.uk/genomes/bacteria.html). Searching these genomes allows us to identify strong hits in D. zeae and S. glossinidius with E-values of 10-12 and 10-10 which we can merge into our alignment using the methods described in Plan B above. You should then manually refine the resulting merged alignment with an eye towards maintaining conserved sequence motifs and structure. Already at this evolutionary distance, there has been some apparent small decay in secondary structure, as well as an expansion of the sequence contained in the loop region of the second stem in D. zeae (Figure 6).
We observe a number of hits in X. nematophila with E-values in the range of 10-2, and we can apply Plan A to investigate these. By checking each of these individually in the ENA browser, we can identify one that falls in the same genomic context as our previous MicA homologs (Figure 7). By using this sequence as the starting point for a BLAST search, we can identify a number of other divergent Xenorhabdus homologs. As these are quite diverged from the E. coli sequence, we first construct an alignment for them using WAR (Figure 7), then attempt to merge our alignments manually using shared structural features as our guide. Interestingly, the target-binding region of MicA appears to have suffered a poly-A insertion along this lineage, suggesting that there may be changes in the regulon it targets, in line with recent findings that RNA:RNA interactions are frequently poorly conserved (Lai and Meyer, 2015; Richter and Backofen, 2012). Using this model to search all of the bacterial genomes in EMBL-Bank (approximately 6GB of sequence, taking ~30 hours on a 2.26 GHz Intel Core 2 Duo processor) shows that our CM now has high-scoring hits exclusively in Enterobacteriales, while covering a broader range than our initial BLAST searches. This search also reveals a number of possible sources of additional diversity: Photorhabdus asymbiotica and Edwardsiella ictaluri both have strong hits below the average score for other enterobacterial genomes -- incorporating them may further increase the sensitivity of our model, and is left as an exercise to the reader.
Figure 6. MicA alignment containing additional homologs found using Plan B.
Two new sequences have been added to the alignment (D. zeae, CP001655; S. glossinidius, AP008232), adding structural and sequence diversity.
Figure 7. Xenorhabdus MicA homologs found using Plan A.
The top panel shows the location of a putative micA homolog with a marginal Infernal E-value in X. nematophila, sharing synteny with established micA sequences in E. coli. The bottom panel shows a manually refined alignment of three putative Xenorhabdus MicA sequences, displaying structural and sequence similarity to our previously constructed alignment of enterobacterial MicA sequences.
Guidelines for Understanding Results
Given the diversity of ncRNAs in sequence, structure, function, and conservation, it is difficult to provide general guidelines for interpreting results, though we will attempt to provide some guidance here. The best guard against spurious results is maintaining a skeptical attitude during family building. We have suggested some questions to ask yourself during family building above, but they bear repeating: does the phylogenetic distribution make sense? Can it be explained by vertical inheritance? The appearance of a sequence from a distantly related phylum in Infernal search results is far more likely to be a false positive than horizontal transfer in most cases. Do the expanded alignments make sense? Are important structural and sequence features (e.g. intermolecular interaction sites, protein binding sites, stabilizing secondary structures) conserved? If additional sequences no longer contain these important features, you can assume you have hit the limit of your family’s conservation.
Knowing a “good” alignment is more art than science, and often requires experience and domain-specific knowledge of the molecules being aligned. Articles in the RNA Families track at RNA Biology provide good, reviewed, examples of RNA alignments for a range of RNA classes, and may serve as instructive examples. Covariation is often taken as gold-standard evidence for the conservation of secondary structure, and can be visualized in RALEE. However it should be noted that many secondary-structure aware RNA alignment tools explicitly attempt to maximize covariation; this combined with the tendency of many aligners to produce alignments for any given set of sequences, even if they are not homologous (see discussion in the Commentary below), can lead to an artificial prediction of high levels of covariation in secondary structure. This can generally be avoided by not including low E-value sequences in your alignment without strong additional evidence for homology (e.g. experimental results, conserved synteny, conservation of defining structural and sequence features). Even in the case of good alignments, knowing what degree of covariation to expect in bona fide secondary structure is often difficult. Ultimately, an RNA family only provides a hypothesis concerning the evolutionary relationship between a set of sequences. How good that hypothesis is depends on many factors discussed in this article, but an attitude of healthy skepticism and understanding of the biology of the family on the part of the builder are perhaps the most important.
Commentary
Background Information
RNA sequence alignment remains a challenge despite at least 30 years of work on the problem (Woese et al., 1983; Lane et al., 1985; Guttel et al., 1985; Pace et al., 1989). As discussed in the introduction, alignments based on primary sequence become untrustworthy below ~60% pairwise sequence identity, likely due to the lower information content of individual nucleic acids as compared to amino acids in protein alignments. This can be intuitively understood by recalling the fact that 3 nucleotides are required to encode an individual amino acid; so, an amino acid carries 3 times as much information as a nucleic acid (a bit less, actually, due to the redundancy of the genetic code). In addition, the larger alphabet size of protein sequences allows for the easy deployment of more complex substitution models, and a glut of protein sequence data allows for highly effective parameterization of these models.
The incorporation of secondary structure, i.e. base-pairing, information has been proposed as a means to make up for these difficulties in RNA alignment methods. The first proposal for such a method is now known as the Sankoff algorithm (Sankoff, 1985). The Sankoff algorithm uses dynamic programming, an optimization technique central to sequence analysis. (A full explanation of dynamic programming is beyond the scope of this unit, but for a brief introduction see two excellent primers by Sean Eddy covering applications to alignment (Eddy, 2004b) and secondary structure prediction (Eddy, 2004a); for those seeking a deeper understanding (Durbin et al., 1998) provides detailed coverage of dynamic programming as well as covariance models.) Dynamic programming had previously been applied to the problems of sequence alignment (Needleman and Wunsch, 1970) and RNA folding (Nussinov et al., 1978). Sankoff proposed a union of these two methods. Unfortunately, the resulting algorithm has a time requirements of O(L3N) and space requirements of O(L2N) where L is the sequence length and is the number of sequences aligned. This is impractical, even for small numbers of short sequences. A number of faster heuristic algorithms have been developed to approximate Sankoff alignment. Recent examples include CentroidAlign (Hamada et al., 2009b), mLocARNA (Will et al., 2007), and FoldalignM (Torarinsson et al., 2007). These methods can push the RNA alignment twilight zone as low as 40 percent identity (Gardner et al., 2005).
However, for the purpose of family-building, we are often starting with a single sequence of unknown secondary structure, and have to gather additional homologs using a fast alignment tool, such as BLAST. This is a common starting point in analysing non-coding coding regions from RNA sequence experiments, for example (Perkins et al., 2009; Chaudhuri et al., 2011; Lindgreen et al., 2014). Unfortunately, BLAST and similar single-sequence homology search methods are not able to reliably detect homologs below 60 percent sequence identity. In this range of pairwise sequence identities, the slight increases in accuracy of Sankoff-type algorithms over non-structural alignment is only rarely worth the additional computational costs involved. (For relatively recent benchmarks of alignment tools on ncRNA sequences see (Hamada et al., 2009b) and the supplementary information of (Bradley et al., 2008); Hamada includes comparisons of aligner runtimes, while Bradley examines relative performance over a range of pair-wise sequence percent identities.) Alignments generated with standard alignment tools can then be used as a basis for predictions of secondary structure using tools like Pfold (Knudsen and Hein, 2003), RNAalifold (Bernhart et al., 2008), or CentroidFold (Hamada et al., 2009a).
Regardless, all modern alignment tools, Sankoff-type or standard, suffer from a number of known problems. Most alignment tools use progressive alignment. This means that the aligner decomposes the alignment problem into a series of pair-wise alignment problems along a guide tree built using some measure of sequence similarity, so that highly similar sequences are aligned first before alignments between more divergent sequences are attempted. This greatly reduces the computational complexity of the alignment problem, but also means that errors in early pairwise alignment steps are propagated through the entire alignment. A number of solutions have been proposed to this problem, such as explicitly modeling insertion-deletion histories (Löytynoja and Goldman, 2008) or using modified or alternative optimization methods such as consistency-guided progressive alignment (Notredame et al., 2000) or sequence annealing (Schwartz and Pachter, 2007). A second issue is that it is not clear which function of the alignment aligners should be optimizing (Iantorno et al., 2014), and many appear to over-predict homology (Schwartz et al., 2005; Bradley et al., 2008, 2009). Finally, many parameters commonly used in alignment, such as gap opening and closing probabilities and substitution matrices, appear to vary across organisms, sequences, and even positions within an alignment. All of this leads to considerable uncertainty in alignment (Wong et al., 2008), which is not easily captured by most current alignment methods, although sampling-based methods do exist which attempt to account for this (Suchard and Redelings, 2006; Westesson et al., 2012). The additional parameters and uncertainty introduced by RNA secondary structure prediction only compounds these problems.
An additional problem with alignment is the issue of determining whether two sequences are similar due to homology or analogy. Homology describes a similarity in features based on common descent; for instance, all bird wings are homologous wings. Analogy, on the other hand, describes a similarity in features based on common function without common descent; bat and bird wings perform the same function and appear superficially similar, however, their evolutionary histories are quite different. In sequence analysis, we often assume that aligned residues within an alignment share common ancestry, but this assumption can be confounded by analogous sequence. These analogs often take the form of motifs, short sequences which perform specific functions within the RNA molecule and can arise easily through convergent evolution (Gardner and Eldai, 2015). An example of such a motif is the bacterial rho-independent terminator (Gardner et al., 2011), a short hairpin responsible for halting transcription in many species. While such motifs can be a boon in discovering novel ncRNA genes (Argaman et al., 2001; Livny et al., 2005) or aligning homologs which contain them (Nawrocki et al., 2015), they can also be a source of false-positives when attempting to build an alignment of homologous sequences.
Acknowledgements
This work was supported by the Wellcome Trust, grant number WT098051. L.B. was supported by a Research Fellowship from the Alexander von Humboldt Stiftung/Foundation. PPG is supported by a Rutherford Discovery Fellowship.
Literature Cited
- Alikhan N-F, Petty NK, Ben Zakour NL, Beatson SA. BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons. BMC genomics. 2011;12:402. doi: 10.1186/1471-2164-12-402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Andersen ES, Lind-Thomsen A, Knudsen B, Kristensen SE, Havgaard JH, Torarinsson E, Larsen N, Zwieb C, Sestoft P, Kjems J, et al. Semiautomated improvement of RNA alignments. RNA. 2007;13:1850–1859. doi: 10.1261/rna.215407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Argaman L, Hershberg R, Vogel J, Bejerano G, Wagner EGH, Margalit H, Altuvia S. Novel small RNA-encoding genes in the intergenic regions of Escherichia coli. Current biology: CB. 2001;11:941–950. doi: 10.1016/s0960-9822(01)00270-6. [DOI] [PubMed] [Google Scholar]
- Asai K, Kiryu H, Hamada M, Tabei Y, Sato K, Matsui H, Sakakibara Y, Terai G, Mituyama T. Software.ncrna.org: web servers for analyses of RNA sequences. Nucleic acids research. 2008;36:W75–W78. doi: 10.1093/nar/gkn222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barquist L, Vogel J. Accelerating Discovery and Functional Analysis of Small RNAs with New Technologies. Annual review of genetics. 2015;49:367–394. doi: 10.1146/annurev-genet-112414-054804. [DOI] [PubMed] [Google Scholar]
- Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic acids research. 2013;41:D991–5. doi: 10.1093/nar/gks1193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrick JE, Breaker RR. The distributions, mechanisms, and structures of metabolite-binding riboswitches. Genome biology. 2007;8:R239. doi: 10.1186/gb-2007-8-11-r239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrick JE, Sudarsan N, Weinberg Z, Ruzzo WL, Breaker RR. 6S RNA is a widespread regulator of eubacterial RNA polymerase that resembles an open promoter. RNA. 2005;11:774–784. doi: 10.1261/rna.7286705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bateman A, Agrawal S, Birney E, Bruford EA, Bujnicki JM, Cochrane G, Cole JR, Dinger ME, Enright AJ, Gardner PP, et al. RNAcentral: A vision for an international database of RNA sequences. RNA. 2011;17:1941–1946. doi: 10.1261/rna.2750811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bauer M, Klau GW, Reinert K. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC bioinformatics. 2007;8:271. doi: 10.1186/1471-2105-8-271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. RNAalifold: improved consensus structure prediction for RNA alignments. BMC bioinformatics. 2008;9:474. doi: 10.1186/1471-2105-9-474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome research. 2004;14:988–995. doi: 10.1101/gr.1865504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boratyn GM, Camacho C, Cooper PS, Coulouris G, Fong A, Ma N, Madden TL, Matten WT, McGinnis SD, Merezhuk Y, et al. BLAST: a more efficient report with usability improvements. Nucleic acids research. 2013;41:W29–33. doi: 10.1093/nar/gkt282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradley RK, Pachter L, Holmes I. Specific alignment of structured RNA: stochastic grammars and sequence annealing. Bioinformatics. 2008;24:2677–2683. doi: 10.1093/bioinformatics/btn495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L. Fast statistical alignment. PLoS computational biology. 2009;5:e1000392. doi: 10.1371/journal.pcbi.1000392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brownlee GG. Sequence of 6S RNA of E. coli. Nature. 1971;229:147–149. doi: 10.1038/newbio229147a0. [DOI] [PubMed] [Google Scholar]
- Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A. Rfam 11.0: 10 years of RNA families. Nucleic acids research. 2013;41:D226–32. doi: 10.1093/nar/gks1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics. 2012;28:464–469. doi: 10.1093/bioinformatics/btr703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carver TJ, Rutherford KM, Berriman M, Rajandream M-A, Barrell BG, Parkhill J. ACT: the Artemis Comparison Tool. Bioinformatics. 2005;21:3422–3423. doi: 10.1093/bioinformatics/bti553. [DOI] [PubMed] [Google Scholar]
- Chan PP, Holmes AD, Smith AM, Tran D, Lowe TM. The UCSC Archaeal Genome Browser: 2012 update. Nucleic acids research. 2012;40:D646–52. doi: 10.1093/nar/gkr990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chaudhuri RR, Yu L, Kanji A, Perkins TT, Gardner PP, Choudhary J, Maskell DJ, Grant AJ. Quantitative RNA-seq analysis of the Campylobacter jejuni transcriptome. Microbiology. 2011;157:2922–2932. doi: 10.1099/mic.0.050278-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cordero P, Lucks JB, Das R. An RNA Mapping DataBase for curating RNA structure mapping experiments. Bioinformatics. 2012 doi: 10.1093/bioinformatics/bts554. Available at: http://bioinformatics.oxfordjournals.org/content/28/22/3006.short. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dalli D, Wilm A, Mainz I, Steger G. STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006;22:1593–1599. doi: 10.1093/bioinformatics/btl142. [DOI] [PubMed] [Google Scholar]
- Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press; 1998. [Google Scholar]
- Eddy SR. Accelerated Profile HMM Searches. PLoS computational biology. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddy SR. How do RNA folding algorithms work? Nature biotechnology. 2004a;22:1457–1458. doi: 10.1038/nbt1104-1457. [DOI] [PubMed] [Google Scholar]
- Eddy SR. What is dynamic programming? Nature biotechnology. 2004b;22:909–910. doi: 10.1038/nbt0704-909. [DOI] [PubMed] [Google Scholar]
- Eddy SR, Durbin R. RNA sequence analysis using covariance models. Nucleic acids research. 1994 doi: 10.1093/nar/22.11.2079. Available at: http://nar.oxfordjournals.org/content/22/11/2079.short. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE) PLoS biology. 2011;9:e1001046. doi: 10.1371/journal.pbio.1001046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, et al. Pfam: the protein families database. Nucleic acids research. 2014;42:D222–30. doi: 10.1093/nar/gkt1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finn RD, Clements J, Arndt W, Miller BL, Wheeler TJ, Schreiber F, Bateman A, Eddy SR. HMMER web server: 2015 update. Nucleic acids research. 2015 doi: 10.1093/nar/gkv397. Available at: http://dx.doi.org/10.1093/nar/gkv397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic acids research. 2011;39:W29–37. doi: 10.1093/nar/gkr367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, et al. Ensembl 2014. Nucleic acids research. 2014;42:D749–55. doi: 10.1093/nar/gkt1196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freyhult EK, Bollback JP, Gardner PP. Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. Genome research. 2007;17:117–125. doi: 10.1101/gr.5890907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu Y, Deiorio-Haggar K, Anthony J, Meyer MM. Most RNAs regulating ribosomal protein biosynthesis in Escherichia coli are narrowly distributed to Gammaproteobacteria. Nucleic acids research. 2013;41:3491–3503. doi: 10.1093/nar/gkt055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner PP. The use of covariance models to annotate RNAs in whole genomes. Briefings in functional genomics & proteomics. 2009;8:444–450. doi: 10.1093/bfgp/elp042. [DOI] [PubMed] [Google Scholar]
- Gardner PP, Barquist L, Bateman A, Nawrocki EP, Weinberg Z. RNIE: genome-wide prediction of bacterial intrinsic terminators. Nucleic acids research. 2011;39:5845–5852. doi: 10.1093/nar/gkr168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner PP, Bateman AG. A home for RNA families at RNA Biology. RNA biology. 2009;6:2–4. [Google Scholar]
- Gardner PP, Bateman A, Poole AM. SnoPatrol: how many snoRNA genes are there? Journal of biology. 2010;9:4. doi: 10.1186/jbiol211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner PP, Eldai H. Annotating RNA motifs in sequences and alignments. Nucleic acids research. 2015;43:691–698. doi: 10.1093/nar/gku1327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner PP, Giegerich R. A comprehensive comparison of comparative RNA structure prediction approaches. BMC bioinformatics. 2004;5:140. doi: 10.1186/1471-2105-5-140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic acids research. 2005;33:2433–2439. doi: 10.1093/nar/gki541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gogol EB, Rhodius VA, Papenfort K, Vogel J, Gross CA. Small RNAs endow a transcriptional activator with essential repressor functions for single-tier control of a global stress regulon. Proceedings of the National Academy of Sciences of the United States of America. 2011;108:12875–12880. doi: 10.1073/pnas.1109379108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Griffiths-Jones S. RALEE--RNA ALignment editor in Emacs. Bioinformatics. 2005;21:257–259. doi: 10.1093/bioinformatics/bth489. [DOI] [PubMed] [Google Scholar]
- Gruber AR, Kilgus C, Mosig A, Hofacker IL, Hennig W, Stadler PF. Arthropod 7SK RNA. Molecular biology and evolution. 2008a;25:1923–1930. doi: 10.1093/molbev/msn140. [DOI] [PubMed] [Google Scholar]
- Gruber AR, Lorenz R, Bernhart SH, Neuböck R, Hofacker IL. The Vienna RNA websuite. Nucleic acids research. 2008b;36:W70–4. doi: 10.1093/nar/gkn188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guttel RR, Weiser B, Woese CR, Noller HF. Comparative anatomy of 16-S-like ribosomal RNA. Progress in nucleic acid research and molecular biology. 1985;32:155–216. doi: 10.1016/s0079-6603(08)60348-7. [DOI] [PubMed] [Google Scholar]
- Hamada M, Kiryu H, Sato K, Mituyama T, Asai K. Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics. 2009a;25:465–473. doi: 10.1093/bioinformatics/btn601. [DOI] [PubMed] [Google Scholar]
- Hamada M, Sato K, Kiryu H, Mituyama T, Asai K. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics. 2009b;25:3236–3243. doi: 10.1093/bioinformatics/btp580. [DOI] [PubMed] [Google Scholar]
- Hertel J, de Jong D, Marz M, Rose D, Tafer H, Tanzer A, Schierwater B, Stadler PF. Non-coding RNA annotation of the genome of Trichoplax adhaerens. Nucleic acids research. 2009;37:1602–1615. doi: 10.1093/nar/gkn1084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Höchsmann M, Töller T, Giegerich R, Kurtz S. Local similarity in RNA secondary structures. Proceedings / IEEE Computer Society Bioinformatics Conference. IEEE Computer Society Bioinformatics Conference. 2003;2:159–168. [PubMed] [Google Scholar]
- Hoeppner MP, Barquist LE, Gardner PP. An Introduction to RNA Databases. In: Gorodkin J, Ruzzo WL, editors. RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods Methods in Molecular Biology. Humana Press; 2014. pp. 107–123. [DOI] [PubMed] [Google Scholar]
- Iantorno S, Gori K, Goldman N, Gil M, Dessimoz C. Multiple Sequence Alignment Methods Methods in Molecular Biology. Humana Press; 2014. Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment; pp. 59–73. [DOI] [PubMed] [Google Scholar]
- Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL. NCBI BLAST: a better web interface. Nucleic acids research. 2008;36:W5–9. doi: 10.1093/nar/gkn201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jossinet F, Westhof E. Sequence to Structure (S2S): display, manipulate and interconnect RNA data from sequence to structure. Bioinformatics. 2005;21:3320–3321. doi: 10.1093/bioinformatics/bti504. [DOI] [PubMed] [Google Scholar]
- Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M, et al. The UCSC Genome Browser database: 2014 update. Nucleic acids research. 2014;42:D764–70. doi: 10.1093/nar/gkt1168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent WJ. BLAT—The BLAST-Like Alignment Tool. Genome research. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kiryu H, Tabei Y, Kin T, Asai K. Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics. 2007;23:1588–1598. doi: 10.1093/bioinformatics/btm146. [DOI] [PubMed] [Google Scholar]
- Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic acids research. 2003;31:3423–3428. doi: 10.1093/nar/gkg614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwok CK, Tang Y, Assmann SM, Bevilacqua PC. The RNA structurome: transcriptome-wide structure probing with next-generation sequencing. Trends in biochemical sciences. 2015;40:221–232. doi: 10.1016/j.tibs.2015.02.005. [DOI] [PubMed] [Google Scholar]
- Lai D, Meyer IM. A comprehensive comparison of general RNA–RNA interaction prediction methods. Nucleic acids research. 2015 doi: 10.1093/nar/gkv1477. Available at: http://nar.oxfordjournals.org/content/early/2015/12/15/nar.gkv1477.abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR. Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proceedings of the National Academy of Sciences of the United States of America. 1985;82:6955–6959. doi: 10.1073/pnas.82.20.6955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–2948. doi: 10.1093/bioinformatics/btm404. [DOI] [PubMed] [Google Scholar]
- Lindgreen S, Gardner PP, Krogh A. MASTR: multiple alignment and structure prediction of non-coding RNAs using simulated annealing. Bioinformatics. 2007;23:3304–3311. doi: 10.1093/bioinformatics/btm525. [DOI] [PubMed] [Google Scholar]
- Lindgreen S, Umu SU, Lai AS-W, Eldai H, Liu W, McGimpsey S, Wheeler NE, Biggs PJ, Thomson NR, Barquist L, et al. Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling. PLoS computational biology. 2014;10:e1003907. doi: 10.1371/journal.pcbi.1003907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Livny J, Fogel MA, Davis BM, Waldor MK. sRNAPredict: an integrative computational approach to identify sRNAs in bacterial genomes. Nucleic acids research. 2005;33:4096–4105. doi: 10.1093/nar/gki715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–1635. doi: 10.1126/science.1158395. [DOI] [PubMed] [Google Scholar]
- Marz M, Donath A, Verstraete N, Nguyen VT, Stadler PF, Bensaude O. Evolution of 7SK RNA and its protein partners in metazoa. Molecular biology and evolution. 2009;26:2821–2830. doi: 10.1093/molbev/msp198. [DOI] [PubMed] [Google Scholar]
- Marz M, Mosig A, Podlevsky JD, Stadler PF. The common ancestral core of vertebrate and fungal telomerase RNAs. Nucleic acids. 2012 doi: 10.1093/nar/gks980. Available at: http://nar.oxfordjournals.org/content/early/2012/10/23/nar.gks980.short. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McWilliam H, Li W, Uludag M, Squizzato S, Park YM, Buso N, Cowley AP, Lopez R. Analysis Tool Web Services from the EMBL-EBI. Nucleic acids research. 2013;41:W597–600. doi: 10.1093/nar/gkt376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Menzel P, Gorodkin J, Stadler PF. The tedious task of finding homologous noncoding RNA genes. RNA. 2009;15:2075–2082. doi: 10.1261/rna.1556009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mituyama T, Yamada K, Hattori E, Okida H, Ono Y, Terai G, Yoshizawa A, Komori T, Asai K. The Functional RNA Database 3.0: databases to support mining and annotation of functional RNAs. Nucleic acids research. 2009;37:D89–92. doi: 10.1093/nar/gkn805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mosig A, Chen JJ-L, Stadler PF. Algorithms in Bioinformatics Lecture Notes in Computer Science. Springer; Berlin Heidelberg: 2007. Homology Search with Fragmented Nucleic Acid Sequence Patterns; pp. 335–345. [Google Scholar]
- Mosig A, Zhu L, Stadler PF. Customized strategies for discovering distant ncRNA homologs. Briefings in functional genomics & proteomics. 2009;8:451–460. doi: 10.1093/bfgp/elp035. [DOI] [PubMed] [Google Scholar]
- Myslinski E, Ségault V, Branlant C. An intron in the genes for U3 small nucleolar RNAs of the yeast Saccharomyces cerevisiae. Science. 1990;247:1213–1216. doi: 10.1126/science.1690452. [DOI] [PubMed] [Google Scholar]
- Nawrocki EP. Annotating functional RNAs in genomes using Infernal. Methods in molecular biology. 2014;1097:163–197. doi: 10.1007/978-1-62703-709-9_9. [DOI] [PubMed] [Google Scholar]
- Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J, et al. Rfam 12.0: updates to the RNA families database. Nucleic acids research. 2015;43:D130–7. doi: 10.1093/nar/gku1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nawrocki EP, Eddy SR. Query-dependent banding (QDB) for faster RNA similarity searches. PLoS computational biology. 2007 doi: 10.1371/journal.pcbi.0030056. Available at: http://dx.plos.org/10.1371/journal.pcbi.0030056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009;25:1335–1337. doi: 10.1093/bioinformatics/btp157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of molecular biology. 2000;302:205–217. doi: 10.1006/jmbi.2000.4042. [DOI] [PubMed] [Google Scholar]
- Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ. Algorithms for Loop Matchings. SIAM journal on applied mathematics. 1978;35:68–82. [Google Scholar]
- Pace NR, Smith DK, Olsen GJ, James BD. Phylogenetic comparative analysis and the secondary structure of ribonuclease P RNA—a review. Gene. 1989;82:65–75. doi: 10.1016/0378-1119(89)90031-0. [DOI] [PubMed] [Google Scholar]
- Perkins TT, Kingsley RA, Fookes MC, Gardner PP, James KD, Yu L, Assefa SA, He M, Croucher NJ, Pickard DJ, et al. A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS genetics. 2009;5:e1000569. doi: 10.1371/journal.pgen.1000569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Puton T, Kozlowski LP, Rother KM, Bujnicki JM. CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction. Nucleic acids research. 2014;42:5403–5406. doi: 10.1093/nar/gku208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reeder J, Giegerich R. Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction. Bioinformatics. 2005;21:3516–3523. doi: 10.1093/bioinformatics/bti577. [DOI] [PubMed] [Google Scholar]
- Richter A, Backofen R. Accessibility and conservation: General features of bacterial small RNA–mRNA interactions? RNA biology. 2012;9:954–965. doi: 10.4161/rna.20294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rinn JL, Chang HY. Genome regulation by long noncoding RNAs. Annual review of biochemistry. 2012;81:145–166. doi: 10.1146/annurev-biochem-051410-092902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- RNAcentral Consortium. RNAcentral: an international database of ncRNA sequences. Nucleic acids research. 2015;43:D123–9. doi: 10.1093/nar/gku991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts E, Eargle J, Wright D, Luthey-Schulten Z. MultiSeq: unifying sequence and structure data for evolutionary analysis. BMC bioinformatics. 2006;7:382. doi: 10.1186/1471-2105-7-382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rocca-Serra P, Bellaousov S, Birmingham A, Chen C, Cordero P, Das R, Davis-Neulander L, Duncan CDS, Halvorsen M, Knight R, et al. Sharing and archiving nucleic acid structure mapping data. RNA. 2011;17:1204–1212. doi: 10.1261/rna.2753211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rost B. Twilight zone of protein sequence alignments. Protein engineering. 1999;12:85–94. doi: 10.1093/protein/12.2.85. [DOI] [PubMed] [Google Scholar]
- Sakakibara Y, Brown M, Hughey R, Mian IS. Stochastic context-free grammers for tRNA modeling. Nucleic acids. 1994 doi: 10.1093/nar/22.23.5112. Available at: http://nar.oxfordjournals.org/content/22/23/5112.short. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sankoff D. Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems. SIAM journal on applied mathematics. 1985;45:810–825. [Google Scholar]
- Schneider KL, Pollard KS, Baertsch R, Pohl A, Lowe TM. The UCSC Archaeal Genome Browser. Nucleic acids research. 2006;34:D407–10. doi: 10.1093/nar/gkj134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schu DJ, Zhang A, Gottesman S, Storz G. Alternative Hfq-sRNA interaction modes dictate alternative mRNA recognition. The EMBO journal. 2015;34:2557–2573. doi: 10.15252/embj.201591569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartz AS, Myers EW, Pachter L. Alignment Metric Accuracy. arXiv [q-bio.QM] 2005 Available at: http://arxiv.org/abs/q-bio/0510052. [Google Scholar]
- Schwartz AS, Pachter L. Multiple alignment by sequence annealing. Bioinformatics. 2007;23:e24–9. doi: 10.1093/bioinformatics/btl311. [DOI] [PubMed] [Google Scholar]
- Seemann SE, Richter AS, Gesell T, Backofen R, Gorodkin J. PETcofold: predicting conserved interactions and structures of two multiple alignments of RNA sequences. Bioinformatics. 2011;27:211–219. doi: 10.1093/bioinformatics/btq634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K, Hackermüller J, Reinhardt R, et al. The primary transcriptome of the major human pathogen Helicobacter pylori. Nature. 2010;464:250–255. doi: 10.1038/nature08756. [DOI] [PubMed] [Google Scholar]
- Silvester N, Alako B, Amid C, Cerdeño-Tárraga A, Cleland I, Gibson R, Goodgame N, Ten Hoopen P, Kay S, Leinonen R, et al. Content discovery and retrieval services at the European Nucleotide Archive. Nucleic acids research. 2015;43:D23–9. doi: 10.1093/nar/gku1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith C, Heyne S, Richter AS, Will S, Backofen R. Freiburg RNA Tools: a web server integrating INTARNA, EXPARNA and LOCARNA. Nucleic acids research. 2010;38:W373–7. doi: 10.1093/nar/gkq316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spitale RC, Flynn RA, Torre EA, Kool ET, Chang HY. RNA structural analysis by evolving SHAPE chemistry. Wiley Interdisciplinary Reviews: RNA. 2014;5:867–881. doi: 10.1002/wrna.1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stadler PF, Chen JJ-L, Hackermüller J, Hoffmann S, Horn F, Khaitovich P, Kretzschmar AK, Mosig A, Prohaska SJ, Qi X, et al. Evolution of vault RNAs. Molecular biology and evolution. 2009;26:1975–1991. doi: 10.1093/molbev/msp112. [DOI] [PubMed] [Google Scholar]
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome research. 2002;12:1611–1618. doi: 10.1101/gr.361602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stombaugh J, Widmann J, McDonald D, Knight R. Boulder ALignment Editor (ALE): a web-based RNA alignment tool. Bioinformatics. 2011;27:1706–1707. doi: 10.1093/bioinformatics/btr258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suchard MA, Redelings BD. BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22:2047–2048. doi: 10.1093/bioinformatics/btl175. [DOI] [PubMed] [Google Scholar]
- Sullivan MJ, Petty NK, Beatson SA. Easyfig: a genome comparison visualizer. Bioinformatics. 2011;27:1009–1010. doi: 10.1093/bioinformatics/btr039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tabei Y, Kiryu H, Kin T, Asai K. A fast structural multiple alignment method for long RNA sequences. BMC bioinformatics. 2008;9:33. doi: 10.1186/1471-2105-9-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic acids research. 1999;27:2682–2690. doi: 10.1093/nar/27.13.2682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics. 2013;14:178–192. doi: 10.1093/bib/bbs017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Torarinsson E, Havgaard JH, Gorodkin J. Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007;23:926–932. doi: 10.1093/bioinformatics/btm049. [DOI] [PubMed] [Google Scholar]
- Torarinsson E, Lindgreen S. WAR: Webserver for aligning structural RNAs. Nucleic acids research. 2008;36:W79–84. doi: 10.1093/nar/gkn275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vogel J. A rough guide to the non-coding RNA world of Salmonella. Molecular microbiology. 2009 doi: 10.1111/j.1365-2958.2008.06505.x. Available at http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2958.2008.06505.x/full. [DOI] [PubMed] [Google Scholar]
- Waldispühl J, Kam A, Gardner PP. Crowdsourcing RNA structural alignments with an online computer game. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2015:330–341. [PubMed] [Google Scholar]
- Wassarman KM, Storz G. 6S RNA regulates E. coli RNA polymerase activity. Cell. 2000;101:613–623. doi: 10.1016/s0092-8674(00)80873-9. [DOI] [PubMed] [Google Scholar]
- Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. doi: 10.1093/bioinformatics/btp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wehner S, Damm K, Hartmann RK, Marz M. Dissemination of 6S RNA among bacteria. RNA biology. 2014;11:1467–1478. doi: 10.4161/rna.29894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinberg Z, Kim PB, Chen TH, Li S, Harris KA, Lünse CE, Breaker RR. New classes of self-cleaving ribozymes revealed by comparative genomics analysis. Nature chemical biology. 2015;11:606–610. doi: 10.1038/nchembio.1846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR. Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome biology. 2010;11:R31. doi: 10.1186/gb-2010-11-3-r31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Westesson O, Barquist L, Holmes I. HandAlign: Bayesian multiple sequence alignment, phylogeny and ancestral reconstruction. Bioinformatics. 2012;28:1170–1171. doi: 10.1093/bioinformatics/bts058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wheeler TJ, Eddy SR. nhmmer: DNA homology search with profile HMMs. Bioinformatics. 2013;29:2487–2489. doi: 10.1093/bioinformatics/btt403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS computational biology. 2007;3:e65. doi: 10.1371/journal.pcbi.0030065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woese CR, Gutell R, Gupta R, Noller HF. Detailed analysis of the higher-order structure of 16S-like ribosomal ribonucleic acids. Microbiological reviews. 1983;47:621–669. doi: 10.1128/mr.47.4.621-669.1983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong KM, Suchard MA, Huelsenbeck JP. Alignment uncertainty and genomic analysis. Science. 2008;319:473–476. doi: 10.1126/science.1151532. [DOI] [PubMed] [Google Scholar]
- Xie M, Mosig A, Qi X, Li Y, Stadler PF, Chen JJ-L. Size variation and structural conservation of vertebrate telomerase RNA. The Journal of biological chemistry. 2008;283:2049–2059. doi: 10.1074/jbc.M708032200. [DOI] [PubMed] [Google Scholar]
- Xu X, Ji Y, Stormo GD. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinformatics. 2007;23:1883–1891. doi: 10.1093/bioinformatics/btm272. [DOI] [PubMed] [Google Scholar]
- Yao Z, Weinberg Z, Ruzzo WL. CMfinder—a covariance model based RNA motif finding algorithm. Bioinformatics. 2006;22:445–452. doi: 10.1093/bioinformatics/btk008. [DOI] [PubMed] [Google Scholar]