Abstract
Summary: infernal builds consensus RNA secondary structure profiles called covariance models (CMs), and uses them to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments.
Availability: Source code, documentation and benchmark downloadable from http://infernal.janelia.org. infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X.
Contact: nawrockie,kolbed,eddys@janelia.hhmi.org
1 INTRODUCTION
When searching for homologous structural RNAs in sequence databases, it is desirable to score both primary sequence and secondary structure conservation. The most generally useful tools that integrate sequence and structure take as input any RNA (or RNA multiple alignment), and automatically construct an appropriate statistical scoring system that allows quantitative ranking of putative homologs in a sequence database (Gautheret and Lambert, 2001; Huang et al., 2008; Zhang et al., 2005). Stochastic context-free grammars (SCFGs) provide a natural statistical framework for combining sequence and (non-pseudoknotted) secondary structure conservation information in a single consistent scoring system (Brown, 2000; Durbin et al., 1998; Eddy and Durbin, 1994; Sakakibara et al., 1994).
Here, we announce the 1.0 release of infernal, an implementation of a general SCFG-based approach for RNA database searches and multiple alignment. infernal builds consensus RNA profiles called covariance models (CMs), a special case of SCFGs designed for modeling RNA consensus sequence and structure. It uses CMs to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments. One use of infernal is to annotate RNAs in genomes in conjunction with the Rfam database (Gardner et al., 2009), which contains hundreds of RNA families. Rfam follows a seed profile strategy, in which a well-annotated ‘seed’ alignment of each family is curated, and a CM built from that seed alignment is used to identify and align additional members of the family. infernal has been in use since 2002, but 1.0 is the first version that we consider to be a reasonably complete production tool. It now includes E-value estimates for the statistical significance of database hits, and heuristic acceleration algorithms for both database searches and multiple alignment that allow infernal to be deployed in a variety of real RNA analysis tasks with manageable (albeit high) computational requirements.
2 USAGE
A CM is built from a Stockholm format multiple sequence alignment (or single RNA sequence) with consensus secondary structure annotation marking which positions of the alignment are single stranded and which are base paired (Eddy, 2009). CMs assign position-specific scores for the four possible residues at single-stranded positions, the 16 possible base pairs at paired positions and for insertions and deletions. These scores are log-odds scores derived from the observed counts of residues, base pairs, insertions and deletions in the input alignment, combined with prior information derived from structural ribosomal RNA alignments. CM parameterization has been described in more detail elsewhere (Eddy, 2002, 2009; Eddy and Durbin, 1994; Klein and Eddy, 2003; Nawrocki and Eddy, 2007).
infernal is composed of several programs that are used in combination by following four basic steps:
Build a CM from a structural alignment with cmbuild.
Calibrate a CM for homology search with cmcalibrate.
Search databases for putative homologs with cmsearch.
Align putative homologs to a CM with cmalign.
The calibration step is optional and computationally expensive (4 h on a 3.0 GHz Intel Xeon for a CM of a typical RNA family of length 100 nt), but is required to obtain E-values that estimate the statistical significance of hits in a database search. cmcalibrate will also determine appropriate hidden Markov model (HMM) filter thresholds for accelerating searches without an appreciable loss of sensitivity. Each model only needs to be calibrated once.
3 PERFORMANCE
A published benchmark (independent of our lab) (Freyhult et al., 2007) and our own internal benchmark used during development (Nawrocki and Eddy, 2007) both find that infernal and other CM-based methods are the most sensitive and specific tools for structural RNA homology search among those tested. Figure 1 shows updated results of our internal benchmark comparing infernal 1.0 with the previous version (0.72) that was benchmarked in Freyhult et al. (2007), and also to family-pairwise search with BLASTN (Altschul et al., 1997; Grundy, 1998). infernal's sensitivity and specificity have greatly improved, due to mainly three relevant improvements in the implementation (Eddy, 2009): a biased composition correction to the raw log-odds scores, the use of Inside log likelihood scores (the summed score of all possible alignments of the target sequence) in place of CYK scores (the single maximum likelihood alignment score) and the introduction of approximate E-value estimates for the scores.
The benchmark dataset used in Figure 1 includes query alignments and test sequences from 51 Rfam (release 7) families [details in (Nawrocki and Eddy, 2007)]. No query sequence is >60% identical to a test sequence. The 450 total test sequences were embedded at random positions in a 10 Mb ‘pseudogenome’. Previously, we generated the pseudogenome sequence from a uniform residue frequency distribution (Nawrocki and Eddy, 2007). Because base composition biases in the target sequence database cause the most serious problems in separating significant CM hits from noise, we improved the realism of the benchmark by generating the pseudogenome sequence from a 15-state fully connected HMM trained by Baum–Welch expectation maximization (Durbin et al., 1998) on genome sequence data from a wide variety of species. Each of the 51 query alignments was used to build a CM and search the pseudogenome, a single list of all hits for all families were collected and ranked, and true and false hits were defined, as described in Nawrocki and Eddy (2007), producing the ROC curves in Figure 1.
infernal searches require a large amount of compute time [our 10 Mb benchmark search takes about 30 h per model on average (Fig. 1)]. To alleviate this, infernal 1.0 implements two rounds of filtering. When appropriate, the HMM filtering technique described by Weinberg and Ruzzo (2006) is applied first with filter thresholds configured by cmcalibrate [occasionally a model with little primary sequence conservation cannot be usefully accelerated by a primary sequence-based filter as explained in (Eddy, 2009)]. The query-dependent banded (QDB) CYK maximum likelihood search algorithm is used as a second filter with relatively tight bands [β=10−7, the β parameter is the subtree length probability mass excluded by imposing the bands as explained in Nawrocki and Eddy (2007)]. Any sequence fragments that survive the filters are searched a final time with the Inside algorithm [again using QDB, but with looser bands (β = 10−15)]. In our benchmark, the default filters accelerate similarity search by about 30-fold overall, while sacrificing a small amount of sensitivity (Fig. 1). This makes version 1.0 substantially faster than 0.72. BLAST is still orders of magnitude faster, but significantly less sensitive than infernal. Further acceleration remains a major goal of infernal development.
The computational cost of CM alignment with cmalign has been a limitation of previous versions of infernal. Version 1.0 now uses a constrained dynamic programming approach first developed by Brown (2000) that uses sequence-specific bands derived from a first-pass HMM alignment. This technique offers a dramatic speedup relative to unconstrained alignment, especially for large RNAs such as small and large subunit (SSU and LSU, respectively) ribosomal RNAs, which can now be aligned in roughly 1 and 3 s per sequence, respectively, as opposed to 12 min and 3 h in previous versions. This acceleration has facilitated the adoption of infernal by RDP, one of the main ribosomal RNA databases (Cole et al., 2009).
infernal is now a faster and more sensitive tool for RNA sequence analysis. Version 1.0's heuristic acceleration techniques make some important applications possible on a single desktop computer in less than an hour, such as searching a prokaryotic genome for a particular RNA family, or aligning a few thousand SSU rRNA sequences. Nonetheless, infernal remains computationally expensive, and many problems of interest require the use of a cluster. The most expensive programs (cmcalibrate, cmsearch and cmalign) are implemented in coarse-grained parallel MPI versions which divide the workload into independent units, each of which is run on a separate processor.
ACKNOWLEDGEMENTS
We thank Goran Ceric for his peerless skill in managing Janelia Farm's high-performance computing resources.
Funding: infernal development is supported by the Howard Hughes Medical Institute. It has been supported in the past by an NIH NHGRI training grant (T32-HG000045) to E.P.N., an NSF Graduate Fellowship to D.L.K.; NIH R01-HG01363 and a generous endowment from Alvin Goldfarb.
Conflict of Interest: none declared.
REFERENCES
- Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown MP. Small subunit ribosomal RNA modeling using stochastic context-free grammars. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000;8:57–66. [PubMed] [Google Scholar]
- Cole JR, et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009;37:D141–D145. doi: 10.1093/nar/gkn879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durbin R, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998. [Google Scholar]
- Eddy SR. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics. 2002;3:18. doi: 10.1186/1471-2105-3-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddy SR. The Infernal user's guide. 2009 Available at http://infernal.janelia.org/. (last accessed date March 27, 2009) [Google Scholar]
- Eddy SR, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. 1994;22:2079–2088. doi: 10.1093/nar/22.11.2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freyhult EK, et al. Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. Genome Res. 2007;17:117–125. doi: 10.1101/gr.5890907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner PP, et al. Rfam: updates to the RNA families database. Nucleic Acids Res. 2009;37:D136–D140. doi: 10.1093/nar/gkn766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gautheret D, Lambert A. Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J. Mol. Biol. 2001;313:1003–1011. doi: 10.1006/jmbi.2001.5102. [DOI] [PubMed] [Google Scholar]
- Grundy WN. Homology detection via family pairwise search. J. Comput. Biol. 1998;5:479–491. doi: 10.1089/cmb.1998.5.479. [DOI] [PubMed] [Google Scholar]
- Huang Z, et al. Fast and accurate search for non-coding RNA pseudoknot structures in genomes. Bioinformatics. 2008;24:2281–2287. doi: 10.1093/bioinformatics/btn393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klein RJ, Eddy SR. RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003;4:44. doi: 10.1186/1471-2105-4-44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nawrocki EP, Eddy SR. Query-dependent banding (QDB) for faster RNA similarity searches. PLoS Comput. Biol. 2007;3:e56. doi: 10.1371/journal.pcbi.0030056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sakakibara Y, et al. Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res. 1994;22:5112–5120. doi: 10.1093/nar/22.23.5112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinberg Z, Ruzzo WL. Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics. 2006;22:35–39. doi: 10.1093/bioinformatics/bti743. [DOI] [PubMed] [Google Scholar]
- Zhang S, et al. Searching genomes for noncoding RNA using FastR. IEEE/ACM Trans. Comput. Biol. Bioinform. 2005;2:366–379. doi: 10.1109/TCBB.2005.57. [DOI] [PubMed] [Google Scholar]