Abstract
Summary: Here, we present riboPicker, a robust framework for the rapid, automated identification and removal of ribosomal RNA sequences from metatranscriptomic datasets. The results can be exported for subsequent analysis, and the databases used for the web-based version are updated on a regular basis. riboPicker categorizes rRNA-like sequences and provides graphical visualizations and tabular outputs of ribosomal coverage, alignment results and taxonomic classifications.
Availability and implementation: This open-source application was implemented in Perl and can be used as stand-alone version or accessed online through a user-friendly web interface. The source code, user help and additional information is available at http://ribopicker.sourceforge.net/.
Contact: rschmied@sciences.sdsu.edu; rschmied@sciences.sdsu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Metatranscriptomic approaches are drastically improving our understanding of metabolism and gene expression in microbial communities. By investigating all functional mRNA transcripts isolated from an environmental sample, metatranscriptomic analyses provide insights into the metabolic pathways important for a community at the time of sampling. Although metatranscriptomes are used to investigate metabolic activities, the majority of RNA recovered in metatranscriptomic studies is ribosomal RNA (rRNA), often exceeding 90% of the total reads (Stewart et al., 2010). Even after various treatments prior to sequencing, the observed rRNA content decreases only slightly (He et al., 2010) and metatranscriptomes still contain significant amounts of rRNA.
Although rRNA-like sequences are occasionally removed from metatranscriptomes, the removal is performed only with a subset of the publicly available rRNA sequences. Failure to remove all rRNA sequences can lead to misclassifications and erroneous conclusions during the downstream analysis. It is estimated that misannotations of rRNA as proteins may cause up to 90% false positive matches of rRNA-like sequences in metatranscriptomic studies (Tripp et al., 2011). The potential for false positives arrises from a failure to completely remove all rRNA prior to translating the putative rRNA and querying a protein database. The rRNA operons in Bacteria and Archaea are not known to contain expressed protein coding regions that at the same time code for rRNA and therefore, annotations of proteins in rRNA coding regions should be presumed to be misannotations (Aziz et al., 2008). Metagenomic sequence data generated to asses the metabolic potential of a community will also be affected by false positive matches of rRNA sequences when querying a protein database. Therefore, transcript analysis should only proceed after it has been verified that all rRNA-like sequences have been found and removed from the dataset to allow accurate identification of the transcribed functional content. The high-throughput nature of community sequencing efforts necessitates better tools for the automated preprocessing of sequence datasets.
Here, we describe an application able to provide graphical guidance and to perform identification, classification and removal of rRNA-like sequences on metatranscriptomic data. The application incorporates a modified version of the BWA-SW program (http://bioinformatics.oxfordjournals.org/citmgr?gca=bioinfo;26/5/589), and is publicly available through a user-friendly web interface and as stand-alone version. The web interface allows online analysis using rRNA sequences from public databases and provides data export for subsequent analysis.
2 METHODS
2.1 Implementation and computational platform
The riboPicker application was implemented as stand-alone and web-based version in Perl. The web application is currently running on a web server with Ubuntu Linux using an Apache HTTP server to support the web services. The alignments are computed on a connected computing cluster with 10 working nodes (each with 8 CPUs and 16 GB RAM) running the Oracle Grid Engine version 6.2. All graphics are generated using the Cairo graphics library (http://cairographics.org/).
2.2 Identification of rRNA-like sequences
The identification of rRNA-like sequences is based on sequence alignments using a modified version of the BWA-SW program. The modifications do not change the default behavior of the algorithm and include parameter forced changes in the alignment of ambiguous bases and the generation of an alternative output. The documentation provides a detailed list of changes and is available on the program website.
riboPicker uses query sequence coverage, alignment identity and minimum alignment length thresholds to determine if an input sequence is an rRNA-like sequence or not. This approach is based on the idea that looking for similar regions consists of grouping sequences that share some minimum sequence similarity over a specified minimum length. Threshold percentage values are rounded toward the lower integer and should not be set to 100% if errors are expected in the input sequences. The results for multiple databases are automatically joined before generating any outputs.
Using simulated datasets, we evaluated the classification of rRNA-like sequences and showed that riboPicker performed with high accuracy comparable to the latest version of meta_rna (Huang et al., 2009) and BLASTn (Supplementary Material). A comparison on real metatranscriptomic data showed that riboPicker processes data more than twice as fast as Hidden Markov Model (HMM)-based programs and >100 times faster than BLASTn (Supplementary Material).
2.3 Reference databases
The web-based version offers preprocessed databases for 5S/5.8S,16S/18S and 23S/28S rRNA sequences from a variety of resources, currently including SILVA (Pruesse et al., 2007), RDP (Cole et al., 2009), Greengenes (DeSantis et al., 2006), Rfam (Gardner et al., 2011), NCBI (Sayers et al., 2011) and HMP DACC (The NIH HMP Working Group et al., 2009). To reduce the number of possibly misannotated entries, sequences were filtered by length to remove very short and long sequences and by genomic location to remove overlapping rRNA misannotations. The remaining sequences were then converted into DNA sequences (if required) and filtered for read duplicates to reduce redundancy in the sequence data. Detailed information for each reference database is provided on the website. Taxonomic information was either retrieved with the sequence data from the resources or was added based on the NCBI Taxonomy. The databases are automatically updated on a regular basis and can be requested from the authors for offline analysis. A non-redundant database is made available for the stand-alone version on the program website.
3 WEB-INTERFACE
3.1 Inputs
The web interface allows the submission of compressed FASTA or FASTQ files to reduce the time of data upload. Uploaded data can be shared or accessed at a later point using unique data identifiers. It should be noted at this point that the input datasets should only contain quality-controlled, preprocessed sequences to ensure accurate results (Schmieder and Edwards, 2011). In addition to the sequence data, the rRNA reference databases have to be selected from the list of available databases.
Unlike the stand-alone version, the web-based program allows the user to define threshold parameters based on the results after the data are processed. This does not require an a priori knowledge of the best parameters for a given dataset and the parameter choice can be guided by the graphical visualizations.
3.2 Outputs
Users can download the results in FASTA or FASTQ (if provided as input) format or its compressed version. Results will be stored for the time selected by the user (either 1 day or 1 week), if not otherwise requested, on the web server using a unique identifier displayed during data processing and on the result page. This identifier additionally allows users to share the result with other researchers.
The current implementation offers several graphical and tabular outputs in addition to the processed sequence data. The Coverage versus Identity plot shows the number of matching reads for different coverage and identity threshold values. The coverage plots show where the metatranscriptomic sequences aligned to the rRNA reference sequences and provide an easy way to check for possible bias in the alignment or the rRNA-removal prior to sequencing. The coverage data for each database sequence is available for download. The taxonomic classifications of rRNA-like sequences are presented as bar charts for each selected database. The summary report includes information about the input data, selected databases and thresholds, and rRNA-like sequence classifications by database, domain and phyla.
4 BRIEF SURVEY OF ALTERNATIVE PROGRAMS
There are different applications that can identify rRNA-like sequences in metatranscriptomic datasets. The command line program meta_rna (Huang et al., 2009) is written in Python and identifies rRNA sequences based on HMMs using the HMMER package (Eddy, 2009). Another program based on HMMER is rRNASelector (Lee et al., 2011), which is written in Java and can only be used through its graphical interface. The web-based MG-RAST (Meyer et al., 2008) uses the BLASTn program, identifying rRNA-like sequences based on sequence similarity. The HMM-based programs currently allow identification of bacterial and archaeal rRNAs. The sequence similarity-based programs make it easy to assign sequences to taxonomic groups.
5 CONCLUSION
riboPicker allows scientists to efficiently remove rRNA-like sequences from their metatranscriptomic datasets prior to downstream analysis. The web interface is simple and user-friendly, and the stand-alone version allows offline analysis and integration into existing data processing pipelines. The tool provides a computational resource able to handle the amount of data that next-generation sequencers are capable of generating and can place the process more within reach of the average research lab.
Supplementary Material
ACKNOWLEDGEMENT
We thank Matthew Haynes and Ramy Aziz for comments and suggestions. We thank the HMP DACC for making reference genomes data from the NIH Human Microbiome Project publicly available.
Funding: National Science Foundation Advances in Bioinformatics grant (DBI 0850356 to R.E.).
Conflict of Interest: none declared.
REFERENCES
- Aziz R.K., et al. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008;9:75. doi: 10.1186/1471-2164-9-75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cole J.R., et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009;37:D141–D145. doi: 10.1093/nar/gkn879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeSantis T.Z., et al. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl. Environ. Microbiol. 2006;72:5069–5072. doi: 10.1128/AEM.03006-05. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddy S.R. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009;23:205–211. [PubMed] [Google Scholar]
- Gardner P.P., et al. Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res. 2011;39:D141–D145. doi: 10.1093/nar/gkq1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He S., et al. Validation of two ribosomal RNA removal methods for microbial metatranscriptomics. Nat. Methods. 2010;7:807–812. doi: 10.1038/nmeth.1507. [DOI] [PubMed] [Google Scholar]
- Huang Y., et al. Identification of ribosomal RNA genes in metagenomic fragments. Bioinformatics. 2009;25:1338–1340. doi: 10.1093/bioinformatics/btp161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Durbin R. Fast and accurate long-read alignment with Burrows-wheeler transform. Bioinformatics. 2010;26:589–595. doi: 10.1093/bioinformatics/btp698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee J.-H., et al. rRNASelector: a computer program for selecting ribosomal RNA encoding sequences from metagenomic and metatranscriptomic shotgun libraries. J. Microbiol. 2011;49:689–691. doi: 10.1007/s12275-011-1213-z. [DOI] [PubMed] [Google Scholar]
- Meyer F., et al. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9:386. doi: 10.1186/1471-2105-9-386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pruesse E., et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007;35:7188–7196. doi: 10.1093/nar/gkm864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sayers E.W., et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011;39:D38–D51. doi: 10.1093/nar/gkq1172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmieder R., Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011;27:863–864. doi: 10.1093/bioinformatics/btr026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stewart F.J., et al. Development and quantitative analyses of a universal rRNA-subtraction protocol for microbial metatranscriptomics. ISME J. 2010;4:896–907. doi: 10.1038/ismej.2010.18. [DOI] [PubMed] [Google Scholar]
- The NIH HMP Working Group, et al. The NIH Human Microbiome Project. Genome Res. 2009;19:2317–2323. doi: 10.1101/gr.096651.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tripp H.J., et al. Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies. Nucleic Acids Res. 2011;39:8792–8802. doi: 10.1093/nar/gkr576. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.