Abstract
Translational control is crucial in the regulation of gene expression and deregulation of translation is associated with a wide range of cancers and human diseases. Ribosome profiling is a technique that provides genome wide information of mRNA in translation based on deep sequencing of ribosome protected mRNA fragments (RPF). RPFdb is a comprehensive resource for hosting, analyzing and visualizing RPF data, available at www.rpfdb.org or http://sysbio.sysu.edu.cn/rpfdb/index.html. The current version of database contains 777 samples from 82 studies in 8 species, processed and reanalyzed by a unified pipeline. There are two ways to query the database: by keywords of studies or by genes. The outputs are presented in three levels. (i) Study level: including meta information of studies and reprocessed data for gene expression of translated mRNAs; (ii) Sample level: including global perspective of translated mRNA and a list of the most translated mRNA of each sample from a study; (iii) Gene level: including normalized sequence counts of translated mRNA on different genomic location of a gene from multiple samples and studies. To explore rich information provided by RPF, RPFdb also provides a genome browser to query and visualize context-specific translated mRNA. Overall our database provides a simple way to search, analyze, compare, visualize and download RPF data sets.
INTRODUCTION
Translational control is crucial in the regulation of gene expression. Deregulation of translation is associated with a wide range of cancers and human diseases (1). For example, hereditary hyperferritinaemia-cataract syndrome (HHCS) is an autosomal dominant disorder caused by mutations in the iron-response elements of ferritin. The mutation causes increased translation of ferritin mRNA and, hence, elevated serum levels of ferritin. This leads to nuclear cataract, an eye disease that eventually progresses to total blindness (2). Therefore, accurate measurement of translated mRNA is invaluable to better understand cellular functions and human diseases.
Nuclease footprinting is a conventional way to determine ribosome positions on mRNA, where the 28–30 nucleotides of mRNAs protected by a ribosome indicate translated mRNA (3,4). Ribosome profiling is a recently developed high-throughput strategy based on deep sequencing of ribosome-protected mRNA fragments (RPF) (5,6), that provides genome-wide information of mRNA in translation. Since its inception in 2009, RPF technique has been utilized in a range of studies in both prokaryotic and eukaryotic organisms and the number of studies increases rapidly every year (4,6–8).
To date, GWIPS-viz is the only database specifically designed for RPF data sets (9). GWIPS-viz provides a very valuable online genome browser to view the coverage and distribution of RPF reads. Currently it hosts RPF and mRNA data sets from 45 studies. Another relevant database is TISdb that is based on the recently developed Global Translation Initiation sequencing (GTI-seq) technology which provides global mapping of translation initiation codons (10). TISdb provides tools to search for translation initiation sites and the associated open reading frames (ORFs) based on multiple GTI-seq datasets. Since the numbers of studies using RPF technique have been growing significantly in the recent years, there is a strong need for an integrated database that facilitates the exploration of data from these studies. Furthermore, in addition to visualization of RPF reads, there is also an emerging database demand both for hosting the meta information of the studies but also for in-depth analysis of studies in a consistent way.
Herein, we present RPFdb, a comprehensive resource for hosting, analyzing and visualizing RPF data sets, available at www.rpfdb.org or http://sysbio.sysu.edu.cn/rpfdb/index.html. The current version of database contains 777 samples from 82 studies in 8 species, reprocessed by a unified pipeline. The main functions of the database include Browse, Search and Download, summarized in Figure 1. There are two ways to query the database: by keywords of studies or by genes. The outputs are presented in three levels. (i) Study level: including meta information of studies and reprocessed data for gene expression of translated mRNAs; (ii) Sample level: including global perspective of translated mRNA and a list of the most translated mRNA of each sample from a study; (iii) Gene level: including normalized sequence counts of translated mRNA on different genomic location of a gene from multiple samples and studies. To explore the rich information provided by RPF, RPFdb also provides a genome browser to query and visualize context-specific translated mRNA. Overall our database provides a simple way to search, analyze, compare, visualize and download RPF data sets.
Figure 1.
Structure and contents of RPFdb.
MATERIALS AND METHODS
Data sources
The RPF sequencing data were collected from Gene Expression Omnibus (GEO) and Short Read Archive (SRA) databases. The current version contains 777 samples from 82 studies in 8 species: Arabidopsis, C. elegans, Drosophila, E. coli, Human, Mouse, Yeast and Zebrafish. Figure 2 shows the distribution of studies and samples by species. Human and yeast are the two species that were most studied.
Figure 2.
Distribution of studies and samples in species
Data processing
The pipeline for data processing is summarized in Figure 3. Specifically, SRAToolkit v2.4.3 (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software) was used to convert sra files to fastq format. And FastQC v0.11.2 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) was used to access the base quality of raw data. Since ribosome encloses a ∼30 nucleotides of mRNA and the 3′ sequences are usually linkers (5), we selected the first 26 nucleotides of the sequences for the subsequent alignment against reference genome as described in the literature (11). STAR v2.4.0i was used for alignment where one mismatch was allowed and multiple alignments were accepted (12). Supplemental Table S1 shows reference genome and related annotation files of eight species. We removed any sample if the number of its uniquely mapped reads is less than 1 million.
Figure 3.
Data processing and RPKM calculation
Data analysis
RNA–SeQC was used to generate two types of statistics of mapped reads: (i) the numbers of mapped and unmapped reads of each sample and (ii) the mapped ratio in exonic, intronic and intergenic regions of each sample (13). Reads Per Kilobase per Million mapped reads (RPKM) was used to measure RPF abundance, as defined below (14).
![]() |
where R is the total mapped reads in all genes, flf is the feature length and rf is the raw mRNA counts.
The raw mRNA counts was calculated by HTseq-count v0.6.1p1 (15). Exonic loci of CDS and exon, were retrieved directly from annotation file against reference genome (Supplemental Table S1), and exonic loci of 3′ UTR and 5′ UTR were extracted by R package GenomicFeatures (16). If an exonic locus was classified as UTR for one transcript and as CDS for another transcript of the same gene, UTR or CDS region for each transcript was annotated separately, and the same exon was counted as both UTR and CDS. The feature length from multiple transcripts of the gene were merged.
Database implementation
The database was implemented by PHP, MySQL and JavaScript. The study and gene information were stored and queried by MySQL and PHP. The JavaScript JQuery and D3.js library were used for producing dynamic and interactive data visualization in the web browser. In addition, we integrated JBrowse in our database for visualizing context-specific translated mRNA intuitively. And the aligned RPF sequences in each species against their reference and annotation information are hosted in the genome browser.
RESULTS
Usage and access
The RPFdb includes Home, Browse, Search, Download and Help pages.
Search
This page provides two ways to query the database. (i) Search gene: by selecting a species and entering a gene symbol or Ensembl ID in the search box of the search page (also appears in the home page), the output shows the genome information of this gene from all the samples for the selected species, including RPKM of the gene and RPKM of the 5′ UTR, CDS and 3′ UTR regions. The users can sort the table by clicking the column names. The users may also use a search box on the output page to filter the results. A snapshot of output of ‘Search gene’ is shown in Supplemental Figure S1. The JBrowse icon provides hyperlink to a genome browser, which is described in the next section. (ii) Search study: users can search studies by keywords. This feature is useful to retrieve data set from the curated studies in the database. A snapshot of output of ‘Search study’ is shown in Supplemental Figure S2.
Genome browser
To explore the distribution of RPF reads for a given gene, RPFdb provides a genome browser to query and visualize context-specific translated mRNA. A snapshot of an example of ‘Genome browser’ is shown in Supplemental Figure S1. The annotated gene track and reference are displayed on the top of browser. RPF reads of the selected gene are shown at the bottom. If users select multiple samples, they can easily compare reads distribution on the genome.
Browse
For each study, this page displays: (i) Meta information of the study, including abstract of study, tissue or cell source, treatment for RPF experiment and reference genome for alignment (Supplemental Figure S3); (ii) The top 200 most translated genes. In addition, the whole RPKM table of the study is also downloadable (Supplemental Figure S4); (iii) Plots showing overall statistics of each sample, including the numbers and fraction of mapped and unmapped reads in each sample, and statistics of RPKM on different genomic regions of each sample, including 5′ UTR, CDS, 3′ UTR and gene (Supplemental Figures S3 and S4).
Download
Download page also has search function so that users can quickly find out their interested data set for downloading. In addition, we also support the Application Programming Interface (API), which allows developers to obtain the analysis result from RPFdb by using a HTTP client. The server-side programs in RPFdb accept a fixed URL syntax for retrieval operations. For example, the search result of gene THI1 in Arabidopsis can be returned by using the URL http://sysbio.sysu.edu.cn/rpfdb/fetchExpression.php?gene=THI1&species=Arabidopsis. Besides, both gene symbol and Ensembl ID can be used as query keys.
A case study
Here we show how to explore genes or studies of interest. For example, we want to know how mouse gene Swi5 is translated under different conditions. On either the home page or the search page, we can select the species, mouse, and enter Swi5 or Ensembl gene ID ENSMUSG00000044627. The output page displays RPKM of Swi5 from 128 samples. We next want to know how Swi5 is translated in mouse embryonic stem cells, ‘E14’. Entering ‘E14’ into the search box of the output page, it displays the results from the studies using ‘E14’ (11,17) (Supplemental Figure S5A). The output shows that RPKM in 3′ UTR is lower than that in 5′ UTR and CDS for all samples. And RPKM in 5′ UTR is much higher than that in CDS for the harringtonine-treated samples (Supplemental Figure S5A). This can be explained that ribosomes dropped after stop codon in 3′ UTR whereas ribosomes accumulate at initiation sites in 5′ UTR for the harringtonine treated cells (11). If we want to explore reads distribution of cycloheximide and harringtonine treated cells in the genome browser. We select samples of interest and click ‘Genome Browse’. The output page shows the reads coverage on genome (Supplemental Figure S5B). It shows that the harringonine-treat cells have one clear peak in the 5′ UTR region, in contrast, cycloheximide treated cells have reads in multiple exon regions and 5′ UTR region (Figure Supplemental Figure S5C and SD). These features are consistent with the results presented in the original publication (11).
Another useful function of RPFdb is to query studies of interest. For example, we enter ‘cell cycle’ in the search study page. The output page shows the relevant study of Stumpf CR et al. (18). Click ‘Details’, it leads to the page displaying the summary of the study and global perspective of the genome-wide distribution of RPF reads (Supplemental Figure S2).
DISCUSSION
The numbers of studies using RPF technique have been growing significantly in the recent years. There is a strong need for an integrated database that facilitates the exploration of data from these studies. Here we present RPFdb, a comprehensive resource for hosting and analyzing the publicly available RPF data sets. The main functions of RPFdb include to search studies of interest, to explore basic statistics of reads of the studies, to compare reads of translated mRNA of given genes from multiple studies and to visualize RPF data in a genome browser.
GWIPS-viz (9) and RPFdb are both useful resources for users who are interested in translated mRNA but with little computational knowledge. In order to help users to choose an appropriate database, we compared RPFdb and GWIPS-viz regarding the number of studies, the type of collected data sets, data processing method, aligner, genome browser and main features (Table 1).
Table 1. Comparison of RPFdb and GWIPS-viz.
RPFdb | GWIPS-viz | |
---|---|---|
No. of studies | 82 | 45 |
Type of data sets | RPF only | RPF and mRNA-seq |
Data processing | • The first 26 nucleotides kept; | • The adaptor linker sequence or poly-(A) tails trimmed from the 3′ ends of reads; |
• One mismatch allowed | • Three mismatches with alignment allowed | |
Aligner | STAR | Bowtie |
Genome browser | Jbrowse | UCSC genome browser |
Meta information | Searchable | Not searchable |
Main features | • Statistics of studies and samples; | • Visualization of RPF, inferred A-sites and mRNA; |
• RPKM of RPF on different genomic location; | • Comparison of RPF and mRNA from the same sample | |
• Visualization of RPF |
Owing to the increasing interest in ribosome profiling and in translational regulation in general, we envision that ribosome profiling technology will be applied to a broader set of species and conditions and more publications will be released in future. RPFdb will be updated in a timely manner with new released data from public studies. We will also make efforts to improve the database. We hope that RPFdb will be a valuable resource for both experimental and computational biologists who are interested in understanding translational regulation and gene regulation.
Acknowledgments
We would like to thank Dr. Nicholas Ingolia, Dr. Pasha Baranov and all the members of Zhi Xie's lab to provide suggestions to help improve the database.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Recruitment Program of Global Experts and National Natural Science Foundation of China [31471232, 31471252]; National Basic Research Program (973 project) [2013CB933900 and 2012CB911201]; Guangdong Natural Science Foundation [S20120011335 and 2014A030313181]; Program of International S&T Cooperation [2014DFB30020]. Funding for open access charge: Recruitment Program of Global Experts and National Natural Science Foundation of China [31471232, 31471252]; National Basic Research Program (973 project) [2013CB933900 and 2012CB911201]; Guangdong Natural Science Foundation [S20120011335 and 2014A030313181]; Program of International S&T Cooperation [2014DFB30020].
Conflict of interest statement. None declared.
REFERENCES
- 1.Calkhoven C.F., Muller C., Leutz A. Translational control of gene expression and disease. Trends Mol. Med. 2002;8:577–583. doi: 10.1016/s1471-4914(02)02424-3. [DOI] [PubMed] [Google Scholar]
- 2.Millonig G., Muckenthaler M.U., Mueller S. Hyperferritinaemia-cataract syndrome: worldwide mutations and phenotype of an increasingly diagnosed genetic disorder. Hum. Genomics. 2010;4:250–262. doi: 10.1186/1479-7364-4-4-250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Stern-Ginossar N. Decoding viral infection by ribosome profiling. J. Virol. 2015;89:6164–6166. doi: 10.1128/JVI.02528-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ingolia N.T., Ghaemmaghami S., Newman J.R., Weissman J.S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009;324:218–223. doi: 10.1126/science.1168978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ingolia N.T., Brar G.A., Rouskin S., McGeachy A.M., Weissman J.S. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat. Protoc. 2012;7:1534–1550. doi: 10.1038/nprot.2012.086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Oh E., Becker A.H., Sandikci A., Huber D., Chaba R., Gloge F., Nichols R.J., Typas A., Gross C.A., Kramer G. Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo. Cell. 2011;147:1295–1308. doi: 10.1016/j.cell.2011.10.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Guo H., Ingolia N.T., Weissman J.S., Bartel D.P. Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature. 2010;466:835–840. doi: 10.1038/nature09267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stadler M., Artiles K., Pak J., Fire A. Contributions of mRNA abundance, ribosome loading, and post-or peri-translational effects to temporal repression of C. elegans heterochronic miRNA targets. Genome Res. 2012;22:2418–2426. doi: 10.1101/gr.136515.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Michel A.M., Fox G., M Kiran A., De Bo C., O'Connor P.B.F., Heaphy S.M., Mullan J.P.A., Donohue C.A., Higgins D.G., Baranov P.V. GWIPS-viz: development of a ribo-seq genome browser. Nucleic Acids Res. 2014;42:D859–D864. doi: 10.1093/nar/gkt1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wan J., Qian S.B. TISdb: a database for alternative translation initiation in mammalian cells. Nucleic Acids Res. 2014;42:D845–D850. doi: 10.1093/nar/gkt1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ingolia N.T., Lareau L.F., Weissman J.S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011;147:789–802. doi: 10.1016/j.cell.2011.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.DeLuca D.S., Levin J.Z., Sivachenko A., Fennell T., Nazaire M.-D., Williams C., Reich M., Winckler W., Getz G. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012;28:1530–1532. doi: 10.1093/bioinformatics/bts196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mortazavi A., Williams B.A., McCue K., Schaeffer L., Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
- 15.Anders S., Pyl P.T., Huber W. HTSeq–A Python framework to work with high-throughput sequencing data. Bioinformatics. 2014;31:166–169. doi: 10.1093/bioinformatics/btu638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lawrence M., Huber W., Pages H., Aboyoun P., Carlson M., Gentleman R., Morgan M.T., Carey V.J. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 2013;9:e1003118. doi: 10.1371/journal.pcbi.1003118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ingolia N.T., Brar G.A., Stern-Ginossar N., Harris M.S., Talhouarne G.J., Jackson S.E., Wills M.R., Weissman J.S. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Rep. 2014;8:1365–1379. doi: 10.1016/j.celrep.2014.07.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Stumpf C.R., Moreno M.V., Olshen A.B., Taylor B.S., Ruggero D. The translational landscape of the mammalian cell cycle. Mol. Cell. 2013;52:574–582. doi: 10.1016/j.molcel.2013.09.018. [DOI] [PMC free article] [PubMed] [Google Scholar]