RiboVIEW: a computational framework for visualization, quality control and statistical analysis of ribosome profiling data

Carine Legrand; Francesca Tuorto

doi:10.1093/nar/gkz1074

. 2019 Nov 28;48(2):e7. doi: 10.1093/nar/gkz1074

RiboVIEW: a computational framework for visualization, quality control and statistical analysis of ribosome profiling data

Carine Legrand ^1,², Francesca Tuorto ^1,^✉

PMCID: PMC6954398 PMID: 31777932

Abstract

Recently, newly developed ribosome profiling methods based on high-throughput sequencing of ribosome-protected mRNA footprints allow to study genome-wide translational changes in detail. However, computational analysis of the sequencing data still represents a bottleneck for many laboratories. Further, specific pipelines for quality control and statistical analysis of ribosome profiling data, providing high levels of both accuracy and confidence, are currently lacking. In this study, we describe automated bioinformatic and statistical diagnoses to perform robust quality control of ribosome profiling data (RiboQC), to efficiently visualize ribosome positions and to estimate ribosome speed (RiboMine) in an unbiased way. We present an R pipeline to setup and undertake the analyses that offers the user an HTML page to scan own data regarding the following aspects: periodicity, ligation and digestion of footprints; reproducibility and batch effects of replicates; drug-related artifacts; unbiased codon enrichment including variability between mRNAs, for A, P and E sites; mining of some causal or confounding factors. We expect our pipeline to allow an optimal use of the wealth of information provided by ribosome profiling experiments.

INTRODUCTION

The translation of genetic information into polypeptide sequences is a cellular process common to all kingdoms of life, involving a multitude of orchestrated interactions between mRNAs, translation factors, ribosomes and tRNAs. Translation is a highly regulated and fine-tuned process, which enables a fast response to metabolic and environmental changes and its regulation balances the pool of proteins actively translated from mRNAs (1). While mRNAs and proteins can be measured by RNA-seq and mass spectrometry, respectively, ribosome profiling allows to directly measure protein synthesis by detecting the position of ribosomes on mRNAs (2,3). As a result, Ribo-seq provides a quantitative profile of the translatome at high resolution, i.e. the set of mRNA species under active translation. More specifically, Ribo-Seq is based on the isolation and retrieval of mRNA fragments (footprints) when they are protected by a ribosome, followed by deep sequencing-based identification of ribosome footprints. Adequate alignment of these footprints allows to determine the position of translating ribosomes on mRNAs at single-codon resolution (3,4). This method has quickly been adopted by many laboratories, but at present, data analysis requires computational expertise (5), and the analysis so far has used visualization methods but few dedicated statistical estimates or quality diagnostics. Bioinformatics tools like Rqc (6) centered on quality assessment of reads (data structure, contaminants, etc.) can be used to assess the quality of sequencing but are not informative on the artifacts and batch effects detected on ribosome profiling datasets (Table 1). In this study we compared the performance of RiboVIEW with other existing tools dedicated to ribosome profiling analysis, like Gwips-viz (7), RiboProfiling (8) and riboSeqR (9). This comparison is presented in Table 2 and also includes tools with some quality control capabilities like RiboViz (10), mQC (11), RiboTools (12) and Ribo-TISH, (13) though none of these methods provides the full array of controls and visualization that we propose.

Table 1.

Artifacts and batch effects in ribosome profiling experiments

Category	Impact	Ref
Replicate concordance	Single codon-level analysis (ORF, etc.)	(23)
Drugs	Leakage, ribosome run-off, biased codon occupancy	(15,29–31)
Experimental conditions	Ligation-digestion: end base bias, ambiguous FP location, loss of periodicity	(21,30)
	Hybridization-subtraction: bias on codon enrichment	(9)
	EDTA: loss of information from large RPFs
	RiboZero: loss of rRNA-like mRNA segments
	Drop-off due to unwanted amino acid starvation, monitoring with metagene	(32,33)
	RNAse digestion	(34)
Organismal specificities	Footprint loss through size selection in s.cerevisiae	(28)
	CHX leakage in S. cerevisiae and S. pombe	(15,29)
PCR, seq., post-treatment	Sequence preference, Biased counts due to inefficient alignment	(21)
	Information loss	(35)
Other batch effects	Non-relevant variability for one or all estimates	(21)

Open in a new tab

Table 2.

Comparison of ribosome profiling tools

Open in a new tab

RiboVIEW visualizes translation elongation at codon level and provides relevant quality properties. Furthermore RiboVIEW provides unbiased estimates of codon enrichment, detects some (causal or confounding) covariates (Table 1, Supplementary Figure S1).

MATERIALS AND METHODS

Data preparation

Sequenced reads are submitted to adapter removal, using cutadapt (v1.8.1) with options ‘-a AGATCGGAAGAGCACACGTCT–error-rate=0.1 –times=2 –overlap=1’. Resulting reads are trimmed using Trimmomatic (v0.36) with 30 as minimum quality score, minimum length 11nt and maximum length 36nt (options -phred33 LEADING:30 TRAILING:30 MINLEN:11 CROP:36). Remaining reads are depleted of rRNA and other non-nuclear mRNA by aligning using Bowtie on a depletion reference (rRNA, tRNA and mitochondrial RNA sequences), with options –seedmms 2 –seedlen 11 –maqerr 70 –tryhard -k 1. Finally, reads that do not align to the depletion references are aligned to the transcriptome, using Bowtie with options –seedmms 2 –seedlen 11 –maqerr 70 -m 1.

Fasta-format mRNA and ncRNA reference, as well as GTF-format annotation, were downloaded from Ensembl FTP-download page https://www.ensembl.org/info/data/ftp/index.html. A template for data preparation under UNIX/Linux systems is provided in the Supplemental Information.

Workflow

Once the aligned reads are generated in a bam format, the next step is done using R command line. A template workflow is provided in the Supplementary Information. In this template, the user can define the addresses of the input files and the experimental conditions. Then, a set of commands generates results including the two HTML pages Results-QC.html and Results-MINE.html where the results can be viewed. All is coded as custom R and Python scripts. As a general rule, replicates of a same condition are either integrated or all shown. In some cases, this would have been impractical; the resulting plots are then saved to the output folder and only one replicate is shown in the HTML pages.

Calculating periodicity

The number of footprints which align close to the start of the annotated sequence of an mRNA are counted (regardless of codon identity). Tables of this coverage, stratified by footprint length and by position in a window of 20nt 5′ of the A in AUG-codon to 20nt on the 3′ side, are generated. Those tables are used to display periodicity using a recurrence plot (14), R function recurr from R package tseriesChaos, adapted for RiboVIEW). Recurrence plot usually represents time autocorrelation in dynamical systems, while it is used here to represent spatial autocorrelation. The recurrence plot is generated for each footprint length, alongside with a barplot, which shows the coverage achieved for each footprint length.

Selecting adequate footprint lengths

Recurrence plots per footprint length are displayed one at a time (function selectFPlen), after which an interactive dialog prompts the user to select a minimum and maximum footprint length which comply with recurrence every 3nt, starting at −12nt.

Metagene

A metagene is generated using the coverage in A site for each available position along each mRNA in one sample, which is calculated in enrichment.py and stored in output files *.metagene. Positions are normalized to the following metagene coordinates: [−1; 0] for the 5′ UTR, [0; 1] for the CDS and [1; 2] for the 3′ UTR. Coverage counts are normalized so as to add up to unity, and binned at 0.1 resolution (option res1 in function ‘metagene.all’). These normalized and binned values constitute the metagene profile. The percentage of reads in the UTRs relative to the CDS is calculated and informs possible selection artifact (indicative cutoffs of 1% and 10% are used). The percentage of reads in the first 15 codons stretch at CDS start, including the AUG codon, is calculated and compared to an indicative threshold of 1% for possible inflation around AUG. Leakage is examined at AUG and STOP codon. For AUG, a robust linear fit is applied to the metagene profile at and after AUG (metagene coordinates [−0.1; +0.3]). If the slope from this fit is positive and has a significant P-value at 0.05 level (respectively, 0.1), this yields an indication of strong (respectively, mild) leakage after AUG. Similarly, leakage at STOP is calculated as the percentage of metagene coverage after the STOP codon relative to shortly before (segments [1; 1.3] and [0.9; 1] in metagene coordinates). A percentage larger than 5% (respectively, 1%) triggers strong (mild) indication of STOP leakage.

Ligation biases if any are highlighted in logoplots at the nucleotide and codon level. These plots are automated from coverage counts at nucleotide and codon level as derived from enrichment.py and from adapted scripts from the R package ggseqlogo. A significant nucleotide or codon sequence bias is here indicated not by a P-value, but by the information content, measured in bits. Cutoffs of 0.2 (or 0.4) are used to indicate possible (or strong) bias on a sequence.

Correlation between replicates

Counts per mRNA per replicate are displayed in an RPKM plot for each set of replicates of the same condition, along with the Spearman correlation at gene level. Additionally, stretches of 3–100 codons are scanned for Spearman correlation higher than 0.4, or better, 0.6 between replicates. The relevant stretch, or 100 if none fulfills this criterion, is used to display a codon-level RPKM plot.

A heatmap with hierarchical clustering is produced for the full set of samples. Hierarchical clustering is compared to the actual experimental conditions and replicates using a Spearman correlation.

Codon enrichment, relative codon enrichment and codon occupancy

As a rationale for unbiased codon enrichment calculation, we considered the pool of mRNA actively translated, from which footprints derived. Focusing on a codon with identity c, we look for footprints where this codon appears at a certain offset i from the A-site (Supplementary Figure S2). For example, this offset i could be four codons away, downstream (5′ side) of the A-site. If there is no specific pausing or acceleration of this codon c at offset i, then one would expect codon c at offset i to appear in ribosome footprints at a frequency, which simply reflects its codon usage. Based on this rationale, unbiased codon enrichment is calculated as the observed codon usage relative to the expected codon usage. In practice, in the Python script dedicated to enrichment calculation sums first the observed codon usage at mRNA level and second over mRNAs, using weights. Weights by mRNA are defined as the number of reads per mRNA. Furthermore, we make the assumption that the expected codon usage is independent of the position, except in domains near AUG and STOP codons, which are excluded (15 codons near AUG, 5 codons near STOP codons). This yields equation (1), where in particular weights are simplified out when one sums over all mRNAs:

(1)

where Ē_c,i is the codon enrichment for codon identity c at position i (with, especially, i = 0 at A-site, i = 1 at P-site), averaged over mRNAs, n_c,i,g is the number of codons c observed at position i in mRNA g, and global codon usage is defined by equation (2):

(2)

Under the assumptions mentioned, enrichment Ē_c,i is unbiased at unity (Supplementary Figure S2). This relies on the fact that, if codon c is nether paused nor accelerated and if the assumption that the expected codon usage is independent of the position holds, then the observed codon usage converges to the expected codon usage, as the coverage in the experiment becomes large enough, as in Equation (3):

(3)

As a consequence, still in the case when codon c is neither paused nor accelerated, enrichment Ē_c,i, the ratio of observed to expected codon usage, should converge to 1, as coverage becomes large enough.

Standard deviation of codon enrichment is calculated similarly, using the number of reads as weights. This simplifies into equation (4):

(4)

Standard error of the average codon enrichment across mRNAs is taken as the standard deviation of codon enrichment divided by the square root of the number of mRNAs.

Relative codon enrichment is calculated as described previously by Hussmann (15). Relative codon enrichment for arginine codons are produced for each replicate and displayed in RiboQC. The full table of values for all codons is provided in the results folder.

A comparison of main differences in the calculation of enrichment in this study and in Hussmann et al. (2015) is given in Supplementary Figure S3).

Bulk codon occupancy corresponds to the counts of footprints stratified by codon identity. We provide the codons present in A-site, P-site and E-site, as well as three positions downstream and upstream of the ribosome. The rationale to assign a specific position for each codon was described previously (16). Briefly, the A-site is assigned at the footprint 5′ start +15nt, relaxed by ±1nt to match the closest codon in the main reading frame.

Enrichment per experimental condition

Enrichment in one condition is calculated as the weighted mean and standard deviation over replicates, where the weight associated to one replicate is 1/SE²(Ē_c,i). Standard error across replicates is taken as the standard deviation across replicates divided by the square root of the number of replicates. Enrichment per condition is displayed in Results-Mine.html for each codon, along with an error bar corresponding to ±standard deviation.

Enrichment between experimental conditions

Enrichment in condition (a) relative to condition (b) is calculated by bootstrapping possible quotients from replicates in condition (a) relative to replicates from condition (b). This procedure yields the mean and standard error of the quotient of enrichments. This quotient is shown in Rresults-Mine.html, with error bars corresponding to the standard error.

mRNA tracks

Coverage in the A site is displayed in a barplot along the coding sequence of an mRNA, in every sample. By default, an mRNA with sufficient coverage is picked at random. Tracks for a specific mRNA can be requested using the option ‘mRNA=’ in RiboVIEW function visu.tracks.

Venn diagram

RiboView automatically retrieves the number of footprints per mRNA and per replicate for one condition and creates a Venn plot using R package VennDiagram. This is restricted to up to five replicates per condition (limitation of VennDiagram package).

Group effects

Group effects are evaluated using a principal component analysis on codon enrichment. A P-value for significant principal component is derived by bootstrapping 10000 times the elements of the matrix of occupancies for all samples. The PCA plot is displayed along with this P-value for interpretation as a batch effect (separation of replicates), or as a functional role (separation of conditions) by the user. Additionally, a tSNE plot is generated, using the average number of replicates to set the parameter ‘perplexity’.

Second, a linear regression is applied to codon occupancy with nucleobases a, c, g or u as explanatory variables. The slope, standard deviation and P-value are retrieved to produce a barplot for display in Results-Mine.html. Error bars signal the standard deviation, while a significant P-value is indicated in the text associated to this plot, for the user to identify, if either a batch effect between different replicates or a functional effect between different conditions is present.

RiboQC HTML page

Text and plot files are retrieved from R data files corresponding to each theme ‘Periodicity’, ‘Replicates’, ‘Selection’ and ‘Drugs’, and to each category within these themes. This hierarchy of themes and categories is specified via a nested list. The output page Results-Qc.html is generated in three phases: (i) HTML (Hyper Text Markup Language) commands for page initiation, definition and header are written to Results-Qc.html. This includes a style sheet ‘output-style’ written in CSS (Cascading Style Sheets) language. (ii) A loop for each theme generates one rounded box-frame per theme. Inside this frame, a nested loop generates one tab per category, containing one plot and corresponding text with relevant values. Plots are included as a character string using Python package ‘base64’. (iii) Footer and closing HTML commands are written to Results-Qc.html.

RiboMine HTML page

The procedure and structure is the same as for RiboQC HTML page. The different content is entirely defined by the hierarchy of themes and categories and by the corresponding R-data files loaded.

Tests

Using a Python script, synthetic mRNAs were generated and annotated. Following different relevant scenarii (footprint periodicity present or absent, enrichment or not at a specific codon, ribosome leakage or not), footprint reads were sampled from the pool of synthetic mRNAs and written to BAM files. These files were used as input into RiboVIEW in order to check its different functionalities. A checklist of functions and expected outputs was established.

RESULTS

Preparatory work and input

RiboVIEW is meant to be easily integrated into a general ribosome profiling workflow (Figure 1A). Cells could be treated with cycloheximide, or with different drugs or simply flash-frozen to arrest translating ribosomes. Cytoplasmic extracts from these cells are then treated with RNase to digest regions of mRNAs not protected by ribosomes. 80S monosomes, that mainly protect a ∼30-nucleotide footprint, are purified using a sucrose gradient or alternatively with a sucrose cushion. Nucleotide footprints are then size-selected and processed for Illumina high-throughput sequencing (Figure 1A).

Besides the analysis pipeline, we also provide, in the Supplementary Methods, our in-house experimental protocol adapted from (17,18).

The next steps are computational: first the footprints are trimmed from adapter and low quality bases, and depleted if they align to rRNA (or further RNA sequences which could be ambiguously aligned to mRNA, like tRNA for example). Remaining reads are mapped to the transcriptome.

We routinely use Bowtie (19) with a seed region of 20 nucleotides and one mismatch allowed. We chose Bowtie because it is fast and dedicated to short read alignments up to 50 bp, which is compatible with the length of a ribosome footprint, for which it is faster and/or more sensitive than Bowtie2 that was developed for reads longer than 50 bp (http://bowtie-bio.sourceforge.net). STAR aligner (20) could be a valid alternative. Mapping should preferably be unique (21), which results in the loss of some coverage, but avoids skewed codon enrichment or artifacts in translation efficiency (#FP/#mRNA) results.

Afterwards, the resulting BAM files, reference mRNA sequences (fasta format), and annotation of their coding sequence (table format, generated from a gtf file) can be entered in RiboView. We show here example results of the analysis of our in-house samples (16), obtained from HeLa cells treated with the elongation inhibitor cycloheximide (Supplementary HTML files).

1.5 to 2.4 M reads for minus queuine medium samples (denoted L) and +queuine medium (denoted L+Q) samples (category) aligned to known CDS regions. This corresponds to 88.3–92.7% of remaining reads after depletion of rRNA, tRNA and mitochondrial RNAs (7.6–12.3M reads were depleted).

We further validated RiboVIEW using independent datasets. An example obtained from c. elegans samples (22) is provided in Supplementary Figures S4 and S5 and Supplementary HTML files.

RiboQC

Reproducibility and quality control are a concern for any experimental procedure. RiboQC offers a collection of tools to scan own data for the most relevant aspects of ribosome profiling quality control.

Periodicity

For any given mRNA, ribosome footprint sequences should mainly correspond to the protein coding portion of the transcript, extending from the start codon to the stop codon. Footprint-allocated position of the A-site should also show a strong preference for the first nucleotide position within each codon, in agreement with the reading frame. In order to monitor these aspects, we propose both the classical coverage representation (barplot, Figure 1B), which represents well the coverage obtained from different footprint lengths, and a recurrence plot, which is an unsupervised way to display recurring patterns in a series (14). Recurrence plot was previously developed for dynamical systems and can be adapted in a straightforward way to ribosome profiling data. A recurrence plot is robust to non-periodic coverage variations, and preserves the positional information, whereas a Fourier transform would yield a summary value over all positions, and methods based on coverage in 0, +1 or +2 frames lose positional information and could be biased for outlier peaks. Sufficient periodicity is attained if a peak at −12nt is present on the recurrence plot (−12nt corresponding to AUG initiation in P-site), and distinct recurrence patterns occur every 3nt, at −9nt, −6nt, etc. We call ‘distinct’ a recurrence black band centered around −12, −9, …, −18 nt and separated of the next band by grey to white bands. In our demonstration dataset, the peak at AUG initiation is clearly visible as a dark band at −12nt and recurrence starting at −12nt is shown by dark bands at −12nt, −9nt, …, +18nt, each well separated by a lighter band (Figure 1C left panel). This was the case for footprints of length 27–30nt. By contrast, footprints of length 32nt possess a diffuse peak at −12nt (encompassing positions −13nt and −12nt), and lack recurrence at positions −9, +3, +9, +12, +15, +18 nt (Figure 1C right panel).