Detection of generic differential RNA processing events from RNA-seq data

Van Du T Tran; Oussema Souiai; Natali Romero-Barrios; Martin Crespi; Daniel Gautheret

doi:10.1080/15476286.2015.1118604

. 2016 Feb 5;13(1):59–67. doi: 10.1080/15476286.2015.1118604

Detection of generic differential RNA processing events from RNA-seq data

Van Du T Tran ^a,^b, Oussema Souiai ^a, Natali Romero-Barrios ^c, Martin Crespi ^c, Daniel Gautheret ^a

PMCID: PMC4829270 PMID: 26849165

ABSTRACT

RNA-seq data analysis has revealed abundant alternative splicing in eukaryotic mRNAs. However, splicing is only one of many processing events that transcripts may undergo during their lifetime. We present here RNAprof (RNA profile analysis), a program for the detection of differential processing events from the comparison of RNA-seq experiments. RNAprof implements a specific gene-level normalization procedure and compares RNA-seq coverage profiles at nucleotide resolution to detect regions of significant coverage differences, independently of splice sites or other gene features. We used RNAprof to analyze the effect of alternative-splicing regulators NSRa and NSRb on the Arabidopsis thaliana transcriptome. A number of intron retention events and alternative transcript structures were specifically detected by RNAprof and confirmed by qRT-PCR. Further tests using a public Mus musculus RNA-seq dataset and comparisons with other RNA isoform predictors showed that RNAprof uniquely identified sets of highly significant processing events as well as other relevant library-specific differences in RNA-seq profiles. This highlights an important layer of variation that remains undetected by current protocols for RNA-seq analysis.

KEYWORDS: Alternative polyadenylation, alternative splicing, alternative transcripts, RNA processing, RNA-seq

Introduction

High-throughput RNA sequencing (RNA-seq) has profoundly transformed transcriptome analysis through its ability to simultaneously characterize transcripts and quantify their abundance. One area where RNA-seq is especially powerful is the discovery of alternative splicing isoforms. Computational analysis of RNA-seq data obtained from different tissues or conditions has revealed a considerable extent of alternative splicing in plants and animals.^1-6

RNA splicing is only one of many events contributing to transcript isoform diversity. In all species, a large fraction of genes have alternative transcription start (TSS) and termination sites and undergo widespread post-transcriptional processing through cleavage by multiple endo- and exo-nucleases. For instance, in eukaryotes, snoRNA and miRNA precursors are processed from mRNA precursors by endonucleases such as Drosha or DCL1⁷ and, in turn, miRNAs are subject to tissue-specific editing and processing.⁸ Pre-mRNAs are cleaved at alternative polyadenylation sites⁹ and mature mRNAs are degraded from their 5′ and 3′ ends by a series of specific exonucleases.¹⁰ In bacteria, it is now clear that every operon produces multiple alternative transcripts with extensive variations in transcript starts and terminations.¹¹ Therefore it is important to interpret any RNA-seq experiment as a snapshot of a complex set of active and inactive RNAs produced by a network of transcription and processing factors.

Functional insight into alternative transcription is best achieved through differential analysis of RNA libraries from different tissues, environments or genetic backgrounds. Such experiments preferentially use 2 or more replicates for each condition to account for fluctuations occurring between libraries. More than a dozen computational tools are available to analyze such data and predict differentially expressed isoforms.¹² Most of these tools, however, focus on the detection of splicing variants. They implement a variety of quantification procedures based either on counting RNA-seq reads at exon junctions³ or within exons,¹ or through full isoform reconstruction and subsequent probabilistic assignment of reads to isoforms.⁶ Departures from this splicing-centric view of alternative transcription are rare. Two programs, DERfinder¹³ and rDiff,¹⁴ can detect variations occuring along RNA-seq coverage profiles at nucleotide resolution, independently of prior transcript models.

Here we introduce a novel nucleotide-level analysis of RNA-seq coverage centered on the transcription unit. By performing profile normalization at the transcription unit level, we focus on smaller scale variations, which may potentially include any type of RNA processing event, including alternative splicing or 5′ and 3′ regions. Importantly, our program does not attempt to reconstruct full transcripts, thus bypassing error-prone reconstruction and counting procedures and enabling the observation of small-scale processing events producing short RNA fragments. Using RNA-seq data sets from Arabidopsis and mouse, we show this method is able to detect strong, functionally significant events that escaped current methods.

Results

The RNAprof pipeline

RNAprof (RNA profile comparison) is a suite of Perl and R programs that aims at discovering any variation occurring along the expressed fraction of a transcript, with or without pre-annotated splicing events or exon definition. RNAprof requires RNA-seq data obtained under 2 different conditions, preferably in biological replicates, mapped onto a reference genome using any common mapping software (mapping should be splice-aware in case of eukaryotic genomes). Structures of all transcription units (TUs), including non-coding genes and non-coding regions of coding genes, should also be provided. Alternatively, a companion program (GFFprof) can infer TU structures from the mapping data, enabling the study of RNA processing beyond pre-annotated exons and genes. RNAprof then seeks local profile variations occurring within the TU boundaries.

RNAprof involves 3 main steps: (1) computing of RNA-seq coverage profiles, (2) profile normalization and (3) differential event detection. The first step is straightforward based on the alignment files produced by the mapping software. The following gene-level normalization step is crucial to remove any difference in coverage resulting from differential gene expression, which is a predominant source of read count variation. We initially considered for this task common normalization procedures used in RNA-seq analysis, including Total Count (TC), Median (MED), Upper Quartile (UQ),¹⁵ TMM,¹⁶ RLE (Relative Log Expression) and DESeq,¹⁷ yet all were found to be unsuited to gene-level normalization purpose. Fig. 1 presents profile shapes that these methods failed to normalize. In Fig. 1A, although the transcript is differentially expressed between 2 conditions in triplicate, the profiles do not change in shape, and thus no differential event is supposed to exist. However, TMM, UQ, and RLE did not scale those profiles, providing size factors of 1 to all samples and resulting in detection of a whole-transcript differential event. In Fig. 1B, DESeq, RLE, and MED produced unwanted corrections to replicates with unusual background noise. UQ and TC were also sensitive to this noise to some extent. Fig. 1C shows a size difference between the 2 blocks. In such a situation, size factors of 1 are expected for all samples, but nonetheless, except for DESeq, all gave size factors different from 1. The situation in Fig. 1D is somewhat similar to the one in Fig. 1C, but there is little or no overlap between the 2 blocks. In this case, all methods failed to produce the required size factors of 1.

Figure 1. — Examples of RNA-seq coverage profiles leading to failure of common normalization procedures. Top frames show simulated RNA-seq profiles, bottom frames show the resulting size factors according to different normalization methods. (A) Conditions where TMM, UQ, RLE provide unwanted results. (B) Conditions where DESeq, RLE, MED provide unwanted results. (C) Conditions where all, but DB and DESeq, provide unwanted results. (D) Conditions where all but DB provide unwanted results. EXPECTED: Expected size factors, DB: Dominant Block, MED: Median, RLE: Relative Log Expression, TC: Total Count, UQ: Upper Quartile. Blue/red lines represent 2 experimental conditions. Dashed/dotted/solid lines: biological replicates.

To deal with those scaling issues, we designed a new normalization method, namely Dominant Block, based on the region showing higher expression in each profile, independently of the behavior of the less covered regions (see Fig. 2A, Fig. S1, Materials and Methods). Dominant Block could produce the appropriate profile normalization in the counterexamples to the other ones (see Fig. 1). In this approach, we first normalize coverages by library size. For each TU and each library, we sort position-specific counts in decreasing order (Fig. 2B). We then seek the first sharp descent in coverage occurring in any sample in the first quarter of the expressed region of the TU. This defines the “high coverage” region upon which normalization scale factors are computed (Fig. 2B, Materials and Methods).

Figure 2. — Graphical summary of the RNAprof normalization procedure. (A) Original RNA-seq mapping profiles. (B) Sorted profiles showing dominant block definition and normalization factors.

A differential processing event is a stretch of adjacent nucleotides whose coverage significantly differs between conditions. Such an event can be detected with the negative binomial test function implemented in the DESeq package,¹⁷ while considering read counts at the nucleotide, instead of gene, level and size factors obtained from our Dominant Block normalization. This test produces a fold-change and a raw p-value for each position in a TU. Events are then called by combining adjacent positions with significant variation and applying further constraints to limit variance among replicates (Fig. S1, Material and Methods). The resulting size factors and average read counts are then used to estimate the score (fold-change) and p-value of the event using the same negative binomial test.

Analysis of transcript isoforms in a plant RNA splicing mutant

We applied RNAprof to analyze the RNA processing activity of Nuclear Speckle RNA (NSR) binding proteins NSRa and NSRb 2 known regulators of alternative splicing in Arabidopsis thaliana.¹⁸ We submitted strand-specific polyA+ cDNA libraries from nsra/nsrb double mutant and wild type plants to Illumina sequencing, and aligned the sequence reads to the A. thaliana genome. The mapping results were then analyzed with RNAprof.

RNAprof identifies 1885 significant events in 1473 genes. The median size of events is 69 nt, with sizes ranging from 24 nt to 1645 nt. A wide range of transcript alterations is detected, including intron retention (Fig. 3A), alternative transcription initiation or termination (Figs. 3B, C), and alternative splicing events (Fig. 3D). Technical alterations affecting the RNA-seq profile, such as an insertion at the NSRa locus (responsible for the mutation) also produce striking signatures (Fig. 3E). We selected 18 highly significant events from 13 distinct genes for RT-qPCR validation. Optimized oligos allowed us to confirm 11 events from 9 genes (Table S1,Figs. S2-10). Further scrutiny of the unconfirmed events revealed that, in 5 out of 7 cases, changes were imputable to artifacts of the mapping software that were independent of the RNAprof analysis (Fig. S11). A set of significant events that we did not seek to confirm was associated to slight differences of coverage in highly expressed genes, probably due to the increased power of the statistical tests rather than to true biological causes (Fig. S12).

Figure 3. — Potential RNA processing events detected by RNAprof from the comparison of *A. thaliana* mRNA-seq libraries produced from wild type (WT) cells (red) and an *Atnsra/b* double mutant¹⁸ (blue). Each condition was analyzed in biological triplicates. In each plot, the X-axis represents gene coordinates (boxes and lines representing exons and introns, respectively) and the Y axis the RNA-seq coverage (number of reads at each position). Triangles interrupting the coverage plots designate collapsed regions with low coverage (less than 5% of the highest coverage), lozenges designate collapsed regions with several triangles between 2 events. Vertical purple lines and p-values indicate differential processing events. Size factors provide the scales of non-normalized profiles relative to normalized profiles. Panels A-E represent various types of differential events detected. (A) Intron retention in 4 introns of gene *AT4G35770*. (B) Alternative 5’ ends in gene *AT2G23810*. (C) Profile inversion in gene *AT1G19220* that may result from a premature 3’ end. (D) A putative new alternative exon in gene *AT3G27990*. (E) A genomic insertion in the *NSRa* gene (*AT1G76940*).

To highlight the characteristics of events detected by RNAprof, we compared our results with those produced by 4 other isoform detection/quantification programs: the Cufflinks/Cuffdiff suite,⁶ Diffsplice,³ DERfinder,¹³ and DEXSeq.¹ We also included DESeq2¹⁹ for assessing differential expression at the whole gene level. We were not able to test rDiff after several unsuccessful installation attempts. We ran each program against the same A. thaliana dataset as above (see Materials and Methods). DERfinder detected 17049 events of significant p-value, but no one was considered significant with a false discovery rate (FDR) < 0.1. There was little overlap among the top 100 hits of the other programs (Fig. S13), except between Cuffdiff and DESeq2, consistent with their similar underlying model (whole transcript or whole gene). The large difference between tools is explained by the different representation of isoforms that each software implements. Cuffdiff identifies full transcript isoforms that are differentially expressed, while DEXSeq and Diffsplice detect alternative splicing events at the exon or splice junction level. The 20% overlap between the top 100 RNAprof and Cuffdiff/DESeq2 hits suggests a minor fraction of RNAprof hits that are associated to differential expression at the whole gene or whole transcript level. However, overlap with the other tools was practically zero. Among the top 100 hits of the other tools that were not detected by RNAprof, we noted a number of events produced by, firstly differential expression at the whole gene level, secondly noise from low count transcripts, and thirdly noise from inconsistent replicates (Fig. S14). The vast majority of intron retention events found by RNAprof was not identified by any of the other programs, although some were detected by Diffsplice (Fig. S15).

In the rare genes where events were detected by more than one method, individual methods always found events at different locations (Figs. S15 - 17). Finally, we noted that the only instance of exon skipping detected by RNAprof in this data set (Fig. 3D) could not be detected by any other program, notably DERfinder and DEXSeq, which are supposed to detect such events. DERfinder classified it as an insignificant event, and DEXSeq failed because the exon was not previously annotated, whereas, RNAprof succeeded on account of its suitable normalization and independence of the gene structure.

Analysis by RNAprof of published RNA-seq data uncovers novel events

To illustrate the general applicability of RNAprof and assess its capacity to deal with long mammalian genes, we used the program to reanalyze a published RNA-seq survey of the mouse neural retina leucine zipper protein Nrl, a transcription factor involved in eye development.²⁰ The data set comprises 6 poly-A+ RNA libraries in biological triplicates from wild type and Nrl-/- mutant retina. We mapped each library to the mouse genome and analyzed the results using RNAprof and the other software listed above (see Materials and Methods). RNAprof identifies differential events in 1500 genes. Fig. 4 shows 5 examples from the top 20 scoring events. The mutated Nrl gene produces the most significant event due to residual expression of a 5’ exon in the KO mouse (Fig. 4A). Interestingly, several genes that were found differentially expressed in the original study, such as GUCA1B, ROM1 and CALU, were also among the top ranking RNAprof genes due to strong variations in their RNA-seq profiles (Figs. 4B-D). This includes a possible case of internal transcription start (Fig. 4B) accumulation of unspliced forms (Fig. 4C) and an alternative 5’ exon (Fig. 4D). Furthermore, certain genes that did not present differential expression as a whole in the previous publication reveal dramatic changes in their RNA-seq profiles, suggestive of differential processing (Fig. 4E).

Figure 4. — Potential RNA processing events detected by RNAprof from the comparison of *M. musculus* RNA-seq libraries produced from WT (red) and *Nrl*-/- mutant (blue) retinal cells.²⁰ See Legend of Fig. 3. (A) Event in the *Nrl* gene. (B) events in the *CALU* gene. (C) events in the *GUCA1B* gene. (D) event in the *ROM1* gene. (E) events in the *ANP32A* gene.

An inspection of the mapped reads indicates that significant subset of the RNAprof events detected in Nrl mutants are not imputable to differential RNA processing but instead result from mutations occurring specifically in mutant mice (Figs. S18 and 19). Deletions or mutations in the genome sequence may cause reads to align to different locations or fail to align altogether. This creates local drops in RNA-seq coverage that are captured by RNAprof. Such mutations or deletions were particularly frequent in the mouse Nrl mutant, which we verified by changing our mapping procedure (Figs. S18 and 19). This “side effect” of RNA-seq profile analysis is indeed interesting as it can reveal events with deep functional consequences that, to our knowledge, escape current RNA-seq software. Obviously, RNA-seq profile analysis cannot document the origin of such sequence variation, which may be due to genome mutation as well as editing or modification at the transcript level.

Comparison of events predicted by RNAprof, Diffsplice, Cuffdiff, DEXSeq and DESeq2 using the mouse Nrl data showed only a slightly larger overlap than with the Arabidopsis nsra/nsrb data (Fig. S20). DERfinder found no significant event either in this dataset. RNAprof shared more events with Cuffdiff than with any other program when considering events at the gene level. However, method-specific predictions remained predominant for all programs and the precise locations of events were generally different from a method to the other, even for events occurring in the same gene (Figs. S21 and 22).

Discussion

RNAprof is a novel method for the comparative analysis of RNA-seq profiles. It aims to detect local significant variations occurring inside each transcription unit (TU) across different experimental conditions. To this end, we implemented a specific normalization procedure based on the region with highest RNA-seq coverage in the TU. The pipeline then detects local differential variations at nucleotide level using functions from the DESeq library. Finally, a differential event is established from a stretch of consecutive positions with significant variation, after p-value adjustment.

The RNAprof pipeline is applicable in principle to any genome from bacteria to vertebrates and can be used to detect a wide variety of events impacting RNA-seq profiles. This includes biological events caused by changes in environment, genetic background, cell type or compartment, disease status, sex or age. The program can also be employed to observe the systematic effects of technical changes in RNA sequencing libraries, such as comparative profiles from polyA+, total, size-selected or 5′/3′ end-captured RNAs.

In the mammalian and plant gene knockout experiments we analyzed, RNAprof detected a diversity of events including intron retentions, alternative 5′ and 3′ exons, skipped exons, complex changes in transcript structures and profile variations caused by mutations in one of the conditions. A comparison with 4 other programs for alternative transcription analysis using the same data sets showed very little overlap with predictions from the other programs, each algorithm predicting its own specific set of events. Such a poor convergence was also observed in a recent benchmark of splice variant detection software.¹² Indeed, there is no consensus yet as how transcript isoforms should be modeled and quantified. Although full-length and exon-level transcript representations remain necessary for assessing the functional impact of alternative transcription on protein products, analyses that focus on local events can also provide essential information. Tools such as RNAprof that can detect changes in RNA-seq profiles with no prior assumption should prove very useful as an initial screening to capture transcript variations that may result from any kind of pre or post-transcriptional event.

Materials and methods

RNAprof overview

Most methods for measuring differential transcript expression measure variation in terms of amount of reads mapped to assembled transcripts or individual exons. Instead, RNAprof seeks variation at the nucleotide level within the limits of each annotated gene or TU, with or without knowledge on gene features, such as exons, introns or splice sites. RNAprof is intended to capture events associated with RNA processing such as alternative splicing, alternative transcription start site, alternative polyadenylation, variable termination mechanisms, intron processing, ncRNA processing or processing from UTRs. The gene coordinates for RNAprof may be obtained from a prior annotation or can be produced based on the mapping data via our assembly procedure. This procedure can either be guided by a prior annotation or performed de novo (see section “Input RNA-seq and annotation data” below). The RNAprof software archive, including documentation and test sets, is available at the following address: http://rna.igmors.u-psud.fr/Software/rnaprof.php

Input RNA-seq and annotation data

We assume that m + n RNA-seq libraries are available, where m replicates correspond to one experimental condition (M) and n replicates to the other experimental condition (N). The input mapping data must be generated in BAM format, with XS tag information, preferably with TopHat.²¹ The second required input to RNAprof is gene annotation, which serves to extract TU boundaries for signal normalization and event detection. Annotation can either be provided by users in the form of a standard GFF format file containing a list of gene/transcript coordinates or generated de novo using the GFFprof procedure. In this procedure, all mapping data are merged using SAMtools,²² nucleotide coverage is computed using BEDtools,²³ and coverage information is then used to identify TUs. One may opt to extend a prior annotation, i.e. merge a de novo and a prior annotation. In this process, genes can be extended at both ends based on the mapping data, forming new TUs.

The de novo annotation procedure can operate on either stranded or non-stranded RNA-seq data. In case of non-stranded RNA-seq, the program attempts strand assignation based on prior annotation and/or spliced reads. This may lead to inexact gene coverage profiles in region of dual-strand expression, which may in turn affect event prediction.

Profile calculation and normalization

For each independent mapping file (one per library), we use BEDtools to count reads at each genome position, and then calculate the coverage profiles for each individual TU. The profile of a TU in each library is represented as a vector containing the gene identification, chromosome, start/end positions and coverage counts. These profiles serve as input to the RNAprof R script.

We introduce a novel normalization approach, which is better adapted to event detection at nucleotide level than state-of-the-art methods, such as Total Count, Median, Upper Quartile, DESeq, RLE, and TMM^15-17 (See Fig. 1 and Results). In this approach, read count is handled at the nucleotide, instead of gene, level, and each gene is considered as a “genome” (i.e., representing the entire set of observations). The normalization process (Fig. S1), namely Dominant Block, is performed for each individual TU and uses the most expressed part of the TU in a library as the determinant for the TU expression profile. The process is described as follows.

The m + n expression profiles for a given TU are first normalized by library size (i.e. divided by the corresponding library size and multiplied by the median library size). We then sort position-specific counts in decreasing order, defining a decreasing profile. In the latter, a dominant block is determined with a descent angle of at least $θ \approx π / 2$ (Fig. 2). By default, we approximate this with: $tan θ \approx tan (π / 2) = \frac{sin (π / 2)}{cos (π / 2)} \approx 1 / (1 - \frac{1}{2} {(\frac{π}{2})}^{2} + \frac{1}{24} {(\frac{π}{2})}^{4}) \approx 50.$ This descent should occur between the upper decile and the upper quartile of the part of the TU with coverage above a low-count threshold set by default at 0.05 times the highest coverage in the whole TU. If such a descent is not found, the dominant block is defined as the upper quartile block. The exclusion of low-count positions from the analysis helps reduce computing time and avoid producing figures with large uninformative regions. Moreover, this does not affect the final result, as no significant candidate would be expected in such low expressed regions.

We then take the smallest dominant block from all decreasing profiles. Its size determines the common dominant block size D. The total counts in the D-nucleotide blocks are used to normalize the m + n initial profiles. The size factor for each profile is defined by its dominant block total count divided by the median total count.

Event detection

Local variation between expression profiles can be naturally seen as the result of differentially expressed nucleotides in the TU. To detect this, we exploit the negative binomial test implemented in DESeq (version 1.16.0),¹⁷ while considering read counts on nucleotides instead of genes. The size factors are obtained from our Dominant Block normalization. The test yields a fold-change and a raw p-value at each nucleotide, showing to some extent the differential expression measure at nucleotide level. We define, at each nucleotide, a score fc as the fold-change when the raw p-value is less than 1e-3 (by default), and as 1, otherwise. This score function determines a fold-change curve on the whole TU (Fig. S1).

Each hill- or well-like pattern (each stretch of nucleotides with fc > 1 or fc < 1) identifies a candidate for local variation, i.e., an event. To reduce noise, especially due to DESeq outliers, and enforce stability between replicates at candidate events, we require that the variance between replicates of condition M is lower than the variance between the average of M and any replicate of N, and is also lower than the variance between all libraries. Constraints for condition N are inversely built.

We apply the following decision tree model recursively to merge consecutive events. This allows for producing a unique event instead of several fragmented ones separated by small regions. Firstly, the 2 events to be merged possess the same pattern (hill or well). Secondly, the nucleotides in between either have low counts (as defined above) or have the same trend of fold change (greater or less than 1) as the 2 events. Thirdly, the separating segment size is bounded by the sizes of the 2 events to be merged.

Read counts for merged segments are computed based on the total segment size, for each library. Using the size factors computed above, we apply the negative binomial test from DESeq to calculate the fold-change (score) and p-value of this event. P-values are further corrected for multiple testing by applying a Benjamini-Hochberg adjustment²⁴ to the total number of identified events.

Runtime

A complete RNAprof run on the Nrl mouse dataset comprising 6 libraries of 30-39 M mapped reads each required 14 h on a 24-core 252 Gb RAM server. This included 3.5 h for BAM file analysis and coverage calculation and 10 h for the RNAprof event detection stage per se. An additional 4 h would be required for the optional reannotation of transcription units. The runtime for the RNAprof event detection stage is mostly dependent on the total size of the TU regions covered by RNA-seq reads, hence it will be generally lower for non-mammalian genomes.

Library preparation and RNA sequencing in A. thaliana nsra/nsrb mutants

The A. thaliana nsra/nsrb double mutant was obtained in the Columbia-0 (Col-0) background by cross of mutants AtNSRa (SALK_003214) and AtNSRb (SAIL_717) from the SALK and SAIL T-DNA collection, respectively. The plants were grown in a long day (16 h light/8 h dark) on solid half-strength MS medium. Three independent biological replicates were produced. Plantlets were collected at 1.04 developmental growth stages,²⁵ cultivated in the presence of auxin under different light conditions to maximise alternative splicings events.¹⁸ Total RNA was extracted using Qiagen RNeasy® Plant kit according to the supplier’s instructions (QIAGEN S.A, France) and strand-specific PolyA+ cDNA libraries were prepared. For RNA-seq experiments, the Illumina HiSeq2000 technology was used to perform paired-end 100 bp sequencing, using the TruSeq Stranded mRNA SamplePrep Guide 15031047_D protocol (Illumina®, California, USA). The RNA-seq samples have been sequenced in paired-end (PE) with a sizing of 260 bp and a read length of 100 bp. Six samples by lane of Hiseq2000 using bar-coded adapters and giving approximately 30 millions of PE reads by sample were generated. Sequence files are submitted to the NCBI GEO database under accession GSE65717.

qPCR event validation

Total RNA was prepared from 14 days old plantlets treated with 1 µM NAA for 24 hours¹⁸ using Qiagen RNeasy plant mini kit. The DNAse treatment was performed according to the manufacturer’s protocols. For reverse transcription with SuperScriptII (Invitrogen), 2 µg of total DNase-treated RNA and oligo-dT primers were used. One microliter of the resulting cDNA solution was used for RT-PCR or RT-qPCR analyses. The latter was performed in biological triplicate using standard protocols (40 cycles, 60 °C annealing) on a Roche LightCycler 480 apparatus (Roche) with the LightCycler® 480 SYBR Green I Master (Roche). The primers used are listed in Table S2. Each cDNA sample was precisely calibrated and verified for 2 constitutive genes (AT1G13320; AT4G26410²⁶). Each differential event was normalized with respect to an internal gene probe (called INPUT) corresponding to a common exon. This allows us to differentiate the ratio of splicing events independently of the individual level of gene expression (splicing index) in each sample. For visual splicing assays to monitor splicing changes, a semi-quantitative RT-PCR was used.¹⁸ PCR amplification was performed as follows: on a cycle of 4 min at 98 °C, 26 cycles of 30 s at 98 °C, 30 s at 55 °C and 1 min at 72 °C. The products were separated on a 7.5% polyacrylamide gel stained with SYBR green (Invitrogen) and revealed by ChemiDoc MP Imaging System (BIO-RAD) with Image Lab software.

3′ RACE PCR

For confirmation of alternative 3’ ends in gene AT1G28330 (Fig. S10), total RNA was prepared from 14 days old nsra/nsrb plantlets treated with 1 µM NAA for 24 hours using Qiagen RNeasy plant mini kit. The DNAse treatment was performed according to the manufacturer’s protocols. First, expression of AT1G19220 was evaluated by conventional RT-qPCR with primers located in exon 2. Then, 5 µg of total RNA was subjected to 3’-RACE with a GeneRacer kit (Invitrogen) according to the manufacturer’s protocol. Briefly, total RNA was reverse-transcribed with a GeneRacer™ Oligo dT primer (GCTGTCAACGATACGCTACGTAACGGCATGACAGTG(T)₂₄) that con-tains a dT tail of 24 nucleotides (SuperScript™ III RT module) and a 5’ end containing the priming sites for the GeneRacer™ 3’ and the GeneRacer™ 3’ Nested primers. This cDNA was used as templates for PCR amplification with the GeneRacer™ 3’ primer and an exon 2 (AT1G19220) specific forward primer. Then 2 Nested-PCR amplifications using as template the first amplification: one with the GeneRacer™ 3’ Nested primer and an exon 5 (AT1G19220) specific forward primer and a second one with the GeneRacer™ 3’ Nested primer and an exon 7 (AT1G19220) specific forward primer.

Mouse Nrl-/- data set

Fastq files corresponding to the 6 wild type (WT) and Nrl-/- sequencing libraries were retrieved from the SRA database (SRA reference: SRP009096).²⁰ Sequences originated from an Illumina Genome Analyzer IIx. Each file contained 36 to 48 million non-directional, single-end, 70 nt reads.

Comparison of isoform calling software

Arabidopsis nsra/nsrb and mouse Nrl libraries were analyzed using the same protocol except when stated otherwise. Common to all isoform calling procedures, we aligned RNA-seq reads to their respective genome (Arabidopsis thaliana genome TAIR10 assembly or Mus musculus genome NCBI 37.1) using TopHat2 V.2.0²¹ with parameter max-multihits=1 (additional parameters for Arabidopsis runs: min-intron-length=5, max-intron-length=2000). DEXSeq and RNAprof were run with annotation files TAIR10.22 or NCBI37.61. Although RNAprof can work either with or without prior annotation, we ran RNAprof using prior annotations to facilitate comparison with the other programs. Moreover, we did not use the RNAprof option for genome re-annotation from BAM files. For Cuffdiff, we first performed transcript assembly from each independent BAM file with Cufflinks and then merged annotations with Cuffmerge. DERfinder and Diffsplice were used with no prior annotation. The program versions and parameters for event determination are presented in Table S3. Finally, to compare predictions from all the programs, we mapped the coordinates of differential events or transcripts provided by each program to the TAIR10.22 or NCBI37.61 gene coordinates and compared events at the gene level (Figs. S13, 20).

Supplementary Material

Supplemental_Files.docx

krnb-13-01-1118604-s001.docx^{(4.5MB, docx)}

Disclosure of potential conflicts of interest

No potential conflicts of interest were disclosed.

Acknowledgments

This work was supported by grants ANR-12-ADAP-0019 (RNADAPT) and ANR-10-INBS-0009 (France Génomique) from Agence Nationale pour la Recherche. Work in MC laboratory is supported by a public grant overseen by the French National research Agency (ANR) as part of the «Investissement d’Avenir» program, through the “Lidex-3P” project and a French State grant (reference ANR-10-LABX-0040-SPS) funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02

References

1.Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res 2012; 22:2008-17; PMID:22722343; http://dx.doi.org/ 10.1101/gr.133744.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, Wong WK, Mockler TC. Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res 2010; 20:45-58; PMID:19858364; http://dx.doi.org/ 10.1101/gr.093302.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, Monroy A, Kuan PF, Hammond SM, Makowski L et al.. Diffsplice: the genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res 2013; 41:e39; PMID:23155066; http://dx.doi.org/ 10.1093/nar/gks1026 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 2013; 29:1035-43; PMID:23428641; http://dx.doi.org/ 10.1093/bioinformatics/btt087 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 2010; 464:773-7; PMID:20220756; http://dx.doi.org/ 10.1038/nature08903 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 2013; 31:46-53; PMID:23222703; http://dx.doi.org/ 10.1038/nbt.2450 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Brown JW, Marshall DF, Echeverria M. Intronic noncoding RNAs and splicing. Trends Plant Sci 2008; 13:335-42; PMID:18555733; http://dx.doi.org/ 10.1016/j.tplants.2008.04.010 [DOI] [PubMed] [Google Scholar]
8.Yang W, Chendrimada TP, Wang Q, Higuchi M, Seeburg PH, Shiekhattar R, Nishikura K. Modulation of microRNA processing and expression through RNA editing by ADAR deaminases. Nat Struct Mol Biol 2006; 13:13-21; PMID:16369484; http://dx.doi.org/ 10.1038/nsmb1041 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Beaudoing E, Gautheret D. Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res 2001; 11:1520-6; PMID:11544195; http://dx.doi.org/ 10.1101/gr.190501 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Garneau NL, Wilusz J, Wilusz CJ. The highways and byways of mRNA decay. Nat Rev Mol Cell Biol 2007; 8:113-26; PMID:17245413; http://dx.doi.org/ 10.1038/nrm2104 [DOI] [PubMed] [Google Scholar]
11.Nicolas P, Mäder U, Dervyn E, Rochat T, Leduc A, Pigeonneau N, Bidnenko E, Marchadier E, Hoebeke M, Aymerich S et al.. Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis. Science 2012; 335:1103-6; PMID:22383849; http://dx.doi.org/ 10.1126/science.1206848 [DOI] [PubMed] [Google Scholar]
12.Liu R, Loraine AE, Dickerson JA. Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems. BMC Bioinformatics 2014; 15:364; PMID:25511303; http://dx.doi.org/ 10.1186/s12859-014-0364-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT. Differential expression analysis of RNA-seq data at single-base resolution. Biostat 2014; 15(3):413-26; PMID:24398039; http://dx.doi.org/ 10.1093/biostatistics/kxt053 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A, Borgwardt K, Rätsch G. Accurate detection of differential RNA processing. Nucleic Acids Res 2013; 41:5189-98; PMID:23585274; http://dx.doi.org/ 10.1093/nar/gkt211 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 2010; 11:94; PMID:20167110; http://dx.doi.org/ 10.1186/1471-2105-11-94 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010; 11:R25; PMID:20196867; http://dx.doi.org/ 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol 2010; 11(10):R106; PMID:20979621; http://dx.doi.org/ 10.1186/gb-2010-11-10-r106 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Bardou F, Ariel F, Simpson CG, Romero-Barrios N, Laporte P, Balzergue S, Brown JWS, Crespi M. Long non-coding RNA modulate alternative splicing regulators in Arabidopsis. Dev Cell 2014; 30(2):166-76; PMID:25073154; http://dx.doi.org/ 10.1016/j.devcel.2014.06.017 [DOI] [PubMed] [Google Scholar]
19.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014; 15(12):550; PMID:25516281; http://dx.doi.org/ 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Brooks MJ, Rajasimha HK, Roger JE, Swaroop A. Next-generation sequencing facilitates quantitative analysis of wild-type and Nrl(-/-) retinal transcriptomes. Mol Vis 2011; 17:3034-54; PMID:22162623 [PMC free article] [PubMed] [Google Scholar]
21.Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013; 14:R36; PMID:23618408; http://dx.doi.org/ 10.1186/gb-2013-14-4-r36 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup . The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 2009; 25:2078-9; PMID:19505943; http://dx.doi.org/ 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010; 26:841-2; PMID:20110278; http://dx.doi.org/ 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc B 1995; 57: 289-300 [Google Scholar]
25.Boyes DC, Zayed AM, Ascenzi R, McCaskill AJ, Hoffman NE, Davis KR, Gorlach J. Growth stage-based phenotypic analysis of Arabidopsis: a model for high throughput functional genomics in plants. Plant Cell 2001; 13:1499-1510; PMID:11449047; http://dx.doi.org/ 10.1105/tpc.13.7.1499 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Czechowski T, Stitt M, Altmann T, Udvardi MK, Scheible WR. Genome-wide identification and testing of superior reference genes for transcript normalization in Arabidopsis. Plant Physiol 2005; 139:5-17; PMID:16166256; http://dx.doi.org/ 10.1104/pp.105.063743 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental_Files.docx

krnb-13-01-1118604-s001.docx^{(4.5MB, docx)}

[cit0001] 1.Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res 2012; 22:2008-17; PMID:22722343; http://dx.doi.org/ 10.1101/gr.133744.111 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0002] 2.Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, Wong WK, Mockler TC. Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res 2010; 20:45-58; PMID:19858364; http://dx.doi.org/ 10.1101/gr.093302.109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0003] 3.Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, Monroy A, Kuan PF, Hammond SM, Makowski L et al.. Diffsplice: the genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res 2013; 41:e39; PMID:23155066; http://dx.doi.org/ 10.1093/nar/gks1026 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0004] 4.Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 2013; 29:1035-43; PMID:23428641; http://dx.doi.org/ 10.1093/bioinformatics/btt087 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0005] 5.Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 2010; 464:773-7; PMID:20220756; http://dx.doi.org/ 10.1038/nature08903 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0006] 6.Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 2013; 31:46-53; PMID:23222703; http://dx.doi.org/ 10.1038/nbt.2450 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0007] 7.Brown JW, Marshall DF, Echeverria M. Intronic noncoding RNAs and splicing. Trends Plant Sci 2008; 13:335-42; PMID:18555733; http://dx.doi.org/ 10.1016/j.tplants.2008.04.010 [DOI] [PubMed] [Google Scholar]

[cit0008] 8.Yang W, Chendrimada TP, Wang Q, Higuchi M, Seeburg PH, Shiekhattar R, Nishikura K. Modulation of microRNA processing and expression through RNA editing by ADAR deaminases. Nat Struct Mol Biol 2006; 13:13-21; PMID:16369484; http://dx.doi.org/ 10.1038/nsmb1041 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0009] 9.Beaudoing E, Gautheret D. Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res 2001; 11:1520-6; PMID:11544195; http://dx.doi.org/ 10.1101/gr.190501 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0010] 10.Garneau NL, Wilusz J, Wilusz CJ. The highways and byways of mRNA decay. Nat Rev Mol Cell Biol 2007; 8:113-26; PMID:17245413; http://dx.doi.org/ 10.1038/nrm2104 [DOI] [PubMed] [Google Scholar]

[cit0011] 11.Nicolas P, Mäder U, Dervyn E, Rochat T, Leduc A, Pigeonneau N, Bidnenko E, Marchadier E, Hoebeke M, Aymerich S et al.. Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis. Science 2012; 335:1103-6; PMID:22383849; http://dx.doi.org/ 10.1126/science.1206848 [DOI] [PubMed] [Google Scholar]

[cit0012] 12.Liu R, Loraine AE, Dickerson JA. Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems. BMC Bioinformatics 2014; 15:364; PMID:25511303; http://dx.doi.org/ 10.1186/s12859-014-0364-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0013] 13.Frazee AC, Sabunciyan S, Hansen KD, Irizarry RA, Leek JT. Differential expression analysis of RNA-seq data at single-base resolution. Biostat 2014; 15(3):413-26; PMID:24398039; http://dx.doi.org/ 10.1093/biostatistics/kxt053 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0014] 14.Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A, Borgwardt K, Rätsch G. Accurate detection of differential RNA processing. Nucleic Acids Res 2013; 41:5189-98; PMID:23585274; http://dx.doi.org/ 10.1093/nar/gkt211 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0015] 15.Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 2010; 11:94; PMID:20167110; http://dx.doi.org/ 10.1186/1471-2105-11-94 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0016] 16.Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010; 11:R25; PMID:20196867; http://dx.doi.org/ 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0017] 17.Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol 2010; 11(10):R106; PMID:20979621; http://dx.doi.org/ 10.1186/gb-2010-11-10-r106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0018] 18.Bardou F, Ariel F, Simpson CG, Romero-Barrios N, Laporte P, Balzergue S, Brown JWS, Crespi M. Long non-coding RNA modulate alternative splicing regulators in Arabidopsis. Dev Cell 2014; 30(2):166-76; PMID:25073154; http://dx.doi.org/ 10.1016/j.devcel.2014.06.017 [DOI] [PubMed] [Google Scholar]

[cit0019] 19.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014; 15(12):550; PMID:25516281; http://dx.doi.org/ 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0020] 20.Brooks MJ, Rajasimha HK, Roger JE, Swaroop A. Next-generation sequencing facilitates quantitative analysis of wild-type and Nrl(-/-) retinal transcriptomes. Mol Vis 2011; 17:3034-54; PMID:22162623 [PMC free article] [PubMed] [Google Scholar]

[cit0021] 21.Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013; 14:R36; PMID:23618408; http://dx.doi.org/ 10.1186/gb-2013-14-4-r36 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0022] 22.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup . The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 2009; 25:2078-9; PMID:19505943; http://dx.doi.org/ 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0023] 23.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010; 26:841-2; PMID:20110278; http://dx.doi.org/ 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0024] 24.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc B 1995; 57: 289-300 [Google Scholar]

[cit0025] 25.Boyes DC, Zayed AM, Ascenzi R, McCaskill AJ, Hoffman NE, Davis KR, Gorlach J. Growth stage-based phenotypic analysis of Arabidopsis: a model for high throughput functional genomics in plants. Plant Cell 2001; 13:1499-1510; PMID:11449047; http://dx.doi.org/ 10.1105/tpc.13.7.1499 [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0026] 26.Czechowski T, Stitt M, Altmann T, Udvardi MK, Scheible WR. Genome-wide identification and testing of superior reference genes for transcript normalization in Arabidopsis. Plant Physiol 2005; 139:5-17; PMID:16166256; http://dx.doi.org/ 10.1104/pp.105.063743 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Detection of generic differential RNA processing events from RNA-seq data

Van Du T Tran

Oussema Souiai

Natali Romero-Barrios

Martin Crespi

Daniel Gautheret

ABSTRACT

Introduction