Systematic integration of RNA-Seq statistical algorithms for accurate detection of differential gene expression patterns

Panagiotis Moulos; Pantelis Hatzis

doi:10.1093/nar/gku1273

. 2014 Dec 1;43(4):e25. doi: 10.1093/nar/gku1273

Systematic integration of RNA-Seq statistical algorithms for accurate detection of differential gene expression patterns

Panagiotis Moulos ^1,^*, Pantelis Hatzis ^1,^*

PMCID: PMC4344485 PMID: 25452340

Abstract

RNA-Seq is gradually becoming the standard tool for transcriptomic expression studies in biological research. Although considerable progress has been recorded in the development of statistical algorithms for the detection of differentially expressed genes using RNA-Seq data, the list of detected genes can differ significantly between algorithms. We present a new method (PANDORA) that combines multiple algorithms toward a summarized result, more efficiently reflecting true experimental outcomes. This is achieved through the systematic combination of several analysis algorithms, by weighting their outcomes according to their performance with realistically simulated data sets generated from real data. Results supported by the analysis of both simulated and real data from different organisms as well as correlation with PolII occupancy demonstrate that PANDORA improves the detection of differential expression. It accomplishes this by optimizing the tradeoff between standard performance measurements, such as precision and sensitivity.

INTRODUCTION

One of the common applications of RNA-Seq (1) is genome-wide transcript expression profiling and detection of differentially expressed genes (DEGs) across distinct biological conditions. RNA-Seq experiments, like all high-throughput techniques, are subject to several sources of bias, systematic or experimental (2). To account for such biases, several algorithms have been developed based on different statistical models (3). Another important factor contributing to the native complexity of RNA-Seq data is the often ambiguous interpretation of the observed results. For example, when two genes share functional elements (e.g. overlapping exons), it is challenging to distinguish the real source of expression. Such issues are more evident when trying to explain expression differences between different isoforms of the same gene. On the other hand, although RNA-Seq has inflated the number of potential applications of gene expression experiments, it is still often used as a replacement of DNA microarrays, because of its greater dynamic range, higher resolution and lower experimental noise.

A lot of research effort has been devoted either in developing novel methods for the detection of DEGs using RNA-Seq or in evaluating the performance of existing ones. It was recently shown that different methods perform best in different data sets or sources of error (4,5). This, however, results in significant variations in the quality of gene expression data published by different investigators, in addition to the inherent biological variations between the analyzed samples. Furthermore, different result reporting formats often lead to difficulties in the presentation and comprehension of the findings. A potential way to overcome the above caveats is to combine current algorithms toward the derivation of more robust gene lists, characterized by both higher statistical power and lower numbers of false positives and false negatives (referred to hereafter as ‘false hits’). In other words, instead of creating additional statistical algorithms, prior knowledge on the advantages and disadvantages of existing algorithms derived by extensive comparison studies (e.g. (5)) can be exploited. These algorithms can be combined based on their strength, by using, for example, a P-value weighting scheme (6), resulting in faster convergence to an optimized list of DEGs where the ratio between true positive genes and false hits is maximized.

In this article, we demonstrate that the systematic combination of several current RNA-Seq statistical algorithms based on performance weighting can improve the overall detection of DEGs, by reducing false hits while maintaining true positives. To this end, we developed PANDORA (PerformANce Driven scOring of RNA-Seq stAtistics), a new method for the analysis of RNA-Seq gene expression data which deploys a combination of existing algorithms. We show that PANDORA, as evaluated on simulated and real data sets, achieves significant improvements in both precision and sensitivity, and, in the worst case, similar performance to existing algorithms that have been previously evaluated as most robust (4,5). Finally, by correlating two independent methods for the evaluation of gene expression (RNA-Seq and PolII occupancy data), we show that PANDORA offers an intuitive methodology for the optimization of statistical accuracy and agreement between alternative transcriptional measurements. This method, together with additional statistical test combination methods, is implemented in metaseqR, an R/Bioconductor package. metaseqR provides a straightforward interface to several current statistical tests and normalization algorithms for RNA-Seq data. At the same time it provides greater insight to the data through extensive and comprehensive reports of the findings.

MATERIALS AND METHODS

P-value combination

metaseqR can combine the P-value scores returned by the application of more than one statistical tests with six approaches. Throughout the rest of this section, we generalize the term ‘P-value’ to include the score (1 – posterior probability of differential expression) returned by baySeq and the score (1 – q statistic) returned by NOISeq. Specifically, let p_ij be the resulting P-value for gene i after the application of the statistical test j. Then, the combined P-value for gene i, is derived using one of the following approaches:

Simes method

Let p_i₁, p_i₂, …, p_im be the P-value scores returned for gene i after the application of m statistical tests. Let also p_i₍₁₎, p_i₍₂₎, …, p_i₍_m₎ be the aforementioned P-values sorted in increasing order. Then, according to Simes’ method, the probability

can be used as the combined P-value for a set of m statistical tests (7). In case of independent tests, this is the exact P-value for the combination. In case of dependency, numerical evidence (7) shows that this approximation can still be used.

Fisher's method

According to the Fisher's method, let f be the statistic defined by the natural logarithm of the product of m individual P-values (from m statistical tests) multiplied by –2:

It can be proved (8) that f follows an X² distribution with 2m degrees of freedom, which can be used to derive the combined P-value from m statistical tests. However, Fisher's method was developed with the intention of combining P-values produced by the same statistical test applied in different data sets, where the underlying null hypothesis is the same (meta-analysis), rather than combining P-values produced by different statistical tests applied to the same data set. In the context of different statistical tests, the null hypothesis remains essentially the same (testing differential against non-differential gene expression), although there might be differences in its formulation, according to the context of each statistical test.

Whitlock's method

According to Whitlock's weighted Z-method (9), the weighted Z statistic for each gene i

follows the standard Normal distribution N(0,1), which can be used to derive the P-value of the combined tests. Regarding the usage of the Whitlock method for combining P-values from different statistical tests, the assumptions described above (Fisher's method) also apply.

Maximum P-value

In this case, the combined P-value is

In the case of a predefined level of significance α, this approach is equivalent to the ‘intersection’ of the genes for which the null hypothesis of no differential expression is rejected (p_ij < α), as these are derived from each statistical test. Thus, the final list will contain genes which have been characterized as differentially expressed by all the statistical tests applied. The maximum P-value ensures that the false positives are minimized at a (usually high) cost on the true positives (statistical power).

Minimum P-value

In this case, the combined P-value is

In the case of a predefined level of significance α, this approach is equivalent to the ‘union’ of the genes for which the null hypothesis of no differential expression is rejected (p_ij < α), as these are derived from each statistical test. Thus, the final list will contain genes which have been characterized as differentially expressed by at least one of the statistical tests applied. The minimum P-value ensures that the true positives are maximized at a (usually high) cost on the false positives (type I error).

PANDORA P-value

In this case, the combined P-value is

where w_j represent automatically assigned or user-specific weights for the j statistical tests performed. The weights can be automatically assigned according to the area under the false discovery curve (AUFC) estimated from simulated data based (possibly) on the data set under investigation, or can be user-defined according to previous performance experience (see ‘The choice of weights for PANDORA’ section in Supplementary Material). In the case of automatic estimation of weights using the AUFCs, the weights are estimated using the following formula:

where AUFC_j is the area under the false discoveries curve for the results of statistical test j. Apart from the automatic weights assignment, metaseqR allows the user to specify weights, based. for example. on previous own experience, on previous studies or manual inspection of the performance measurements offered by metaseqR. In all cases, the supplied weights must have a unit sum.

Precision-sensitivity tradeoff metrics

In order to quantify the tradeoff between precision and sensitivity when assessing methods, we use the F₁-score (or F-measure). It comprises a statistical measurement used to assess the performance of statistical tests or binary classifiers (10) by combining both the precision and sensitivity. The F₁-score is defined as the harmonic mean of precision and sensitivity:

If precision and sensitivity are replaced by their own definitions

respectively, it follows that:

As an additional precision-sensitivity tradeoff measurement, we use the ad hoc and more intuitive ratio of True Positives to the sum of False Positives and False Negatives, referred as False Discovery Tradeoff (FDT). The FDT is defined as:

Performance evaluation tools

In order to assess the performance of the combined statistical testing of metaseqR as compared to the usage of single tests, we used several evaluation tools. Specifically:

We used false discovery curves (FDCs) and false negative curves (FNCs) to measure the progression of type I and type II errors, respectively, while traversing lists ranked according to statistical significance from top-to-bottom for FDCs and from bottom-to-top for FNCs.
We used receiver operating characteristic (ROC) analysis (the area under the curve) and assessment of the true false discoveries to demonstrate that weighting and combining statistical tests according to their performance yields results as good as or better than the best performing algorithms in the majority of test cases.
We used the F₁-score and the area under F₁-score curves to demonstrate its maximization in final gene lists, produced by weighted P-value combination, in most simulated cases. The area under the F₁-score curve is constructed by ranking the genes according to their combined P-value in increasing order and calculating the respective F₁-score while traversing the ordered list.
We applied ROC and F₁-score analysis in real data from the SEQC project and true false discovery rate (FDR) analysis in a data subset from Brawand et al.
We applied F₁-score analysis in two RNA-Seq data sets (11,12), based on coupled PolII occupancy across gene bodies, the correlation of which with gene expression as measured by RNA-Seq was used as an independent measurement of transcription quantification.

All the aforementioned performance measurements regarding simulated data were calculated by averaging the results over 10 simulations.

Simulated data

The synthetic data sets referenced in the Results section were generated with the make.sim.data.sd function included in metaseqR. This function implements the simulator described in (5), which creates synthetic RNA-Seq gene counts based on the negative binomial distribution, where the mean and dispersion parameters are estimated from real data. Additionally, as the main normalization algorithm that we use is EDASeq (13), which makes use of the GC content of genes, we assign to each synthetic gene a GC content value sampled from the respective model organism. For the purpose of this article, we used the following data sets (the Human, Mouse and Fruitfly data sets were downloaded from the ReCount database (14)):

For the simulations based on human, we used combined data from Montgomery et al. (15) and Pickrell et al. (16) as also proposed in (5). The Montgomery data set contains gene expression abundances from 60 individual from European descent while the Pickrell data set contains gene expression abundances from 69 Nigerian individuals. The original purpose of both studies was to identify genetic variations.
For the simulations based on chimpanzee, we used a subset of data generated by Brawand et al. (17) who studied gene expression evolution in 6 organs across 10 representative mammalian species. For the needs of the simulations we used a subset of four sequencing runs with libraries constructed from polyadenylated RNA extracted from the same number of male chimpanzee brain prefrontal cortexes.
For the simulations based on mouse, we used data from Bottomly et al. (18). Among other work, the authors studied gene expression differences between two commonly used mouse models, C57BL/6J and DBA/2J. The data set contains RNA-Seq gene read counts from 21 samples, 10 C57BL/6J and 11 DBA/2J, respectively.
For the simulations based on fruitfly, we used a subset of data generated by Graveley et al. (19) who studied Drosophila melanogaster transcriptome at various developmental stages. For the needs of the simulations we used a subset consisting of RNA-Seq read counts from 15 adult male fruitflies.
For the simulations based on Arabidopsis thaliana, we used data from Yanming et al. (20), who studied the defense response of the plant against bacterial infection. The data set was embedded in the R package NBPSeq and the Bioconductor package TCC.

Based on the above data sets and the implemented simulator, we used two main synthetic data configurations from each organism:

A simulated RNA-Seq data set with 10 000 genes, consisting of two experimental conditions, each with three replicates. The percentage of DEGs was set to 10% (1000 hypothetical genes), of which 50% are up-regulated in the first condition and 50% are up-regulated in the second.
A simulated RNA-Seq data set with 10 000 genes, consisting of two experimental conditions, each with seven replicates. The percentage of DEGs was set to 12% (1200 hypothetical genes), of which ∼42% are up-regulated in the first condition and ∼58% are up-regulated in the second, comprising a slightly unbalanced set of deregulated genes among the two conditions.

Real RNA-Seq data

In addition to the usage of simulated data, we evaluated the performance of metaseqR using data from the SEQC project, the latest development under the MAQC study, which historically has been used to assess performance, similarities and reproducibility between microarray platforms. The SEQC data set contains 92 spike-in RNA controls at known concentrations, which allow the construction of an a priori ground truth regarding the specific transcripts. Furthermore, the SEQC project includes a set of ∼1000 genes validated by TaqMan quantitative polymerase chain reaction (qPCR); thus the fold change of these genes as calculated by qPCR can be used to estimate true levels of differential expression and provide measurements regarding the performance of statistical algorithms and the combination methods of metaseqR. Details regarding the SEQC data set are discussed in detail elsewhere (4,21). For the purposes of this study we used processed SEQC data by Rapaport et al. (4), available at https://bitbucket.org/soccin/seqc/. In order to apply EDASeq normalization to SEQC data including the spike-in controls, we retrieved the nucleotide sequences of the spike-ins available from Life Technologies^TM and calculated their GC content. GC content for the remaining genes was retrieved from Ensembl (22).

Moreover, to assess the performance of metaseqR-supported algorithms and P-value combination methods regarding the control of false positive outcomes, we used the two conditions of the SEQC data individually, as well the data from Brawand et al. described in the previous section to perform ‘same versus same’ mock comparisons under settings where differential expression should not be detected between the mock biological conditions. For the SEQC data, we performed two comparisons using data from the individual groups of the data set (A and B) by separating each group to two subgroups (A1, A2 and B1, B2, respectively) with two replicates each. As both the groups contained five technical replicates, we excluded from each group the replicate which was clustered further from the other two in a sample-wise hierarchical clustering.

RNA-Seq and PolII occupancy data

The performance of PANDORA was additionally assessed by correlating the fold change of DEGs derived from RNA-Seq data with the corresponding fold change in PolII occupancy across gene bodies. To this end we used the following two coupled (RNA-Seq and PolII occupancy) data sets: (i) RNA-Seq and PolII data from Mokry et al. (12), where a doxycyclin-inducible shRNA targeting β-catenin allows for complete and specific blocking of the constitutively active Wnt pathway in colorectal cancer cells and (ii) RNA-Seq and PolII data from Lin et al. (11) (GEO accession number GSE38148), where the RNA PolII elongation factor Ell3 is knocked-down in mouse ES cells. Genes were categorized as up- or down-regulated according to PolII occupancy if the logarithm (base 2) of fold change between average ChIP-Seq reads per 1 kb across the gene body in treatment over control was above the 90th or the 10th quantile of the fold change distribution, respectively.

RESULTS

PANDORA's approach and performance evaluation criteria

PANDORA is based on an intuitive approach for ranking each statistical test according to its performance based on real-time simulations. Synthetic data sets, with a priori known DEGs, are constructed using parameters estimated from well characterized real data sets for several organisms or from the investigator's own data. Consequently, the statistical analysis of these data sets can be used to assess the performance of each algorithm. The latter is assessed using FDCs, where the performance is measured by the number of false positive findings encountered in a list, ordered according to decreasing statistical significance. PANDORA uses the AUFC to construct weights for each algorithm which are applied to the P-values returned by each test.

We evaluated the performance of PANDORA using experimental settings based on simulated data or data acquired from the SEQC study (21) and Brawand et al. (17). The synthetic data sets were created using a previously described methodology (5,23) based on real publicly available RNA-Seq count data for five organisms, namely, Homo sapiens (Human), Pan troglodytes (Chimpanzee), Mus musculus (Mouse), Drosophila melanogaster (Fruitfly) and Arabidopsis thaliana (Arabidopsis). We used a limited set of simulations and real data, which are sufficient to prove the added value of PANDORA. Specifically, we restricted the properties of synthetic data sets to two main configurations: a data set with two conditions of three biological replicates each and balanced differential expression between conditions (‘3 replicates—balanced DEG’) and a data set with two conditions of seven biological replicates each and slightly unbalanced differential expression between conditions (‘7 replicates—unbalanced DEG’).

As shown below, the combined usage of statistical tests for RNA-Seq data does not necessarily perform better in terms of statistical power than certain algorithms reported to stand out in related evaluations. With certain evaluation metrics, this is something to be expected because of the convex P-value weighting scheme (Supplementary Material, Section 2.1). However, as performance is not only measured by power, but sensitivity as well, PANDORA offers an optimal tradeoff between precision and sensitivity, when multiple statistical tests are applied on the same data set and their resulting statistical scores are properly weighted. To demonstrate this, we used established measurements, such as ROC analysis and FDCs, as well as the harmonic mean of precision and sensitivity (F₁-score, Materials and Methods). To avoid confusion, we will use the term ‘P-value’ to refer to all the statistical scores (P-values, posterior probabilities, etc.), even if not all statistical scores comprise nominal P-values.

PANDORA performs similar to the best algorithms in detecting differential expression

We compared PANDORA with five additional combination methods, either intuitive (e.g. the intersection of all genes marked as statistically deregulated by each examined test) or mined from the related statistical literature and used in the appropriate context (e.g. the Simes P-value combination method). Specifically, we applied the following methods:

The P-value combination method proposed by Simes (7), referred to henceforth as Simes.
The union of all genes marked as statistically significant deregulated by each examined test by considering the minimum statistical score among the examined tests as the final P-value, referred to henceforth as Union.
The intersection of all genes marked as statistically deregulated by each examined test by considering the maximum P-value among the examined tests as the final P-value, referred to henceforth as Intersection.
The P-value combination method proposed by Fisher for the meta-analysis of multiple independent observations testing the same hypothesis (8), referred to henceforth as Fisher. Although the classical interpretation of meta-analysis refers to the application of the same statistical test in similar data sets to test against a total null hypothesis, we use it to combine the P-values of different statistical tests applied to the same data set under a generalized hypothesis test for each gene, which in our case is ‘differential’ against ‘non-differential’ expression.
The meta-analysis procedure proposed by Whitlock (9), referred to henceforth as Whitlock. Its main advantage over the Fisher method is the possibility to use weights for each set of observations; we thus apply it as a counterpart to the PANDORA weighting scheme, using the same weights. We use it under the same assumptions as the Fisher method.

Moreover, we paralleled the performance of the combination methods to the performance of the six statistical tests (DESeq, edgeR's exact test, limma with the voom method, NBPSeq, NOISeq and baySeq (20,24–28)) when applied alone. In total, we compared 12 possible statistical analyses.

Figure 1 depicts the FDCs for the six statistical tests alone and the six P-value combinations, applied on the synthetic data sets generated based on real data from five organisms. To avoid possible biases introduced by major (DESeq) or minor (edgeR, limma voom, NBPSeq, NOISeq and baySeq) differences in the normalization approaches followed by each algorithm, we used EDASeq (13) as a common normalization framework. An exploration of the ‘3 replicates—balanced DEG’ simulation (left panel), which resembles typical real-life experimental settings, reveals that the overall best performing individual test is limma voom, followed by baySeq and edgeR's exact test (edgeR's General Linear Models test was not examined here). DESeq and NBPSeq always share the last positions in terms of numbers of false positives located among the top ranked genes, whereas NOISeq is in the middle. Regarding the P-value combinations, classical meta-analysis (Fisher, Whitlock) and the Simes method, all perform quite poorly, presenting a large number of false positives among the top genes, very similar to the Union method which is the most liberal, and thus expected to return a large number of false discoveries. PANDORA performs quite well but not better than limma voom because of the convex nature of the weights. Notably, the Intersection, which is supposed to be the strictest selection strategy, appears inferior to limma voom and sometimes baySeq, apart from the Fruitfly simulations. This observation suggests that the often intuitive use of the common DEGs, returned by a variety of tests, as a robust signature to distinguish among biological conditions and sometimes as a ‘ground truth’ for evaluation (4), is not always an optimal solution, especially with a small number of replicates. The performance trends of individual as well as combined tests as assessed by FDCs do not change significantly when using normalization algorithms suggested in the distinct analysis packages (Supplementary Figure S4). They also do not change when combining adjusted for multiple testing instead of raw P-values for the four tests (DESeq, edgeR, limma and NBPSeq) returning nominal P-values (Supplementary Figures S13 and S15). Finally, we examined the progression of Type II errors using FNCs for each algorithm and simulation configuration. In this case, the above performance evaluation is almost reversed and many of the best performing algorithms exhibit a large number of false negatives, with certain combination methods (PANDORA among them) showing quite stable results (Supplementary Results).

ROC analysis and assessment of FDRs place PANDORA among the top performing algorithms

Next, we evaluated the performance of PANDORA as compared to the other methods using ROC analysis and assessment of the true FDR, inspired by previous comparison work (5). ROC analysis using simulated data and the six individual statistical tests is in accordance with previous studies (5,26). Specifically, when using three replicates with balanced differential expression between two conditions (Figure 2, left panels), limma voom was constantly the best performer, followed by edgeR. The remaining four positions are shared among DESeq, NBPSeq, NOISeq and baySeq, with DESeq usually presenting lower performance and baySeq the most variant performance. The performance rank of the tests does not notably change when using a higher number of replicates and slightly unbalanced differential expression between conditions (Figure 2, right panels). Instead, the ROC area under the curve is significantly increased, as expected due to the larger number of replicates. The test that does not seem to benefit from more replicates is NOISeq. In addition, the performance variability of baySeq is reduced. Regarding the performance of the combination methods, Intersection is constantly the lowest scoring and most variable method when using a small number of replicates, followed by Simes. These observations do not particularly change in the presence of more replicates, apart from Intersection, which becomes more accurate, but equally variable. Finally, PANDORA performs well and very close to the highest performing individual tests in both simulation sets for all organisms, but not better, again due to the convex weighting scheme.

Figure 2. — ROC analysis using EDASeq normalization. The boxplots depict the summarized areas under the ROC curve across 10 simulations for each organism and for each simulation configuration. The results for each organism and simulation configuration can be distinguished by the right and top side titles of each panel, referring to the organism from which simulation parameters are estimated and the simulation configuration, respectively. The dashed rectangle highlights the PANDORA method results. Each statistical test's and each P-value combination method's performance trends are similar, apart from the Mouse case, where the performance is increased in both simulation configurations. The impact of incorporating more replicates to the simulation is evident, as most methods achieve area under the curve values closer to 1 in all cases. PANDORA performs very similar to the highest scoring algorithms according to the area under the ROC curves.

We also assessed the accuracy of individual as well as combined tests by measuring their true FDR using simulated data (Figure 3). From this, we excluded NOISeq for the reasons explained in (5). Regarding the individual tests in the ‘3 replicates—balanced DEG’ simulations, only limma voom and baySeq show true FDR levels below 5%. DESeq and NBPSeq show the poorest performance, while edgeR is in the middle. These performance trends do not change for the ‘7 replicates—unbalanced DEG’ case, where the true FDR is closer to 5% for all individual tests. From the P-value combination methods, Intersection performs best, as expected due to stringency of selection, while the Simes method performance is between the performances of individual tests. Remarkably, the Union method, which is the most liberal and expected to exhibit the highest number of true false discoveries, performs better than the Fisher and Whitlock methods. Finally, and most importantly, PANDORA exhibits true FDR values below 5%, which is better than limma voom, the best individual algorithm so far according to our study. All the aforementioned observations do not change when using each package's specific normalization algorithm instead of EDASeq (Supplementary Figure S10).

PANDORA exhibits the best tradeoff between precision and sensitivity

We evaluated the ability of individual as well as proposed combined tests to maximize the number of true positives, while at the same time keeping the number of false hits at a low level thus to maximize the F₁-score. Figure 4 shows that in all simulated cases, PANDORA successfully combines several statistical tests toward a more robust gene list. Specifically, for the ‘3 replicates—balanced DEG’ configuration, PANDORA achieves F₁-score levels slightly higher than the best performing individual tests for Human, Chimpanzee and Arabidopsis. The F₁-score levels are much higher in Mouse and Fruitfly simulations. In the ‘7 replicates—unbalanced DEG’ study, the PANDORA method markedly stands out for all five organisms. Moreover, the good performance of PANDORA is supported by the area under the F₁-score curve (Supplementary Figure S7), which depicts the evolution of the F₁-score while traversing ranked gene lists from higher to lower statistical significance. Regarding the other combination methods, in the majority of cases they demonstrate lower F₁-scores than PANDORA, but higher F₁-scores than certain rather conservative individual tests (NOISeq, baySeq). Specifically, among the other methods, Simes presents the highest F₁-score, followed by Whitlock, Fisher, Union and Intersection. Although Whitlock performs quite well, if we break down the F₁-score value to its components (true and false hits, Supplementary Table S1 and Figure S12), we see that, although it takes into account the performance of each algorithm, the Whitlock method is poor because of the high number of false positives as compared to the gain in true positives. This suggest that the use of weighted Z-scores is not appropriate for combining statistical tests for RNA-Seq data and that PANDORA manages to accomplish this task in simulated data. Overall, our observations strengthen the main rationale behind PANDORA: as each statistical test presents certain advantages (whether these are low numbers of false positives at the cost of true positives or high numbers of true positives but also an excessive number of false positives too), a more robust gene list can be derived by applying an objective methodology to combine them. PANDORA achieves that by AUFC weighting, which ranks the statistical tests according to their ability to provide a reasonably low number of false positives in the top genes (500 in simulation studies). Finally, as expected, the Intersection method presents the lowest F₁-score in all simulations across all organisms, being overly conservative and ignoring a lot of true hits returned by each statistical test (Supplementary Table S1 and Figure S12).

Figure 4. — Analysis of the F₁-score using EDASeq normalization. The boxplots summarize the F₁-scores across 10 simulations for each organism and for each simulation configuration. The F₁-scores for each organism and simulation configuration can be distinguished by the right and top side titles of each panel, referring to the organism from which simulation parameters are estimated and the simulation configuration, respectively. The dashed rectangle highlights the PANDORA method results. The prevalence of PANDORA is evident in all cases. Intersection and baySeq consistently show the lowest F₁-score. F₁-scores are calculated using the final gene lists returned by each method at a P-value cutoff of 0.05.

From the individual tests, limma voom and edgeR demonstrate the highest F₁-score values, while baySeq and NOISeq show consistently lower values in most cases. This suggests that although the latter two are both successful in a satisfactory top-to-bottom statistical ranking of genes, their final outcome is excessively strict, resulting in significant loss of true positives (and a respective disproportional increase in false negatives, Supplementary Table S1 and Figure S12). Notably, baySeq presents an F₁-score as low as the conservative strategy of constructing final gene lists by intersecting lists derived by individual tests. This is also supported by FNCs (Supplementary Figures S3, S5, S14 and S16), where in the majority of the cases baySeq presents a higher area under the FNC as compared to the other algorithms, either combined or individual. The tradeoff between precision and sensitivity is also assessed using corrected P-values (Supplementary Results) and by measuring the ratio of true positives to false hits (Supplementary Figures S25 and S26). Overall, the F₁-score analysis reveals what more classical statistical algorithm performance evaluation measures, such as FDCs and ROC, do not: PANDORA successfully provides optimized gene lists, with respect to the tradeoff between true positives and false hits.

Robust performance of PANDORA on real RNA-Seq data

Our evaluation results are based mostly on the simulated data sets, where we could control the outcome of the experiment. However, as data sets from real biological sources often exhibit unexpected biases and other sources of biological noise, we also evaluated PANDORA regarding (i) its ability to identify true or false differential expression with RNA-Seq data generated in the latest versions of the MAQC study, namely, the SEQC project for the evaluation of sequencing platforms regarding gene expression and (ii) its ability to control false discoveries with mock comparisons using data from SEQC and four male chimpanzee prefrontal cortex samples from Brawand et al. (17) which have also been used in the evaluation of the DEXSeq package (29). The SEQC study includes spike-in RNA controls at known concentrations (ERCC spike-in data), as well as a set of ∼1000 genes validated by TaqMan qPCR analysis (TaqMan data) which aid in testing the performance of DEG discovery.

ROC analysis using the SEQC data classified PANDORA somewhere in the middle of the performances of the individual tests (Figure 5A). This was again to be expected because of the convexity of weights and because limma voom (which is weighted more than the rest) seems to underperform both in TaqMan and ERCC spike-in data. It should be noted that for the analysis of the SEQC data, we used weights estimated for Human from the combined data set of Montgomery et al. (15) and Pickrell et al. (16). We did not estimate weights based on the actual SEQC data, as the technical replication in the experimental design would bias the dispersion estimation in our simulator.

From the individual algorithms, DESeq and NBPSeq show the highest area under the curve, followed by edgeR, limma voom, baySeq and NOISeq when using the TaqMan data. When using ERCC spike-ins, DESeq is again top, followed by NOISeq, edgeR, limma voom and NBPSeq. Regarding the P-value combination methods, they all present similar values, apart from Fisher in TaqMan data and Simes in ERCC spike-in data. Union and Intersection score among the highest in both cases. The performance classification obtained by inspecting the F₁-score values for each data set and for each algorithm (individual or combined, Figure 5B) is quite different. Thus, DESeq shows one of the lowest F₁-scores, very close to the Intersection, which is expected to have a low F₁-score for reasons explained in previous sections. All the other methods perform similarly when using TaqMan data, with NBPSeq and NOISeq slightly above the others. With ERCC spike-ins, edgeR, NBPSeq and NOISeq stand out, followed by limma voom. baySeq performs poorly and Fisher and Whitlock are the best P-value combination methods in this case. Again, the performance of PANDORA is in the middle, for the reasons explained above. The aforementioned observations do not change when using adjusted P-values (Supplementary Figure S23A and B), apart from the performance of DESeq regarding the F₁-score, which drops even more for ERCC spike-in data.

Finally, we evaluated the approximation of the true FDR of individual algorithms and P-value combination methods using three ‘same versus same’ comparisons, two by separating the SEQC data groups into two subgroups each and one more using data from Brawand et al. (Table 1), excluding again NOISeq (5). All the tested methods achieve a ‘same versus same’ comparison FDR below 5% for all three comparisons. From the individual tests DESeq and NBPSeq score worst, presenting a higher FDR in two out of three comparisons. The best individual test is limma voom with FDR < 0.01% in all comparisons, followed by baySeq and edgeR. From the P-value combination methods, Intersection is the best, as expected (but with several pitfalls explained in previous sections), followed by PANDORA. The worst combination method is Fisher, followed by Whitlock, Union and Simes. Notably, PANDORA achieves one of the best rankings, presenting also several advantages, as described in previous sections. The aforementioned performance trend does not change when using each package's normalization algorithm instead of EDASeq (Supplementary Table S2).

Table 1. False discovery rates (Benjamini–Hochberg adjusted P-values) from three ‘same versus same’ comparisons (EDASeq normalization).

	DESeq	edgeR	voom	NBPSeq	baySeq	Simes	Union	Intersection	PANDORA	Fisher	Whitlock
SEQC group A	0.0238	0.0007	<0.0001	0.0177	0.0001	0.0201	0.0270	<0.0001	0.0007	0.0420	0.0266
SEQC group B	0.0003	0.0012	<0.0001	0.0019	<0.0001	0.0014	0.0033	<0.0001	<0.0001	0.0343	0.0101
Brawand data	0.0046	0.0009	<0.0001	0.0048	0.0001	0.0045	0.0065	<0.0001	<0.0001	0.0184	0.0133

Open in a new tab

Optimal performance of PANDORA in correlating RNA-Seq with PolII occupancy

Next, we sought to evaluate PANDORA's performance on real RNA-Seq data by utilizing coupled PolII occupancy across gene bodies as an alternative approach for transcription quantification. The data sets we utilized included coupled RNA-Seq and PolII ChIP-Seq data generated, in one instance, in a colorectal cancer cell line engineered to carry a doxycyclin-inducible shRNA targeting β-catenin, which allows for complete and specific blocking of the—in these cells constitutively active—Wnt pathway (12); in the other instance in mouse ES cells after shRNA-mediated knock-down of the RNA PolII elongation factor Ell3 (11). Both coupled data sets utilized consider knock-down studies of factors that are expected, among others, to change PolII occupancy over gene bodies as well. Expression changes were calculated by RNA-Seq and PolII occupancy knock-down over wild type. F₁-scores for each statistical algorithm and P-value combination method were calculated by using PolII occupancy as a ‘ground truth’ representation. To evaluate the tradeoff between the correlations of transcription changes, the above were deployed as a graphical evaluation method of each algorithm (Figure 6). As expected, stricter algorithms which yield fewer DEGs (Intersection, baySeq, DESeq) show better correlation with PolII occupancy but much lower F₁-scores, because of the exclusion of many true hits, as those are defined by a cutoff in PolII occupancy across gene bodies. On the other hand, more liberal algorithms (Union, NBPSeq, Fisher, edgeR) show a larger F₁-score but a much lower correlation with PolII occupancy. It is evident that PANDORA achieves an optimal tradeoff between precision-sensitivity (F₁-score) as defined by PolII occupancy and correlation of gene expression as measured by two independent techniques. The optimization is clearer for the Lin et al. data set (Figure 6, left panel). This observation highlights the added value of PANDORA regarding both the optimization of the tradeoff between true and false hits, as well as the fact that it yields biologically significant results: i.e. good correlation between two gene expression abundance methods using real data, when the number of DEGs is not known a priori.

Figure 6. — Tradeoff between accuracy (F₁-scores) and correlation with PolII occupancy for all methods. When using two coupled RNA-Seq and PolII ChIP-Seq data sets, stricter algorithms (Intersection, baySeq) yield higher correlation with differences in transcription as measured by PolII occupancy at the cost of false hits (low F₁-scores). The opposite is observed for more liberal methods (Union, edgeR). PANDORA achieves an optimal tradeoff between accuracy and correlation of transcription differences between treatment and control, as measured by RNA-Seq and PolII occupancy. The arrows point at the PANDORA method.

DISCUSSION

In this article, we present PANDORA, a method which combines existing statistical tests for RNA-Seq data. PANDORA is implemented in metaseqR, a Bioconductor package, which offers an interface for several RNA-Seq data normalization methods and statistical tests coupled with extensive and comprehensive reports of the results. The strong points of PANDORA and metaseqR not fully addressed by existing packages are:

The optimization of the tradeoff between precision and sensitivity in the vast majority of test cases.
The robustness in detecting differential expression without loss of statistical power.
The provision of a straightforward interface to eight normalization methods and nine statistical tests for RNA-Seq gene expression data through the use of seven R/Bioconductor packages developed for this purpose, allowing any combination among the above.
The detailed, comprehensive and interactive report produced at the end of each analysis, which to our knowledge is unique among related open-source tools.
The intuitive combination of multiple statistical tests applied on the same data with six possible methods.

RNA-Seq is gradually becoming the standard tool in gene expression research, with a wealth of statistical algorithms for the detection of DEGs emerging recently. Although there is a recent shift in software tool development aimed at understanding more complex genomic events that can be detected and studied by RNA-Seq (e.g. alternative splicing), it is most frequently applied to investigate changes in a ‘summarized gene’ expression across several experimental conditions. In addition, looking into more complex events, such as alternative isoform representation, often benefits from having established robust calls for deregulated genes. This is partly justified by a recent publication (30) where the authors analyzed transcript expression across several human tissues and cell lines and concluded that there is one dominant transcript per gene.

Our results were described based on raw, rather than corrected for multiple testing, P-values where appropriate. The issue of multiple testing correction and its application has been exhaustively treated in the literature. Its effect can be quite misleading with certain algorithms (e.g. DESeq, NBPSeq). Specifically, using simulated data sets for five organisms and two experimental configurations, the performance of DESeq, edgeR, NBPSeq and limma voom is not improved when using adjusted P-values, as measured by FDCs (Supplementary Figures S13 and S15). On the other hand, when using adjusted P-values for these four statistical tests, the performance as measured by FNCs dramatically drops, with limma voom exhibiting the most moderate accuracy loss. We thus believe that such results/phenomena are usually not noted, as most efforts in statistical algorithms are directed toward the minimization of false positives—without regard to the false negatives. This leads to overly strict outcomes and restricted downstream analyses (e.g. pathway analysis).

One possible solution to this problem would be to use either Bayesian approaches (e.g. baySeq or EBSeq) or more empirical ones (e.g. NOISeq), where ‘traditional’ multiple testing correction is not applied. We demonstrate that although these methods perform quite well as assessed by FDCs, FNCs and ROC analysis, when looking at actual numbers of true and false hits at specific statistical thresholds, NOISeq and baySeq perform badly: they generate an excessively high number of false hits, particularly false negatives. For the researcher who wants to look at a strict list of top DEGs, this is welcome. It, however, can also be achieved by imposing stricter statistical thresholds with other algorithms. In addition, not all methods return nominal P-values which can be subjected to multiple testing correction (NOISeq, baySeq) and, for these reasons, raw P-values have been used before for evaluation purposes (5). Using P-value combination applied on many statistical tests with unadjusted P-values seems to provide a solution to the problem described above. Specifically, PANDORA achieves the goal of a good tradeoff between true and false hits, by combining statistical tests using weights based on FDCs. Most importantly, it performs well when assessing accuracies based on real data. Thus, we propose that PANDORA can be used as an empirical alternative to the often strict multiple testing correction procedures.

As shown by F₁-score analysis, PANDORA is not the best performer when combining P-values adjusted for multiple testing correction. This, in our opinion is not a bottleneck, because, first, in a realistic experimental setting, the inclusion of more than three biological replicates is rare. Our analysis indicates that in the presence of a low number of replicates per experimental condition, and when using adjusted P-values for DESeq, edgeR, limma voom and NBPSeq: (i) the restriction of false positives is rather low as demonstrated by FDCs, (ii) the loss in numbers of false negatives is high as demonstrated by FNCs, (iii) ROC analysis does not show notable differences in performance and (iv) F₁-scores slightly drop in most cases for the four tests, especially in the human and mouse cases, two organisms that dominate current biological studies. Second, in some cases a PANDORA type of combining tests may not be desirable. This would occur either because it does not perform well for a specific data set (metaseqR provides evaluation facilities to test this) or a specific statistical test known to work well for a biological system under investigation. In such cases, metaseqR still provides an easy interface to several normalization algorithms, statistical tests and quality control measures, combined with very rich and detailed reporting capabilities.

In simulation studies, where the synthetic data sets are produced by strictly controlled computational means, PANDORA works considerably well. This may not always be the case with real data. The analysis of real RNA-Seq data from the SEQC project demonstrated a more moderate performance of the PANDORA method. At this point, it should be stressed that (i) we used Human weights estimated from the Montgomery–Pickrell combined data sets, which give greater weights to limma voom, edgeR and baySeq algorithms and (ii) the assessment is based on the TaqMan data and the ERCC spike-in controls, which both constitute a fixed set of ‘ground truth’ transcripts and the performance is based on these fixed numbers. In a real-life situation, there is no fixed number of DEGs that are expected from an experiment. This number is rather defined by statistical and/or empirical thresholds chosen by the analyst, without any guarantee that the false positives and false negatives are minimized. In this sense, methodology assessment with simulated data sets, where the number of differential synthetic genes is known but each test is free to return any number of genes, may well prove more crucial in the determination of which methods perform best.

Furthermore, technical replication in the SEQC data set is suitable for assessing technology reproducibility. However, it fails to capture biological noise present in more realistic experimental settings, which can be partially caught by simulated data created with parameters estimated from real data. We thus advocate the use of PANDORA, which excels in the majority of test cases, as its usage can offer a very good starting point for downstream analyses. The gene lists with minimized false hits it generates should be appreciated especially in low replication data sets, such as clinical applications.

The robustness of PANDORA is further supported when using PolII occupancy as a measurement of transcription. When the correlation of differential gene expression, as detected by RNA-seq, with differential average PolII occupancy across gene bodies was used as an independent accuracy measurement, we observed that PANDORA offers a good tradeoff between precision and sensitivity (F₁-score). At the same time it achieves a good correlation with PolII occupancy. On the other hand, very liberal methods (e.g. Union, NBPSeq) fail to correlate well with PolII, even if they generate higher F₁-scores. This is the expected outcome of the inclusion of more false positives. Stricter methods correlate better with PolII (again to be expected because of stringency), but fail to offer a good tradeoff between precision and sensitivity.

Finally, at the end of each analysis, metaseqR creates a very detailed and partially interactive report, not offered by similar open-source packages. This report includes, among others, an automatically generated text which can be directly used in the ‘Methods’ section of a scientific article and aids the bench biologist in understanding the pre-processing, normalization, data filtering and statistical testing that has been performed. Detailed information regarding the numbers of DEGs and the reproducibility of the analysis is also provided. In addition, a lot of diagnostic plots with short explanatory texts are available, coupled with tables depicting the top DEGs and all the intermediately generated read count tables can be retrieved through the report for further analysis. In this way, the experimentalist has a quick overview of the results without traversing multiple and separated gene lists and figures.

To conclude, PANDORA is a new method for weighting the results of statistical tests for RNA-Seq data based on their performance. Using simulated data generated based on parameter estimation from real data sets we have demonstrated that PANDORA achieves optimal tradeoff between precision and sensitivity. Finally, we have shown that PANDORA performs as well as individual tests using classical evaluation measurements and that it achieves an optimal tradeoff between performance and correlation of differential expression as measured by PolII occupancy.

AVAILABILITY

PANDORA is implemented in metaseqR, a Bioconductor (http://www.bioconductor.org) package for the analysis of RNA-Seq gene expression data providing an interface for several normalization methods and statistical tests, methods for combining statistical tests as well as detailed and comprehensive reporting facilities.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

SUPPLEMENTARY DATA

supp_43_4_e25__index.html^{(993B, html)}

Acknowledgments

The authors would like to especially thank Iannis Talianidis, Christoforos Nikolaou and Thanasis Margaritis for fruitful comments that improved the manuscript. The authors would also like to thank Matthieu Lavigne for testing the software and proposing several fixes and improvements and Maria Fousteri for going through metaseqR's results when applied on RNA-Seq data provided by her lab.

Footnotes

Present addresses:

Panagiotis Moulos, Biomedical Sciences Research Center ‘Alexander Fleming’, 34 Fleming str, 16672, Vari, Greece.

Pantelis Hatzis, Biomedical Sciences Research Center ‘Alexander Fleming’, 34 Fleming str, 16672, Vari, Greece.

FUNDING

European Union (European Social Fund – ESF) and Greek national funds through the Operational Program ‘Education and Lifelong Learning’ of the National Strategic Reference Framework (NSRF) – Research Funding Program: THALES, Investing in knowledge society through the European Social Fund. Also, Marie Curie Career Integration to P.H. [293968]. Funding for open access charge: Marie Curie Career Integration to P.H. [293968].

Conflict of interest statement. None declared.

REFERENCES

1.Wang Z., Gerstein M., Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009;10:57–63. doi: 10.1038/nrg2484. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Dillies M.A., Rau A., Aubert J., Hennequet-Antier C., Jeanmougin M., Servant N., Keime C., Marot G., Castel D., Estelle J., et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 2013;14:671–683. doi: 10.1093/bib/bbs046. [DOI] [PubMed] [Google Scholar]
3.Oshlack A., Robinson M.D., Young M.D. From RNA-seq reads to differential expression results. Genome Biol. 2010;11:220. doi: 10.1186/gb-2010-11-12-220. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Rapaport F., Khanin R., Liang Y., Pirun M., Krek A., Zumbo P., Mason C.E., Socci N.D., Betel D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14:R95. doi: 10.1186/gb-2013-14-9-r95. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Soneson C., Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91. doi: 10.1186/1471-2105-14-91. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Genovese C.R., Roeder K., Wasserman L. False discovery control with p-value weighting. Biometrika. 2006;93:509–524. [Google Scholar]
7.Simes R.J. An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986;73:751–754. [Google Scholar]
8.Fisher R.A. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd; 1932. [Google Scholar]
9.Whitlock M.C. Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 2005;18:1368–1373. doi: 10.1111/j.1420-9101.2005.00917.x. [DOI] [PubMed] [Google Scholar]
10.Powers D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2011;2:37–63. [Google Scholar]
11.Lin C., Garruss A.S., Luo Z., Guo F., Shilatifard A. The RNA Pol II elongation factor Ell3 marks enhancers in ES cells and primes future gene activation. Cell. 2013;152:144–156. doi: 10.1016/j.cell.2012.12.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Mokry M., Hatzis P., Schuijers J., Lansu N., Ruzius F.P., Clevers H., Cuppen E. Integrated genome-wide analysis of transcription factor occupancy, RNA polymerase II binding and steady-state RNA levels identify differentially regulated functional gene classes. Nucleic Acids Res. 2012;40:148–158. doi: 10.1093/nar/gkr720. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Risso D., Schwartz K., Sherlock G., Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12:480. doi: 10.1186/1471-2105-12-480. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Frazee A.C., Langmead B., Leek J.T. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011;12:449. doi: 10.1186/1471-2105-12-449. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Montgomery S.B., Sammeth M., Gutierrez-Arcelus M., Lach R.P., Ingle C., Nisbett J., Guigo R., Dermitzakis E.T. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464:773–777. doi: 10.1038/nature08903. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Pickrell J.K., Marioni J.C., Pai A.A., Degner J.F., Engelhardt B.E., Nkadori E., Veyrieras J.B., Stephens M., Gilad Y., Pritchard J.K. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Brawand D., Soumillon M., Necsulea A., Julien P., Csardi G., Harrigan P., Weier M., Liechti A., Aximu-Petri A., Kircher M., et al. The evolution of gene expression levels in mammalian organs. Nature. 2011;478:343–348. doi: 10.1038/nature10532. [DOI] [PubMed] [Google Scholar]
18.Bottomly D., Walter N.A., Hunter J.E., Darakjian P., Kawane S., Buck K.J., Searles R.P., Mooney M., McWeeney S.K., Hitzemann R. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS ONE. 2011;6:e17820. doi: 10.1371/journal.pone.0017820. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Graveley B.R., Brooks A.N., Carlson J.W., Duff M.O., Landolin J.M., Yang L., Artieri C.G., van Baren M.J., Boley N., Booth B.W., et al. The developmental transcriptome of Drosophila melanogaster. Nature. 2011;471:473–479. doi: 10.1038/nature09715. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Di Y., Schafer D.W., Cumbie J.S., Chang J.H. The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat. Appl. Genet. Mol. Biol. 2011;10:1–28. [Google Scholar]
21.Shi L., Campbell G., Jones W.D., Campagne F., Wen Z., Walker S.J., Su Z., Chu T.M., Goodsaid F.M., Pusztai L., et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 2010;28:827–838. doi: 10.1038/nbt.1665. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Flicek P., Ahmed I., Amode M.R., Barrell D., Beal K., Brent S., Carvalho-Silva D., Clapham P., Coates G., Fairley S., et al. Ensembl 2013. Nucleic Acids Res. 2013;41:D48–D55. doi: 10.1093/nar/gks1236. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Robles J.A., Qureshi S.E., Stephen S.J., Wilson S.R., Burden C.J., Taylor J.M. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genom. 2012;13:484. doi: 10.1186/1471-2164-13-484. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Hardcastle T.J., Kelly K.A. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:422. doi: 10.1186/1471-2105-11-422. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Law C.W., Chen Y., Shi W., Smyth G.K. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29. doi: 10.1186/gb-2014-15-2-r29. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Robinson M.D., McCarthy D.J., Smyth G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Tarazona S., Garcia-Alcalde F., Dopazo J., Ferrer A., Conesa A. Differential expression in RNA-seq: a matter of depth. Genome Res. 2011;21:2213–2223. doi: 10.1101/gr.124321.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Anders S., Reyes A., Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res. 2012;22:2008–2017. doi: 10.1101/gr.133744.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Gonzalez-Porta M., Frankish A., Rung J., Harrow J., Brazma A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 2013;14:R70. doi: 10.1186/gb-2013-14-7-r70. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY DATA

supp_43_4_e25__index.html^{(993B, html)}

supp_gku1273_nar-02511-met-n-2014-File009.pdf^{(7.5MB, pdf)}

[B1] 1.Wang Z., Gerstein M., Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009;10:57–63. doi: 10.1038/nrg2484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Dillies M.A., Rau A., Aubert J., Hennequet-Antier C., Jeanmougin M., Servant N., Keime C., Marot G., Castel D., Estelle J., et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 2013;14:671–683. doi: 10.1093/bib/bbs046. [DOI] [PubMed] [Google Scholar]

[B3] 3.Oshlack A., Robinson M.D., Young M.D. From RNA-seq reads to differential expression results. Genome Biol. 2010;11:220. doi: 10.1186/gb-2010-11-12-220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Rapaport F., Khanin R., Liang Y., Pirun M., Krek A., Zumbo P., Mason C.E., Socci N.D., Betel D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013;14:R95. doi: 10.1186/gb-2013-14-9-r95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Soneson C., Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14:91. doi: 10.1186/1471-2105-14-91. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Genovese C.R., Roeder K., Wasserman L. False discovery control with p-value weighting. Biometrika. 2006;93:509–524. [Google Scholar]

[B7] 7.Simes R.J. An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986;73:751–754. [Google Scholar]

[B8] 8.Fisher R.A. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd; 1932. [Google Scholar]

[B9] 9.Whitlock M.C. Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 2005;18:1368–1373. doi: 10.1111/j.1420-9101.2005.00917.x. [DOI] [PubMed] [Google Scholar]

[B10] 10.Powers D.M.W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2011;2:37–63. [Google Scholar]

[B11] 11.Lin C., Garruss A.S., Luo Z., Guo F., Shilatifard A. The RNA Pol II elongation factor Ell3 marks enhancers in ES cells and primes future gene activation. Cell. 2013;152:144–156. doi: 10.1016/j.cell.2012.12.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Mokry M., Hatzis P., Schuijers J., Lansu N., Ruzius F.P., Clevers H., Cuppen E. Integrated genome-wide analysis of transcription factor occupancy, RNA polymerase II binding and steady-state RNA levels identify differentially regulated functional gene classes. Nucleic Acids Res. 2012;40:148–158. doi: 10.1093/nar/gkr720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Risso D., Schwartz K., Sherlock G., Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12:480. doi: 10.1186/1471-2105-12-480. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Frazee A.C., Langmead B., Leek J.T. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011;12:449. doi: 10.1186/1471-2105-12-449. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Montgomery S.B., Sammeth M., Gutierrez-Arcelus M., Lach R.P., Ingle C., Nisbett J., Guigo R., Dermitzakis E.T. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464:773–777. doi: 10.1038/nature08903. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Pickrell J.K., Marioni J.C., Pai A.A., Degner J.F., Engelhardt B.E., Nkadori E., Veyrieras J.B., Stephens M., Gilad Y., Pritchard J.K. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Brawand D., Soumillon M., Necsulea A., Julien P., Csardi G., Harrigan P., Weier M., Liechti A., Aximu-Petri A., Kircher M., et al. The evolution of gene expression levels in mammalian organs. Nature. 2011;478:343–348. doi: 10.1038/nature10532. [DOI] [PubMed] [Google Scholar]

[B18] 18.Bottomly D., Walter N.A., Hunter J.E., Darakjian P., Kawane S., Buck K.J., Searles R.P., Mooney M., McWeeney S.K., Hitzemann R. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS ONE. 2011;6:e17820. doi: 10.1371/journal.pone.0017820. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19.Graveley B.R., Brooks A.N., Carlson J.W., Duff M.O., Landolin J.M., Yang L., Artieri C.G., van Baren M.J., Boley N., Booth B.W., et al. The developmental transcriptome of Drosophila melanogaster. Nature. 2011;471:473–479. doi: 10.1038/nature09715. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Di Y., Schafer D.W., Cumbie J.S., Chang J.H. The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat. Appl. Genet. Mol. Biol. 2011;10:1–28. [Google Scholar]

[B21] 21.Shi L., Campbell G., Jones W.D., Campagne F., Wen Z., Walker S.J., Su Z., Chu T.M., Goodsaid F.M., Pusztai L., et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 2010;28:827–838. doi: 10.1038/nbt.1665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Flicek P., Ahmed I., Amode M.R., Barrell D., Beal K., Brent S., Carvalho-Silva D., Clapham P., Coates G., Fairley S., et al. Ensembl 2013. Nucleic Acids Res. 2013;41:D48–D55. doi: 10.1093/nar/gks1236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Robles J.A., Qureshi S.E., Stephen S.J., Wilson S.R., Burden C.J., Taylor J.M. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genom. 2012;13:484. doi: 10.1186/1471-2164-13-484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.Hardcastle T.J., Kelly K.A. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:422. doi: 10.1186/1471-2105-11-422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Law C.W., Chen Y., Shi W., Smyth G.K. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29. doi: 10.1186/gb-2014-15-2-r29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Robinson M.D., McCarthy D.J., Smyth G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28.Tarazona S., Garcia-Alcalde F., Dopazo J., Ferrer A., Conesa A. Differential expression in RNA-seq: a matter of depth. Genome Res. 2011;21:2213–2223. doi: 10.1101/gr.124321.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29.Anders S., Reyes A., Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res. 2012;22:2008–2017. doi: 10.1101/gr.133744.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30.Gonzalez-Porta M., Frankish A., Rung J., Harrow J., Brazma A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 2013;14:R70. doi: 10.1186/gb-2013-14-7-r70. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Systematic integration of RNA-Seq statistical algorithms for accurate detection of differential gene expression patterns

Panagiotis Moulos

Pantelis Hatzis

Abstract

INTRODUCTION

MATERIALS AND METHODS

P-value combination

Simes method

Fisher's method

Whitlock's method

Maximum P-value

Minimum P-value

PANDORA P-value

Precision-sensitivity tradeoff metrics

Performance evaluation tools

Simulated data

Real RNA-Seq data

RNA-Seq and PolII occupancy data

RESULTS

PANDORA's approach and performance evaluation criteria

PANDORA performs similar to the best algorithms in detecting differential expression

Figure 1.

ROC analysis and assessment of FDRs place PANDORA among the top performing algorithms

Figure 2.

Figure 3.

PANDORA exhibits the best tradeoff between precision and sensitivity

Figure 4.

Robust performance of PANDORA on real RNA-Seq data

Figure 5.

Table 1. False discovery rates (Benjamini–Hochberg adjusted P-values) from three ‘same versus same’ comparisons (EDASeq normalization).

Optimal performance of PANDORA in correlating RNA-Seq with PolII occupancy

Figure 6.

DISCUSSION

AVAILABILITY

SUPPLEMENTARY DATA

Acknowledgments

Footnotes

FUNDING

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases