aRNApipe: a balanced, efficient and distributed pipeline for processing RNA-seq data in high-performance computing environments

Arnald Alonso; Brittany N Lasseigne; Kelly Williams; Josh Nielsen; Ryne C Ramaker; Andrew A Hardigan; Bobbi Johnston; Brian S Roberts; Sara J Cooper; Sara Marsal; Richard M Myers

doi:10.1093/bioinformatics/btx023

. 2017 Jan 20;33(11):1727–1729. doi: 10.1093/bioinformatics/btx023

aRNApipe: a balanced, efficient and distributed pipeline for processing RNA-seq data in high-performance computing environments

Arnald Alonso ^1,², Brittany N Lasseigne ¹, Kelly Williams ¹, Josh Nielsen ¹, Ryne C Ramaker ^1,³, Andrew A Hardigan ^1,³, Bobbi Johnston ¹, Brian S Roberts ¹, Sara J Cooper ¹, Sara Marsal ², Richard M Myers ^1,^✉

PMCID: PMC5447234 PMID: 28108448

Abstract

Summary

The wide range of RNA-seq applications and their high-computational needs require the development of pipelines orchestrating the entire workflow and optimizing usage of available computational resources. We present aRNApipe, a project-oriented pipeline for processing of RNA-seq data in high-performance cluster environments. aRNApipe is highly modular and can be easily migrated to any high-performance computing (HPC) environment. The current applications included in aRNApipe combine the essential RNA-seq primary analyses, including quality control metrics, transcript alignment, count generation, transcript fusion identification, alternative splicing and sequence variant calling. aRNApipe is project-oriented and dynamic so users can easily update analyses to include or exclude samples or enable additional processing modules. Workflow parameters are easily set using a single configuration file that provides centralized tracking of all analytical processes. Finally, aRNApipe incorporates interactive web reports for sample tracking and a tool for managing the genome assemblies available to perform an analysis.

Availability and documentation

https://github.com/HudsonAlpha/aRNAPipe; DOI: 10.5281/zenodo.202950

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Quantification of RNA transcripts by next-generation sequencing technologies continues to increase in both throughput and capabilities as sequencing becomes more affordable and accessible (McGettigan, 2013). Unlike gene expression microarrays, RNA-seq not only quantifies gene expression levels but also measures alternative splicing, transcript fusions, and RNA sequence variants (Finotello and Di Camillo, 2015; Koboldt et al., 2012; Maher et al., 2009). This broad spectrum of applications has fostered development of a rich set of bioinformatics methods focused on each processing stage (Conesa et al., 2016). Current RNA-seq data primary analysis applications usually apply a single processing step, involve complex dependencies between processing stages, and depend on the sequencing protocol performed (see Supplementary Section S1). Consequently, there is an increasing need for tools orchestrating the analysis workflow to ensure repeatability of RNA-seq data processing. In addition to the need for data processing integration, the computational requirements of some RNA-seq analysis steps are a bottleneck (Scholz et al., 2012) and, the use of high-performance computing (HPC) clusters is unavoidable. Because HPC clusters are a valuable and often limited resource, tools integrating RNA-seq processing stages must be carefully designed and optimized. Considering these challenges, we developed a balanced, efficient and distributed pipeline for RNA-seq data analysis: aRNApipe (automated RNA-seq pipeline). This pipeline was optimized to efficiently exploit HPC clusters, to scale from tens to thousands of RNA-seq libraries, and includes modules yielding complete RNA-seq primary analysis.

2 Methods

aRNApipe is designed to overcome the challenges of integration, synchronization and reporting of RNA-seq data analysis by using a project-oriented and balanced design optimized for HPC clusters (Fig. 1).

Fig. 1. — aRNApipe workflow for primary analysis of RNA-seq data

The core application of aRNApipe (Supplementary Section S1) includes six operating modes: (i) executing a new analysis, (ii) updating a previous analysis to include new samples or enable new modules, (iii) showing analysis progress, (iv) building a project skeleton, (v) showing available genome builds, and (vi) stopping an ongoing analysis.

Input data: aRNApipe requires two input files: (i) analysis configuration and (ii) samples to include in the analysis. In the configuration file, the user can set the executing parameters, including enabling/designating arguments of processing modules, assigning computational resources to modules, and selecting the reference genome build.

Current applications: aRNApipe currently includes applications covering the main variations of RNA-seq data generation (Supplementary Section S2). Throughout the workflow, a main daemon process manages pipeline execution (i.e. inter-dependencies between applications) and monitors analysis of each sample at each stage (Fig. 1). First, low-quality reads/bases and adapter sequences can be filtered. Then, a second stack of applications is run in parallel, including assessment of raw data quality, transcript alignment and quantification and identification of gene fusions. The main process launches a third stack of analyses including quantification of genes and exons, conversion of SAM files to BAM sorted files, and alignment quality. Finally, a fourth stack of applications including variant calling and alternative splicing modules are run.

Report generation: The Spider is an aRNApipe add-on module that generates interactive web reports summarizing an aRNApipe analysis. These reports review each module, including sample quality control metrics (Supplementary Figs S1 and S2). The user can also observe the computational resources used by each module and access all logs generated during analysis (Supplementary Fig. S3). Additionally, the Spider generates matrix-like count data files of raw counts, RPKMs and corresponding annotation files (i.e. gene identifier and length). Supplementary Section S3 provides a list of generated outputs.

Reference builder: The programs used for RNA-seq data processing often use different formats and standards. To address this problem and to provide a centralized repository of available genome builds, aRNApipe includes a reference builder that generates all required files for a genome build based on initial files obtained from sources like NCBI and Ensembl repositories (Supplementary Section S4 and Fig. S4).

Implementation: aRNApipe has been developed using Python 2.7 (Supplementary Fig. S5). When running on an HPC cluster, aRNApipe relies on the workload management application to submit jobs for each processing stage, taking into account cross-stage dependencies and custom resource requirements for each stage. aRNApipe has been implemented with the workload management system IBM Platform LSF, but its design allows quick migration to any other workload manager by editing one Python library (Supplementary Section S5). Additionally, a single-machine version is also provided. A configuration library provides supply paths to all applications used by aRNApipe.

3 Results

We have extensively tested aRNApipe and used it to analyze hundreds of RNA-seq libraries with multiple configurations, including different species, different genome builds and different RNA-seq protocols. The reports generated for four example datasets can be accessed online (http://arnapipe.bitbucket.org): (i) Strand-specific paired-end RNA-seq data from nine human samples of different tissues (GSE69241), (ii) unstranded single-end data from two paired normal and colorectal tumor tissues (GSE29580), (iii) unstranded paired-end data from 11 melanoma samples (GSE20156) and 1 prostate cancer cell line (NCIH660), and (iv) in-house unstranded paired-end data from 20 zebrafish libraries.

4 Conclusions

aRNApipe provides an integrated and efficient workflow for analyzing single-end and stranded or unstranded paired-end RNA-seq data. Unlike previous pipelines, aRNApipe is focused on HPC environments and the independent designation of computational resources at each stage allows optimization of HPC resources (see Supplementary Section S1). This application is highly flexible because its project configuration and management options. The Spider provides functional reports for the user at all analytical stages and the reference builder is a valuable genome build manager. Implementation of this pipeline allows users to quickly and efficiently complete primary RNA-seq analysis.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(4.5MB, docx)}

Acknowledgements

We thank the Myers Lab at the HudsonAlpha Institute of Biotechnology for their constructive suggestions, feedback and testing during implementation of this pipeline. We also thank the HudsonAlpha IT team for their continuous support during the development of the pipeline and its optimization on the HPC cluster.

Funding

This work was supported by funds from The HudsonAlpha Institute for Biotechnology. A.H. and R.R. are funded by the NIH-National Institute of General Medical Sciences Medical Scientist Training Program (5T32GM008361-21). B.L. is funded by the William J. Maier III Fellowship in Cancer Prevention (Prevent Cancer Foundation).

Conflict of Interest: none declared.

References

Conesa A. et al. (2016) A survey of best practices for RNA-seq data analysis. Genome Biol., 17, 19.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finotello F., Di Camillo B. (2015) Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis. Brief. Funct. Genom., 14, 130–142. [DOI] [PubMed] [Google Scholar]
Koboldt D.C. et al. (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res., 22, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maher C.A. et al. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature, 458, 97–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
McGettigan P.A. (2013) Transcriptomics in the RNA-seq era. Curr. Opin. Chem. Biol., 17, 4–11. [DOI] [PubMed] [Google Scholar]
Scholz M.B. et al. (2012) Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol., 23, 9–15. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(4.5MB, docx)}

[btx023-B1] Conesa A. et al. (2016) A survey of best practices for RNA-seq data analysis. Genome Biol., 17, 19.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx023-B2] Finotello F., Di Camillo B. (2015) Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis. Brief. Funct. Genom., 14, 130–142. [DOI] [PubMed] [Google Scholar]

[btx023-B3] Koboldt D.C. et al. (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res., 22, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx023-B4] Maher C.A. et al. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature, 458, 97–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx023-B5] McGettigan P.A. (2013) Transcriptomics in the RNA-seq era. Curr. Opin. Chem. Biol., 17, 4–11. [DOI] [PubMed] [Google Scholar]

[btx023-B7] Scholz M.B. et al. (2012) Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol., 23, 9–15. [DOI] [PubMed] [Google Scholar]

PERMALINK

aRNApipe: a balanced, efficient and distributed pipeline for processing RNA-seq data in high-performance computing environments

Arnald Alonso

Brittany N Lasseigne

Kelly Williams

Josh Nielsen

Ryne C Ramaker

Andrew A Hardigan

Bobbi Johnston

Brian S Roberts

Sara J Cooper

Sara Marsal

Richard M Myers

Abstract

Summary

Availability and documentation

Supplementary information

1 Introduction

2 Methods

Fig. 1.

3 Results

4 Conclusions

Supplementary Material

Acknowledgements

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

aRNApipe: a balanced, efficient and distributed pipeline for processing RNA-seq data in high-performance computing environments

Arnald Alonso

Brittany N Lasseigne

Kelly Williams

Josh Nielsen

Ryne C Ramaker

Andrew A Hardigan

Bobbi Johnston

Brian S Roberts

Sara J Cooper

Sara Marsal

Richard M Myers

Abstract

Summary

Availability and documentation

Supplementary information

1 Introduction

2 Methods

Fig. 1.

3 Results

4 Conclusions

Supplementary Material

Acknowledgements

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases