grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories

Louis J Taylor; Arwa Abbas; Frederic D Bushman

doi:10.1093/bioinformatics/btaa167

. 2020 Mar 10;36(11):3607–3609. doi: 10.1093/bioinformatics/btaa167

grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories

Louis J Taylor ^b1, Arwa Abbas ^b2, Frederic D Bushman ^b1,^✉

Editor: Jonathan Wren

PMCID: PMC7267817 PMID: 32154830

Abstract

Summary

High-throughput sequencing is a powerful technique for addressing biological questions. Grabseqs streamlines access to publicly available metagenomic data by providing a single, easy-to-use interface to download data and metadata from multiple repositories, including the Sequence Read Archive, the Metagenomics Rapid Annotation through Subsystems Technology server and iMicrobe. Users can download data and metadata in a standardized format from any number of samples or projects from a given repository with a single grabseqs command.

Availability and implementation

Grabseqs is an open-source tool implemented in Python and licensed under the MIT license. The source code is freely available at https://github.com/louiejtaylor/grabseqs, the Python Package Index and Anaconda Cloud repository.

Contact

bushman@pennmedicine.upenn.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Next-generation sequencing (NGS) is a powerful technique used to analyze nucleic acid sequences. Applications of NGS include single organism genome sequencing and assembly, metagenomic sequencing to profile complex microbial communities, RNA sequencing to characterize gene expression and single-cell sequencing technologies which enable analysis of genomic and transcriptomic variation in single cells. The development of new analysis tools for working with large datasets as well as decreasing costs has driven the adoption of NGS in diverse contexts, including surveillance for emerging zoonotic pathogens (Rabaa et al., 2015), profiling the genomic and epigenomic landscape (Corces et al., 2017), clinical diagnoses of infectious disease (Naccache et al., 2014) and tracking the safety and efficacy of gene therapies in humans (Fraietta et al., 2018).

In addition to its usefulness as a primary data source, NGS data are valuable for secondary analysis. Many studies have used publically available sequence data to discover novel, highly divergent microorganisms (Abbas et al., 2019; Dutilh et al., 2014; Pasolli et al., 2019). Another study predicted cell function through massive re-use of single-cell RNAseq data (Kiselev et al., 2018). Public availability of metagenomic data also facilitates reproducibility—new techniques can be benchmarked against existing, published data. This is exemplified in a recent study uncovering bacterial sequence contamination in virus-enriched metagenomes (Zolfo et al., 2019).

Multiple repositories enable public access to NGS data. The National Center for Biotechnology Information (NCBI) hosts the Sequence Read Archive (SRA), a repository containing multiple classes of NGS data (Leinonen et al., 2011). Sequence data archived in the European Nucleotide Archive are also available through the SRA (Cochrane et al., 2016). Other repositories include the Metagenomics Rapid Annotation through Subsystems Technology (MG-RAST) server (Glass et al., 2011) and iMicrobe (Hurwitz, 2014). While these web-based tools facilitate data discovery by simplifying user interaction with study metadata, currently each repository is only accessible through different interfaces. To access data, repository-specific command-line tools such as sra-tools or pysradb must be used (Choudhary, 2019; Leinonen et al., 2011), or data must be downloaded from the web or an FTP site, such as the Galaxy platform (Afgan et al., 2018). Some software, including the bowtie2 aligner (Langmead et al., 2012) and the Sunbeam metagenomics pipeline (Clarke et al., 2019), support direct analysis of data from certain repositories. However, no tool yet exists to easily access both raw data and metadata from multiple sources.

Here, we introduce grabseqs, a single command-line utility to facilitate downloading NGS data and descriptive metadata from SRA, MG-RAST and iMicrobe. Data and metadata downloaded by grabseqs are in the same format regardless of repository of origin, streamlining downstream analysis and processing. Our goal in developing grabseqs was to simplify access to sequencing data and promote reproducibility. A comparison between grabseqs and other tools for downloading NGS reads from major repositories is shown in Table 1.

Table 1.

Feature comparison between grabseqs and other tools for accessing NGS data

		Galaxy	grabseqs	MG-RAST API	pysradb	sra-tools
Data format	SRA	FQ.GZ	FQ.GZ		FQ.GZ^a	FQ.GZ^a
	MG-RAST		FQ.GZ	Varies^b
	iMicrobe		FQ.GZ
Metadata format	SRA		CSV		TSV
	MG-RAST		CSV	JSON
	iMicrobe		CSV
Features	Interface	GUI (web)/API	CLI	GUI (web)/API	CLI	CLI
	Bulk downloads	+	+		+
	Retry failed downloads	+	+
	Text-based database searching				+

Open in a new tab

Note: + signifies feature availability. FQ.GZ, gzipped FASTQ file; CSV/TSV, comma-/tab-separated values; JSON, JavaScript Object Notation; CLI, command-line interface, GUI, graphical user interface; API, application programming interface.

Data are optionally available in SRA format.

^{^b}

Data available from the MG-RAST API may be in different formats including FASTA and FASTQ.

2 Implementation and usage

Grabseqs is implemented in Python. The package is divided into four modules—one per repository for downloading reads and one module (utils) implementing functions used across multiple repositories. A diagram of the grabseqs architecture is shown in Supplementary File S1. We also provide a repository module template and a step-by-step example of its use (Supplementary File S2) to facilitate adding support for different repositories. Grabseqs updates are released on an as-needed basis. Version numbers for grabseqs follow semantic versioning practices (https://semver.org/spec/v2.0.0.htm). Grabseqs is continuously integration tested with CircleCI (https://circleci.com/).

Grabseqs can be installed through Conda (www.anaconda.org) or the Python Package Index (PyPI) (https://pypi.org/) and is compatible with most modern MacOS and Linux distributions. Specifically, grabseqs has been tested on MacOS 10.14, and the following Linux operating systems: CentOS 6, 7 and 8; Debian 9 and 10; Ubuntu 16.04, 18.04 and 19.10; Red Hat Enterprise 6, 7 and 8 and SUSE Enterprise 12 and 15. Installation through Conda is recommended, as all software dependencies are automatically installed together with grabseqs. Installation through PyPI automatically installs all required Python packages. Accessing data from all repositories requires Python 3 and file compression software (either pigz or gzip). Some dependencies are repository-specific: downloading from NCBI SRA requires sra-tools (Leinonen et al., 2011); accessing MG-RAST requires wget (https://www.gnu.org/software/wget/) and the Python requests package (https://requests.readthedocs.io/en/master/); iMicrobe downloads use wget and requests-html (https://requests-html.kennethreitz.org/). Repository-specific dependencies are only required when accessing that repository—e.g. grabseqs can download from MG-RAST without needing sra-tools to be installed. At runtime, grabseqs checks whether software dependencies are installed and alerts the user of missing dependencies.

Data from each repository are accessible through the ‘grabseqs’ command-line program. The target repository is specified through subcommands, i.e. ‘grabseqs sra’ for SRA data or ‘grabseqs mgrast’ for MG-RAST data. The user can specify one or more project or sample accession numbers, and grabseqs will download reads for all samples associated with a given project accession number, and/or reads for all provided sample accession numbers. Reads are stored as gzipped FASTQ files; metadata are formatted as comma-separated values. Paired reads are saved as individual files with a suffix of ‘_1’ and ‘_2’ added after the sample accession number and before the .fastq.gz extension.

A number of command-line options facilitate data processing and downstream analysis. For all repositories, the ‘-l’ (list) option outputs a list of the files that will be created for the supplied accession number(s), without actually downloading the reads. The ‘-m’ flag saves sample metadata in a comma-separated format. The ‘-t’ option enables multithreaded downloading and compression. We also include multiple repository-specific commands such as the option to use fastq-dump (instead of fasterq-dump) for SRA downloads and passing additional arguments to sra-tools for customization of the format of downloaded reads. The grabseqs GitHub repository (https://github.com/louiejtaylor/grabseqs) includes documentation of all command-line options, usage examples and tips for downloading data from each repository.

3 Conclusions

Grabseqs is a lightweight tool for downloading reads from multiple NGS data repositories. Grabseqs provides a single interface for accessing data and repository-associated metadata in a standardized format from NCBI SRA (Leinonen et al., 2011), iMicrobe (Hurwitz, 2014) and MG-RAST (Glass et al., 2011). Grabseqs is available through Anaconda Cloud, the Python Package Index and on GitHub as an open-source package under the MIT license.

Supplementary Material

btaa167_Supplementary_Data

Click here for additional data file.^{(148.7KB, zip)}

Acknowledgements

We thank Andrew Connell for his advice regarding Python package design. We thank members of Bushman Laboratory and PennCHOP Microbiome Program for discussions, suggestions and testing. We appreciate the anonymous reviewers’ suggestions, which improved both the software and the manuscript.

Author contributions

L.J.T. conceived and designed the software. L.J.T. and A.A. implemented the software and wrote the initial manuscript draft. All authors edited and approved the final version of the manuscript.

Funding

This work was supported by the National Institutes of Health [T32AI007324 to L.J.T] and by the NIH grants [R01-AI082020, R61-HL137063 and R01-HL113252 to F.D.B.], the Penn Center for AIDS Research [P30-AI045008] and the PennCHOP Microbiome Program.

Conflict of Interest: none declared.

References

Abbas A.A. et al. (2019) Redondoviridae, a family of small, circular DNA viruses of the human oro-respiratory tract associated with periodontitis and critical illness. Cell Host Microbe, 25, 719–729. [DOI] [PMC free article] [PubMed] [Google Scholar]
Afgan E. et al. (2018) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res., 46, W537–W544. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choudhary S. (2019) pysradb: a Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive. F1000Research, 532, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clarke E.L. et al. (2019) Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome, 7, 2–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cochrane G. et al. (2016) The international nucleotide sequence database collaboration. Nucleic Acids Res., 44, D48–D50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Corces M.R. et al. (2017) An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat. Methods, 14, 959–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dutilh B.E. et al. (2014) A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun., 5, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fraietta J.A. et al. (2018) Disruption of TET2 promotes the therapeutic efficacy of CD19-targeted T cells. Nature, 558, 307–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
Glass E.M. et al. (2011) The metagenomics RAST server: a public resource for the automatic phylogenetic and functional analysis of metagenomes. In: de Bruijn,F.J. (ed.) Handbook of Molecular Microbial Ecology I: Metagenomics and Complementary Approaches, Vol. 8. John Wiley & Sons, Inc., Hoboken, New Jersey, pp. 325–331.
Hurwitz B.L. (2014) iMicrobe: advancing clinical and environmental microbial research using the iPlant cyberinfrastructure. In: Proceedings of the International Plant and Animal Genome XXII Conference, San Diego, CA. Scherago International, Livingston, NJ.
Kiselev V.Y. et al. (2018) Scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods, 15, 359–362. [DOI] [PubMed] [Google Scholar]
Langmead B. et al. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leinonen R. et al. (2011) The sequence read archive. Nucleic Acids Res., 39, 2010–2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Naccache S.N. et al. (2014) A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res., 24, 1180–1192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pasolli E. et al. (2019) Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell, 176, 649–662.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rabaa M.A. et al. (2015) The Vietnam Initiative on Zoonotic Infections (VIZIONS): a strategic approach to studying emerging zoonotic infectious diseases. Ecohealth, 12, 726–735. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zolfo M. et al. (2019) Detecting contamination in viromes using ViromeQC. Nat. Biotechnol., 37, 1408–1412. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaa167_Supplementary_Data

Click here for additional data file.^{(148.7KB, zip)}

[btaa167-B1] Abbas A.A. et al. (2019) Redondoviridae, a family of small, circular DNA viruses of the human oro-respiratory tract associated with periodontitis and critical illness. Cell Host Microbe, 25, 719–729. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B2] Afgan E. et al. (2018) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res., 46, W537–W544. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B3] Choudhary S. (2019) pysradb: a Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive. F1000Research, 532, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B4] Clarke E.L. et al. (2019) Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome, 7, 2–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B5] Cochrane G. et al. (2016) The international nucleotide sequence database collaboration. Nucleic Acids Res., 44, D48–D50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B6] Corces M.R. et al. (2017) An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat. Methods, 14, 959–962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B7] Dutilh B.E. et al. (2014) A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun., 5, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B8] Fraietta J.A. et al. (2018) Disruption of TET2 promotes the therapeutic efficacy of CD19-targeted T cells. Nature, 558, 307–312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B9] Glass E.M. et al. (2011) The metagenomics RAST server: a public resource for the automatic phylogenetic and functional analysis of metagenomes. In: de Bruijn,F.J. (ed.) Handbook of Molecular Microbial Ecology I: Metagenomics and Complementary Approaches, Vol. 8. John Wiley & Sons, Inc., Hoboken, New Jersey, pp. 325–331.

[btaa167-B10] Hurwitz B.L. (2014) iMicrobe: advancing clinical and environmental microbial research using the iPlant cyberinfrastructure. In: Proceedings of the International Plant and Animal Genome XXII Conference, San Diego, CA. Scherago International, Livingston, NJ.

[btaa167-B11] Kiselev V.Y. et al. (2018) Scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods, 15, 359–362. [DOI] [PubMed] [Google Scholar]

[btaa167-B12] Langmead B. et al. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B13] Leinonen R. et al. (2011) The sequence read archive. Nucleic Acids Res., 39, 2010–2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B14] Naccache S.N. et al. (2014) A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res., 24, 1180–1192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B15] Pasolli E. et al. (2019) Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell, 176, 649–662.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B16] Rabaa M.A. et al. (2015) The Vietnam Initiative on Zoonotic Infections (VIZIONS): a strategic approach to studying emerging zoonotic infectious diseases. Ecohealth, 12, 726–735. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa167-B17] Zolfo M. et al. (2019) Detecting contamination in viromes using ViromeQC. Nat. Biotechnol., 37, 1408–1412. [DOI] [PubMed] [Google Scholar]

PERMALINK

grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories

Louis J Taylor

Arwa Abbas

Frederic D Bushman

Roles

Abstract

Summary

Availability and implementation

Contact

Supplementary information

1 Introduction

Table 1.

2 Implementation and usage

3 Conclusions

Supplementary Material

Acknowledgements

Author contributions

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories

Louis J Taylor

Arwa Abbas

Frederic D Bushman

Roles

Abstract

Summary

Availability and implementation

Contact

Supplementary information

1 Introduction

Table 1.

2 Implementation and usage

3 Conclusions

Supplementary Material

Acknowledgements

Author contributions

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases