fqfa: A pure Python package for genomic sequence files

Alan F Rubin

doi:10.21105/joss.02076

. Author manuscript; available in PMC: 2021 Apr 13.

Published in final edited form as: J Open Source Softw. 2020 Mar 1;5(47):2076. doi: 10.21105/joss.02076

fqfa: A pure Python package for genomic sequence files

Alan F Rubin ^1,²

PMCID: PMC8043664 NIHMSID: NIHMS1639132 PMID: 33855258

Summary

Modern bioinformatics requires the use of many field-specific file formats. Two of the most prevalent formats for representing biological sequences are FASTA (Pearson & Lipman, 1988) and FASTQ (Cock, Fields, Goto, Heuer, & Rice, 2010). While multiple feature-rich Python bioinformatics libraries exist that can process biological sequence files (Cock et al., 2009; scikit-bio Development Team, 2013), they require complex compiled dependencies that may limit their use in non-Unix environments. Other FASTA or FASTQ specific Python libraries (Du, 2019; Hunt, 2013; Pedersen, 2010; Shirley, Ma, Pedersen, & Wheelan, 2015) are outdated, require runtime dependencies, or make heavy use of C extensions that prioritize speed over readability and portability.

fqfa is a pure Python package that aims to fill the needs of bioinformatics and computational biology researchers who want a simple and efficient solution for working with files in FASTA and FASTQ formats. It has no dependencies outside of the Python standard library (with the exception of backported dataclasses (Smith, 2017) for Python 3.6 users) and makes use of newer language features such as type hinting and f-strings to improve readability. These implementation details make fqfa highly suitable for use in notebooks and projects that have simple requirements, with underlying code that is easy for novice bioinformaticians and students to understand and explore.

Although fqfa is written in pure Python, its performance is comparable to modules using C extensions like pyfastx (Du, 2019) for tasks such as processing a FASTQ file sequentially and collecting or filtering on quality statistics from the high-throughput sequencing reads. Detailed benchmarking results and usage examples comparing fqfa and pyfastx (Du, 2019) are available as part of the fqfa documentation in static format as well as in Jupyter notebooks (Kluyver et al., 2016).

fqfa is released under the BSD 3-Clause License and is available from GitHub and PyPI.

Acknowledgements

Thank you to Matthew Wakefield for helpful discussion and code review. The research bene-fited by support from the Victorian State Government Operational Infrastructure Support and Australian Government NHMRC Independent Research Institute Infrastructure Support. AFR was supported by the National Human Genome Research Institute of the NIH under award number RM1HG010461.

References

Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, et al. (2009). Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423. doi: 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cock PJA, Fields CJ, Goto N, Heuer ML, & Rice PM (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 38(6), 1767–1771. doi: 10.1093/nar/gkp1137 [DOI] [PMC free article] [PubMed] [Google Scholar]
Du L (2019, March). Lmdu/pyfastx. Retrieved from https://github.com/lmdu/pyfastx
Hunt M (2013, September). Sanger-pathogens/Fastaq. Pathogen Informatics, Wellcome Sanger Institute. Retrieved from https://github.com/sanger-pathogens/Fastaq [Google Scholar]
Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, et al. (2016). Jupyter notebooks – a publishing format for reproducible computational workflows. (Loizides F & Schmidt B, Eds.). IOS Press. [Google Scholar]
Pearson WR, & Lipman DJ (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8), 2444–2448. doi: 10.1073/pnas.85.8.2444 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedersen B (2010, July). Brentp/pyfasta. Retrieved from https://github.com/brentp/pyfasta
scikit-bio Development Team. (2013, December). Biocore/scikit-bio. biocore. Retrieved from https://github.com/biocore/scikit-bio
Shirley MD, Ma Z, Pedersen BS, & Wheelan SJ (2015). Efficient ”pythonic” access to FASTA files using pyfaidx (No. e1196). PeerJ Inc. doi: 10.7287/peerj.preprints.970v1 [DOI] [Google Scholar]
Smith EV (2017, May). Ericvsmith/dataclasses. Retrieved from https://github.com/ericvsmith/dataclasses

[R1] Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, et al. (2009). Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423. doi: 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Cock PJA, Fields CJ, Goto N, Heuer ML, & Rice PM (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 38(6), 1767–1771. doi: 10.1093/nar/gkp1137 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Du L (2019, March). Lmdu/pyfastx. Retrieved from https://github.com/lmdu/pyfastx

[R4] Hunt M (2013, September). Sanger-pathogens/Fastaq. Pathogen Informatics, Wellcome Sanger Institute. Retrieved from https://github.com/sanger-pathogens/Fastaq [Google Scholar]

[R5] Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, et al. (2016). Jupyter notebooks – a publishing format for reproducible computational workflows. (Loizides F & Schmidt B, Eds.). IOS Press. [Google Scholar]

[R6] Pearson WR, & Lipman DJ (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8), 2444–2448. doi: 10.1073/pnas.85.8.2444 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Pedersen B (2010, July). Brentp/pyfasta. Retrieved from https://github.com/brentp/pyfasta

[R8] scikit-bio Development Team. (2013, December). Biocore/scikit-bio. biocore. Retrieved from https://github.com/biocore/scikit-bio

[R9] Shirley MD, Ma Z, Pedersen BS, & Wheelan SJ (2015). Efficient ”pythonic” access to FASTA files using pyfaidx (No. e1196). PeerJ Inc. doi: 10.7287/peerj.preprints.970v1 [DOI] [Google Scholar]

[R10] Smith EV (2017, May). Ericvsmith/dataclasses. Retrieved from https://github.com/ericvsmith/dataclasses

PERMALINK

fqfa: A pure Python package for genomic sequence files

Alan F Rubin

Summary

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

fqfa: A pure Python package for genomic sequence files

Alan F Rubin

Summary

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases