Metagenomics workflow for hybrid assembly, differential coverage binning, metatranscriptomics and pathway analysis (MUFFIN)

Renaud Van Damme; Martin Hölzer; Adrian Viehweger; Bettina Müller; Erik Bongcam-Rudloff; Christian Brandt

doi:10.1371/journal.pcbi.1008716

. 2021 Feb 9;17(2):e1008716. doi: 10.1371/journal.pcbi.1008716

Metagenomics workflow for hybrid assembly, differential coverage binning, metatranscriptomics and pathway analysis (MUFFIN)

Renaud Van Damme ^1,^2,^*, Martin Hölzer ³, Adrian Viehweger ^3,⁴, Bettina Müller ¹, Erik Bongcam-Rudloff ², Christian Brandt ^2,⁵

Editor: Mihaela Pertea⁶

PMCID: PMC7899367 PMID: 33561126

Abstract

Metagenomics has redefined many areas of microbiology. However, metagenome-assembled genomes (MAGs) are often fragmented, primarily when sequencing was performed with short reads. Recent long-read sequencing technologies promise to improve genome reconstruction. However, the integration of two different sequencing modalities makes downstream analyses complex. We, therefore, developed MUFFIN, a complete metagenomic workflow that uses short and long reads to produce high-quality bins and their annotations. The workflow is written by using Nextflow, a workflow orchestration software, to achieve high reproducibility and fast and straightforward use. This workflow also produces the taxonomic classification and KEGG pathways of the bins and can be further used for quantification and annotation by providing RNA-Seq data (optionally). We tested the workflow using twenty biogas reactor samples and assessed the capacity of MUFFIN to process and output relevant files needed to analyze the microbial community and their function. MUFFIN produces functional pathway predictions and, if provided de novo metatranscript annotations across the metagenomic sample and for each bin. MUFFIN is available on github under GNUv3 licence: https://github.com/RVanDamme/MUFFIN.

Author summary

Determining the entire DNA of environmental samples (sequencing) is a fundamental approach to gain deep insights into complex bacterial communities and their functions. However, this approach produces enormous amounts of data, which makes analysis time intense and complicated. We developed the Software “MUFFIN,” which effortlessly untangle the complex sequencing data to reconstruct individual bacterial species and determine their functions. Our software is performing multiple complicated steps in parallel, automatically allowing everyone with only basic informatics skills to analyze complex microbial communities.

For this, we combine two sequencing technologies: "long-sequences" (nanopore, better reconstruction) and "short-sequences" (Illumina, higher accuracy). After the reconstruction, we group the fragments that belong together ("binning") via multiple approaches and refinement steps while also utilizing the information from other bacterial communities ("differential binning"). This process creates hundreds of "bins" whereas each represents a different bacterial species with a unique function. We automatically determine their species, assess each genome’s completeness, and attribute their biological functions and activity ("transcriptomics and pathways"). Our Software is entirely freely available to everyone and runs on a good computer, compute cluster, or via cloud.

This is a PLOS Computational Biology Software paper.

Introduction

Metagenomics is widely used to analyze the composition, structure, and dynamics of microbial communities, as it provides deep insights into uncultivatable organisms and their relationship to each other [1–5]. In this context, whole metagenome sequencing is mainly performed using short-read sequencing technologies, predominantly provided by Illumina. Not surprisingly, the vast majority of tools and workflows for the analysis of metagenomic samples are designed around short reads. However, long-read sequencing technologies, as provided by PacBio or Oxford Nanopore Technologies (ONT), retrieve genomes from metagenomic datasets with higher completeness and less contamination [6]. The long-read information bridges gaps in a short-read-only assembly that often occur due to intra- and interspecies repeats [6]. Complete viral genomes can be already identified from environmental samples without any assembly step via nanopore-based sequencing [7]. Combined with a reduction in cost per gigabase [8] and an increase in data output, the technologies for sequencing long reads quickly became suitable for metagenomic analysis [9–12]. In particular, with the MinION, ONT offers mobile and cost-effective sequencing device for long reads that paves the way for the real-time analysis of metagenomic samples. Currently, the combination of both worlds (long reads and high-precision short reads) allows the reconstruction of more complete and more accurate metagenome-assembled genomes (MAGs) [6].

One of the main challenges and bottlenecks of current metagenome sequencing studies is the orchestration of various computational tools into stable and reproducible workflows to analyze the data. A recent study from 2019 involving 24,490 bioinformatics software resources showed that 26% of all these resources are not currently online accessible [13]. Among 99 randomly selected tools, 49% were deemed ’difficult to install,’ and 28% ultimately failed the installation procedure. For a large-scale metagenomics study, various tools are needed to analyze the data comprehensively. Thus, already during the installation procedure, various issues arise related to missing system libraries, conflicting dependencies and environments, or operating system incompatibilities. Even more complicating, metagenomic workflows are computing intense and need to be compatible with high-performance compute clusters (HPCs), and thus different workload managers such as SLURM or LSF. We combined the workflow manager Nextflow [14] with virtualization software (so-called ’containers’) to generate reproducible results in various working environments and allow full parallelization of the workload to a higher degree.

Several workflows for metagenomic analyses have been published, including MetaWRAP(v1.2.1) [15], Anvi’o [16], SAMSA2 [17], Humann [18], MG-Rast [19], ATLAS [20], or Sunbeam [21]. Unlike those, MUFFIN allows for a hybrid metagenomic approach combining the strengths of short and long reads. It ensures reproducibility through the use of a workflow manager and reliance on either install-recipes (Conda [22]) or containers (Docker [23], Singularity).

Design and implementation

MUFFIN integrates state-of-the-art bioinformatic tools via Conda recipes or Docker/Singularity containers for the processing of metagenomic sequences in a Nextflow workflow environment (Fig 1). MUFFIN executes three steps subsequently or separately if intermediate results, such as MAGs, are available. As a result, a more flexible workflow execution is possible. The three steps represent common metagenomic analysis tasks and are summarized in Fig 1:

Fig 1 — All three steps (Assemble, Classify, Annotate) from top to bottom are shown. The RNA-Seq data for Step 3 (Annotate) is optional. Differential reads are other read data sets that are solely used for "differential coverage binning" to improve the overall binning performance.

Assemble: Hybrid assembly and binning
Classify: Bin quality control and taxonomic assessment
Annotate: Bin annotation and KEGG pathway summary

The workflow takes paired-end Illumina reads (short reads) and nanopore-based reads (long reads) as input for the assembly and binning and allows for additional user-provided read sets for differential coverage binning. Differential coverage binning facilitates genome bins with higher completeness than other currently used methods [24]. Step 2 will be executed automatically after the assembly and binning procedure or can be executed independently by providing MUFFIN a directory containing MAGs in FASTA format. In step 3, paired-end RNA-Seq data can be optionally supplemented to improve the annotation of bins.

On completion, MUFFIN provides various outputs such as the MAGs, KEGG pathways, and bin quality/annotations. Additionally, all mandatory databases are automatically downloaded and stored in the working directory or can be alternatively provided via an input flag.

Step 1—Assemble: Hybrid assembly and binning

The first step (Assembly and binning) uses metagenomic nanopore-based long reads and Illumina paired-end short reads to obtain high-quality and highly complete bins. The short-read quality control is operated using fastp (v0.20.0) [25]. Optionally, Filtlong (v0.2.0) [26] can be used to discard long reads below a length of 1000 bp. The hybrid assembly can be performed according to two principles, which differ substantially in the read set to begin with. The default approach starts from a short-read assembly where contigs are bridged via the long reads using metaSPAdes (v3.13.2) [27–29]. Alternatively, MUFFIN can be executed starting from a long-read-only assembly using metaFlye (v2.8) [30,31] followed by polishing the assembly with the long reads using Racon (v1.4.13) [32] and medaka (v1.0.3) [33] and finalizing the error correction by incorporating the short reads using multiple rounds of Pilon (v1.23) [34]. Both approaches should be chosen based on the available amount of raw read data available to users. E.g., if more short read data is available, meta-spades should be the choice (long reads are "supplemental"). If more long-read data is available, e.g.,> 15 Gigabases (corresponds to a full MinION or GridION flow cell) [35] flye should be used as the assembly approach.

Binning is one of the most crucial steps during metagenomic analysis besides assembly. Therefore, MUFFIN combines three different binning software tools, respectively CONCOCT (v1.1.0) [36], MaxBin2 (v2.2.7) [37], and MetaBAT2 (v2.13) [38] and refine the obtained bins via MetaWRAP (v1.3) [15]. The user can provide additional read data sets (short or long reads) to perform automatically differential coverage binning to assign contigs to their bins better.

Moreover, an additional reassembly of bins has shown the capacity to increase the completeness and N50 while decreasing the contamination of some bins [15]. Therefore, MUFFIN allows for an optional reassembly to improve the continuity of the MAGs further. This reassembly is performed by retrieving the reads belonging to one bin and doing an assembly with Unicycler (v0.4.7) [39]. As each reassembly might improve or worsen each bin, this process is optional and therefore deactivated by default. Individual manual curation is necessary by the user to compare each bin before and after reassembly, as described by Uritskiy et al. [15].

To support a transparent and reproducible metagenomics workflow, all reads that cannot be mapped back to the existing high-quality bins (after the refinement) are available as an output for further analysis. These "unused" reads could be further analyzed by other tools such as Kraken2 [40], Kaiju [41], or centrifuge [42] for read classification, "What the Phage" [43] to search for phages, mi-faser [44] for functional annotation of the reads or even use these reads as a new input to run MUFFIN.

Step 2—Classify: Bin quality control and taxonomic assessment

In the second step (Bin quality control and taxonomic assessment), the quality of the bins is evaluated with CheckM (v1.1.3) [45] followed by assigning a taxonomic classification to the bins using sourmash (v2.0.1) [46] and the Genome Taxonomy Database (GTDB release r89) [47]. The GTDB was chosen as it contains many unculturable bacteria and archaea–this allows for monophyletic species assignments, which other databases do not assure [35,48]. Moreover, the coherent taxonomic classifications and more accurate taxonomic boundaries (e.g., for class, genus, etc.) proposed by GTDB substantially increases the general classification accuracy [48]. The user can also analyze other bin sets in this step regardless of their origin by providing a directory with multiple FASTA files (bins).

Step 3—Annotate: Bin annotation and KEGG pathway summary

The last step of MUFFIN (Bin annotation and output summary) comprises the annotation of the bins using eggNOG-mapper (v2.0.1) [49] and the eggNOG database (v5) [50]. If RNA-Seq data of the metagenome sample is provided (Illumina, paired-end), quality control using fastp (v0.20.0) [25] and a de novo metatranscript assembly using Trinity (v2.9.1) [51] followed by quantification of the metatranscripts by mapping of the RNA-seq reads using Salmon (v1.0) [52] are performed. Lastly, the metatranscripts are annotated using eggNOG-mapper (v2.0.1) [49]. Again, the annotation by eggnog-mapper provides a wide array of annotation information such as the GO terms, the NOG terms, the BiGG reaction, CAZy, KEGG orthology, and pathways.

These gene annotations are parsed and visualized in KEGG pathways for each sample and bin. The expression of low and high abundant genes present in the bins is shown. If only bin sets are provided without any RNA-Seq data, the pathways of all the bins are created based on gene presence alone. The KEGG pathway results are summarized in detail as interactive HTML files (example snippet: Fig 2).

Like step 2, this step can be directly performed with a bin set created via another workflow.

Running MUFFIN and version control

MUFFIN (V1.0.3, 10.5281/zenodo.4296623) requires only two dependencies, which allows an easy and user-friendly workflow execution. One of them is the workflow management system Nextflow [14] (version 20.07+), and the other can be either Conda ²⁰[22] as a package manager or Docker [23] / Singularity to use containerized tools. A detailed installation process is available on https://github.com/RVanDamme/MUFFIN. Each MUFFIN release specifies the Nextflow version it was tested on, but any version of MUFFIN V1.0.2+ will work with nextflow version 20.07+. A Nextflow-specific version can always be directly downloaded as an executable file from https://github.com/nextflow-io/nextflow/releases, which can then be paired with a compatible MUFFIN version via the -r flag.

Results

We chose Nextflow for the development of our metagenomic workflow because of its direct cloud computing support (Amazon AWS, Google Life Science, Kubernetes), various ready-to-use batch schedulers (SGE, SLURM, LSF), state-of-the-art container support (Docker, Singularity), and accessibility of a widely used software package manager (Conda). Moreover, Nextflow [14] provides a practical and straightforward intermediary file handling with process-specific work directories and the possibility to resume failed executions where the work ceased. Additionally, the workflow code itself is separated from the ’profile’ code (which contains Docker, Conda, or cluster related code), which allows for a convenient and fast workflow adaptation to different computing clusters without touching or changing the actual workflow code.

The entire MUFFIN workflow was executed on 20 samples from the Bioproject PRJEB34573 (available at ENA or NCBI) using the Cloud Life Sciences API (google cloud) with docker containers. This metagenomic bioreactor study provides paired-end Illumina and nanopore-based data for each sample [35]. We used five different Illumina read sets of the same project for differential coverage binning, and the workflow runtime was less than two days for all samples. MUFFIN was able to retrieve 1122 MAGs with genome completeness of at least 70% and contamination of less than 10% (Fig 3). In total, MUFFIN retrieved 654 MAGs with genome completeness of over 90%, of which 456 have less than 2% contamination out of the 20 datasets. For comparison, a recent study was using 134 publicly available datasets from different biogas reactors and retrieved 1,635 metagenome-assembled genomes with genome completeness of over 50% [53].

Exemplarily, we investigated the impact of additional reassembly of each bin for five samples (Fig 3). The N50 was increased by an average of 6–7 fold across all samples. Twenty-six bins of the five samples had an N50 ranging between 1 to 3 Mbases. Reassembly of bins has shown the capacity to increase the completeness and N50 while decreasing the contamination of some bins [15]. This is in line with our samples as some bins benefit more from this step than others. In general, while we observed a general increase in N50 for most bins, the genome quality based on checkM metrics (completeness, contamination) was slightly increasing or decreasing for individual bins.

Discussion

The analysis of metagenomic sequencing data evolved as an emerging and promising research field to retrieve, characterize, and analyze organisms that are difficult to cultivate. There are numerous tools available for individual metagenomics analysis tasks, but they are mainly developed independently and are often difficult to install and run. The MUFFIN workflow gathers the different steps of a metagenomics analysis in an easy-to-install, highly reproducible, and scalable workflow using Nextflow, which makes them easily accessible to researchers.

MUFFIN utilizes the advantages of both sequencing technologies. Short-reads provide a better representation of low abundant species due to their higher coverage based on read count. Long-reads are utilized to resolve repeats for better genome continuity. This aspect is further utilized via the final reassembly step after binning, which is an optional step due to the additional computational burden which solely aims to improve genome continuity.

Another critical aspect is the full support of differential binning, for both long and short reads, via a single input option. The additional coverage information from other read sets of similar habitats allows for the generation of more concise bins with higher completeness and less contamination because more coverage information is available for each binning tool to decide which bin each contig belongs to.

With supplied RNA-Seq data, MUFFIN is capable of enhancing the pathway results present in the metagenomic sample by incorporating this data as well as the general expression level of the genes. Such information is essential to further analyze metagenomic data sets in-depth, for example, to define the origin of a sample or to improve environmental parameters for production reactors such as biogas reactors. Knowing whether an organism expresses a gene is a crucial element in deciding whether more detailed analysis of that organism in the biotope where the sample was taken is necessary or not.

MUFFIN utilizes a large number of tools to provide a comprehensive analysis of metagenomics samples. The associated tools were mainly chosen based on benchmark performance, e.g., assembly [29,31,54–56], polishing [55], binning [15], annotation for pathways [49], taxonomic classification [47], however stability and workflow compatibility was also an important factor to consider. Due to the modular coding structure of nextflow DSL2 language, MUFFIN can quickly adapt towards better tools or improved versions if necessary, in the future.

MUFFIN executes a de novo assembly of the RNA-seq reads instead of a mapping of the reads against the MAGs to avoid bias and error during the mapping. Indeed, not all the DNA reads were assembled or binned and present in the last step (annotation). Thus we might miss transcripts on the sample level. In addition, for similar genes, it’s impossible to know to which organism the reads should map to. By using metatranscripts and comparing the annotations of the metatranscripts to the annotation of the MAGs, we avoid those issues.

Availability and future directions

MUFFIN is an ongoing workflow project that gets further improved and adjusted. The modular workflow setup of MUFFIN using Nextflow allows for fast adjustments as soon as future developments in hybrid metagenomics arise, including the pre-configuration for other workload managers. MUFFIN can directly benefit from the addition of new bioinformatics software such as for differential expression analysis and short-read assembly that can be easily plugged into the modular system of the workflow. Another improvement is the creation of an advanced user and wizard user configuration file, allowing experienced users to tweak the different parameters of the different software as desired.

MUFFIN will further benefit from different improvements, in particular by graphically comparing the generated MAGs via a phylogenetic tree. Furthermore, a convenient approach to include negative controls is under development to allow the reliable analysis of super-low abundant organisms in metagenomic samples.

MUFFIN is publicly available at https://github.com/RVanDamme/MUFFIN under the GNU general public license v3.0. Detailed information about the program versions used and additional information can be found in the GitHub repository. All tools used by MUFFIN are listed in the S1 Table. The Docker images used in MUFFIN are prebuilt and publicly available at https://hub.docker.com/u/nanozoo, and the GTDB formatted for sourmash (v2.0.1)[46] usage is publicly available at https://osf.io/m5czv/. The MAGs produced by the 20 samples; the template of the output of MUFFIN (README_output.txt); the subset data use in the test profile of MUFFIN (subset_data.tar.gz); and the results of MUFFIN on the subset data with and without RNA using both flye and spades are also available at https://osf.io/m5czv/. The Version of MUFFIN presented in this paper is (V1.0.3, 10.5281/zenodo.4296623).

Supporting information

S1 Table. List of the MUFFIN task, the softwares and versions.

(XLSX)

Click here for additional data file.^{(14.4KB, xlsx)}

Acknowledgments

We want to thank Hadrien Gourlé and Moritz Buck for the valuable insights into metagenomic analysis and annotation.

Data Availability

All subset files for testing the pipeline are available from https://osf.io/m5czv/ MUFFIN is available at https://github.com/RVanDamme/MUFFIN under GNU General Public License version 3.

Funding Statement

This study was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – BR 5692/1-1 and BR 5692/1-2. This material is based upon work supported by Google Cloud. BM was funded by FORMAS, grant number 942-2015-1008. MH is supported by the Collaborative Research Centre AquaDiva (CRC 1076 AquaDiva) of the Friedrich Schiller University Jena, funded by the DFG. MH appreciates the support of the Joachim Herz Foundation by the add-on fellowship for interdisciplinary life science. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol. 1998;5: R245–R249. 10.1016/s1074-5521(98)90108-9 [DOI] [PubMed] [Google Scholar]
2.De R. Metagenomics: aid to combat antimicrobial resistance in diarrhea. Gut Pathog. 2019;11: 47 10.1186/s13099-019-0331-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Mukherjee A, Reddy MS. Metatranscriptomics: an approach for retrieving novel eukaryotic genes from polluted and related environments. 3 Biotech. 2020;10: 71 10.1007/s13205-020-2057-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Grossart H-P, Massana R, McMahon KD, Walsh DA. Linking metagenomics to aquatic microbial ecology and biogeochemical cycles. Limnol Oceanogr. 2020;65: S2–S20. 10.1002/lno.11382 [DOI] [Google Scholar]
5.Carabeo-Pérez A, Guerra-Rivera G, Ramos-Leal M, Jiménez-Hernández J. Metagenomic approaches: effective tools for monitoring the structure and functionality of microbiomes in anaerobic digestion systems. Appl Microbiol Biotechnol. 2019;103: 9379–9390. 10.1007/s00253-019-10052-5 [DOI] [PubMed] [Google Scholar]
6.Overholt WA, Hölzer M, Geesink P, Diezel C, Marz M, Küsel K. Inclusion of Oxford Nanopore long reads improves all microbial and viral metagenome-assembled genomes from a complex aquifer system. Environ Microbiol. 2020;22: 4000–4013. 10.1111/1462-2920.15186 [DOI] [PubMed] [Google Scholar]
7.Assembly-free single-molecule nanopore sequencing recovers complete virus genomes from natural microbial communities | bioRxiv. [cited 3 December 2020]. Available: https://www.biorxiv.org/content/10.1101/619684v1 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Wetterstrand KA. DNA Sequencing Costs: Data. In: www.genome.gov/sequencingcostsdata [Internet]. 5 February 2020. [cited 5 Feb 2020]. Available: www.genome.gov/sequencingcostsdata [Google Scholar]
9.Somerville V, Lutz S, Schmid M, Frei D, Moser A, Irmler S, et al. Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol. 2019;19: 143 10.1186/s12866-019-1500-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Warwick-Dugdale J, Solonenko N, Moore K, Chittick L, Gregory AC, Allen MJ, et al. Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands. PeerJ. 2019;7 10.7717/peerj.6800 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Driscoll CB, Otten TG, Brown NM, Dreher TW. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture. Stand Genomic Sci. 2017;12 10.1186/s40793-017-0232-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Suzuki Y, Nishijima S, Furuta Y, Yoshimura J, Suda W, Oshima K, et al. Long-read metagenomic exploration of extrachromosomal mobile genetic elements in the human gut. Microbiome. 2019;7: 119 10.1186/s40168-019-0737-z [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Mangul S, Martin LS, Eskin E, Blekhman R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 2019;20: 47 10.1186/s13059-019-1649-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35: 316–319. 10.1038/nbt.3820 [DOI] [PubMed] [Google Scholar]
15.Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6: 158 10.1186/s40168-018-0541-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: an advanced analysis and visualization platform for ’omics data. PeerJ. 2015;3: e1319 10.7717/peerj.1319 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Westreich ST, Treiber ML, Mills DA, Korf I, Lemay DG. SAMSA2: a standalone metatranscriptome analysis pipeline. BMC Bioinformatics. 2018;19: 175 10.1186/s12859-018-2189-z [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, et al. Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome. PLOS Comput Biol. 2012;8: e1002358 10.1371/journal.pcbi.1002358 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Meyer F, Paarmann D, D’Souza M, Olson R, Glass E, Kubal M, et al. The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9: 386 10.1186/1471-2105-9-386 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Kieser S, Brown J, Zdobnov EM, Trajkovski M, McCue LA. ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinformatics. 2020;21: 257 10.1186/s12859-020-03585-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Clarke EL, Taylor LJ, Zhao C, Connell A, Lee J-J, Fett B, et al. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome. 2019;7: 46 10.1186/s40168-019-0658-x [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Anaconda Software distribution. Anaconda | The World’s Most Popular Data Science Platform. In: https://anaconda.com [Internet]. 5 Feb 2020 [cited 5 Feb 2020]. Available: https://www.anaconda.com/
23.Boettiger C. An introduction to Docker for reproducible research. ACM SIGOPS Oper Syst Rev. 2015;49: 71–79. 10.1145/2723872.2723882 [DOI] [Google Scholar]
24.Albertsen M, Philip H, Skarshewski A, Nielsen K, Tyson G, Nielsen P. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31 10.1038/nbt.2480 [DOI] [PubMed] [Google Scholar]
25.Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34: i884–i890. 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Wick R. rrwick/Filtlong. 2020. Available: https://github.com/rrwick/Filtlong
27.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012;19: 455–477. 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Antipov D, Korobeynikov A, McLean JS, Pevzner PA. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinforma Oxf Engl. 2016;32: 1009–1015. 10.1093/bioinformatics/btv688 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27: 824–834. 10.1101/gr.213959.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37: 540–546. 10.1038/s41587-019-0072-8 [DOI] [PubMed] [Google Scholar]
31.Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods. 2020;17: 1103–1110. 10.1038/s41592-020-00971-x [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27: 737–746. 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.nanoporetech/medaka. Oxford Nanopore Technologies; 2020. Available: https://github.com/nanoporetech/medaka
34.Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS One. 2014;9: e112963 10.1371/journal.pone.0112963 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Brandt C, Bongcam-Rudloff E, Müller B. Abundance Tracking by Long-Read Nanopore Sequencing of Complex Microbial Communities in Samples from 20 Different Biogas/Wastewater Plants. Appl Sci. 2020;10: 7518 10.3390/app10217518 [DOI] [Google Scholar]
36.Alneberg J, Bjarnason BS, Bruijn I de, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11: 1144–1146. 10.1038/nmeth.3103 [DOI] [PubMed] [Google Scholar]
37.Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26 10.1186/2049-2618-2-26 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3: e1165 10.7717/peerj.1165 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017;13 10.1371/journal.pcbi.1005595 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15: R46 10.1186/gb-2014-15-3-r46 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7: 11257 10.1038/ncomms11257 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016. [cited 3 Dec 2020]. 10.1101/gr.210641.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Marquet M, Hölzer M, Pletz MW, Viehweger A, Makarewicz O, Ehricht R, et al. What the Phage: A scalable workflow for the identification and analysis of phage sequences. bioRxiv. 2020. 10.1101/2020.07.24.219899 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Zhu C, Miller M, Marpaka S, Vaysberg P, Rühlemann MC, Wu G, et al. Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. 2018;46: e23 10.1093/nar/gkx1209 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043–1055. 10.1101/gr.186072.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Brown C, Irber L. sourmash: a library for MinHash sketching of DNA. In: Journal of Open Source Software [Internet]. 14 September 2016. [cited 18 Nov 2019]. 10.21105/joss.00027 [DOI] [Google Scholar]
47.Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36: 996–1004. 10.1038/nbt.4229 [DOI] [PubMed] [Google Scholar]
48.Méric G, Wick RR, Watts SC, Holt KE, Inouye M. Correcting index databases improves metagenomic studies. bioRxiv. 2019; 712166 10.1101/712166 [DOI] [Google Scholar]
49.Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, et al. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol Biol Evol. 2017;34: 2115–2122. 10.1093/molbev/msx148 [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK, Cook H, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019;47: D309–D314. 10.1093/nar/gky1085 [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8: 1494–1512. 10.1038/nprot.2013.084 [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14: 417–419. 10.1038/nmeth.4197 [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Campanaro S, Treu L, Rodriguez-R LM, Kovalovszki A, Ziels RM, Maus I, et al. The anaerobic digestion microbiome: a collection of 1600 metagenome-assembled genomes shows high species diversity related to methane production. bioRxiv. 2019; 680553 10.1101/680553 [DOI] [Google Scholar]
54.Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2020;8: 2138 10.12688/f1000research.21782.3 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Nicholls SM, Quick JC, Tang S, Loman NJ. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. GigaScience. 2019;8 10.1093/gigascience/giz043 [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Lau MCY, Harris RL, Oh Y, Yi MJ, Behmard A, Onstott TC. Taxonomic and Functional Compositions Impacted by the Quality of Metatranscriptomic Assemblies. Front Microbiol. 2018;9 10.3389/fmicb.2018.00009 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008716.r001

Decision Letter 0

Mihaela Pertea

26 Sep 2020

Dear %TITLE% Van Damme,

Thank you very much for submitting your manuscript "Metagenomics workflow for hybrid assembly, differential coverage binning, transcriptomics and pathway analysis (MUFFIN)" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Mihaela Pertea

Software Editor

PLOS Computational Biology

Mihaela Pertea

Software Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This paper describes MUFFIN, a workflow application for binning MAGs

from shotgun metagenome data generated from both long and short-read

sequencing platforms.

Overall, the paper is well written, reasonably thorough, and describes

a workflow that is certainly needed. We are not aware of competing

workflows that deal with long reads, and we feel that Muffin is poised

to become a widely used tool!

Despite the overall quality of the paper, we do think there are a number

of things that could be reworded, clarified, and revised. Please see

below.

Major comments --

While ONT sequencing is certainly a fast-advancing technology, in our

limted experience it seems still to be mostly suitable for relatively

low complexity communities, and this paper glosses over its current

limitations. Perhaps we are wrong? If so, more references would be

useful (than just the one). Regardless it would be good to qualify the

statement here to make it clear that e.g. reference 6 is about hybrid

assembly, and not purely about de novo long read sequencing in

isolation. And/or it might be good to provide some minimal guidance

(perhaps elsewhere in the paper) as to how many reads from which ONT

platform would be useful.

line 108, "Binning is the most crucial step..." We are not sure that

is true - certainly assembly would seem to be important, for example.

Perhaps "a crucial step"?

A useful addition would be some discussion of why the tools chosen were

chosen. Perhaps there was no reason other than that "they worked in our

hands" (which is perfectly fine) but if there is more to add, please do!

It might also be good to comment on whether the workflow is flexible

enough to "swap in" different tools.

Line 179, are there downsides to this re-assembly? Perhapos chimerism/loss

of microdiversity? It would also be good to discuss why additional bins

were able to be constructed on a second round, after being missed in the

first round.

Perhaps comment on why RNAseq reads are assembled instead of being mapped?

What is being mapped to whom, in any case, line 137? Are the de novo

metagenome assemblies/MAGs used at all here?

Eggnog outputs a lot of information other than just KEGG... perhaps

some mentions of these would be good? (GO terms, NOG terms, suggested

taxonomy, etc)

Minor or easily resolved comments --

We suggest citing ATLAS and Sunbeam, which are short-read-focused workflows

for MAG extraction.

Please make an archival copy of the version of Muffin used in this

paper and provide the DOI; this can easily be done by connecting Zenodo

to GitHub and making a new release on GitHub.

line 44, 'such' can be removed.

In Figure 1, what does the word "Differential" (in Differential Reads as an

input) mean?

line 110, 'refine' should be plural

line 120, maybe suggest some additional tools beyond running

Muffin again? mifaser, for example?

line 126, sourmash v3.5 was released recently - perhaps upgrade past an

alpha version?

line 182, "more _from_ this step"

line 198, "whereas" should be rephrased - maybe refer to both

technologies? Something like "MUFFIN takes advantage of short reads

for abundance estimation and long reads for their ability to resolve

repeats" or something .

line 222, suggest removing both "all" words.

line 237, "Annotation" should not be capitalized.

Figure 3A colors make it hard to see the most contaminated bins. Maybe

invert the color scale? or choose the inferno palate instead of

viridis.

We suggest reporting not only the MAGs/length etc that MUFFIN

recovered, but also how many reads weren't assembled/analyzed/didn't make it

into the MAGs.

Reviewer #2: Long reads are more and more common for metagenomics. MUFFIN is a pipeline integrating short reads, long reads and RNA seq reads for functional annotation of metagenomes. It aggregates the results into Kegg pathways and produces nice html reports. It is great that the authors provided multiple executer systems. The installation instruction is clear, and the executor can be specified easily.

Major concerns:

1. Unable to run MUFFIN

I tried to run muffin on a Linux cluster with different combinations of executors and backends without success. The executed commands together with the error logs are attached.

2. More documentation to customize the cluster execution.

It is impossible to make a tool that can be run on any cluster system. The authors did a good job by providing solutions for slurm and google could. However, if they claim that Muffin can be executed on other systems (L160) they should provide profiles or document how users can create execution profiles for their clusters. It would also be helpful to explain how to configure the profiles.

3. Multiple samples.

It is not completely clear how a user can run Muffin with multiple samples. Does the user need to run muffin on each sample separately and define the optional other samples for differential binning? Is there no easier way to run Muffin on a set of samples? How can a user compare different samples? The main output of Muffins is the quantification of KEGG pathways based on MAGs (and RNA seq data) but the MAGs are sample-specific and not directly comparable.

4. RNA-seq:

If I understand correctly the RNAseq data comes from a microbial community, not from a single genome. In this case, the term “metatranscriptome” would be more appropriate.

Trinity was originally not designed for meta-transcriptomes but works relatively good (10.1186/2049-2618-2-39). Could you discuss the challenges of assembly for metatranscriptomes and why you used genome-independent transcriptome assembly if you have MAGs available?

5. Pathway aggregation.

Muffin provides nice html reports with of the KEGG pathways and the genes present in the MAGs and or the RNA seq data. Unfortunately, the gene presence (at least in the provided subset results) is relatively sparse. It is therefore difficult to interpret and compare the data. Could you provide a pathway coverage information as numerical values? It would also be interesting to see pathways of the 20 samples from the bioreactor project as a figure.

6. Bioreactor example:

a. Did the re-assembly reduce N50 for some of the bins?

b. Did it change the quality?

c. Why is the re-assembly only performed for 5 samples?

Minor

• There is no quality control implanted for long reads (except length filtering). Doesn’t it make sense to remove reads that come from contaminants or have low quality?

• The Author summary paragraph is not the same as author contribution.

• Novelty: L71: It seems as if the authors make the indirect claim that muffin the first pipeline for hybrid assembly, is this true?

• The sentence L23-25 in the abstract is difficult to read. propose reformulation: “and can be further used for quantification and annotation by providing RNA-Seq data (optionally).”

• Is it possible to assemble with long reads only or does one need to define always short reads for the polishing?

• L98: Are Pac Bio long also supported?

• L127: Is it possible to update the GTDB ?

• L129: The sentence "GTDB substantially improved overall downstream results40" is misleading. One would assume that the sentence is about the downstream results of MUFFIN. But then the authors should explain which downstream results and how instead citing ref40.

• The authors look ahead on potential future version conflicts. However, I don't fully understand the solution they propose. How can a user make sure that they pick the correct version of nextflow or Muffin? Isn't there another way where it's not up to the user to read the release notes to exclude version conflicts e.g. a conda package providing the tested version of muffin and nextflow?

• It's confusing that the '-profile' command-line argument has only one leading hyphen which is inconsistent with the rest of the CLI.

Reviewer #3: The authors devised a particular metagenomic analysis workflow and applied it to 20 samples of public data. The workflow is composed of popular software tools and the orchestration software. The manuscript is focused on presenting this particular workflow, yet without giving any justification for the choices or discussion of alternatives. The description of the analysed samples is superficial to the point of being non informative.

Thus I struggled to apply criteria for publication in PLOS Computational Biology here:

• Originality - seems to be mostly related to hybrid assembly from short and long reads, which is commonly done when such data exists, which on itself is not very common.

• Innovation - all the components are public 3rd party tools, thus it seems to be mostly related to a particular configuration of the pipeline.

• High importance to researchers in the field - likely not.

• Significant biological and/or methodological insight - no.

• Rigorous methodology - alternatives exist for each step, yet not evaluated.

• Substantial evidence for its conclusions - no conclusions.

It reads as a part of documentation rather then a research article.

Apologies for not being able to be more positive.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: No: could you provide the results from the 20 sample bioreactor or at least the important part of it?

Reviewer #3: No: missed the data for Figure 3.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: C. Titus Brown

Reviewer #2: Yes: Silas Kieser

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: annotate_muffin.sh

Click here for additional data file.^{(258B, sh)}

Attachment

Submitted filename: run_muffin.sh

Click here for additional data file.^{(273B, sh)}

Attachment

Submitted filename: annotate_muffin_custom.sh

Click here for additional data file.^{(211B, sh)}

Attachment

Submitted filename: nextflow.log.3

Click here for additional data file.^{(19.9KB, log)}

Attachment

Submitted filename: command.sh

Click here for additional data file.^{(102B, sh)}

Attachment

Submitted filename: nextflow.log

Click here for additional data file.^{(27.4KB, log)}

Attachment

Submitted filename: command_local.sh

Click here for additional data file.^{(102B, sh)}

Attachment

Submitted filename: nextflow_local.log

Click here for additional data file.^{(34.2KB, log)}

Attachment

Submitted filename: bins.csv

Click here for additional data file.^{(148B, csv)}

PLoS Comput Biol. 2021 Feb 9;17(2):e1008716. doi: 10.1371/journal.pcbi.1008716.r002

Author response to Decision Letter 0

8 Dec 2020

Attachment

Submitted filename: Reviewers Comment.docx

Click here for additional data file.^{(29.1KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008716.r003

Decision Letter 1

Mihaela Pertea

17 Jan 2021

Dear %TITLE% Van Damme,

We are pleased to inform you that your manuscript 'Metagenomics workflow for hybrid assembly, differential coverage binning, metatranscriptomics and pathway analysis (MUFFIN)' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Mihaela Pertea

Software Editor

PLOS Computational Biology

Mihaela Pertea

Software Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The corrections are satisfying and I managed to run MUFFIN.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Silas Kieser

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008716.r004

Acceptance letter

Mihaela Pertea

4 Feb 2021

PCOMPBIOL-D-20-01293R1

Metagenomics workflow for hybrid assembly, differential coverage binning, metatranscriptomics and pathway analysis (MUFFIN)

Dear Dr Van Damme,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. List of the MUFFIN task, the softwares and versions.

(XLSX)

Click here for additional data file.^{(14.4KB, xlsx)}

Attachment

Submitted filename: annotate_muffin.sh

Click here for additional data file.^{(258B, sh)}

Attachment

Submitted filename: run_muffin.sh

Click here for additional data file.^{(273B, sh)}

Attachment

Submitted filename: annotate_muffin_custom.sh

Click here for additional data file.^{(211B, sh)}

Attachment

Submitted filename: nextflow.log.3

Click here for additional data file.^{(19.9KB, log)}

Attachment

Submitted filename: command.sh

Click here for additional data file.^{(102B, sh)}

Attachment

Submitted filename: nextflow.log

Click here for additional data file.^{(27.4KB, log)}

Attachment

Submitted filename: command_local.sh

Click here for additional data file.^{(102B, sh)}

Attachment

Submitted filename: nextflow_local.log

Click here for additional data file.^{(34.2KB, log)}

Attachment

Submitted filename: bins.csv

Click here for additional data file.^{(148B, csv)}

Attachment

Submitted filename: Reviewers Comment.docx

Click here for additional data file.^{(29.1KB, docx)}

Data Availability Statement

All subset files for testing the pipeline are available from https://osf.io/m5czv/ MUFFIN is available at https://github.com/RVanDamme/MUFFIN under GNU General Public License version 3.

[pcbi.1008716.ref001] 1.Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol. 1998;5: R245–R249. 10.1016/s1074-5521(98)90108-9 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref002] 2.De R. Metagenomics: aid to combat antimicrobial resistance in diarrhea. Gut Pathog. 2019;11: 47 10.1186/s13099-019-0331-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref003] 3.Mukherjee A, Reddy MS. Metatranscriptomics: an approach for retrieving novel eukaryotic genes from polluted and related environments. 3 Biotech. 2020;10: 71 10.1007/s13205-020-2057-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref004] 4.Grossart H-P, Massana R, McMahon KD, Walsh DA. Linking metagenomics to aquatic microbial ecology and biogeochemical cycles. Limnol Oceanogr. 2020;65: S2–S20. 10.1002/lno.11382 [DOI] [Google Scholar]

[pcbi.1008716.ref005] 5.Carabeo-Pérez A, Guerra-Rivera G, Ramos-Leal M, Jiménez-Hernández J. Metagenomic approaches: effective tools for monitoring the structure and functionality of microbiomes in anaerobic digestion systems. Appl Microbiol Biotechnol. 2019;103: 9379–9390. 10.1007/s00253-019-10052-5 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref006] 6.Overholt WA, Hölzer M, Geesink P, Diezel C, Marz M, Küsel K. Inclusion of Oxford Nanopore long reads improves all microbial and viral metagenome-assembled genomes from a complex aquifer system. Environ Microbiol. 2020;22: 4000–4013. 10.1111/1462-2920.15186 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref007] 7.Assembly-free single-molecule nanopore sequencing recovers complete virus genomes from natural microbial communities | bioRxiv. [cited 3 December 2020]. Available: https://www.biorxiv.org/content/10.1101/619684v1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref008] 8.Wetterstrand KA. DNA Sequencing Costs: Data. In: www.genome.gov/sequencingcostsdata [Internet]. 5 February 2020. [cited 5 Feb 2020]. Available: www.genome.gov/sequencingcostsdata [Google Scholar]

[pcbi.1008716.ref009] 9.Somerville V, Lutz S, Schmid M, Frei D, Moser A, Irmler S, et al. Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol. 2019;19: 143 10.1186/s12866-019-1500-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref010] 10.Warwick-Dugdale J, Solonenko N, Moore K, Chittick L, Gregory AC, Allen MJ, et al. Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands. PeerJ. 2019;7 10.7717/peerj.6800 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref011] 11.Driscoll CB, Otten TG, Brown NM, Dreher TW. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture. Stand Genomic Sci. 2017;12 10.1186/s40793-017-0232-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref012] 12.Suzuki Y, Nishijima S, Furuta Y, Yoshimura J, Suda W, Oshima K, et al. Long-read metagenomic exploration of extrachromosomal mobile genetic elements in the human gut. Microbiome. 2019;7: 119 10.1186/s40168-019-0737-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref013] 13.Mangul S, Martin LS, Eskin E, Blekhman R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 2019;20: 47 10.1186/s13059-019-1649-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref014] 14.Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35: 316–319. 10.1038/nbt.3820 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref015] 15.Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6: 158 10.1186/s40168-018-0541-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref016] 16.Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: an advanced analysis and visualization platform for ’omics data. PeerJ. 2015;3: e1319 10.7717/peerj.1319 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref017] 17.Westreich ST, Treiber ML, Mills DA, Korf I, Lemay DG. SAMSA2: a standalone metatranscriptome analysis pipeline. BMC Bioinformatics. 2018;19: 175 10.1186/s12859-018-2189-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref018] 18.Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, et al. Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome. PLOS Comput Biol. 2012;8: e1002358 10.1371/journal.pcbi.1002358 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref019] 19.Meyer F, Paarmann D, D’Souza M, Olson R, Glass E, Kubal M, et al. The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9: 386 10.1186/1471-2105-9-386 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref020] 20.Kieser S, Brown J, Zdobnov EM, Trajkovski M, McCue LA. ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinformatics. 2020;21: 257 10.1186/s12859-020-03585-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref021] 21.Clarke EL, Taylor LJ, Zhao C, Connell A, Lee J-J, Fett B, et al. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome. 2019;7: 46 10.1186/s40168-019-0658-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref022] 22.Anaconda Software distribution. Anaconda | The World’s Most Popular Data Science Platform. In: https://anaconda.com [Internet]. 5 Feb 2020 [cited 5 Feb 2020]. Available: https://www.anaconda.com/

[pcbi.1008716.ref023] 23.Boettiger C. An introduction to Docker for reproducible research. ACM SIGOPS Oper Syst Rev. 2015;49: 71–79. 10.1145/2723872.2723882 [DOI] [Google Scholar]

[pcbi.1008716.ref024] 24.Albertsen M, Philip H, Skarshewski A, Nielsen K, Tyson G, Nielsen P. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31 10.1038/nbt.2480 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref025] 25.Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34: i884–i890. 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref026] 26.Wick R. rrwick/Filtlong. 2020. Available: https://github.com/rrwick/Filtlong

[pcbi.1008716.ref027] 27.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012;19: 455–477. 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref028] 28.Antipov D, Korobeynikov A, McLean JS, Pevzner PA. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinforma Oxf Engl. 2016;32: 1009–1015. 10.1093/bioinformatics/btv688 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref029] 29.Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27: 824–834. 10.1101/gr.213959.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref030] 30.Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37: 540–546. 10.1038/s41587-019-0072-8 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref031] 31.Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods. 2020;17: 1103–1110. 10.1038/s41592-020-00971-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref032] 32.Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27: 737–746. 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref033] 33.nanoporetech/medaka. Oxford Nanopore Technologies; 2020. Available: https://github.com/nanoporetech/medaka

[pcbi.1008716.ref034] 34.Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS One. 2014;9: e112963 10.1371/journal.pone.0112963 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref035] 35.Brandt C, Bongcam-Rudloff E, Müller B. Abundance Tracking by Long-Read Nanopore Sequencing of Complex Microbial Communities in Samples from 20 Different Biogas/Wastewater Plants. Appl Sci. 2020;10: 7518 10.3390/app10217518 [DOI] [Google Scholar]

[pcbi.1008716.ref036] 36.Alneberg J, Bjarnason BS, Bruijn I de, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11: 1144–1146. 10.1038/nmeth.3103 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref037] 37.Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2: 26 10.1186/2049-2618-2-26 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref038] 38.Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3: e1165 10.7717/peerj.1165 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref039] 39.Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017;13 10.1371/journal.pcbi.1005595 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref040] 40.Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15: R46 10.1186/gb-2014-15-3-r46 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref041] 41.Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7: 11257 10.1038/ncomms11257 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref042] 42.Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016. [cited 3 Dec 2020]. 10.1101/gr.210641.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref043] 43.Marquet M, Hölzer M, Pletz MW, Viehweger A, Makarewicz O, Ehricht R, et al. What the Phage: A scalable workflow for the identification and analysis of phage sequences. bioRxiv. 2020. 10.1101/2020.07.24.219899 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref044] 44.Zhu C, Miller M, Marpaka S, Vaysberg P, Rühlemann MC, Wu G, et al. Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. 2018;46: e23 10.1093/nar/gkx1209 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref045] 45.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25: 1043–1055. 10.1101/gr.186072.114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref046] 46.Brown C, Irber L. sourmash: a library for MinHash sketching of DNA. In: Journal of Open Source Software [Internet]. 14 September 2016. [cited 18 Nov 2019]. 10.21105/joss.00027 [DOI] [Google Scholar]

[pcbi.1008716.ref047] 47.Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36: 996–1004. 10.1038/nbt.4229 [DOI] [PubMed] [Google Scholar]

[pcbi.1008716.ref048] 48.Méric G, Wick RR, Watts SC, Holt KE, Inouye M. Correcting index databases improves metagenomic studies. bioRxiv. 2019; 712166 10.1101/712166 [DOI] [Google Scholar]

[pcbi.1008716.ref049] 49.Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, et al. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol Biol Evol. 2017;34: 2115–2122. 10.1093/molbev/msx148 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref050] 50.Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK, Cook H, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019;47: D309–D314. 10.1093/nar/gky1085 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref051] 51.Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8: 1494–1512. 10.1038/nprot.2013.084 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref052] 52.Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14: 417–419. 10.1038/nmeth.4197 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref053] 53.Campanaro S, Treu L, Rodriguez-R LM, Kovalovszki A, Ziels RM, Maus I, et al. The anaerobic digestion microbiome: a collection of 1600 metagenome-assembled genomes shows high species diversity related to methane production. bioRxiv. 2019; 680553 10.1101/680553 [DOI] [Google Scholar]

[pcbi.1008716.ref054] 54.Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2020;8: 2138 10.12688/f1000research.21782.3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref055] 55.Nicholls SM, Quick JC, Tang S, Loman NJ. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. GigaScience. 2019;8 10.1093/gigascience/giz043 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008716.ref056] 56.Lau MCY, Harris RL, Oh Y, Yi MJ, Behmard A, Onstott TC. Taxonomic and Functional Compositions Impacted by the Quality of Metatranscriptomic Assemblies. Front Microbiol. 2018;9 10.3389/fmicb.2018.00009 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Metagenomics workflow for hybrid assembly, differential coverage binning, metatranscriptomics and pathway analysis (MUFFIN)

Renaud Van Damme

Martin Hölzer

Adrian Viehweger

Bettina Müller

Erik Bongcam-Rudloff

Christian Brandt

Roles

Abstract

Author summary

Introduction

Design and implementation

Fig 1. Simplified overview of the MUFFIN workflow.

Step 1—Assemble: Hybrid assembly and binning

Step 2—Classify: Bin quality control and taxonomic assessment

Step 3—Annotate: Bin annotation and KEGG pathway summary

Fig 2. Example snippets of the sub-workflow results of step 3 (Annotate).

Running MUFFIN and version control

Results

Fig 3. Quality of meta-assembled genomes (MAGs).

Discussion

Availability and future directions

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Mihaela Pertea

Roles

Author response to Decision Letter 0

Decision Letter 1

Mihaela Pertea

Roles

Acceptance letter

Mihaela Pertea

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases