Skip to main content
GigaScience logoLink to GigaScience
. 2025 Sep 26;14:giaf017. doi: 10.1093/gigascience/giaf017

PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

Engy Nasr 1, Anna Henger 2, Björn Grüning 3, Paul Zierep 4,, Bérénice Batut 5,6,
PMCID: PMC12466118  PMID: 41004266

Abstract

Background

Food contamination by pathogens poses a global health threat, affecting an estimated 600 million people annually. During a foodborne outbreak investigation, microbiological analysis of food vehicles detects responsible pathogens and traces contamination sources. Metagenomic approaches offer a comprehensive view of the genomic composition of microbial communities, facilitating the detection of potential pathogens in samples. Combined with sequencing techniques like Oxford Nanopore sequencing, such metagenomic approaches become faster and easier to apply. A key limitation of these approaches is the lack of accessible, easy-to-use, and openly available pipelines for pathogen identification and tracking from (meta)genomic data.

Findings

PathoGFAIR is a collection of Galaxy-based Findable, Accessible, Interoperable, and Reusable (FAIR) workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing. Although initially developed to detect pathogens in food datasets, the workflows can be applied to other metagenomic Nanopore pathogenic data. PathoGFAIR incorporates visualizations and reports for comprehensive results. We tested PathoGFAIR on 130 samples containing different pathogens from multiple hosts under various experimental conditions. For all but 1 sample, workflows have successfully detected expected pathogens at least at the species rank. Further taxonomic ranks are detected for samples with sufficiently high colony-forming unit and low cycle threshold values.

Conclusions

PathoGFAIR detects the pathogens at species and subspecies taxonomic ranks in all but 1 tested sample, regardless of whether the pathogen is isolated or the sample is incubated before sequencing. Importantly, PathoGFAIR is easy to use and can be straightforwardly adapted and extended for other types of analysis and sequencing techniques, making it usable in various pathogen detection scenarios.

Keywords: Galaxy, public health, Nanopore, pipeline, open source, benchmark samples, visualization

Introduction

Foodborne pathogens pose a significant threat to public health worldwide, causing millions of cases of illness and even death every year [1, 2]. These diverse microorganisms, spanning bacteria, viruses, parasites, and fungi, can contaminate a variety of foods, leading to both localized outbreaks and widespread epidemics. Ensuring food safety and controlling foodborne pathogens are key priorities for public health authorities at local, regional, and global levels, including agencies such as the European Food Safety Authority (EFSA), European Centre for Disease Prevention and Control (ECDC), and the World Health Organization (WHO) [3].

Traditional methods for identifying the source of food contamination require isolation of the target pathogen. This process is not only time-consuming but can be labor-intensive, often requiring multiple steps and sophisticated techniques, and lacks a guaranteed success rate [4]. In contrast, shotgun metagenomic approaches provide a solution to these challenges, as they give an overview of the genomic composition in the sample, including the food source itself, the microbial community, and any possible pathogens and their complete genetic information [5]. Importantly, shotgun metagenomic approaches eliminate the need for prior isolation of the targeted pathogen, as required by whole-genome sequencing (WGS) methods, and they are not limited to specific genes as opposed to real-time PCR approaches [6] or 16S ribosomal RNA (rRNA) sequencing. While 16S rRNA sequencing is widely used for bacterial taxonomic profiling, it is limited in scope compared to shotgun metagenomic sequencing. The latter allows for the detection of a wide range of pathogens, including bacteria, viruses, and fungi, and gives access to the full genomes, enabling the taxa-agnostic identification of antimicrobial resistance (AMR) and virulence genes. This broader scope makes shotgun sequencing more suitable for comprehensive pathogen detection, especially in complex foodborne outbreak investigations [7].

Nanopore sequencing provides long-read data that can capture comprehensive genetic information. Its utilization, as exemplified by studies like [8], demonstrates its utility in closing genomic gaps, delivering real-time sequencing data, and enhancing the capabilities of metagenomic approaches for outbreak investigations. This technology enables more accurate and rapid pathogen detection, a critical advancement in scenarios where timely responses are essential for effective outbreak management.

Once sequencing data are generated, they must be processed using bioinformatics tools to identify pathogens, their genetic variations, and virulence factor (VF) genes, thereby facilitating timely and accurate detection [9, 10]. However, available tools and workflows require bioinformatic and computational knowledge and expertise. For example, tool parameters need to be adapted to the specific use case. End-to-end platforms (Table 1) that allow users to analyze their samples are either restricted with only a limited free trial (e.g., BugSeq [11]) or paid subscription (e.g., OneCodex [12]), or they require high computational resources (e.g., SURPI [13] and Sunbeam [14]). For certain free resources, the underlying workflow is not available and adaptable for the user. For example, IDseq [15] (also known as CZID [16]), a free cloud-based service for pathogen detection, can only be externally accessed through the dedicated online user interface. Furthermore, some of these workflows are specific to a certain host, pathogen, or sequencing technique, lacking the flexibility for customization.

Table 1:

Comparison of features between PathoGFAIR and other similar pipelines or systems. This comparison sheds light on various features and characteristics, such as accessibility, technical specifications, and the scope of analyses offered by each system. It serves as a reference to evaluate the suitability of PathoGFAIR and other similar pipelines or systems for specific needs and requirements.

Features PathoGFAIR IDseq BugSeq SURPI OneCodex Sunbeam Innuendo PAIPline Victors
[21] [22] [23]
General characteristics
Free of charge *
Open source code
Web interface **
Automatable API
Accessibility and availability
Simple end-user modification
Publicly available web server
Last updated 2024 2024 2024 2014 2023 2024 2018 2018 2019
User support and documentation
Tutorial
Documentation
User support
Technical specifications
Workflow manager Galaxy Snakemake Nextflow
Sequencing technique Nanopore*** Illumina Illumina Illumina Illumina Illumina Illumina
& Nanopore & Nanopore
Analyses
Preprocessing
Taxonomy profiling
Gene-based pathogen identification
Allele-based pathogen identification
Sample aggregation and visualizations

*Free trial of 10 samples is available.

**Malfunctioned when tested.

***Can be easily adapted to any other types of sequencing techniques via Galaxy, a customizable and automatable API.

Galaxy [17] is an open-source platform for Findable, Accessible, Interoperable, and Reusable (FAIR) data analysis. It enables users to apply a comprehensive suite of bioinformatics tools (that can be combined into workflows) through either its user-friendly web interface or its automatable Application Programming Interface (API) for integrating and customizing workflows, enhancing user flexibility. It ensures reproducibility by capturing the necessary information to repeat and understand data analyses. Galaxy offers a collection of high-quality prebuilt workflows that can be used directly or are easily adapted to the user’s needs via the Galaxy workflow editor. Galaxy workflows can be executed on any Galaxy server, even on the private Galaxy server, making it suitable also for data where privacy concerns are important. Furthermore, Galaxy via the major public servers [17] freely provides a large computing infrastructure, allowing for the execution of computationally challenging workflows, which is often the case for metagenomic analysis.

Here, we present PathoGFAIR, a collection of Galaxy-based workflows for pathogen identification and tracking its presence among (meta)genomics Oxford Nanopore sequencing data. The workflows are openly available on 2 workflow registries (Dockstore [18] and WorkflowHub [19]). They can be used directly on 3 major Galaxy servers (usegalaxy.org, usegalaxy.eu, usegalaxy.org.au) or installed in any other Galaxy server. The workflows are created to work agnostically, detecting all pathogens present in the samples without prior knowledge of the target pathogen. As the workflows are created in Galaxy, they can be adapted for other sequencing techniques or with various downstream analyses, such as differential expression analysis, or further statistics and visualizations [17]. Workflows are documented and supported by an extensive tutorial freely available via the Galaxy Training Network (GTN) [20]. Overall, PathoGFAIR offers an easy-to-use computational solution that speeds up the process of sampling, detecting, and tracking pathogens.

Implementation

Overview

PathoGFAIR comprises a collection of 5 workflows, implemented in Galaxy (Fig. 1). Each workflow serves a specific function and can be executed independently, enabling users to tailor their analysis according to their requirements.

Figure 1:

Figure 1:

Flowchart of the PathoGFAIR workflows. Workflow 1 (olive green) takes as input sequencing data generated by Oxford Nanopore technologies and performs quality control and host filtering. Then, 3 parallel workflows are executed on the output of Workflow 1: Workflow 2 (red) for taxonomy profiling, Workflow 3 (dark cyan) for gene-based pathogen identification, and Workflow 4 (purple) for SNP-based pathogen identification. These 4 workflows can run individually and in parallel. Finally, all outputs for the different provided datasets are aggregated in Workflow 5 (green) for PathoGFAIR sample aggregation and visualization.

The input data for PathoGFAIR comprise sequencing data generated using Oxford Nanopore technologies, along with an optional metadata table describing the datasets. Basecalling for converting raw signal data from the Nanopore sequencer into nucleotide sequences is not included within the PathoGFAIR workflows. In the use cases presented later in the article, real-time basecalling is performed using the MinKNOW software (Oxford Nanopore Technologies) before the reads are used in the workflows. Basecalling is a crucial step, as it affects the quality of the reads. Users are encouraged to ensure that high-quality basecalling is performed before starting the analysis with PathoGFAIR.

The datasets are preprocessed in Workflow 1, which encompasses quality control and host removal procedures. Subsequently, the preprocessed data are directed to 3 parallel workflows: taxonomy profiling (Workflow 2), gene-based pathogen identification (Workflow 3), and allele-based pathogen identification (Workflow 4). This parallel execution allows for efficient analysis and flexibility in workflow selection. Notably, Workflow 4 can optionally synchronize with Workflow 2 or Workflow 3 to leverage prior taxonomic analysis or gene-based pathogen identification results, providing users with flexibility based on specific use cases. By using detailed taxonomic identification from Workflow 2 or gene-based pathogen identification from Workflow 3, Workflow 4 enhances mapping and single-nucleotide polymorphism (SNP) detection accuracy. This process involves selecting the correct reference genome of the pathogen for mapping, informed by results from Workflow 2, Workflow 3, or even Workflow 1, which performs initial taxonomy assignment during the host filtering step.

Since each workflow can be executed independently, users can focus on specific aspects of pathogen detection or analysis. This modular approach empowers users to utilize the full range of functions offered by each workflow individually or to combine them as needed for comprehensive pathogen detection.

Finally, in Workflow 5, outputs from the previous workflows and the metadata of the dataset are aggregated and visualized for comprehensive pathogen tracking across samples. This aggregation step ensures a holistic view of pathogen presence and distribution, facilitating further insights and analysis.

Overall, the independent nature of PathoGFAIR’s workflows provides users with a user-friendly and customizable approach to pathogen detection, allowing for both comprehensive analyses and targeted investigations based on specific research needs or objectives.

Ensuring the accuracy and currency of reference data is indeed fundamental for robust metagenomic analysis. PathoGFAIR leverages Galaxy’s integrated Data Managers, which enables Galaxy admins to provide up-to-date reference data. These Data Managers automate the download, installation, and regular update of essential reference databases, ensuring that PathoGFAIR users work with complete, accurate, and up-to-date reference information. PathoGFAIR workflows are configured to use well-maintained and reputable sources, such as NCBI and other public pathogen reference repositories, which further support accuracy and comprehensiveness. Additionally, Galaxy’s user-friendly interface enables users to select preferred references or request the inclusion of specific databases via Galaxy administrators, adding to the workflow’s adaptability for diverse use cases.

PathoGFAIR offers a competitive, and accessible solution (Table 1) to detect and track pathogens in metagenomic Nanopore data through its 5 Galaxy-based FAIR and customizable workflows.

Workflow 1: Preprocessing

Workflow 1 encompasses essential preprocessing steps to ensure the quality and integrity of sequencing data.

Quality control and sequence filtering, based on quality, length, or low complexity, are performed using Fastp (v 0.23.2) (biotools:fastp) [24]. Porechop (v 0.2.4) [25] trims low-quality base pairs and removes duplicates and adapters. Quality thresholds are set to ensure that reads have an average quality score of Q20, aligning with accepted standards for Nanopore sequencing, where Q20 or higher quality is typically sufficient for reliable results [26].

Quality-controlled (QC) reads are cleaned of sequences from the host or food source (e.g., bovine in case of bovine meat) by mapping to their reference genome using Minimap2 (v 2.26) (RRID:SCR_018550) [27], a tool tens of times faster than mainstream long-read mappers such as BLASR [28], BWA-MEM [29], NGMLR [30], and GMAP [31] and 3 times as fast as Bowtie2 (biotools:bowtie2) [32], designed for Illumina short reads [27]. A variety of reference genomes (e.g., Human, Chicken, or Cow) can be installed on Galaxy servers to work with Minimap2. A wide variety of reference genomes are integrated into Minimap2 on Galaxy, providing users with a convenient selection to choose from before executing the workflow. Kraken2 (v 1.2) (biotools:kraken2) [33] is applied for further contamination detection (e.g., human sequences using the Kalamari database). The Kalamari database includes mitochondrial sequences of various known hosts [34]. Host/food source reads matched to the Kalamari database are assessed and removed using Krakentools (v 1.2) [33].

The workflow returns QC reads without contamination or host sequences as well as interactive reports, produced by FastQC (v 0.12.1) (RRID:SCR_014583), fastp, and MultiQC (v 1.11) (RRID:SCR_014982) [35]. Furthermore, Nanoplot (v 1.39.0) (biotools:nanoplot) [36] is employed to provide detailed quality metrics specifically tailored to the preprocessing step, enriching the suite of analytical insights and facilitating robust data evaluation.

Workflow 2: Taxonomy profiling

Workflow 2 performs taxonomic profiling of the microbial community to identify pathogens and other microorganisms for the QC reads from Workflow 1, using Kraken2 (v 1.2) [33] and the PlusPF (archaea, bacteria, viral, plasmid, human, UniVec_Core, protozoa, fungi, and plant) Refseq database (7 June 2022). Although Kraken2 is a tool designed for short-read sequencing and is known for its false-positive taxonomy assignments, particularly at lower microbial abundances [37], its application to long reads can still yield a substantial overview of the microbial community. This is particularly true for discerning bacteria that could potentially be pathogenic at genus and species taxonomic ranks  [38, 39]. Kraken2 allows for the rapid assignment of taxonomy at multiple ranks, from kingdom to species, using an efficient exact k-mer matching algorithm. Other tools such as Centrifuge (RRID:SCR_016665) or MetaPhlAn (RRID:SCR_004915) are viable alternatives, also available on Galaxy. Kraken2 is selected for its speed, sensitivity, and ability to work with large reference databases, a critical factor when analyzing complex metagenomic samples [33, 40]. The produced community profile is visualized using Krona (RRID:SCR_012785) [41] and observed interactively for different taxonomic ranks using Phinch [42] or Pavian [43].

Workflow 3: Gene-based pathogen identification

In this workflow, the pathogens are identified by the presence of genes associated with pathogenicity. QC reads from Workflow 1 are assembled into contigs using Metaflye (v 2.9.1) (RRID:SCR_017016) [44]. The contigs are then polished using the Medaka Consensus Pipeline (v 1.7.2) (biotools:medaka) [45], which generates consensus sequences using neural networks and shows improved accuracy over graph-based approaches for Oxford Nanopore reads. The polished contigs are screened afterward using ABRicate (v 1.0.1) (biotools:ABRicate) [46] for virulence factors (VFs) with the Virulence Factor Database (VFDB) [47] and for antimicrobial resistance (AMR) genes with AMRFinderPlus [48] database. ABRicate is chosen for its versatility, as it supports multiple databases, including those for antimicrobial resistance genes and the VFDB. This makes it a comprehensive tool for gene-based pathogen detection, capable of identifying a wide range of relevant genetic markers [46].

Workflow 4: Allele-based pathogen identification

Another approach to identifying pathogens is to use an allelic approach by detecting SNPs (i.e., markers showing evolutionary histories of homogeneous strains) [49]. This process includes SNP calling, aimed at identifying novel pathogen strains and elucidating discrepancies compared to reference sequences, thereby facilitating the tracing of emerging variants. Within Workflow 4, both complex variants and SNPs are discerned, serving as crucial elements for subsequent pathogen identification and variant tracing purposes.

QC reads from Workflow 1 are mapped using Minimap2 (v 2.26) to a selected reference genome of a suspected pathogen. Users can choose the reference genome based on their prior knowledge of the target pathogen, the taxonomic analysis in Workflow 2, or the detected pathogenic genes in Workflow 3. Variant calling for mapped reads is performed using Clair3 (v 0.1.12) (biotools:clair3) [50]. Clair3, a tool developed for long reads, has been chosen because it is demonstrated to be faster and more accurate than the Medaka variant pipeline, which its developer has declared deprecated in favor of Clair3 [45]. After that, all complex variants and their information, such as type, genomics position, and quality score, are normalized using bcftools norm (v 1.9) [51]. The normalized reads are filtered using SnpSift filter (v 4.3) (RRID:SCR_015624) [52] based on the SNP quality computed in the SNP identification step with Clair3. Filtered variants fields required for further analyses are extracted using SnpSift extract fields (v 4.3) (RRID:SCR_015624) [52]. Finally, a consensus sequence for each sample is built using bcftools consensus (v 1.9) (RRID:SCR_005227) [53]. In addition to the variants, this workflow outputs tables, including summary metrics like the mapping coverage (breadth of coverage) percentages for every sample, per base covering mean depth (depth of covering), and quality filtered complex variants and SNP numbers. For more accurate results, users should consider only SNPs with a minimum depth of covering of 10× to ensure reliable calls, as demonstrated in the analyses of the following Use Cases section. This threshold effectively minimizes the inclusion of false-positive variants, a challenge often encountered with Nanopore sequencing data due to its inherent error rates.

Workflow 5: PathoGFAIR sample aggregation and visualization

In all previously described workflows, individual samples are analyzed separately. Workflow 5 consolidates the outputs from Workflows 1, 2, 3, and 4 along with sample metadata to generate various visualizations and reports. These reports illustrate the detected pathogens and facilitate the visualization and tracking of their presence across all samples.

VF tables from Workflow 3 are used to generate clustered heatmaps showing the VF genes using ggplot2 Heatmap (v 3.4.0) (RRID:SCR_014601). VF sequences are concatenated per sample, generating a consensus sequence of identified VF genes per sample and aligned over all samples using ClustalW (v 2.1) (RRID:SCR_017277). A phylogenetic tree of the virulence gene sequences is then generated from the multiple sequence alignment using FASTTREE (v 2.1.10) (RRID:SCR_015501) [54] and visualized using Newick Display (v 1.6) (biotools:newick_utilities). The same is performed on the AMR tables from Workflow 3. From Workflows 1 and 4 output tables, bar charts are generated.

Other outputs are aggregated and processed within a Jupyter Notebook [55], interactively launched in Galaxy using JupyTool (v 1.0.0). This Notebook showcases the integration of sample metadata to generate analysis-specific plots, leveraging Python (v 3.10.12) [56] libraries such as Pandas (v 1.5.3) [57, 58], Matplotlib (v 3.7.1) [59], Seaborn (v 0.12.2) [60], and Numpy (v 1.24.3) [61]. Examples of these plots include bar plots illustrating the number of reads before and after quality control for all samples, scatterplots visualizing relationships between different variables such as pathogen count and sample characteristics, and interactive cluster maps displaying the clustering patterns of samples based on pathogen composition. These visualization techniques are further elucidated and exemplified in the Use Cases section of this study, where the output tables from the workflows are aggregated with the corresponding sample metadata and visualized to facilitate comprehensive visual analysis.

VF and AMR genes are often found on mobile genetic elements (MGEs) such as plasmids or phages, meaning they can sometimes appear independently of their bacterial hosts. To address this challenge, PathoGFAIR integrates taxonomic profiling from Workflow 2 with gene detection results from Workflow 3. This cross-referencing ensures accurate attribution of VF and AMR genes to their respective host organisms. For further validation, Workflow 4 enables users to map consensus genomes, generated in Workflow 5 from detected VF genes, against any reference genomes. This process confirms whether these VF or AMR genes, detected in Workflow 3, are genuinely linked to the bacterial genome or merely associated with MGEs, with additional coverage metrics helping to ensure accurate mapping. Future updates to PathoGFAIR will include expanded methodologies to validate gene–host associations using broader taxonomic markers, further refining the precision of pathogen characterization.

Workflow reports

As all PathoGFAIR workflows are designed to run seamlessly on the Galaxy platform, an interactive report is automatically generated upon completion of each workflow. These reports provide a comprehensive overview of the respective workflow’s inputs and outputs. In PathoGFAIR, special attention has been given to refining these reports for enhanced user experience. The reports are carefully curated to automatically showcase and emphasize only the most informative, easily interpretable, and accessible outputs for each workflow. This ensures that users can efficiently extract key insights from the results, facilitating a streamlined and user-friendly analysis experience.

Easily adaptable workflows

The workflows can process raw shotgun (meta)genomics sequencing data from any sample, not only food.

PathoGFAIR has been initially developed to take Oxford Nanopore data as inputs. However, PathoGFAIR can work with Illumina data or other types of sequencing technique data. To adapt to Illumina sequencing, only 1 tool needs to be changed in Workflow 1: Porechop [25] with Cutadapt (RRID:SCR_011841) [62]. Workflows 2, 3, 4, and 5 can be used directly with Illumina datasets without any adaptation. Some tools can be changed based on the tool’s known performance toward short and long reads, such as Clair3 (v 0.1.12) [50] and Metaflye (v 2.9.1) [44]. All the mentioned tools are accessible within Galaxy, allowing for seamless interchangeability.

The workflows can also be adapted to process paired-end reads by adjusting the tools’ parameters to take paired-end read samples instead of single-end reads. These changes can be applied with little effort by using the user-friendly workflow editor in Galaxy.

Users can seamlessly switch between different host reference genomes and Kraken2 databases, as PathoGFAIR supports various preinstalled databases on the Galaxy servers. This feature enhances user convenience and efficiently explores different configurations to suit specific analysis requirements.

Similarly, tool versions and parameters can be adapted, for example, to compare results with legacy versions of the workflows. New tool versions are automatically installed on public Galaxy servers using a sophisticated update infrastructure, ensuring a straightforward mechanism to keep the infrastructure up-to-date [63]. Every time a tool is updated, an update of the workflows is suggested, tested with functional tests, and released on the workflow registries once accepted.

Each of the 5 PathoGFAIR workflows is designed for a distinct type of analysis. Workflows 2, 3, and 4 operate independently, offering the flexibility to run them concurrently or skip them as per user requirements. This modular structure allows users to tailor the analysis to their specific needs, activating only the functionalities necessary for the desired workflow outcome.

FAIR workflows

The FAIR principles [64], which emphasize the importance of making research objects Findable, Accessible, Interoperable, and Reusable, offer valuable guidance for optimizing the utility and promoting the reproducibility and reusability of any research object (data, software  [65], or workflows).

PathoGFAIR has been developed with the FAIR principles in mind and follows the 10 tips for building FAIR workflows, as suggested by de Visser et al. [66]. First, by using Galaxy as a workflow manager, the workflows are portable (Tip 6) and come with a reproducible computational environment (Tip 7). The tools integrated into the workflows use file format standards such as FASTA and FASTQ for sequence data, SAM and BAM from the Samtools project for alignment data, VCF for genetic variations, GenBank and GFF3 for genomic annotations, and PDB for structural data (Tip 5) [64]. As explained in the previous section, the workflows are provided with default values (Tip 8) and are modular (Tip 9).

The 5 workflows are available on the GitHub repository of IWC, the Intergalactic Workflow Commission of the Galaxy community (Tip 3) [67–72]. Workflows in this repository are reviewed and tested using test data before publication and with every new Galaxy release. The IWC automatically updates the workflows whenever a new version of any tool used in these workflows is released. Deposited workflows follow best practices, are versioned using GitHub releases, and contain important metadata (e.g., License, Author, Institutes) (Tip 2). The workflows are automatically added to 2 workflow repositories (Dockstore [18] and WorkflowHub [19]) to facilitate the discovery and reuse of workflows in an accessible and interoperable way (Tip 1) [73–82]. Via Dockstore or WorkflowHub, the PathoGFAIR workflows can be installed on any up-to-date Galaxy server. They are already publicly available on 3 main Galaxy servers (usegalaxy.org, usegalaxy.eu, usegalaxy.org.au), which any user can use and modify without restriction.

A thorough explanation of how to use the workflows in PathoGFAIR, including a more global description of pathogen identification from Oxford Nanopore data, can be found in a dedicated extensive tutorial [83] together with example input data and results (Tips 4 and 10), freely available and hosted via the GTN [20] infrastructure.

Finally, for every invocation of the workflows, a Research Object Crate (RO-Crate [84, 85]) can be created to store the data products of the different steps, along with the run-associated metadata (including parameters, tool, and workflow version).

Use Cases

To showcase PathoGFAIR and its capabilities, 130 samples from 2 distinct studies—one involving samples with prior pathogen isolation and the other without—were analyzed. In the case of nonisolated samples, pathogens were deliberately spiked to mimic real-world scenarios. For isolated samples, prior identification ensured the pathogens’ identities were known. All samples underwent sequencing using Oxford Nanopore technology, highlighting the workflow’s adaptability across diverse sample preparation methods. All workflows of PathoGFAIR were evaluated for their main intended tasks (e.g., the preprocessing workflow for its read quality retaining and host sequence removal performance) but also for their ability to identify the correct pathogen and how well the accuracy with respect to different sampling conditions is.

Samples without prior pathogen isolation

Data generation

In this study, 46 samples had been prepared given the following protocol [86]. Chicken meat was spiked with either 1 of 3 Salmonella enterica subspecies (S. enterica subsp. houtenae DSM 9221, S. enterica subsp. enterica DSM 554, or S. enterica subsp. salamae DSM 9220) or a mix of them, with concentrations that give cycle threshold (Ct) values between 25 and 33. A total of 15 samples were incubated at 37°C for 24 hours before DNA isolation to facilitate bacterial growth. All samples were after incubated at 56°C for 1 hour with lysis buffer and 20 ng/μL Proteinase K, followed by DNA extraction according to the STAR BEADS Pathogen DNA/RNA Extraction kit (CYANAGEN SRL) instructions. In this study, approximately 25 mg of meat was used per aliquot for DNA extraction. DNA concentrations were measured with the Qubit® 4.0 Fluorometer (Thermo Fisher Scientific) using the double-stranded DNA (dsDNA) High-Sensitivity (HS) assay kit (Thermo Fisher Scientific), following the manufacturer’s protocol. The quality was evaluated with a Nanodrop® 1000 (Thermo Fisher Scientific), assessing the 260/280 nm and 260/230 nm ratios. The 260/280 and 260/230 ratios were close to the expected ranges of 1.8–2.0 and 2.0–2.2, respectively. Extracted DNA was barcoded before sequencing using the Native barcoding genomic DNA (with EXP-NBD104, EXP-NBD114, and SQK-LSK109) protocol (Oxford Nanopore). DNA was then loaded on an R9.4.1 MinION Mk flow cell (Oxford Nanopore). SpotON sample port cover and priming port were closed and sequencing was started. The sequencing device control, data acquisition, and real-time basecalling were carried out by the MinKNOW software of the MinION Mk1C device. For 6 samples, adaptive sampling, a technique used in Nanopore sequencing to selectively sequence microbial DNA while excluding unwanted host DNA (here chicken DNA), was used. Generated sequencing data is available via BioProject PRJNA982679. Metadata for the 46 samples are summarized in Supplementary Table S1 into 5 pieces of information: (i) expected subspecies; (ii) incubation before DNA isolation; (iii) adaptive sampling during sequencing; (iv) colony-forming unit (CFU)/mL [87], a measure providing a quantitative assessment of viable microbial entities within a given sample and measured using standard microbiological techniques such as serial dilution and plating on agar medium; and (v) Ct values [88], values inversely proportional to the amount of nucleic acid present in the samples.

Preprocessing

The number of reads after quality control varies significantly between samples (Fig. 2A), which impacts downstream analyses.

Figure 2:

Figure 2:

(A) Bar plot showing the total number of quality-controlled reads per sample before (dark blue) and after (light blue) host sequence removal. On the left, the metadata of the samples are displayed: (i) the expected S. enterica subsp. salamae in yellow, S. enterica subsp. houtenae in blue, and S. enterica subsp. enterica in light purple; (ii) incubation before DNA isolation (incubated for 24 hours in pink and incubated for 1 hour in brown); and (iii) adaptive sampling during sequencing (chicken excluded in green and chicken not excluded in purple). (B) Clustergram displaying the identified VF gene abundances per sample. The VF genes are presented on the y-axis, and all 46 nonisolated samples are on the x-axis along with their sample information. On the top are the metadata of the samples with the same color code as in A. The gray bars on the bottom and on the right represent dendrogram VF gene (right) and sample metadata (bottom) clusters found with hierarchical clustering with a clustering granularity of 0.5. (C) Phylogenetic tree, using the nucleotide evolution model; General Time Reversible (GTR) model with a CAT approximation for rate heterogeneity across sites [54]. The phylogenetic tree was built on the VF gene consensus sequences concatenated per sample and aligned for all samples. (D) Bar plot with the mapping coverage (breadth of coverage), that is, the percentage of covered bases of each sample to the reference genome, measured by calculating the percentage of positions within each bin with at least 1 base aligned against it. (E) Bar plot with the mean of the mapping depth (depth of coverage) of bases mapped to corresponding bases in the reference genome for every sample. (F) Bar plot with the number of variants and SNPs found per sample. Mapping coverage percentage and the depth mean indicate whether to trust the variants and SNPs found by the workflow or not; the higher the coverage percentage and the depth mean, the more trusted the SNP results for the sample.

For host detection using Minimap2 (v 2.26), the option PacBio/Oxford Nanopore read to reference mapping was set here. As expected from the sample sequencing protocol (chicken samples and not isolated pathogen), most sequences were assigned to chicken (Gallus gallus galGal6): above 90% in 31 samples and between 55% and 85% for the remaining 15 samples (Supplementary Fig. S1). However, the percentage of identified host DNA (between 60% and 98%) was not as low as expected for the 6 samples that had undergone adaptive sampling to exclude chicken DNA during sequencing. This shows that the adaptive sampling to exclude chicken in some samples during sequencing may not have removed all the chicken sequences. All sequences identified as chicken were removed (Fig. 2A). After QC and host removal, 19 samples had fewer than 1,000 reads. These samples could only be analyzed using the taxonomy profiling as highlighted in the next sections.

Taxonomy profiling

S. enterica was detected in Workflow 2 for all samples except 1, at its species and different subspecies taxonomic ranks (interactive KRONA plot in Supplementary Fig. S2 and Supplementary Table S3).

Gene-based pathogen identification

In Workflow 3, Metaflye (v 2.9.1) tool mode’s option was chosen to be Nanopore-HQ. Users can expand the workflow and change this option according to their dataset sequencing technique.

No contig was built for 10 of the 27 samples with fewer than 2,700 reads. The identification of VF or AMR genes was then made impossible. For the other 17 samples, only 1 or 2 contigs were created, not enough for identifying VF and AMR genes.

For the remaining 19 samples with created contigs (from 3 to 157) and number of reads higher than 2,700, VF genes were identified in 15 samples (Fig. 2B), 12 of which were incubated before DNA isolation for 24 hours. Three of the 15 samples were incubated for only 1 hour before DNA isolation, resulting in a few VF genes (Fig. 2B) identified, compared to the other 12 samples, mostly because of the low number of reads (Fig. 3E) from almost the absence of incubation (Fig. 2A). It was, for example, the case for the mixed samples (i.e., samples spiked with all 3 S. enterica subspecies or samples spiked only with S. enterica subsp. houtenae and adaptively sampled during sequencing).

Figure 3:

Figure 3:

Scatterplots showing the number of identified VF genes (A, C, E) and AMR genes (B, D, F) in relationship to the Ct value (A, B), CFU/mL value (C, D), and the number of reads after preprocessing (E, F). The green area (A, B, C, D) highlights Ct values or CFU/mL values for which genes had been detected. Pearson correlation for values in the green area: (A) Inline graphic and P  Inline graphic, (B) Inline graphic and PInline graphic, (C) Inline graphic and P  Inline graphic, (D) Inline graphic and P  Inline graphic.

Some identified VF genes were found more than once in the same sample, with a maximum of 4 times. Common VF genes were identified for samples expecting identical S. enterica subspecies (Fig. 2B), such as the mucD gene, a serine protease mucD precursor, which was only found in S. enterica subsp. houtenae spiked samples, or shdA, an AIDA autotransporter-like protein, only found in S. enterica subsp. enterica spiked samples, but not in samples spiked with only S. enterica subsp. houtenae or S. enterica subsp. salamae.

Similar results were found for AMR genes (Supplementary Fig. S3, Fig. 3F). The sampling conditions affected the number of identified VF and AMR genes, as shown by the relationships between the Ct value, CFU/mL value, or the number of remaining reads after preprocessing (Fig. 3). The lower the Ct value, the higher the number of VF genes and AMR genes identified (Fig. 3A, B). No VF or AMR genes were detected for samples with Ct values above 26. For Ct values below 26, there was a negative correlation (Pearson Inline graphic, PInline graphic) between the Ct value and the number of identified AMR genes. Similar but inverse relations were observed for CFU/mL value (Fig. 3C, D), with a threshold for VF and AMR gene detection at Inline graphic. VF and AMR genes were then detected if several conditions were fulfilled: a Ct value below 26, CFU/mL value above Inline graphic, and at least 5,000 reads after preprocessing. The further the samples were from these thresholds, the higher the number of VF genes and AMR genes identified. Indeed, the 3 top scattered dots (in red; Fig. 3A, C, E), with identified VF genes between 250 and 300 were the samples with the highest number of reads, a higher CFU/mL value, and a relatively lower Ct value compared to other samples. Generally, allowing samples to incubate for a short period before sequencing enhances microbial growth, resulting in higher CFU/mL values and lower Ct values. This increase in microbial concentration improves the efficiency of direct sequencing by providing more genetic material for analysis, facilitating faster and more accurate pathogen detection.

Allele-based pathogen identification

In Workflow 4, samples were mapped against a reference genome of an expected pathogen chosen by the user. S. enterica subsp. enterica ser. Typhimurium (NC_003197.2) was chosen for these data, as it is widely recognized and extensively used in genomic studies due to its complete and well-annotated genome sequence [89]. However, given the diversity among the serovariants of S. enterica subsp. enterica, a high number of complex variants and SNPs are anticipated.

The provided mapping statistics (mapping coverage [breadth of coverage] and mapping depth [depth of coverage] in Fig. 2D, E) serve as proxies for assessing the number and quality of identified SNPs (Fig. 2F). SNPs with low mapping depth are less reliable than those with higher depth. Reliable SNP calling typically requires a depth of at least 10, achieved in 2 samples. Samples with the highest mean mapping depth corresponded to samples with the highest number of reads after preprocessing (Fig. 2A). The higher the coverage and the mean mapping depth, the more quality SNPs were identified (Fig. 2D–F). Some of the samples spiked with S. enterica subsp. enterica had a high breadth of coverage but a low mean depth of coverage depth; as a result, the number of their quality filtered identified SNPs was low.

PathoGFAIR sample aggregation and visualization

For the samples for which VF or AMR genes had been identified, phylogenetic trees were built on the concatenated gene consensus sequences (Fig. 2C for VF genes, Supplementary Fig. S4 for AMR genes). These trees help track divergence between samples and can then highlight the contamination point or an evolution of the subspecies because of mutations. Indeed, samples spiked with S. enterica subsp. enterica were found together in the VF-based tree (Fig. 2C), so the identified VF genes were unique to these samples and could clearly separate the samples from samples with other S. enterica subspecies. The samples spiked with S. enterica subsp. houtenae were mostly clustered together, except for 2 samples because of extra identified VF genes common with samples spiked with S. enterica subsp. enterica and/or S. enterica subsp. salamae. The 2 samples spiked with a mix of the 3 subspecies were found in the middle of the tree (Fig. 2C), showing that a mix of VF genes related to the different subspecies was identified. The mixed sample, S45, spiked with a higher concentration of S. enterica subsp. houtenae than the other subspecies, was close to the sample, S02, spiked with S. enterica subsp. houtenae only. For AMR genes phylogenetic tree (Supplementary Fig. S4), samples were not as clearly separated as the tree for VF genes, mostly because the number of identified AMR genes was relatively low compared to the number of identified VF genes.

Sensitivity

The performance of the workflows was evaluated based on their ability to identify the expected S. enterica pathogen, as well as S. enterica subspecies and strain taxonomic ranks for the tested samples (Supplementary Table S3). In a metagenomic setting, other detected species cannot be regarded as false positives, as they may naturally be present in the sample. Therefore, only sensitivity was reported.

For the taxonomy profiling (Workflow 2), the expected pathogen was detected at its species taxonomic rank in all but 1 sample, resulting in a sensitivity of 97.8%. At the subspecies taxonomic rank, the expected subspecies was detected in 28 out of 46 samples, yielding a sensitivity of 64.0%. To further evaluate subspecies classification performance, the sample-wise sensitivity (the percentage of correctly identified S. enterica subspecies out of all detected S. enterica subspecies) was calculated. Averaged across all samples, the sample-wise sensitivity was 47.3%. In the gene-based pathogen identification (Workflow 3), at least 1 VF gene of the expected pathogen, at strain taxonomic rank, was detected in 13 out of 46 samples, corresponding to a sensitivity of 28.2%. For the samples in which no VF gene was detected, no contigs could be generated, preventing gene calling.

Changing the workflow’s default settings, such as using different reference databases for preprocessing in Workflow 1, taxonomy profiling in Workflow 2, or gene-based pathogen identification in Workflow 3, would likely impact these metrics. Different reference databases could influence the accuracy and sensitivity of taxonomic classification and pathogen identification, as they may contain varying levels of strain-specific data. Adjusting parameters like threshold values, filtering criteria, or the inclusion of additional databases could also affect the detection sensitivity and overall performance, potentially improving or reducing the workflow’s ability to accurately identify pathogens and associated genes in the given samples.

Samples with prior pathogen isolation

Data description

To further test PathoGFair, 84 public datasets were used [90]. These samples were sampled in Palestine by the Swiss Tropical and Public Health Institute from chicken meat, chicken stool, or human stool in 2021 or 2022 (Supplementary Fig. S5). In these samples, S. enterica had been isolated in 19 samples and Campylobacter jejuni in 65 samples. The generated sequencing data are provided under BioProjects PRJNA942086 (S. enterica [91]) and PRJNA942088 (C. jejuni [92]).

Preprocessing

Negligible contamination or host sequences were found between 0% and 0.02% (Supplementary Fig. S6), as expected, because of the prior isolation of the pathogen. The number of reads ranges between 3,000 and 217,000 reads per sample, after quality control.

Taxonomy profiling

As presented in the interactive KRONA plot (Supplementary Fig. S6), the first 19 samples, S. enterica isolates, were assigned correctly to S. enterica, and the remaining 65 samples were assigned correctly to C. jejuni. With the KRONA plot (Supplementary Fig. S6), the total number of reads for each sample can be seen along with detailed percentages on the assigned taxa at each taxonomic rank.

Gene-based pathogen identification

In this workflow, we identified VF and AMR genes for all samples, thanks to the higher number of reads retained after preprocessing and the prior isolation of the pathogens. Consequently, VF genes were detected in all samples, with more VF genes identified than AMR genes (Supplementary Fig. S7), as in general, the number of VF genes in a bacterial genome is often higher than the number of AMR genes. Samples containing S. enterica exhibited more VF genes (172 to 207) compared to samples with C. jejuni (96 to 113). The opposite trend was observed for AMR genes; C. jejuni samples typically had 12 AMR genes detected, while S. enterica samples mostly had 6 AMR genes (Supplementary Fig. S7)

The analysis revealed that samples with similarly isolated pathogens clustered together based on detected VF genes (4). For example, samples with S. enterica and C. jejuni formed distinct clusters. Moreover, correlations were observed among samples from different hosts, sampling years, and pathogenic species.

Figure 4:

Figure 4:

Cluster-map showing the identified VF genes on the y-axis for tested samples presented on the x-axis, clustered based on sample information such as sampling year, isolated pathogen species, and the original host of the sample. Clustering was performed using hierarchical clustering implemented in the Clustergrammer Python package.

Specific VF genes were found in samples with similar isolated pathogens, indicating potential subspecies-specific differences. For instance, Cj1419c, a methyltransferase capsule biosynthesis and transport gene product, was exclusively found in C. jejuni isolate samples sequenced in 2022, while flgB gene, encoding flagellar basal body rod protein, was only detected in C. jejuni isolate samples sequenced in 2021. flaA (flagellin), a VF gene product identifying C. jejuni, was present in S. enterica isolate samples from human stool sampled in 2022 and all C. jejuni isolate samples, but not in S. enterica isolate samples from chicken meat, chicken stool, or human stool sampled in 2021.

Furthermore, certain VF genes such as spvC (type III secretion system effector SpvC phosphothreonine lyase) and pefB (plasmid-encoded fimbriae regulatory protein), associated with S. enterica subsp. enterica ser. Typhimurium str. LT2, were exclusively found in S. enterica isolate samples from human stool sampled in 2022. Conversely, fyuA, a pesticin/yersiniabactin receptor protein that identifies Yersinia pestis, was detected in every S. enterica isolate sample except those from human stool sampled in 2022. Finally, some VF genes, like flif, a flagellar M-ring protein known in Yersinia enterocolitica subsp. enterocolitica, were found in all samples, irrespective of the pathogen species.

Allele-based pathogen identification

The 19 S. enterica isolate samples were mapped against the reference genome of the expected pathogen, S. enterica subsp. enterica ser. Typhimurium (NC_003197.2 [89]), and the 65 C. jejuni isolate samples were mapped against C. jejuni (NC_002163.1).

The 19 S. enterica isolate samples have an average mapping coverage of 94.6% and an average mean mapping depth of 31 per base. The average total number of variants found per S. enterica isolate sample was 43,420. For the 65 C. jejuni isolate samples, the average mapping coverage was 93.7%, the average mean mapping depth was 42 per base, and the average total number of variants found per sample was 26,654. These high values for the average total number of variants identified for samples were expected since the used subspecies for the mapping are different from the subspecies of the samples.

PathoGFAIR sample aggregation and visualization

The isolated samples exhibited a higher count of identified AMR genes compared to the metagenomic samples without prior isolation, enabling the incorporation of additional genes into concatenated gene consensus sequences. The resulting phylogenetic tree, constructed based on the AMR genes (Supplementary Fig. S8), distinctly delineated different S. enterica subspecies. Similarly, this differentiation is evident in the phylogenetic tree based on the VF genes.

Sensitivity

Known species were successfully identified and confirmed across all workflows, including taxonomy profiling and VF gene identification (Supplementary Fig. S6), achieving 100% expected pathogen detection at the expected taxonomic rank for all 84 tested samples, resulting in a sensitivity of 1. Since there were no true negatives (TNs) or conditions without pathogens in the dataset, specificity could not be calculated.

Benchmarking PathoGFAIR

To evaluate the effectiveness of PathoGFAIR workflows, a benchmarking analysis was performed comparing PathoGFAIR’s pathogen detection capabilities with the systems and pipelines listed in Table 1. The primary goal was to assess each pipeline’s ability to accurately detect and identify pathogens from shotgun Nanopore metagenomic data using the samples without prior pathogen isolation. The detailed procedures for the selection and benchmarking process can be found on Protocols.io [93]. Replication instructions, including all benchmark systems, sample metadata, notebooks, and results, are publicly available through the PathoGFAIR GitHub repository.

Selection of the pipelines for the benchmarking

To identify suitable systems and pipelines for the benchmark, each pipeline in Table 1 was evaluated based on availability, accepted input sequencing technique compatibility, and pathogen identification capability (Supplementary Table S4). Pipelines were classified as “Free Access” if available at no cost, “Free Trial” if partially or temporarily accessible, “Paid Access Only” if requiring payment, or “Nonfunctional” if outdated or inoperative. Compatibility with single-end Nanopore metagenomic sequencing data was verified to ensure each pipeline’s applicability for pathogen-focused workflows. Each pipeline’s pathogen identification capabilities were also assessed, examining relevant algorithms and tools to determine their accuracy and sensitivity within metagenomic datasets.

Following these criteria, pipelines requiring paid access (e.g., OneCodex) or being nonfunctional, as well as those incompatible with Nanopore data (e.g., SURPI, Sunbeam, Innuendo, and PAIPline) or lacking robust pathogen identification features, were excluded from further analysis.

Each selected pipeline was subsequently set up and tested to confirm ease of use and reproducibility. During testing, Victors was found to be nonfunctional, and SURPI presented difficulties, requiring local downloads of large host and pathogen reference databases. SURPI’s fragmented documentation contained outdated and contradictory requirements, with no updates since 2014. Innuendo, though documented, also presented usability challenges with incomplete setup instructions. BugSeq, while user-friendly and accessible through its web interface, had an average processing time of 4 hours from sample upload to results delivery. However, its free trial is limited to just 10 samples, and it lacked transparency regarding the tools, workflows, and databases used for analysis, as it is neither open-source nor adaptable. Ultimately, only IDseq (CZID) and our workflows, PathoGFAIR, met the selection criteria and were included for benchmarking (Supplementary Table S4).

IDseq offers a highly user-friendly interface, similar to PathoGFAIR on the Galaxy platform. However, unlike PathoGFAIR, IDseq’s pipeline is limited in terms of adaptability to its workflows, particularly with respect to the tools and parameters used. This restriction can hinder customization and flexibility for users with specific analytical needs, reducing the pipeline’s versatility compared to PathoGFAIR. The IDseq pipeline took 1 hour for sample uploads and an additional hour and a half to complete the analysis across all samples.

Benchmarking on samples without prior pathogen isolation

The benchmark was conducted using the 46 samples without prior pathogen isolation, from the first use case. The samples are chicken meat spiked with pathogens and sequenced using Nanopore, as explained in [86]. Metadata for the 46 samples are summarized in Supplementary Table S1, including the expected S. enterica subspecies (S. enterica subsp. houtenae DSM 9221, S. enterica subsp. enterica DSM 554, and/or S. enterica subsp. salamae DSM 9220). For PathoGFAIR, taxonomy profiling from Workflow 2 and VF gene identification from Workflow 3 were evaluated.

In both PathoGFAIR and IDseq (Supplementary Table S3), S. enterica was detected in all samples except 1, resulting in a 97.8% detection rate. In the exceptional sample, Salmonella was not detected by either system. PathoGFAIR provided additional resolution, identifying the subspecies taxonomic rank in 60.9% of samples and the strain taxonomic rank in 13% of samples. In contrast, IDseq’s output report did not provide pathogen information beyond the species rank (Supplementary Fig. S9).

Further benchmarking with diverse public metagenomic datasets could offer more comprehensive insights into the performance of PathoGFAIR across different experimental conditions and sample types. Such analyses would help assess the workflow’s robustness and adaptability to a wider range of use cases, contributing to its ongoing validation and refinement.

Conclusion

In conclusion, we present PathoGFAIR, a collection of Galaxy FAIR adaptable workflows, designed for pathogen detection and tracking among samples. These 5 workflows span the entire analysis pipeline, ranging from preprocessing reads to advanced analyses, including taxonomy profiling, virulence and antimicrobial resistance gene identification, SNP detection, and evolutionary history comparisons. The workflows generate diverse visualizations for a comprehensive understanding of the results, accompanied by interactive reports detailing all relevant inputs and outputs.

Our workflows have successfully identified pathogens down to genus, species, or subspecies taxonomic ranks across diverse samples, surpassing limitations observed in comparable pipelines. Our workflows facilitate comprehensive sample comparisons across diverse types, conditions, and sequencing techniques by offering interpretative and publication-ready visualizations. The open-access and user-friendly design of PathoGFAIR mitigates accessibility challenges and reduces reliance on local computational resources by leveraging Galaxy’s infrastructure for computational tasks, a feature that sets it apart from similar pipelines. This scalable workflow is a versatile solution for processing (meta)genomic samples, extending its utility beyond detecting foodborne pathogens.

In our findings, optimizing sampling, preparation, and sequencing conditions, such as a 24-hour sample incubation, significantly enhances the identification of virulence and antimicrobial resistance genes. Indeed, the workflows’ performance correlates with sample characteristics, with higher CFU/mL values and read counts, as well as lower Ct values yielding more comprehensive results, which can be used to establish sampling guidelines. Moreover, as the preprocessing workflow effectively removes host sequences, adaptive sampling during sequencing to exclude host DNA is not necessary. The workflows were still able to detect pathogens at least at species taxonomic rank for samples without prior pathogen isolation.

The experimental setup for this study serves as a proof of concept, demonstrating the feasibility of using WGS for the detection and characterization of S. enterica in spiked food matrices. By detecting pathogens across varying CFU/mL levels, as shown in the spiking experiments, the workflows showcase their sensitivity and practical applicability in adhering to stringent regulatory standards. These results establish a solid foundation for applying PathoGFAIR in food safety laboratories and outbreak investigations, where detailed subspecies-rank information is critical for monitoring and traceability.

PathoGFAIR’s utility extends to enrichment-based analyses of foodborne pathogens, aligning with European Union food safety standards such as EN/ISO 6579. Traditionally, these standards require enrichment followed by classical microbiological methods or PCR confirmation. PathoGFAIR complements these approaches by enabling direct analysis of enrichment broths through WGS, facilitating efficient detection and subspecies characterization of S. enterica. This capability enhances food safety monitoring and outbreak response by differentiating S. enterica subspecies from previous outbreak strains, thereby advancing traceability and improving public health interventions.

We further supported the scientific community by introducing new 46 benchmark samples, making them publicly available. This demonstrates our significant investment of time and resources, providing valuable assets for future research.

In addition to the allele-based pathogen identification method, our workflow can be further enhanced by incorporating MLST. MLST, or Multi-Locus Sequence Typing, offers an alternative approach by characterizing isolates through the sequences of housekeeping genes [49]. This method provides valuable information about the genetic diversity and evolutionary relationships among isolates, allowing for more precise identification and tracing of pathogens. By integrating MLST using MLST (v 2.22.0) tool [94] into our workflow, users can benefit from a comprehensive analysis that combines both alleles and variant identification methods, providing a more robust and accurate pathogen detection and tracing solution.

To address the complexity of detecting virulence and AMR genes located on MGEs, future versions of PathoGFAIR can incorporate additional validation steps. Specifically, virulence and AMR gene detection could be cross-referenced with broader taxonomic markers (e.g., 16S rRNA) to ensure genes detected from MGEs are correctly attributed to their respective pathogenic hosts. These enhancements aim to further improve the accuracy of pathogen detection by more reliably linking virulence and AMR genes to their bacterial hosts, thereby refining the overall precision of the workflows.

In the future, integrating PathoGFAIR with Galaxy’s automated bot system holds the promise of ongoing updates and analyses requiring minimal human involvement. By establishing a dedicated bot for PathoGFAIR, continuous results will be effortlessly refreshed whenever new datasets are uploaded, similar to the Galaxy bot created for SARS-CoV-2 [95]. The Galaxy bot for SARS-CoV-2 automatically updates and reanalyzes data with each new upload, maintaining up-to-date results and reducing the need for manual intervention. This automation ensures real-time, efficient data processing and analysis, enhancing the workflow’s accuracy and timeliness. Leveraging the user-friendly interface of the Galaxy platform ensures accessibility for users of all computational skill levels, streamlining the entire process from sample upload to result interpretation with ease. This study not only presents a robust computational solution but also lays the groundwork for semiautomated, efficient, and user-friendly pathogen detection and tracking workflows.

Availability of Source Code and Requirements

Supplementary Material

giaf017_Supplemental_Files
giaf017_Authors_Response_To_Reviewer_Comments_Original_Submission
giaf017_Authors_Response_To_Reviewer_Comments_Revision_1
giaf017_Authors_Response_To_Reviewer_Comments_Revision_2
giaf017_GIGA-D-24-00270_Original_Submission
giaf017_GIGA-D-24-00270_Revision_1
giaf017_GIGA-D-24-00270_Revision_2
giaf017_GIGA-D-24-00270_Revision_3
giaf017_Reviewer_1_Report_Original_Submission

Federico Zambelli -- 7/31/2024

giaf017_Reviewer_2_Report_Original_Submission

Ann-Katrin Llarena -- 9/9/2024

Acknowledgments

This research is made possible by the invaluable support of the entire Freiburg Galaxy team, Bioinformatics, University of Freiburg. The authors extend their special thanks to Wolfgang Maier and Mina Ansari for their technical expertise and academic guidance. We also appreciate the contributions of all the researchers who provided input to the project, with particular gratitude to Tobias Schindler for his exceptional assistance and to Peter van Heusden for his insightful contributions.

Contributor Information

Engy Nasr, Bioinformatics Group, Department of Computer Science, University of Freiburg, 79110 Freiburg im Breisgau, Germany.

Anna Henger, Biolytix AG, 4243 Dittingen, Switzerland.

Björn Grüning, Bioinformatics Group, Department of Computer Science, University of Freiburg, 79110 Freiburg im Breisgau, Germany.

Paul Zierep, Bioinformatics Group, Department of Computer Science, University of Freiburg, 79110 Freiburg im Breisgau, Germany.

Bérénice Batut, IFB-Core, Institut Français de Bioinformatique (IFB), CNRS, INSERM, INREA, CEA, 91057 Evry, France; Université Clermont Auvergne, Plateforme AuBi and Mésocentre Clermont-Auvergne, 63000 Clermont-Ferrand, France.

Additional Files

Supplementary Fig. S1. Violin plot with the percentage of quality-controlled host reads detected and removed in samples with respect to adaptive sampling during sequencing (host excluded or not)—samples without prior pathogen isolation.

Supplementary Fig. S2. Krone pie chart for the taxonomy profiling—samples without prior pathogen isolation.

Supplementary Fig. S3. Bar chart for the total number of VF genes (orange) and AMR genes (blue) found in samples with respect to incubation duration before DNA isolation—samples without prior pathogen isolation.

Supplementary Fig. S4. Phylogenetic tree, using the nucleotide evolution model; General Time Reversible (GTR) model with a CAT approximation for rate heterogeneity across sites [54], for the identified AMR genes—samples without prior pathogen isolation.

Supplementary Fig. S5. Upset plot illustrating the intersections of different metadata categories, including sampling year, pathogen species, and the original host of the samples, highlighting common and unique attributes among the datasets—samples with prior pathogen isolation.

Supplementary Fig. S6. Krona pie chart for the taxonomy profiling—samples with prior pathogen isolation.

Supplementary Fig. S7. Violin plot for the total number of VFs and AMR genes—samples with prior pathogen isolation.

Supplementary Fig. S8. Phylogenetic tree, using the nucleotide evolution model; General Time Reversible (GTR) model with a CAT approximation for rate heterogeneity across sites [54], for the identified AMR genes—samples with prior pathogen isolation.

Supplementary Fig. S9. Heatmap with PathoGFAIR benchmark using the first Use Case datasets and comparing it with similar analysis systems presented in Table 1.

Supplementary Table S1. Metadata for samples without prior pathogen isolation.

Supplementary Table S2. Metadata for samples with prior pathogen isolation.

Supplementary Table S3. PathoGFAIR and IDSeq pathogen detection results for samples without prior isolation at various taxonomic ranks.

Supplementary Table S4. Benchmarking PathoGFAIR systems and pipelines evaluation.

Abbreviations

AMR: antimicrobial resistance; API: Application Programming Interface; CFU: colony-forming unit; Ct: cycle threshold; EFSA: European Food Safety Authority; EU: European Union; FAIR: Findable, Accessible, Interoperable, and Reusable; GTN: Galaxy Training Network; IWC: Intergalactic Workflow Commission; MLST: multilocus sequence typing; NGS: next-generation sequencing; QC: quality control; RKI: Robert Koch Institute; SNP: single-nucleotide polymorphism; SRA: Sequence Read Archive; VF: virulence factor; VFDB: Virulence Factor Database; WHO: World Health Organization; WGS: Whole Genome Sequencing.

Author Contributions

Engy Nasr (Formal analysis [lead], Investigation [lead], Software [lead], Validation [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing [lead]), Anna Henger (Conceptualization [supporting], Formal analysis [supporting], Funding acquisition [supporting], Investigation [supporting], Project administration [supporting], Validation [equal], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), Björn Grüning (Conceptualization [supporting], Funding acquisition [supporting], Investigation [supporting], Project administration [supporting], Software [supporting], Writing – original draft [supporting], Writing – review & editing [supporting]), Paul Zierep (Formal analysis [supporting], Investigation [supporting], Software [supporting], Supervision [equal], Validation [supporting], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), Bérénice Batut (Conceptualization [Lead], Formal analysis [Supporting], Funding acquisition [Lead], Investigation [Supporting], Methodology [Lead], Project administration [Lead], Software [Equal], Supervision [Lead], Validation [Supporting], Visualization [Supporting], Writing—original draft [Supporting], Writing—review & editing [Supporting]).

Funding

This research was supported by the Digital Life Science Call for Academia–Industry Collaborations under EOSC Life [97], European Union’s Horizon 2020 programme under grant agreement number 824087. Additionally, financial support was provided by the German Federal Ministry of Education and Research BMBF grant 031 A538A de.NBI-RBC and the Ministry of Science, Research and the Arts Baden-Württemberg (MWK) within the framework of LIBIS/de.NBI Freiburg. This work was supported by the Programme d’Investissements d’Avenir (PIA), grant Agence Nationale de la Recherche, number ANR-11-INBS-0013.

Data Availability

The raw sequence reads of the 46 samples without prior isolation are available on the Sequence Read Archive (SRA) under BioProjects [98]. The protocol for the preparation of these samples is available on Protocols.io [86]. The workflows presented in the Methods section are available on Intergalactic Workflow Commission (IWC) and 2 workflow registries (Dockstore and WorkflowHub). The training material to understand, learn, and try the workflows is available on the Galaxy Training Network (GTN) [83]. The Jupyter notebook for additional visualizations and generating the figures of this article is available in a GitHub repository [96]. Benchmarking PathoGFAIR protocol is available on protocols.io [93]. Other data further supporting this work are openly available in the GigaScience repository, GigaDB [99].

References

  • 1. Elbehiry  A, Abalkhail  A, Marzouk  E  et al.  An overview of the public health challenges in diagnosing and controlling human foodborne pathogens. Vaccines. 2023;11(4):725. 10.3390/vaccines11040725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Wei  X, Zhao  X. Advances in typing and identification of foodborne pathogens. Curr Opin Food Sci. 2021;37:52–57. 10.1016/j.cofs.2020.09.002. [DOI] [Google Scholar]
  • 3. Organization  WH. WHO global strategy for food safety 2022–2030: towards stronger food safety systems and global cooperation: executive summary. Geneva (Switzerland): World Health Organization; 2022. https://www.who.int/publications/i/item/9789240057685. Accessed 24 June 2024. [Google Scholar]
  • 4. Priyanka  B, Patil  RK, Dwarakanath  S. A review on detection methods used for foodborne pathogens. Indian J Med Res. 2016;144(3):327. 10.4103/0971-5916.198677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Yang  S, Johnson  MA, Hansen  MA, et al.  Metagenomic sequencing for detection and identification of the boxwood blight pathogen Calonectria pseudonaviculata. Sci Rep. 2022;12(1):1399. 10.1038/s41598-022-05381-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Thomas  T, Gilbert  J, Meyer  F. Metagenomics—a guide from sampling to data analysis. Microb Inform Exp. 2012;2(1):3. 10.1186/2042-5783-2-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Thompson  LR, Sanders  JG, McDonald  D, et al.  A communal catalogue reveals Earth’s multiscale microbial diversity. Nature. 2017;551(7681):457–63. 10.1038/nature24621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Bogaerts  B, Van den Bossche  A, Verhaegen  B, et al.  Closing the gap: Oxford Nanopore Technologies R10 sequencing allows comparable results to Illumina sequencing for SNP-based outbreak investigation of bacterial pathogens. J Clin Microbiol. 2024;62:e01576–23. 10.1128/jcm.01576-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Joensen  KG, Scheutz  F, Lund  O, et al.  Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli. J Clin Microbiol. 2014;52(5):1501–10. 10.1128/JCM.03617-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Allard  MW, Bell  R, Ferreira  CM, et al.  Genomics of foodborne pathogens for microbial food safety. Curr Opin Biotechnol. 2018;49:224–29. 10.1016/j.copbio.2017.11.002. [DOI] [PubMed] [Google Scholar]
  • 11. Fan  J, Huang  S, Chorlton  SD. BugSeq: a highly accurate cloud platform for long-read metagenomic analyses. BMC Bioinformatics. 2021;22(1):160. 10.1186/s12859-021-04089-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Minot  SS, Krumm  N, Greenfield  NB. One codex: a sensitive and accurate data platform for genomic microbial identification. bioRxiv. 2015; 10.1101/027607. [DOI]
  • 13. Naccache  SN, Federman  S, Veeraraghavan  N, et al.  A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res. 2014;24(7):1180–92. 10.1101/gr.171934.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Clarke  EL, Taylor  LJ, Zhao  C, et al.  Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome. 2019;7(1):46. 10.1186/s40168-019-0658-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Kalantar  KL, Carvalho  T, de Bourcy  CFA, et al.  IDseq—An open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring. Gigascience. 2020;9(10):giaa111. 10.1093/gigascience/giaa111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Simmonds  SE, Ly  L, Beaulaurier  J, et al.  CZ ID: a cloud-based, no-code platform enabling advanced long read metagenomic analysis. bioRxiv. 2024; 10.1101/2024.02.29.579666. [DOI]
  • 17. The Galaxy Community, Afgan  E, Nekrutenko  A, et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. 2022;50(W1):W345–51. 10.1093/nar/gkac247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Yuen  D, Cabansay  L, Duncan  A, et al.  The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols. Nucleic Acids Res. 2021;49(W1):W624–32. 10.1093/nar/gkab346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Implementing FAIR Digital Objects in the EOSC-Life Workflow Collaboratory. Accessed 25 June 2024. 10.5281/zenodo.4605654. [DOI]
  • 20. Hiltemann  S, Rasche  H, Gladman  S, et al.  Galaxy training: a powerful framework for teaching!. PLoS Comput Biol. 2023;19(1):e1010752. 10.1371/journal.pcbi.1010752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Llarena  A, Ribeiro-Gonçalves  BF, Nuno Silva  D, et al.  INNUENDO: a cross-sectoral platform for the integration of genomics in the surveillance of food-borne pathogens. EFSA Supporting Publications. 2018;15(11). 10.2903/sp.efsa.2018.EN-1498. [DOI] [Google Scholar]
  • 22. Andrusch  A, Dabrowski  PW, Klenner  J, et al.  PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples. Bioinformatics. 2018;34(17):i715–21. 10.1093/bioinformatics/bty595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Sayers  S, Li  L, Ong  E, et al.  Victors: a web-based knowledge base of virulence factors in human and animal pathogens. Nucleic Acids Res. 2019;47(D1):D693–700. 10.1093/nar/gky999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Chen  S, Zhou  Y, Chen  Y, et al.  fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884–90. 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Porechop—de.NBI Nanopore Training Course latest documentation. https://denbi-nanopore-training-course.readthedocs.io/en/latest/read_qc/Porechop_1.html. Accessed 24 June 2024.
  • 26. Wagner  GE, Dabernig-Heinz  J, Lipp  M, et al.  Real-time nanopore Q20+ sequencing enables extremely fast and accurate core genome mlst typing and democratizes access to high-resolution bacterial pathogen surveillance. J Clin Microbiol. 2023;61(4):e01631–22. 10.1128/jcm.01631-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Li  H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Chaisson  MJ, Tesler  G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012;13(1):238. 10.1186/1471-2105-13-238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Li  H, Durbin  R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Sedlazeck  FJ, Rescheneder  P, Smolka  M, et al.  Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15(6):461–68. 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Wu  TD, Watanabe  CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21(9):1859–75. 10.1093/bioinformatics/bti310. [DOI] [PubMed] [Google Scholar]
  • 32. Langmead  B, Salzberg  SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–59. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Lu  J, Rincon  N, Wood  DE, et al.  Metagenome analysis using the Kraken software suite. Nat Protoc. 2022;17(12):2815–39. 10.1038/s41596-022-00738-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Katz  LS, Griswold  T, Lindsey  R, et al.  Kraken with Kalamari: contamination detection. 2021. https://github.com/lskatz/kalamari. Accessed 24 June 2024.
  • 35. Ewels  P, Magnusson  M, Lundin  S, et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–48. 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Nanoplot—RCAC Biocontainers v1.0 documentation. https://biocontainer-doc.readthedocs.io/en/latest/source/nanoplot/nanoplot.html. Accessed 24 June 2024.
  • 37. Portik  DM, Brown  CT, Pierce-Ward  NT. Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets. BMC Bioinformatics. 2022;23:541. 10.1186/s12859-022-05103-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Leidenfrost  RM, Pöther  DC, Jäckel  U, et al.  Benchmarking the MinION: Evaluating long reads for microbial profiling. Sci Rep. 2020;10(1):5125. 10.1038/s41598-020-61989-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Govender  KN, Eyre  DW. Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications. Microbial Genomics. 2022;8(10):000886. 10.1099/mgen.0.000886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Meyer  F, Fritz  A, Deng  ZL, et al.  Critical assessment of metagenome interpretation: the second round of challenges. Nat Methods. 2022;19(4):429–40. 10.1038/s41592-022-01431-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Ondov  BD, Bergman  NH, Phillippy  AM. Interactive metagenomic visualization in a web browser. BMC Bioinformatics. 2011;12(1):385. 10.1186/1471-2105-12-385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Bik  HM, Pitch Interactive . Phinch: an interactive, exploratory data visualization framework for –omic datasets. bioRxiv. 2014; 10.1101/009944. [DOI] [Google Scholar]
  • 43. Breitwieser  FP, Salzberg  SL. Pavian: interactive analysis of metagenomics data for microbiomics and pathogen identification. bioRxiv. 2016; 10.1101/084715. [DOI] [PMC free article] [PubMed]
  • 44. Lin  Y, Yuan  J, Kolmogorov  M, et al.  Assembly of long error-prone reads using de Bruijn graphs. Proc Natl Acad Sci U S A. 2016;113(52):E8396–405. 10.1073/pnas.1604560113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Oxford Nanopore Technologies. Medaka. https://github.com/nanoporetech/medaka. Accessed 24 June 2024. [Google Scholar]
  • 46. Seemann  T. ABRicate. Version 2014-07-17. https://github.com/tseemann/abricate. Accessed 24 June 2024.
  • 47. Chen  L, Zheng  D, Liu  B, et al.  VFDB 2016: hierarchical and refined dataset for big data analysis—10 years on. Nucleic Acids Res. 2016;44(D1):D694–97. 10.1093/nar/gkv1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Feldgarden  M, Brover  V, Haft  DH, et al.  Validating the AMRFinder Tool and Resistance Gene Database by using antimicrobial resistance genotype-phenotype correlations in a collection of isolates. Antimicrob Agents Chemother. 2019;63(11):e00483–19. 10.1128/AAC.00483-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Pearce  ME, Alikhan  NF, Dallman  TJ, et al.  Comparative analysis of core genome MLST and SNP typing within a European Salmonella serovar Enteritidis outbreak. Int J Food Microbiol. 2018;274:1–11. 10.1016/j.ijfoodmicro.2018.02.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Hong Kong University—Biomedical Algorithms Lab (BAL) . Clair3—symphonizing pileup and full-alignment for high-performance long-read variant calling. https://github.com/HKU-BAL/Clair3. Original Release Date: 2021-03-30. Accessed 25 June 2024.
  • 51. Danecek  P, Bonfield  JK, Liddle  J, et al.  Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Cingolani  P, Platts  A, Wang  LL, et al.  A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w Inline graphic; iso-2; iso-3. Fly. 2012;6(2):80–92. 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Danecek  P, McCarthy  SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017. 10.1093/bioinformatics/btx100. [DOI] [PMC free article] [PubMed]
  • 54. Price  MN, Dehal  PS, Arkin  AP. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5(3):e9490. 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. jupyter/jupyter: Jupyter metapackage for installation, docs and chat. https://github.com/jupyter/jupyter/tree/master. Accessed 24 June 2024.
  • 56. Van Rossum  G, Drake  FL. Python 3 reference manual. Scotts Valley (CA): CreateSpace; 2009.ISBN:978-1-4414-1269-0 [Google Scholar]
  • 57. McKinney  W. Data structures for statistical computing in Python. In: van der Walt  S, Millman  J, eds. Proceedings of the 9th Python in Science Conference. Texas, USA; 2010:56–61. 10.25080/Majora-92bf1922-00a. [DOI] [Google Scholar]
  • 58. The pandas development team . pandas-dev/pandas: Pandas. Zenodo. 2020. 10.5281/zenodo.3509134.Accessed 24 June 2024. [DOI] [Google Scholar]
  • 59. Hunter  JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(3):90–95. 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
  • 60. Waskom  M. seaborn: statistical data visualization. J Open Source Softw. 2021;6(60):3021. 10.21105/joss.03021. [DOI] [Google Scholar]
  • 61. Harris  CR, Millman  KJ, van der Walt  SJ, et al.  Array programming with NumPy. Nature. 2020;585(7825):357–62. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Martin  M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17(1):10. 10.14806/ej.17.1.200. [DOI] [Google Scholar]
  • 63. Bray  S, Chilton  J, Bernt  M, et al.  The Planemo toolkit for developing, deploying, and executing scientific data analyses in Galaxy and beyond. Genome Res. 2023;33(2):261–68. 10.1101/gr.276963.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Wilkinson  MD, Dumontier  M, Aalbersberg  IJ, et al.  The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Chue Hong  NP, Katz  DS, Barker  M, et al.  FAIR Principles for Research Software (FAIR4RS Principles). 2021. 10.15497/RDA00065.Accessed 24 June 2024. [DOI] [PMC free article] [PubMed]
  • 66. Visser  Cd, Johansson  LF, Kulkarni  P, et al.  Ten quick tips for building FAIR workflows. PLoS Comput Biol. 2023;19(9):e1011369. 10.1371/journal.pcbi.1011369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Galaxy Project . Galaxy Workflows maintained by the Intergalactic Workflow Commission. 2024. https://github.com/galaxyproject/iwc. Accessed 24 June 2024.
  • 68. Preprocessing Workflow on IWC GitHub Repository. https://github.com/galaxyproject/iwc/tree/main/workflows/microbiome/nanopore-pre-processing. Accessed 3 February 2025.
  • 69. Taxonomy Profiling Workflow on IWC GitHub Repository. https://github.com/galaxyproject/iwc/tree/main/workflows/microbiome/taxonomy-profiling-and-visualization-with-krona. Accessed 3 February 2025.
  • 70. Gene-based Pathogen Identification Workflow on IWC GitHub Repository. https://github.com/galaxyproject/iwc/tree/main/workflows/microbiome/gene-based-pathogen-identification. Accessed 3 February 2025.
  • 71. Allele-based Pathogen Identification Workflow on IWC GitHub Repository. https://github.com/galaxyproject/iwc/tree/main/workflows/microbiome/allele-based-pathogen-identification. Accessed 3 February 2025.
  • 72. PathoGFAIR Samples Aggregation and Visualisation Workflow on IWC GitHub Repository. https://github.com/galaxyproject/iwc/tree/main/workflows/microbiome/pathogen-detection-pathogfair-samples-aggregation-and-visualisation. Accessed 3 February 2025.
  • 73. Nasr  E, Batut  B, Zierep  P. Computational workflow with persistent identifier for preprocessing workflow on Dockstore. Zenodo. 2024. 10.5281/ZENODO.12744322.Accessed 24 June 2024. [DOI] [Google Scholar]
  • 74. Nasr  E, Batut  B, Zierep  P. Computational workflow with persistent identifier for Taxonomy profiling workflow on Dockstore. Zenodo. 2024. 10.5281/zenodo.12744309.Accessed 24 June 2024. [DOI] [Google Scholar]
  • 75. Nasr  E, Batut  B, Zierep  P. Computational workflow with persistent identifier for gene-based pathogen identification workflow on Dockstore. Zenodo. 2024. 10.5281/zenodo.12744315.Accessed 24 June 2024. [DOI] [Google Scholar]
  • 76. Nasr  E, Batut  B, Zierep  P. Computational workflow with persistent identifier for allele-based pathogen identification workflow on Dockstore. Zenodo. 2024. 10.5281/zenodo.12744317.Accessed 24 June 2024. [DOI] [Google Scholar]
  • 77. Nasr  E, Batut  B, Zierep  P. Computational workflow with persistent identifier for PathoGFAIR samples aggregation and visualisation workflow on Dockstore. Zenodo. 2024. 10.5281/zenodo.12744325. [DOI] [Google Scholar]
  • 78. Batut  B, Nasr  E, Zierep  P. nanopore-pre-processing/main. WorkflowHub.  2024. 10.48546/WORKFLOWHUB.WORKFLOW.1061.1. [DOI] [Google Scholar]
  • 79. Nasr  E, Batut  B, Zierep  P. taxonomy-profiling-and-visualization-with-krona/main. WorkflowHub.  2024. 10.48546/WORKFLOWHUB.WORKFLOW.1059.1. [DOI] [Google Scholar]
  • 80. Nasr  E, Batut  B, Zierep  P. gene-based-pathogen-identification/main. WorkflowHub. 2024. 10.48546/WORKFLOWHUB.WORKFLOW.1062.1. [DOI] [Google Scholar]
  • 81. Nasr  E, Batut  B, Zierep  P. allele-based-pathogen-identification/main. WorkflowHub. 2024. 10.48546/WORKFLOWHUB.WORKFLOW.1063.2. [DOI] [Google Scholar]
  • 82. Nasr  E, Batut  B, Zierep  P. pathogen-detection-pathogfair-samples-aggregation-and-visualisation/main. WorkflowHub. 2024. 10.48546/WORKFLOWHUB.WORKFLOW.1060.1. [DOI] [Google Scholar]
  • 83. Batut  B, Nasr  E, Zierep  P. Pathogen detection from (direct Nanopore) sequencing data using Galaxy–Foodborne Edition (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html. Accessed 24 June 2024.
  • 84. Soiland-Reyes  S, Sefton  P, Crosas  M, et al.  Packaging research artefacts with RO-Crate. Data Sci. 2022;5(2):97–138. 10.3233/DS-210053. [DOI] [Google Scholar]
  • 85. Sefton  P, Ó Carragáin  E, Soiland-Reyes  S, et al.  RO-Crate Metadata Specification 1.1.3. Zenodo. 2023. 10.5281/zenodo.7867028. [DOI]
  • 86. Nasr  E, Henger  A, Grüning  B, et al.  Samples preparation for foodborne pathogen detection and tracking project. 2023. protocols.io. 10.17504/protocols.io.8epv5x1jdg1b/v1. [DOI]
  • 87. CFU Full Form. https://unacademy.com/content/neet-ug/full-forms/cfu/. Accessed 24 June 2024.
  • 88. Bioscientia . What do the terms dual target PCR and Ct value mean? | Laboratory Diagnostics. 2020. https://www.bioscientia.de/en/home/our-news/2020/07/what-do-the-terms-dual-target-pcr-and-ct-value-mean/. Accessed 24 June 2024.
  • 89. McClelland  M, Sanderson  KE, Spieth  J, et al.  Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature. 2001;413(6858):852–56. 10.1038/35101614. [DOI] [PubMed] [Google Scholar]
  • 90. Abukhattab  S, Hosch  S, Abu-Rmeileh  NME, et al.  Whole-genome sequencing for One Health surveillance of antimicrobial resistance in conflict zones: a case study of Salmonella spp. and Campylobacter spp. in the West Bank, Palestine. Appl Environ Microbiol. 2023;89(9):e00658–23. 10.1128/aem.00658-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91. Salmonella enterica subsp. enterica (ID 942086)—BioProject—NCBI. https://www.ncbi.nlm.nih.gov/bioproject/942086. Accessed 24 June 2024.
  • 92. Campylobacter jejuni subsp. jejuni (ID 942088)—BioProject—NCBI. https://www.ncbi.nlm.nih.gov/bioproject/942088. Accessed 24 June 2024.
  • 93. Nasr  E, Henger  A, Grüning  B, et al.  Benchmarking PathoGFAIR. 2024. protocols.io. 10.17504/protocols.io.e6nvwbp4zvmk/v1. [DOI]
  • 94. Seemann  T. mlst. https://github.com/tseemann/mlst. Original Release Date: 2014-05-03. Accessed 24 June 2024.
  • 95. Maier  W, Bray  S, van den Beek  M, et al.  Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nat Biotechnol. 2021;39(10):1178–79. 10.1038/s41587-021-01069-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96. usegalaxy-eu/PathoGFAIR: PathoGFAIR: Galaxy FAIR workflows for pathogen detection and samples comparison. https://github.com/usegalaxy-eu/PathoGFAIR. Accessed 25 June 2024.
  • 97. Digital Life Sciences Internal Call for Academia-Industry Collaborations. https://www.eosc-life.eu/industrycall/. Accessed 24 June 2024.
  • 98. PRJNA982679—SRA—NCBI. https://www.ncbi.nlm.nih.gov/sra/PRJNA982679. Accessed 24 June 2024.
  • 99. Nasr  E, Henger  A, Grüning  B, et al.  Supporting data for “PathoGFAIR: A Collection of FAIR and Adaptable (Meta)genomics Workflows for (Foodborne) Pathogens Detection and Tracking.” GigaScience Database. 2025. 10.5524/102648. [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Implementing FAIR Digital Objects in the EOSC-Life Workflow Collaboratory. Accessed 25 June 2024. 10.5281/zenodo.4605654. [DOI]
  2. Sefton  P, Ó Carragáin  E, Soiland-Reyes  S, et al.  RO-Crate Metadata Specification 1.1.3. Zenodo. 2023. 10.5281/zenodo.7867028. [DOI]
  3. PRJNA982679—SRA—NCBI. https://www.ncbi.nlm.nih.gov/sra/PRJNA982679. Accessed 24 June 2024.
  4. Nasr  E, Henger  A, Grüning  B, et al.  Supporting data for “PathoGFAIR: A Collection of FAIR and Adaptable (Meta)genomics Workflows for (Foodborne) Pathogens Detection and Tracking.” GigaScience Database. 2025. 10.5524/102648. [DOI]

Supplementary Materials

giaf017_Supplemental_Files
giaf017_Authors_Response_To_Reviewer_Comments_Original_Submission
giaf017_Authors_Response_To_Reviewer_Comments_Revision_1
giaf017_Authors_Response_To_Reviewer_Comments_Revision_2
giaf017_GIGA-D-24-00270_Original_Submission
giaf017_GIGA-D-24-00270_Revision_1
giaf017_GIGA-D-24-00270_Revision_2
giaf017_GIGA-D-24-00270_Revision_3
giaf017_Reviewer_1_Report_Original_Submission

Federico Zambelli -- 7/31/2024

giaf017_Reviewer_2_Report_Original_Submission

Ann-Katrin Llarena -- 9/9/2024

Data Availability Statement

The raw sequence reads of the 46 samples without prior isolation are available on the Sequence Read Archive (SRA) under BioProjects [98]. The protocol for the preparation of these samples is available on Protocols.io [86]. The workflows presented in the Methods section are available on Intergalactic Workflow Commission (IWC) and 2 workflow registries (Dockstore and WorkflowHub). The training material to understand, learn, and try the workflows is available on the Galaxy Training Network (GTN) [83]. The Jupyter notebook for additional visualizations and generating the figures of this article is available in a GitHub repository [96]. Benchmarking PathoGFAIR protocol is available on protocols.io [93]. Other data further supporting this work are openly available in the GigaScience repository, GigaDB [99].


Articles from GigaScience are provided here courtesy of Oxford University Press

RESOURCES