PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

Engy Nasr; Anna Henger; Björn Grüning; Paul Zierep; Bérénice Batut

doi:10.1093/gigascience/giaf017

. 2025 Sep 26;14:giaf017. doi: 10.1093/gigascience/giaf017

PathoGFAIR: a collection of FAIR and adaptable (meta)genomics workflows for (foodborne) pathogens detection and tracking

Engy Nasr ¹, Anna Henger ², Björn Grüning ³, Paul Zierep ^4,^✉, Bérénice Batut ^5,^6,^✉

PMCID: PMC12466118 PMID: 41004266

Abstract

Background

Food contamination by pathogens poses a global health threat, affecting an estimated 600 million people annually. During a foodborne outbreak investigation, microbiological analysis of food vehicles detects responsible pathogens and traces contamination sources. Metagenomic approaches offer a comprehensive view of the genomic composition of microbial communities, facilitating the detection of potential pathogens in samples. Combined with sequencing techniques like Oxford Nanopore sequencing, such metagenomic approaches become faster and easier to apply. A key limitation of these approaches is the lack of accessible, easy-to-use, and openly available pipelines for pathogen identification and tracking from (meta)genomic data.

Findings

PathoGFAIR is a collection of Galaxy-based Findable, Accessible, Interoperable, and Reusable (FAIR) workflows employing state-of-the-art tools to detect and track pathogens from metagenomic Nanopore sequencing. Although initially developed to detect pathogens in food datasets, the workflows can be applied to other metagenomic Nanopore pathogenic data. PathoGFAIR incorporates visualizations and reports for comprehensive results. We tested PathoGFAIR on 130 samples containing different pathogens from multiple hosts under various experimental conditions. For all but 1 sample, workflows have successfully detected expected pathogens at least at the species rank. Further taxonomic ranks are detected for samples with sufficiently high colony-forming unit and low cycle threshold values.

Conclusions

PathoGFAIR detects the pathogens at species and subspecies taxonomic ranks in all but 1 tested sample, regardless of whether the pathogen is isolated or the sample is incubated before sequencing. Importantly, PathoGFAIR is easy to use and can be straightforwardly adapted and extended for other types of analysis and sequencing techniques, making it usable in various pathogen detection scenarios.

Keywords: Galaxy, public health, Nanopore, pipeline, open source, benchmark samples, visualization

Introduction

Foodborne pathogens pose a significant threat to public health worldwide, causing millions of cases of illness and even death every year [1, 2]. These diverse microorganisms, spanning bacteria, viruses, parasites, and fungi, can contaminate a variety of foods, leading to both localized outbreaks and widespread epidemics. Ensuring food safety and controlling foodborne pathogens are key priorities for public health authorities at local, regional, and global levels, including agencies such as the European Food Safety Authority (EFSA), European Centre for Disease Prevention and Control (ECDC), and the World Health Organization (WHO) [3].

Traditional methods for identifying the source of food contamination require isolation of the target pathogen. This process is not only time-consuming but can be labor-intensive, often requiring multiple steps and sophisticated techniques, and lacks a guaranteed success rate [4]. In contrast, shotgun metagenomic approaches provide a solution to these challenges, as they give an overview of the genomic composition in the sample, including the food source itself, the microbial community, and any possible pathogens and their complete genetic information [5]. Importantly, shotgun metagenomic approaches eliminate the need for prior isolation of the targeted pathogen, as required by whole-genome sequencing (WGS) methods, and they are not limited to specific genes as opposed to real-time PCR approaches [6] or 16S ribosomal RNA (rRNA) sequencing. While 16S rRNA sequencing is widely used for bacterial taxonomic profiling, it is limited in scope compared to shotgun metagenomic sequencing. The latter allows for the detection of a wide range of pathogens, including bacteria, viruses, and fungi, and gives access to the full genomes, enabling the taxa-agnostic identification of antimicrobial resistance (AMR) and virulence genes. This broader scope makes shotgun sequencing more suitable for comprehensive pathogen detection, especially in complex foodborne outbreak investigations [7].

Nanopore sequencing provides long-read data that can capture comprehensive genetic information. Its utilization, as exemplified by studies like [8], demonstrates its utility in closing genomic gaps, delivering real-time sequencing data, and enhancing the capabilities of metagenomic approaches for outbreak investigations. This technology enables more accurate and rapid pathogen detection, a critical advancement in scenarios where timely responses are essential for effective outbreak management.

Once sequencing data are generated, they must be processed using bioinformatics tools to identify pathogens, their genetic variations, and virulence factor (VF) genes, thereby facilitating timely and accurate detection [9, 10]. However, available tools and workflows require bioinformatic and computational knowledge and expertise. For example, tool parameters need to be adapted to the specific use case. End-to-end platforms (Table 1) that allow users to analyze their samples are either restricted with only a limited free trial (e.g., BugSeq [11]) or paid subscription (e.g., OneCodex [12]), or they require high computational resources (e.g., SURPI [13] and Sunbeam [14]). For certain free resources, the underlying workflow is not available and adaptable for the user. For example, IDseq [15] (also known as CZID [16]), a free cloud-based service for pathogen detection, can only be externally accessed through the dedicated online user interface. Furthermore, some of these workflows are specific to a certain host, pathogen, or sequencing technique, lacking the flexibility for customization.

Table 1:

Comparison of features between PathoGFAIR and other similar pipelines or systems. This comparison sheds light on various features and characteristics, such as accessibility, technical specifications, and the scope of analyses offered by each system. It serves as a reference to evaluate the suitability of PathoGFAIR and other similar pipelines or systems for specific needs and requirements.

Features	PathoGFAIR	IDseq	BugSeq	SURPI	OneCodex	Sunbeam	Innuendo	PAIPline	Victors
							[21]	[22]	[23]
General characteristics
Free of charge	✓	✓	✗^*	✓	✗	✓	✓	✓	✓
Open source code	✓	✓	✗	✓	✗	✓	✓	✓	✗
Web interface	✓	✓	✓	✗	✓	✗	✓	✗	✓^**
Automatable API	✓	✗	✗	✗	✗	✓	✓	✗	✗
Accessibility and availability
Simple end-user modification	✓	✗	✗	✗	✗	✓	✗	✓	✗
Publicly available web server	✓	✓	✓	✗	✓	✗	✗	✗	✓
Last updated	2024	2024	2024	2014	2023	2024	2018	2018	2019
User support and documentation
Tutorial	✓	✗	✗	✗	✗	✗	✗	✗	✗
Documentation	✓	✓	✓	✓	✓	✓	✓	✓	✓
User support	✓	✓	✓	✗	✓	✗	✗	✗	✗
Technical specifications
Workflow manager	Galaxy	—	—	—	—	Snakemake	Nextflow	—	—
Sequencing technique	Nanopore^***	Illumina	Illumina	Illumina	—	Illumina	Illumina	Illumina	—
		& Nanopore	& Nanopore
Analyses
Preprocessing	✓	✓	✓	✓	✓	✓	✗	✓	✗
Taxonomy profiling	✓	✓	✓	✓	✓	✓	✗	✓	✗
Gene-based pathogen identification	✓	✓	✓	✓	✓	✓	✓	✓	✓
Allele-based pathogen identification	✓	✗	✓	✗	✗	✗	✓	✗	✓
Sample aggregation and visualizations	✓	✓	✗	✗	✓	✓	✗	✗	✗

Open in a new tab

*Free trial of 10 samples is available.

**Malfunctioned when tested.

***Can be easily adapted to any other types of sequencing techniques via Galaxy, a customizable and automatable API.

Galaxy [17] is an open-source platform for Findable, Accessible, Interoperable, and Reusable (FAIR) data analysis. It enables users to apply a comprehensive suite of bioinformatics tools (that can be combined into workflows) through either its user-friendly web interface or its automatable Application Programming Interface (API) for integrating and customizing workflows, enhancing user flexibility. It ensures reproducibility by capturing the necessary information to repeat and understand data analyses. Galaxy offers a collection of high-quality prebuilt workflows that can be used directly or are easily adapted to the user’s needs via the Galaxy workflow editor. Galaxy workflows can be executed on any Galaxy server, even on the private Galaxy server, making it suitable also for data where privacy concerns are important. Furthermore, Galaxy via the major public servers [17] freely provides a large computing infrastructure, allowing for the execution of computationally challenging workflows, which is often the case for metagenomic analysis.

Here, we present PathoGFAIR, a collection of Galaxy-based workflows for pathogen identification and tracking its presence among (meta)genomics Oxford Nanopore sequencing data. The workflows are openly available on 2 workflow registries (Dockstore [18] and WorkflowHub [19]). They can be used directly on 3 major Galaxy servers (usegalaxy.org, usegalaxy.eu, usegalaxy.org.au) or installed in any other Galaxy server. The workflows are created to work agnostically, detecting all pathogens present in the samples without prior knowledge of the target pathogen. As the workflows are created in Galaxy, they can be adapted for other sequencing techniques or with various downstream analyses, such as differential expression analysis, or further statistics and visualizations [17]. Workflows are documented and supported by an extensive tutorial freely available via the Galaxy Training Network (GTN) [20]. Overall, PathoGFAIR offers an easy-to-use computational solution that speeds up the process of sampling, detecting, and tracking pathogens.

Implementation

Overview

PathoGFAIR comprises a collection of 5 workflows, implemented in Galaxy (Fig. 1). Each workflow serves a specific function and can be executed independently, enabling users to tailor their analysis according to their requirements.

Figure 1: — Flowchart of the PathoGFAIR workflows. Workflow 1 (olive green) takes as input sequencing data generated by Oxford Nanopore technologies and performs quality control and host filtering. Then, 3 parallel workflows are executed on the output of Workflow 1: Workflow 2 (red) for taxonomy profiling, Workflow 3 (dark cyan) for gene-based pathogen identification, and Workflow 4 (purple) for SNP-based pathogen identification. These 4 workflows can run individually and in parallel. Finally, all outputs for the different provided datasets are aggregated in Workflow 5 (green) for PathoGFAIR sample aggregation and visualization.

The input data for PathoGFAIR comprise sequencing data generated using Oxford Nanopore technologies, along with an optional metadata table describing the datasets. Basecalling for converting raw signal data from the Nanopore sequencer into nucleotide sequences is not included within the PathoGFAIR workflows. In the use cases presented later in the article, real-time basecalling is performed using the MinKNOW software (Oxford Nanopore Technologies) before the reads are used in the workflows. Basecalling is a crucial step, as it affects the quality of the reads. Users are encouraged to ensure that high-quality basecalling is performed before starting the analysis with PathoGFAIR.

The datasets are preprocessed in Workflow 1, which encompasses quality control and host removal procedures. Subsequently, the preprocessed data are directed to 3 parallel workflows: taxonomy profiling (Workflow 2), gene-based pathogen identification (Workflow 3), and allele-based pathogen identification (Workflow 4). This parallel execution allows for efficient analysis and flexibility in workflow selection. Notably, Workflow 4 can optionally synchronize with Workflow 2 or Workflow 3 to leverage prior taxonomic analysis or gene-based pathogen identification results, providing users with flexibility based on specific use cases. By using detailed taxonomic identification from Workflow 2 or gene-based pathogen identification from Workflow 3, Workflow 4 enhances mapping and single-nucleotide polymorphism (SNP) detection accuracy. This process involves selecting the correct reference genome of the pathogen for mapping, informed by results from Workflow 2, Workflow 3, or even Workflow 1, which performs initial taxonomy assignment during the host filtering step.

Since each workflow can be executed independently, users can focus on specific aspects of pathogen detection or analysis. This modular approach empowers users to utilize the full range of functions offered by each workflow individually or to combine them as needed for comprehensive pathogen detection.

Finally, in Workflow 5, outputs from the previous workflows and the metadata of the dataset are aggregated and visualized for comprehensive pathogen tracking across samples. This aggregation step ensures a holistic view of pathogen presence and distribution, facilitating further insights and analysis.

Overall, the independent nature of PathoGFAIR’s workflows provides users with a user-friendly and customizable approach to pathogen detection, allowing for both comprehensive analyses and targeted investigations based on specific research needs or objectives.

Ensuring the accuracy and currency of reference data is indeed fundamental for robust metagenomic analysis. PathoGFAIR leverages Galaxy’s integrated Data Managers, which enables Galaxy admins to provide up-to-date reference data. These Data Managers automate the download, installation, and regular update of essential reference databases, ensuring that PathoGFAIR users work with complete, accurate, and up-to-date reference information. PathoGFAIR workflows are configured to use well-maintained and reputable sources, such as NCBI and other public pathogen reference repositories, which further support accuracy and comprehensiveness. Additionally, Galaxy’s user-friendly interface enables users to select preferred references or request the inclusion of specific databases via Galaxy administrators, adding to the workflow’s adaptability for diverse use cases.

PathoGFAIR offers a competitive, and accessible solution (Table 1) to detect and track pathogens in metagenomic Nanopore data through its 5 Galaxy-based FAIR and customizable workflows.

Workflow 1: Preprocessing

Workflow 1 encompasses essential preprocessing steps to ensure the quality and integrity of sequencing data.

Quality control and sequence filtering, based on quality, length, or low complexity, are performed using Fastp (v 0.23.2) (biotools:fastp) [24]. Porechop (v 0.2.4) [25] trims low-quality base pairs and removes duplicates and adapters. Quality thresholds are set to ensure that reads have an average quality score of Q20, aligning with accepted standards for Nanopore sequencing, where Q20 or higher quality is typically sufficient for reliable results [26].

Quality-controlled (QC) reads are cleaned of sequences from the host or food source (e.g., bovine in case of bovine meat) by mapping to their reference genome using Minimap2 (v 2.26) (RRID:SCR_018550) [27], a tool tens of times faster than mainstream long-read mappers such as BLASR [28], BWA-MEM [29], NGMLR [30], and GMAP [31] and 3 times as fast as Bowtie2 (biotools:bowtie2) [32], designed for Illumina short reads [27]. A variety of reference genomes (e.g., Human, Chicken, or Cow) can be installed on Galaxy servers to work with Minimap2. A wide variety of reference genomes are integrated into Minimap2 on Galaxy, providing users with a convenient selection to choose from before executing the workflow. Kraken2 (v 1.2) (biotools:kraken2) [33] is applied for further contamination detection (e.g., human sequences using the Kalamari database). The Kalamari database includes mitochondrial sequences of various known hosts [34]. Host/food source reads matched to the Kalamari database are assessed and removed using Krakentools (v 1.2) [33].

The workflow returns QC reads without contamination or host sequences as well as interactive reports, produced by FastQC (v 0.12.1) (RRID:SCR_014583), fastp, and MultiQC (v 1.11) (RRID:SCR_014982) [35]. Furthermore, Nanoplot (v 1.39.0) (biotools:nanoplot) [36] is employed to provide detailed quality metrics specifically tailored to the preprocessing step, enriching the suite of analytical insights and facilitating robust data evaluation.

Workflow 2: Taxonomy profiling

Workflow 2 performs taxonomic profiling of the microbial community to identify pathogens and other microorganisms for the QC reads from Workflow 1, using Kraken2 (v 1.2) [33] and the PlusPF (archaea, bacteria, viral, plasmid, human, UniVec_Core, protozoa, fungi, and plant) Refseq database (7 June 2022). Although Kraken2 is a tool designed for short-read sequencing and is known for its false-positive taxonomy assignments, particularly at lower microbial abundances [37], its application to long reads can still yield a substantial overview of the microbial community. This is particularly true for discerning bacteria that could potentially be pathogenic at genus and species taxonomic ranks [38, 39]. Kraken2 allows for the rapid assignment of taxonomy at multiple ranks, from kingdom to species, using an efficient exact k-mer matching algorithm. Other tools such as Centrifuge (RRID:SCR_016665) or MetaPhlAn (RRID:SCR_004915) are viable alternatives, also available on Galaxy. Kraken2 is selected for its speed, sensitivity, and ability to work with large reference databases, a critical factor when analyzing complex metagenomic samples [33, 40]. The produced community profile is visualized using Krona (RRID:SCR_012785) [41] and observed interactively for different taxonomic ranks using Phinch [42] or Pavian [43].

Workflow 3: Gene-based pathogen identification

In this workflow, the pathogens are identified by the presence of genes associated with pathogenicity. QC reads from Workflow 1 are assembled into contigs using Metaflye (v 2.9.1) (RRID:SCR_017016) [44]. The contigs are then polished using the Medaka Consensus Pipeline (v 1.7.2) (biotools:medaka) [45], which generates consensus sequences using neural networks and shows improved accuracy over graph-based approaches for Oxford Nanopore reads. The polished contigs are screened afterward using ABRicate (v 1.0.1) (biotools:ABRicate) [46] for virulence factors (VFs) with the Virulence Factor Database (VFDB) [47] and for antimicrobial resistance (AMR) genes with AMRFinderPlus [48] database. ABRicate is chosen for its versatility, as it supports multiple databases, including those for antimicrobial resistance genes and the VFDB. This makes it a comprehensive tool for gene-based pathogen detection, capable of identifying a wide range of relevant genetic markers [46].

Workflow 4: Allele-based pathogen identification

Another approach to identifying pathogens is to use an allelic approach by detecting SNPs (i.e., markers showing evolutionary histories of homogeneous strains) [49]. This process includes SNP calling, aimed at identifying novel pathogen strains and elucidating discrepancies compared to reference sequences, thereby facilitating the tracing of emerging variants. Within Workflow 4, both complex variants and SNPs are discerned, serving as crucial elements for subsequent pathogen identification and variant tracing purposes.

QC reads from Workflow 1 are mapped using Minimap2 (v 2.26) to a selected reference genome of a suspected pathogen. Users can choose the reference genome based on their prior knowledge of the target pathogen, the taxonomic analysis in Workflow 2, or the detected pathogenic genes in Workflow 3. Variant calling for mapped reads is performed using Clair3 (v 0.1.12) (biotools:clair3) [50]. Clair3, a tool developed for long reads, has been chosen because it is demonstrated to be faster and more accurate than the Medaka variant pipeline, which its developer has declared deprecated in favor of Clair3 [45]. After that, all complex variants and their information, such as type, genomics position, and quality score, are normalized using bcftools norm (v 1.9) [51]. The normalized reads are filtered using SnpSift filter (v 4.3) (RRID:SCR_015624) [52] based on the SNP quality computed in the SNP identification step with Clair3. Filtered variants fields required for further analyses are extracted using SnpSift extract fields (v 4.3) (RRID:SCR_015624) [52]. Finally, a consensus sequence for each sample is built using bcftools consensus (v 1.9) (RRID:SCR_005227) [53]. In addition to the variants, this workflow outputs tables, including summary metrics like the mapping coverage (breadth of coverage) percentages for every sample, per base covering mean depth (depth of covering), and quality filtered complex variants and SNP numbers. For more accurate results, users should consider only SNPs with a minimum depth of covering of 10× to ensure reliable calls, as demonstrated in the analyses of the following Use Cases section. This threshold effectively minimizes the inclusion of false-positive variants, a challenge often encountered with Nanopore sequencing data due to its inherent error rates.

Workflow 5: PathoGFAIR sample aggregation and visualization

In all previously described workflows, individual samples are analyzed separately. Workflow 5 consolidates the outputs from Workflows 1, 2, 3, and 4 along with sample metadata to generate various visualizations and reports. These reports illustrate the detected pathogens and facilitate the visualization and tracking of their presence across all samples.

VF tables from Workflow 3 are used to generate clustered heatmaps showing the VF genes using ggplot2 Heatmap (v 3.4.0) (RRID:SCR_014601). VF sequences are concatenated per sample, generating a consensus sequence of identified VF genes per sample and aligned over all samples using ClustalW (v 2.1) (RRID:SCR_017277). A phylogenetic tree of the virulence gene sequences is then generated from the multiple sequence alignment using FASTTREE (v 2.1.10) (RRID:SCR_015501) [54] and visualized using Newick Display (v 1.6) (biotools:newick_utilities). The same is performed on the AMR tables from Workflow 3. From Workflows 1 and 4 output tables, bar charts are generated.

Other outputs are aggregated and processed within a Jupyter Notebook [55], interactively launched in Galaxy using JupyTool (v 1.0.0). This Notebook showcases the integration of sample metadata to generate analysis-specific plots, leveraging Python (v 3.10.12) [56] libraries such as Pandas (v 1.5.3) [57, 58], Matplotlib (v 3.7.1) [59], Seaborn (v 0.12.2) [60], and Numpy (v 1.24.3) [61]. Examples of these plots include bar plots illustrating the number of reads before and after quality control for all samples, scatterplots visualizing relationships between different variables such as pathogen count and sample characteristics, and interactive cluster maps displaying the clustering patterns of samples based on pathogen composition. These visualization techniques are further elucidated and exemplified in the Use Cases section of this study, where the output tables from the workflows are aggregated with the corresponding sample metadata and visualized to facilitate comprehensive visual analysis.

VF and AMR genes are often found on mobile genetic elements (MGEs) such as plasmids or phages, meaning they can sometimes appear independently of their bacterial hosts. To address this challenge, PathoGFAIR integrates taxonomic profiling from Workflow 2 with gene detection results from Workflow 3. This cross-referencing ensures accurate attribution of VF and AMR genes to their respective host organisms. For further validation, Workflow 4 enables users to map consensus genomes, generated in Workflow 5 from detected VF genes, against any reference genomes. This process confirms whether these VF or AMR genes, detected in Workflow 3, are genuinely linked to the bacterial genome or merely associated with MGEs, with additional coverage metrics helping to ensure accurate mapping. Future updates to PathoGFAIR will include expanded methodologies to validate gene–host associations using broader taxonomic markers, further refining the precision of pathogen characterization.

Workflow reports

As all PathoGFAIR workflows are designed to run seamlessly on the Galaxy platform, an interactive report is automatically generated upon completion of each workflow. These reports provide a comprehensive overview of the respective workflow’s inputs and outputs. In PathoGFAIR, special attention has been given to refining these reports for enhanced user experience. The reports are carefully curated to automatically showcase and emphasize only the most informative, easily interpretable, and accessible outputs for each workflow. This ensures that users can efficiently extract key insights from the results, facilitating a streamlined and user-friendly analysis experience.

Easily adaptable workflows

The workflows can process raw shotgun (meta)genomics sequencing data from any sample, not only food.

PathoGFAIR has been initially developed to take Oxford Nanopore data as inputs. However, PathoGFAIR can work with Illumina data or other types of sequencing technique data. To adapt to Illumina sequencing, only 1 tool needs to be changed in Workflow 1: Porechop [25] with Cutadapt (RRID:SCR_011841) [62]. Workflows 2, 3, 4, and 5 can be used directly with Illumina datasets without any adaptation. Some tools can be changed based on the tool’s known performance toward short and long reads, such as Clair3 (v 0.1.12) [50] and Metaflye (v 2.9.1) [44]. All the mentioned tools are accessible within Galaxy, allowing for seamless interchangeability.

The workflows can also be adapted to process paired-end reads by adjusting the tools’ parameters to take paired-end read samples instead of single-end reads. These changes can be applied with little effort by using the user-friendly workflow editor in Galaxy.

Users can seamlessly switch between different host reference genomes and Kraken2 databases, as PathoGFAIR supports various preinstalled databases on the Galaxy servers. This feature enhances user convenience and efficiently explores different configurations to suit specific analysis requirements.

Similarly, tool versions and parameters can be adapted, for example, to compare results with legacy versions of the workflows. New tool versions are automatically installed on public Galaxy servers using a sophisticated update infrastructure, ensuring a straightforward mechanism to keep the infrastructure up-to-date [63]. Every time a tool is updated, an update of the workflows is suggested, tested with functional tests, and released on the workflow registries once accepted.

Each of the 5 PathoGFAIR workflows is designed for a distinct type of analysis. Workflows 2, 3, and 4 operate independently, offering the flexibility to run them concurrently or skip them as per user requirements. This modular structure allows users to tailor the analysis to their specific needs, activating only the functionalities necessary for the desired workflow outcome.

FAIR workflows

The FAIR principles [64], which emphasize the importance of making research objects Findable, Accessible, Interoperable, and Reusable, offer valuable guidance for optimizing the utility and promoting the reproducibility and reusability of any research object (data, software [65], or workflows).

PathoGFAIR has been developed with the FAIR principles in mind and follows the 10 tips for building FAIR workflows, as suggested by de Visser et al. [66]. First, by using Galaxy as a workflow manager, the workflows are portable (Tip 6) and come with a reproducible computational environment (Tip 7). The tools integrated into the workflows use file format standards such as FASTA and FASTQ for sequence data, SAM and BAM from the Samtools project for alignment data, VCF for genetic variations, GenBank and GFF3 for genomic annotations, and PDB for structural data (Tip 5) [64]. As explained in the previous section, the workflows are provided with default values (Tip 8) and are modular (Tip 9).

The 5 workflows are available on the GitHub repository of IWC, the Intergalactic Workflow Commission of the Galaxy community (Tip 3) [67–72]. Workflows in this repository are reviewed and tested using test data before publication and with every new Galaxy release. The IWC automatically updates the workflows whenever a new version of any tool used in these workflows is released. Deposited workflows follow best practices, are versioned using GitHub releases, and contain important metadata (e.g., License, Author, Institutes) (Tip 2). The workflows are automatically added to 2 workflow repositories (Dockstore [18] and WorkflowHub [19]) to facilitate the discovery and reuse of workflows in an accessible and interoperable way (Tip 1) [73–82]. Via Dockstore or WorkflowHub, the PathoGFAIR workflows can be installed on any up-to-date Galaxy server. They are already publicly available on 3 main Galaxy servers (usegalaxy.org, usegalaxy.eu, usegalaxy.org.au), which any user can use and modify without restriction.

A thorough explanation of how to use the workflows in PathoGFAIR, including a more global description of pathogen identification from Oxford Nanopore data, can be found in a dedicated extensive tutorial [83] together with example input data and results (Tips 4 and 10), freely available and hosted via the GTN [20] infrastructure.

Finally, for every invocation of the workflows, a Research Object Crate (RO-Crate [84, 85]) can be created to store the data products of the different steps, along with the run-associated metadata (including parameters, tool, and workflow version).

Use Cases

To showcase PathoGFAIR and its capabilities, 130 samples from 2 distinct studies—one involving samples with prior pathogen isolation and the other without—were analyzed. In the case of nonisolated samples, pathogens were deliberately spiked to mimic real-world scenarios. For isolated samples, prior identification ensured the pathogens’ identities were known. All samples underwent sequencing using Oxford Nanopore technology, highlighting the workflow’s adaptability across diverse sample preparation methods. All workflows of PathoGFAIR were evaluated for their main intended tasks (e.g., the preprocessing workflow for its read quality retaining and host sequence removal performance) but also for their ability to identify the correct pathogen and how well the accuracy with respect to different sampling conditions is.

Samples without prior pathogen isolation

Data generation

In this study, 46 samples had been prepared given the following protocol [86]. Chicken meat was spiked with either 1 of 3 Salmonella enterica subspecies (S. enterica subsp. houtenae DSM 9221, S. enterica subsp. enterica DSM 554, or S. enterica subsp. salamae DSM 9220) or a mix of them, with concentrations that give cycle threshold (Ct) values between 25 and 33. A total of 15 samples were incubated at 37°C for 24 hours before DNA isolation to facilitate bacterial growth. All samples were after incubated at 56°C for 1 hour with lysis buffer and 20 ng/μL Proteinase K, followed by DNA extraction according to the STAR BEADS Pathogen DNA/RNA Extraction kit (CYANAGEN SRL) instructions. In this study, approximately 25 mg of meat was used per aliquot for DNA extraction. DNA concentrations were measured with the Qubit® 4.0 Fluorometer (Thermo Fisher Scientific) using the double-stranded DNA (dsDNA) High-Sensitivity (HS) assay kit (Thermo Fisher Scientific), following the manufacturer’s protocol. The quality was evaluated with a Nanodrop® 1000 (Thermo Fisher Scientific), assessing the 260/280 nm and 260/230 nm ratios. The 260/280 and 260/230 ratios were close to the expected ranges of 1.8–2.0 and 2.0–2.2, respectively. Extracted DNA was barcoded before sequencing using the Native barcoding genomic DNA (with EXP-NBD104, EXP-NBD114, and SQK-LSK109) protocol (Oxford Nanopore). DNA was then loaded on an R9.4.1 MinION Mk flow cell (Oxford Nanopore). SpotON sample port cover and priming port were closed and sequencing was started. The sequencing device control, data acquisition, and real-time basecalling were carried out by the MinKNOW software of the MinION Mk1C device. For 6 samples, adaptive sampling, a technique used in Nanopore sequencing to selectively sequence microbial DNA while excluding unwanted host DNA (here chicken DNA), was used. Generated sequencing data is available via BioProject PRJNA982679. Metadata for the 46 samples are summarized in Supplementary Table S1 into 5 pieces of information: (i) expected subspecies; (ii) incubation before DNA isolation; (iii) adaptive sampling during sequencing; (iv) colony-forming unit (CFU)/mL [87], a measure providing a quantitative assessment of viable microbial entities within a given sample and measured using standard microbiological techniques such as serial dilution and plating on agar medium; and (v) Ct values [88], values inversely proportional to the amount of nucleic acid present in the samples.

Preprocessing

The number of reads after quality control varies significantly between samples (Fig. 2A), which impacts downstream analyses.

Figure 2: — (A) Bar plot showing the total number of quality-controlled reads per sample before (dark blue) and after (light blue) host sequence removal. On the left, the metadata of the samples are displayed: (i) the expected *S. enterica* subsp. *salamae* in yellow, *S. enterica* subsp. *houtenae* in blue, and *S. enterica* subsp. *enterica* in light purple; (ii) incubation before DNA isolation (incubated for 24 hours in pink and incubated for 1 hour in brown); and (iii) adaptive sampling during sequencing (chicken excluded in green and chicken not excluded in purple). (B) Clustergram displaying the identified VF gene abundances per sample. The VF genes are presented on the y-axis, and all 46 nonisolated samples are on the x-axis along with their sample information. On the top are the metadata of the samples with the same color code as in A. The gray bars on the bottom and on the right represent dendrogram VF gene (right) and sample metadata (bottom) clusters found with hierarchical clustering with a clustering granularity of 0.5. (C) Phylogenetic tree, using the nucleotide evolution model; General Time Reversible (GTR) model with a CAT approximation for rate heterogeneity across sites [54]. The phylogenetic tree was built on the VF gene consensus sequences concatenated per sample and aligned for all samples. (D) Bar plot with the mapping coverage (breadth of coverage), that is, the percentage of covered bases of each sample to the reference genome, measured by calculating the percentage of positions within each bin with at least 1 base aligned against it. (E) Bar plot with the mean of the mapping depth (depth of coverage) of bases mapped to corresponding bases in the reference genome for every sample. (F) Bar plot with the number of variants and SNPs found per sample. Mapping coverage percentage and the depth mean indicate whether to trust the variants and SNPs found by the workflow or not; the higher the coverage percentage and the depth mean, the more trusted the SNP results for the sample.

For host detection using Minimap2 (v 2.26), the option PacBio/Oxford Nanopore read to reference mapping was set here. As expected from the sample sequencing protocol (chicken samples and not isolated pathogen), most sequences were assigned to chicken (Gallus gallus galGal6): above 90% in 31 samples and between 55% and 85% for the remaining 15 samples (Supplementary Fig. S1). However, the percentage of identified host DNA (between 60% and 98%) was not as low as expected for the 6 samples that had undergone adaptive sampling to exclude chicken DNA during sequencing. This shows that the adaptive sampling to exclude chicken in some samples during sequencing may not have removed all the chicken sequences. All sequences identified as chicken were removed (Fig. 2A). After QC and host removal, 19 samples had fewer than 1,000 reads. These samples could only be analyzed using the taxonomy profiling as highlighted in the next sections.

Taxonomy profiling

S. enterica was detected in Workflow 2 for all samples except 1, at its species and different subspecies taxonomic ranks (interactive KRONA plot in Supplementary Fig. S2 and Supplementary Table S3).

Gene-based pathogen identification

In Workflow 3, Metaflye (v 2.9.1) tool mode’s option was chosen to be Nanopore-HQ. Users can expand the workflow and change this option according to their dataset sequencing technique.

No contig was built for 10 of the 27 samples with fewer than 2,700 reads. The identification of VF or AMR genes was then made impossible. For the other 17 samples, only 1 or 2 contigs were created, not enough for identifying VF and AMR genes.

For the remaining 19 samples with created contigs (from 3 to 157) and number of reads higher than 2,700, VF genes were identified in 15 samples (Fig. 2B), 12 of which were incubated before DNA isolation for 24 hours. Three of the 15 samples were incubated for only 1 hour before DNA isolation, resulting in a few VF genes (Fig. 2B) identified, compared to the other 12 samples, mostly because of the low number of reads (Fig. 3E) from almost the absence of incubation (Fig. 2A). It was, for example, the case for the mixed samples (i.e., samples spiked with all 3 S. enterica subspecies or samples spiked only with S. enterica subsp. houtenae and adaptively sampled during sequencing).

Figure 3: — Scatterplots showing the number of identified VF genes (A, C, E) and AMR genes (B, D, F) in relationship to the Ct value (A, B), CFU/mL value (C, D), and the number of reads after preprocessing (E, F). The green area (A, B, C, D) highlights Ct values or CFU/mL values for which genes had been detected. Pearson correlation for values in the green area: (A) and P , (B) and P, (C) and P , (D) and P .

Inline graphic — Scatterplots showing the number of identified VF genes (A, C, E) and AMR genes (B, D, F) in relationship to the Ct value (A, B), CFU/mL value (C, D), and the number of reads after preprocessing (E, F). The green area (A, B, C, D) highlights Ct values or CFU/mL values for which genes had been detected. Pearson correlation for values in the green area: (A) and P , (B) and P, (C) and P , (D) and P .

Some identified VF genes were found more than once in the same sample, with a maximum of 4 times. Common VF genes were identified for samples expecting identical S. enterica subspecies (Fig. 2B), such as the mucD gene, a serine protease mucD precursor, which was only found in S. enterica subsp. houtenae spiked samples, or shdA, an AIDA autotransporter-like protein, only found in S. enterica subsp. enterica spiked samples, but not in samples spiked with only S. enterica subsp. houtenae or S. enterica subsp. salamae.

Similar results were found for AMR genes (Supplementary Fig. S3, Fig. 3F). The sampling conditions affected the number of identified VF and AMR genes, as shown by the relationships between the Ct value, CFU/mL value, or the number of remaining reads after preprocessing (Fig. 3). The lower the Ct value, the higher the number of VF genes and AMR genes identified (Fig. 3A, B). No VF or AMR genes were detected for samples with Ct values above 26. For Ct values below 26, there was a negative correlation (Pearson Inline graphic , P) between the Ct value and the number of identified AMR genes. Similar but inverse relations were observed for CFU/mL value (Fig. 3C, D), with a threshold for VF and AMR gene detection at . VF and AMR genes were then detected if several conditions were fulfilled: a Ct value below 26, CFU/mL value above Inline graphic , and at least 5,000 reads after preprocessing. The further the samples were from these thresholds, the higher the number of VF genes and AMR genes identified. Indeed, the 3 top scattered dots (in red; Fig. 3A, C, E), with identified VF genes between 250 and 300 were the samples with the highest number of reads, a higher CFU/mL value, and a relatively lower Ct value compared to other samples. Generally, allowing samples to incubate for a short period before sequencing enhances microbial growth, resulting in higher CFU/mL values and lower Ct values. This increase in microbial concentration improves the efficiency of direct sequencing by providing more genetic material for analysis, facilitating faster and more accurate pathogen detection.

Allele-based pathogen identification

In Workflow 4, samples were mapped against a reference genome of an expected pathogen chosen by the user. S. enterica subsp. enterica ser. Typhimurium (NC_003197.2) was chosen for these data, as it is widely recognized and extensively used in genomic studies due to its complete and well-annotated genome sequence [89]. However, given the diversity among the serovariants of S. enterica subsp. enterica, a high number of complex variants and SNPs are anticipated.

The provided mapping statistics (mapping coverage [breadth of coverage] and mapping depth [depth of coverage] in Fig. 2D, E) serve as proxies for assessing the number and quality of identified SNPs (Fig. 2F). SNPs with low mapping depth are less reliable than those with higher depth. Reliable SNP calling typically requires a depth of at least 10, achieved in 2 samples. Samples with the highest mean mapping depth corresponded to samples with the highest number of reads after preprocessing (Fig. 2A). The higher the coverage and the mean mapping depth, the more quality SNPs were identified (Fig. 2D–F). Some of the samples spiked with S. enterica subsp. enterica had a high breadth of coverage but a low mean depth of coverage depth; as a result, the number of their quality filtered identified SNPs was low.

PathoGFAIR sample aggregation and visualization

For the samples for which VF or AMR genes had been identified, phylogenetic trees were built on the concatenated gene consensus sequences (Fig. 2C for VF genes, Supplementary Fig. S4 for AMR genes). These trees help track divergence between samples and can then highlight the contamination point or an evolution of the subspecies because of mutations. Indeed, samples spiked with S. enterica subsp. enterica were found together in the VF-based tree (Fig. 2C), so the identified VF genes were unique to these samples and could clearly separate the samples from samples with other S. enterica subspecies. The samples spiked with S. enterica subsp. houtenae were mostly clustered together, except for 2 samples because of extra identified VF genes common with samples spiked with S. enterica subsp. enterica and/or S. enterica subsp. salamae. The 2 samples spiked with a mix of the 3 subspecies were found in the middle of the tree (Fig. 2C), showing that a mix of VF genes related to the different subspecies was identified. The mixed sample, S45, spiked with a higher concentration of S. enterica subsp. houtenae than the other subspecies, was close to the sample, S02, spiked with S. enterica subsp. houtenae only. For AMR genes phylogenetic tree (Supplementary Fig. S4), samples were not as clearly separated as the tree for VF genes, mostly because the number of identified AMR genes was relatively low compared to the number of identified VF genes.

Sensitivity

The performance of the workflows was evaluated based on their ability to identify the expected S. enterica pathogen, as well as S. enterica subspecies and strain taxonomic ranks for the tested samples (Supplementary Table S3). In a metagenomic setting, other detected species cannot be regarded as false positives, as they may naturally be present in the sample. Therefore, only sensitivity was reported.

For the taxonomy profiling (Workflow 2), the expected pathogen was detected at its species taxonomic rank in all but 1 sample, resulting in a sensitivity of 97.8%. At the subspecies taxonomic rank, the expected subspecies was detected in 28 out of 46 samples, yielding a sensitivity of 64.0%. To further evaluate subspecies classification performance, the sample-wise sensitivity (the percentage of correctly identified S. enterica subspecies out of all detected S. enterica subspecies) was calculated. Averaged across all samples, the sample-wise sensitivity was 47.3%. In the gene-based pathogen identification (Workflow 3), at least 1 VF gene of the expected pathogen, at strain taxonomic rank, was detected in 13 out of 46 samples, corresponding to a sensitivity of 28.2%. For the samples in which no VF gene was detected, no contigs could be generated, preventing gene calling.

Changing the workflow’s default settings, such as using different reference databases for preprocessing in Workflow 1, taxonomy profiling in Workflow 2, or gene-based pathogen identification in Workflow 3, would likely impact these metrics. Different reference databases could influence the accuracy and sensitivity of taxonomic classification and pathogen identification, as they may contain varying levels of strain-specific data. Adjusting parameters like threshold values, filtering criteria, or the inclusion of additional databases could also affect the detection sensitivity and overall performance, potentially improving or reducing the workflow’s ability to accurately identify pathogens and associated genes in the given samples.

Samples with prior pathogen isolation

Data description

To further test PathoGFair, 84 public datasets were used [90]. These samples were sampled in Palestine by the Swiss Tropical and Public Health Institute from chicken meat, chicken stool, or human stool in 2021 or 2022 (Supplementary Fig. S5). In these samples, S. enterica had been isolated in 19 samples and Campylobacter jejuni in 65 samples. The generated sequencing data are provided under BioProjects PRJNA942086 (S. enterica [91]) and PRJNA942088 (C. jejuni [92]).