Skip to main content
BMC Genomics logoLink to BMC Genomics
. 2025 Jan 8;26:20. doi: 10.1186/s12864-024-11182-5

Galaxy @Sciensano: a comprehensive bioinformatics portal for genomics-based microbial typing, characterization, and outbreak detection

Bert Bogaerts 1,#, Julien Van Braekel 1,#, Alexander Van Uffelen 1, Jolien D’aes 1, Maxime Godfroid 1, Thomas Delcourt 1, Michael Kelchtermans 1, Kato Milis 1, Nathalie Goeders 1, Sigrid C J De Keersmaecker 1, Nancy H C Roosens 1, Raf Winand 1,#, Kevin Vanneste 1,✉,#
PMCID: PMC11715294  PMID: 39780046

Abstract

The influx of whole genome sequencing (WGS) data in the public health and clinical diagnostic sectors has created a need for data analysis methods and bioinformatics expertise, which can be a bottleneck for many laboratories. At Sciensano, the Belgian national public health institute, an intuitive and user-friendly bioinformatics tool portal was implemented using Galaxy, an open-source platform for data analysis and workflow creation. The Galaxy @Sciensano instance is available to both internal and external scientists and offers a wide range of tools provided by the community, complemented by over 50 custom tools and pipelines developed in-house. The tool selection is currently focused primarily on the analysis of WGS data generated using Illumina sequencing for microbial pathogen typing, characterization and outbreak detection, but it also addresses specific use cases for other data types. Our Galaxy instance includes several custom-developed 'push-button' pipelines, which are user-friendly and intuitive stand-alone tools that perform complete characterization of bacterial isolates based on WGS data and generate interactive HTML output reports with key findings. These pipelines include quality control, de novo assembly, sequence typing, antimicrobial resistance prediction and several relevant species-specific assays. They are tailored for pathogens with active genomic surveillance programs, and clinical relevance, such as Escherichia coli, Listeria monocytogenes, Salmonella spp. and Mycobacterium tuberculosis. These tools and pipelines utilize internationally recognized databases such as PubMLST, EnteroBase, and the NCBI National Database of Antibiotic Resistant Organisms, which are automatically synchronized on a regular basis to ensure up-to-date results. Many of these pipelines are part of the routine activities of Belgian national reference centers and laboratories, some of which use them under ISO accreditation. This resource is publicly available for noncommercial use at https://galaxy.sciensano.be/ and can help other laboratories establish reliable, traceable and reproducible bioinformatics analyses for pathogens encountered in public health settings.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-024-11182-5.

Keywords: Galaxy, Public health, Whole genome sequencing, Genomic surveillance

Introduction

In recent years, whole genome sequencing (WGS) has become an essential tool for many public health and clinical laboratories. Public health surveillance of microbiological pathogens is currently often based on WGS data as directed by international guidelines and requirements (e.g., the European Centre for Disease Prevention and Control, the European Food Safety Authority, and the U.S. Food and Drug Administration) [1, 2]. WGS can assess relatedness between isolates down to single nucleotide resolution and provide complete information on characteristics such as virulence or antimicrobial resistance (AMR) gene content [3]. In clinical settings, WGS has improved the ability to trace outbreaks, understand AMR, and optimize treatments. Advances in sequencing technology have led to an ever-increasing influx of genomic data, increasing the need for data analysis methods and expertise. Analyzing these data can be complex and often requires working with command line tools, knowledge of multiple programming languages and significant computing and data storage resources. For many laboratories, meeting these increased needs can be a major bottleneck in the processing of sequencing data [4]. Additionally, bioinformatics processing tends to vary between laboratories and many available tools and databases perform the same type of analysis, which can be overwhelming for scientists without experience in bioinformatics.

One possible solution to overcome these hurdles is provided by Galaxy, a free and open-source platform for data analysis and workflow creation, that is accessible from a web browser [57]. Galaxy provides a user-friendly and intuitive graphical user interface for remotely running bioinformatics tools and workflows. This makes it particularly suitable for scientists who are not accustomed to running bioinformatics tools on the command line. Its ease of use has made it a popular solution worldwide, as evidenced by the 167 open Galaxy instances listed on the Galaxy Project website (https://galaxyproject.org/use/, accessed on August 8th, 2024). These instances range from general-purpose portals offering a wide variety of tools such as the UseGalaxy.org and UseGalaxy.eu instances, to specific instances focused on a particular application scope, such as metagenomics (e.g., ASaiM [8]), transcriptomics (e.g., GIANT [9]), drug discovery (e.g., MPDS [10]), or even a specific sequencing technology such as nanopore sequencing (e.g., NanoGalaxy [11]). In Galaxy, tools can be installed from the Galaxy ToolShed [7], which offers a wide range of community-developed tools and currently hosts 10,030 repositories (https://galaxyproject.org/toolshed/, accessed on August 8th 2024). As it would be impractical to provide every possible tool in a single instance, tool selection is often curated to match the scope of the Galaxy instance. In Galaxy, users can create workflows to combine multiple tools into a single analysis, reducing the workload and improving reproducibility. Traceability is ensured by tracking tool versions and parameters, even for tools executed within workflows. These workflows can be shared among users and made available to all users on a Galaxy instance.

In this paper, we present Galaxy @Sciensano, a customized Galaxy instance tailored for genomics-based pathogen typing, characterization, and outbreak detection relevant to public health and clinical diagnostics. This instance provides a curated set of ToolShed installations, complemented by wrappers and pipelines developed in-house to create a comprehensive solution for pathogen genomics based on Illumina WGS data for several relevant microbial pathogens. This Galaxy instance aims to provide easy-to-use bioinformatics solutions for public health and clinical laboratories operating under a quality system and for public-health experts with limited bioinformatics experience. This is achieved by providing reliable, traceable, and reproducible bioinformatics analysis, custom training resources and in-house developed tools that generate easy-to-understand and intuitive HTML output reports.

Materials and methods

Accessing Galaxy @Sciensano and associated training material

The Galaxy @Sciensano instance is available at https://galaxy.sciensano.be/ and is free for noncommercial use after registration with a valid email address. Registration and usage of the Galaxy instance is subject to compliance with our usage policy, which can be found at https://galaxy.sciensano.be/policy/disclaimer_policy.html. Under normal circumstances, the instance is always available, except during system maintenance, which is announced at least one week in advance on the Galaxy @Sciensano homepage. Each account is provided with 50 GB of disk space. None of the uploaded data stored on the instance are backed up, and users are encouraged to transfer any necessary files to their local computer, as all datasets will be automatically deleted after three months. Issues can be reported directly through the Galaxy platform's built-in functionality, and feedback can be submitted through the contact email listed on the homepage.

A series of in-house video tutorials detailing how the use of the Galaxy @Sciensano instance and a selection of custom tools is available on YouTube (https://www.youtube.com/playlist?list=PL9O-3w2bLZ4X5DJGYlbqL60PQDzn42Wjh). Additionally, the Galaxy training network offers a curated list of tutorials on Galaxy (https://training.galaxyproject.org/), with 414 currently available (accessed on September 16th, 2024) [12].

Tool selection

Overview

In total, 474 tools are available on Galaxy @Sciensano (Supplementary Table S1), of which 53 are exclusive to our instance (Table 1) (accessed on September 16th, 2024). The tool catalog was built by adding tools that are commonly used at our institute, or by implementing wrappers at the request of users when no (stable) wrapper was available in the Galaxy ToolShed, specifically to fit within the scope of our Galaxy instance as a comprehensive bioinformatics portal for genomics-based microbial typing, characterization, and outbreak detection for public health and clinical diagnostics. Requests to make tools available are evaluated in terms of the resources needed, their scope, and the value they add. For example, tools that have a well-maintained ToolShed installation and that fit within the scope of the instance are usually accepted. On the other hand, tools that do not have a ToolShed installation will usually be made available only if their functionality fits within the scope of our instance and similar tools are not yet available. Due to the large number of use cases covered by ‘push-button’ pipelines (see below) implemented as stand-alone tools (Table 1, Supplementary Fig S3), the number of public workflows available on this instance is relatively limited. These ‘push-button’ pipelines correspond to comprehensive pipelines for a particular analysis such as the complete isolate characterization for a particular species. These ‘push-button’ pipelines are available as stand-alone tools in Galaxy and are therefore not customizable, except for the parameters available within the tool panel. In contrast, Galaxy workflows are combinations of tools within Galaxy that are combined into workflows that can be created, modified and shared within Galaxy. However, a species-agnostic quality control (QC) and de novo assembly Galaxy workflow for Illumina paired-end data and dedicated workflows for the examples presented in this manuscript are available (“Example use cases” section).

Table 1.

Overview of custom tools and wrappers available in Galaxy @Sciensano

Category Tool HTML output report Description
Assembly Hybrid assembly pipeline Yes In-house implementation of a hybrid assembly strategy (ONT + Illumina data) [13].
Phylogeny MEGA—Model selection and tree construction No In-house workflow to perform automatic model selection and tree-building using MEGA X [14].
MLST phylogeny Yes Constructs phylogenies from the sequence typing output generated by the stand-alone tool or the pipelines.
PACU Yes In-house workflow for SNP-based clustering of Illumina and/or ONT data [15].
SNP dists No Convert a FASTA alignment to a SNP distance matrix. (https://github.com/tseemann/snp-dists).
SNP phylogeny: CFSAN Yes Complete SNP phylogeny pipeline starting from FASTQ data, including read trimming, variant calling and filtering using the CFSAN SNP pipeline, model selection and tree construction [16].
SNP phylogeny: SAMtools Yes Complete SNP phylogeny pipeline starting from FASTQ data, including read trimming, variant calling using SAMtools,variant filtering, model selection and tree construction [17].
Pipelines Enterococcus pipeline Yes Pipeline for the characterization of Enterococcus spp. isolates.
Klebsiella pipeline Yes Pipeline for the characterization of Klebsiella pneumoniae isolates.
Listeria pipeline Yes Pipeline for the characterization of Listeria monocytogenes isolates.
Mycobacterium pipeline Yes Pipeline for the characterization of Mycobacterium tuberculosis isolates [18].
Neisseria pipeline Yes Pipeline for the characterization of Neisseria meningitidis isolates [19].
Salmonella pipeline Yes Pipeline for the characterization of Salmonella isolates.
Shigella pipeline Yes Pipeline for the characterization of Shigella isolates.
Staphylococcus pipeline Yes Pipeline for the characterization of Staphylococcus aureus isolates.
STEC pipeline Yes Pipeline for the characterization of Escherichia coli isolates (including STEC) [20].
Yersinia pipeline Yes Pipeline for the characterization of Yersinia isolates.
SARS-CoV-2 LoFreq Pipeline No Detection of low-frequency variants in SARS-CoV-2 data.
Viral consensus pipeline (Illumina) Yes Pipeline for extracting the consensus sequence of viral Illumina sequencing data for Influenza, SARS-CoV-2 and RSV, with support for other species when a reference genome is provided.
Viral consensus pipeline (ONT) Yes Pipeline for extracting the consensus sequence of viral ONT sequencing data for Influenza, SARS-CoV-2 and RSV, with support for other species when a reference genome is provided.
Pre-processing Human read scrubbing pipeline Yes Removes human reads using the NCBI human read removal tool.
Quality control CheckM Yes Checks the quality of microbial (meta-)genomes [21].
CheckV Yes Checks the quality of viral (meta-)genomes [22].
ConFindr Yes Identifies bacterial intra-species contamination in raw Illumina data [23].
Resistance characterization AbriTAMR Yes AbriTAMR is an AMR gene detection pipeline that runs AMRFinderPlus on a single (or list) of given isolates and collates the results into a table, separating genes identified into functionally relevant groups [24].
NCBI AMRFinder +  Yes Identification of AMR genes and mutations [25].
IntegronFinder Yes Detection of integrons [26].
ResFinder4 Yes Identification of AMR genes and mutations.
Sequence typing BTyper3 Yes Classification and characterization of Bacillus cereus group isolates [27].
Gene detection – create DB Yes Tool to create custom databases for the in-house gene detection tool.
Gene detection (BLAST) Yes Detection of genes using BLAST + [28].
Gene detection (KMA) Yes Detection of genes using KMA [29].
Gene detection (SRST2) Yes Detection of genes using SRST2 [30].
SCCmecFinder Yes Characterization of the Staphylococcal SCCmec cassette [31].
Sequence typing (BLAST) Yes Sequence typing using BLAST + [28].
Sequence typing (KMA) Yes Sequence typing using KMA [29].
Sequence typing (SRST2) Yes Sequence typing using SRST2 [30].
spa-typing Yes spa gene typing for Staphylococcus.
SRST2 base – gene detection No Gene detection using SRST2 [30].
SRST2 base – typing No Sequence typing using SRST2 [30].
Taxonomic classification ITSx extractor No Extracts the highly variable ITS1 and ITS2 subregions from ITS sequences [32].
Kraken 2 No Kmer-based taxonomic classification [33].
Krona Yes Interactive visualization of Kraken 2 output.
Utilities Pipeline combine No Helper tool to combine the output of multiple pipeline runs.
Pipeline viral consensus—Extract sequences No Helper tool to combine the consensus sequences generated by the viral consensus pipeline.
SCREENED No Assessment of the effectiveness of a PCR method (using an amplicon, primers, and probe) for a large set of sequences [34].
SCREENED2 No Updated version of SCREENED [34].
Variant calling Clair3 variant calling No Variant calling using Clair3 [35].
SAMtools variant calling No Variant calling using SAMtools [17].
SAMtools variant filtering No In-house variant filtering pipeline for variants called using SAMtools.

The custom tools that are available in the Galaxy @Sciensano instance. The first column shows the category. The second column contains the tool name. The third column indicates whether the tool generates an HTML output report. The last column contains a short description of the tool with a reference to the corresponding manuscript (if available)

Abbreviations: AMR antimicrobial resistance, ONT Oxford Nanopore Technologies

Custom tools and databases

The tools exclusive to our instance can be divided into two categories. One category involves wrappers around command line utilities, similar to other ToolShed installations, whereas the other category involves extended implementations that provide additional functionality built on top of these tools. An example of a wrapper with extra functionality is the custom wrapper for NCBI AMRFinder + [25]. This custom wrapper generates formatted output in an easy-to-understand HTML report (Fig. 1) that includes important traceability information relevant for public health and clinical laboratories, such as the analysis date, tool parameters, tool version, full command call(s) and database version. In addition to these custom tool wrappers, Galaxy @Sciensano also includes several in-house developed pathogen characterization ‘push-button’ pipelines. Similar to the custom tool wrappers, the pipelines were designed for traceability and ease-of-use. They are available as stand-alone tools in Galaxy and perform complete characterization of WGS datasets of specific species, including pre-processing and extensive quality control. The pipelines generate comprehensive output reports with key findings. Many of these pathogen characterization workflows have already been described in peer-reviewed publications, as indicated in Table 1. An overview of the common pre-processing steps and custom assays for each of the workflows is provided in Supplementary Figure S3 and Supplementary Table S9, respectively.

Fig. 1.

Fig. 1

Output of the local AMRFinder + tool. HTML output of the local wrapper around the NCBI AMRFinder + [25] tool. The top section (A) shows the analysis information, including the date of the analysis, the selected input files, and the parameter values. The second section (B) shows the output of the analysis in color-coded tables. The overview table provides an overview of the detected genes and their associations with antibiotics. The alignment table shows the statistics of the alignment between the genes in the database and the input assembly. The colors of the table rows correspond to perfect hits (dark green), imperfect full-length hits (light green), and partial hits (gray). Note that no partial hits were detected in this example. The last column of the overview table (C) contains links to the corresponding entries in the NCBI RefSeq database. Below the tables, there is a link to download the raw TSV file generated by AMRFinder + , and the database version that was used (D). The bottom section of the report (E) contains the command line call that was used to generate the results and the citation for the tool used

In addition to the custom tools, 82 databases were made available for our Galaxy @Sciensano instance, for use with both the ToolShed installations (e.g., databases for BLASTN) and the custom tools (e.g., typing schemes for the ‘Sequence typing’ tools, or AMR & virulence gene databases for the ‘Gene detection’ tools). An overview of these databases is provided in Table 2. Currently, 66 of these 82 databases are automatically synchronized on a weekly basis to ensure up-to-date results. For example, if genes or mutations associated with AMR are inserted into the ResFinder4 [36] database, the automated weekly updates ensure that these can be detected by the tool in Galaxy. Similarly, new sequence types and alleles can be detected by the ‘Sequence Typing’ tools as they are introduced into the underlying PubMLST, BIGSdb Pasteur, or EnteroBase databases [3739]. The custom tools include the last database update and the last database change dates in the output report for traceability. A major advantage of Galaxy is that it eliminates the need to store the databases locally, which can be problematic as they can take up much disk space. For example, the instance currently offers 75 pre-indexed sequence typing and gene detection databases for BLAST [28], KMA [29] and SRST2 [30], totaling over 800 GB, allowing for rapid querying with the ‘Sequence typing’ tools. In addition, several custom Kraken 2 [33] databases are available, which are even larger and require a substantial amount of storage to build and query efficiently, which is not feasible without dedicated hardware. Kraken 2 jobs on our Galaxy instance are run on a dedicated host where these databases are pre-loaded into random access memory (RAM), greatly increasing the speed of the analysis. The largest in-house Kraken 2 database, named ‘full’, totals 1.6 TB and includes reference and representative sequences from NCBI across the following taxonomic groups: animals, archaea, bacteria, fungi, plants protozoa and viruses (accessed on February 24th, 2024), along with the human reference genome (RefSeq accession GCF_000001405.40). The selection criteria for adding sequences to this database are detailed in Table 2. This database can be used for accurate and comprehensive profiling of microbial datasets, for example, to identify contaminants in WGS isolate datasets or the species composition of metagenomic datasets.

Table 2.

Overview of the available databases

Tools Database name Description Automatically updated
blastn, tblastn, tblastx nt Collection of all nucleotides sequences. No
Blastn, gene detectiona ARG-ANNOT Antimicrobial resistance genes [40]. Yes
Blastn, gene detectiona CARD Antimicrobial resistance genes [41]. Yes
Blastn, gene detectiona NCBI 16S NCBI 16S database. No
Blastn, gene detectiona NCBI AMR genes (NDARO) Antimicrobial resistance genes [25]. Yes
Blastn, gene detectiona NCBI stress response genes (NDARO) Stress response genes extracted from NDARO. Yes
Blastn, gene detectiona PlasmidFinder – entero Plasmid replicon database for Enterobacteriaceae [42]. Yes
Blastn, gene detectiona PlasmidFinder – Gram-positive Plasmid replicon database for Gram-positive bacteria [42]. Yes
Blastn, gene detectiona SerotypeFinder – H-type (E. coli) Serotype determining genes for E. coli H-type [43]. Yes
Blastn, gene detectiona SerotypeFinder – O-type (E. coli) Serotype determining genes for E. coli O-type [43]. Yes
Blastn, gene detectiona UniVec Database of nucleic acid sequences that might be of vector origin [44]. No
Blastn, gene detectiona UniVec (CDS) Database of nucleic acid sequences that might be of vector origin [44]. No
Blastn, gene detectiona UniVec (Full) Database of nucleic acid sequences that might be of vector origin [44]. No
Blastn, gene detectiona VirulenceFinder – E. coli Virulence genes for E. coli [45]. Yes
Blastn, gene detectiona VirulenceFinder – Listeria Virulence genes for Listeria [45]. Yes
Blastn, gene detectiona VirulenceFinder – Shiga toxins Shiga-toxin genes [45]. Yes
Blastn, gene detectiona VirulenceFinder – S. aureus (exo-enzyme) S. aureus exo-enzyme encoding genes [45]. Yes
Blastn, gene detectiona VirulenceFinder – S. aureus (host immunity) S. aureus host immunity virulence factor genes [45]. Yes
Blastn, gene detectiona VirulenceFinder – S. aureus (toxins) S. aureus toxin genes [45]. Yes
Blastn, gene detectiona VirulenceFactor DB (core) Virulence genes [46]. No
Blastn, gene detectiona VirulenceFactor DB (full) Virulence genes [46]. No
blastn, tblastn, tblastx nr Collection of all non-redundant nucleotides sequences. No
ResFinder4, blastn, gene detectiona ResFinder4 Database with AMR genes and mutations from the ResFinder4 tool [36]. Yes
Sequence typinga Bacillus cereus—cgMLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Bacillus cereus—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Brucella melitensis—cgMLST Typing scheme (cgMLST.org) Yes
Sequence typinga Brucella spp.—MLST (21 loci) Typing scheme (PubMLST [37]) Yes
Sequence typinga Brucella spp.—MLST (9 loci) Typing scheme (PubMLST [37]) Yes
Sequence typinga Campylobacter jejuni/coli—cgMLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Campylobacter jejuni/coli—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Enterobacter cloacae—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Escherichia coli—Classic MLST Pasteur Typing scheme (BIGSdb IPP [38]) Yes
Sequence typinga Escherichia coli—Classic MLST Warwick Typing scheme (EnteroBase [39]) Yes
Sequence typinga Escherichia coli—cgMLST Typing scheme (EnteroBase [39]) Yes
Sequence typinga Escherichia coli—cgMLST Typing scheme (INNUENDO) No
Sequence typinga Enterococcus faecalis—cgMLST Typing scheme (cgMLST.org) Yes
Sequence typinga Enterococcus faecalis—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Enterococcus faecium—cgMLST Typing scheme (cgMLST.org) Yes
Sequence typinga Enterococcus faecium—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Gallibacterium—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Klebsiella – cgMLST Typing scheme (cgMLST.org) Yes
Sequence typinga Klebsiella—Classic MLST Typing scheme (BIGSdb IPP) Yes
Sequence typinga Klebsiella—scgMLST Typing scheme (BIGSdb IPP) Yes
Sequence typinga Leptospira—cgMLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Leptospira—MLST (1)—Boonsilp et al Typing scheme (PubMLST [37]) Yes
Sequence typinga Leptospira—MLST (2)—Varni et al Typing scheme (PubMLST [37]) Yes
Sequence typinga Leptospira—MLST (3)—Ahmed et al Typing scheme (PubMLST [37]) Yes
Sequence typinga Listeria—cgMLST Typing scheme (BIGSdb IPP [47]) Yes
Sequence typinga Listeria—Classic MLST Typing scheme (BIGSdb IPP [47]) Yes
Sequence typinga Listeria—Antibiotic Resistance Typing scheme (BIGSdb IPP [47]) Yes
Sequence typinga Listeria—Metal Detergent Resistance Typing scheme (BIGSdb IPP [47]) Yes
Sequence typinga Listeria—Serogroup Typing scheme (BIGSdb IPP [47]) Yes
Sequence typinga Listeria—Species Confirmation Typing scheme (BIGSdb IPP [47]) Yes
Sequence typinga Listeria—Virulence Typing scheme (BIGSdb IPP [47]) Yes
Sequence typinga Mycobacterium—cgMLST (cgMLST.org) Typing scheme (cgMLST.org) Yes
Sequence typinga Mycobacterium—cgMLST (PubMLST) Typing scheme (PubMLST [37]) Yes
Sequence typinga Mycobacterium—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Mycoplasma gallisepticum—cgMLST Typing scheme (cgMLST.org) Yes
Sequence typinga Mycoplasma gallisepticum—MLST (Bekö et al.) Typing scheme (PubMLST [37]) Yes
Sequence typinga Neisseria—cgMLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Neisseria—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Pseudomonas aeruginosa—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Staphylococcus aureus—cgMLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Staphylococcus aureus—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Salmonella—cgMLST Typing scheme (EnteroBase [39]) Yes
Sequence typinga Salmonella—Classic MLST Typing scheme (EnteroBase [39]) Yes
Sequence typinga rMLST Taxonomic identification Typing scheme (PubMLST [37]) Yes
Sequence typinga Vibrio cholerae—cgMLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Vibrio cholerae—Classic MLST Typing scheme (PubMLST [37]) Yes
Sequence typinga Yersinia—cgMLST (EnteroBase) Typing scheme (EnteroBase [39]) Yes
Sequence typinga Yersinia enterocolitica—cgMLST (Pasteur) Typing scheme (BIGSdb IPP [48]) Yes
Sequence typinga Yersinia pseudotuberculosis—cgMLST (Pasteur) Typing scheme (BIGSdb IPP [48]) Yes
Sequence typinga Yersinia spp.—cgMLST (Pasteur) Typing scheme (BIGSdb IPP [48]) Yes
Sequence typinga Yersinia—Classic MLST (Achtman) Typing scheme (EnteroBase [39]) Yes
Sequence typinga Yersinia—Classic MLST (McNally) Typing scheme (EnteroBase [39]) Yes
Kraken 2 Full

Kraken 2 Microbial database with added animal and plant reference/representative genomes.

Animals: complete and chromosome genomes from RefSeq.

Plants: complete, chromosome, and scaffold genomes from RefSeq; complete and chromosome genomes from GenBank.

Nob
Kraken 2 Microbial

Filtered representative/reference genomes from the taxonomic groups archaea, bacteria, fungi, protozoa and viruses. Additionally, it contains the human reference genome.

Bacteria and viruses: complete genomes from RefSeq.

Archaea, fungi and protozoa: complete, chromosome and scaffold genomes from RefSeq.

Nob
Kraken 2 NRL-GMO Custom collection of representative genomes from RefSeq and GenBank Nob

Overview of the available databases and the corresponding tools in Galaxy @Sciensano. Databases uniquely associated with a particular tool, such as the marker genes from CheckM or the locus database for ConFindr, are not included in this overview

Abbreviations: n/a not available, IPP Institut Pasteur Paris

aThree separate tools are available for ‘gene detection’ and ‘sequence typing’ with different underlying detection methods (blast, KMA, and SRST2)

bThe Kraken 2 databases are periodically updated semi-automatically

Workflows

Galaxy allows the creation of workflows, which are series of tools and dataset actions (i.e., renaming, formatting, etc.) chained together that run in sequence as a batch operation. The ability to run multiple workflows from a single interface reduces workload and ensures repeatability between users. In addition, as workflows support batching, launching multiple tools on tens or hundreds of datasets becomes possible. Users can create workflows from the workflow editor or extract them from a completed analysis in their history. These workflows can be shared with specific users or made available to all the users on the Galaxy instance. We have provided a number of workflows for all users, including a species-agnostic bacterial workflow for QC and de novo assembly of bacterial WGS data ('QC and pre-processing for microbial Illumina data’) and workflows for the examples described in the “Results” section of this manuscript.

Technical setup and maintenance

The Galaxy @Sciensano instance runs on a local virtualized Ubuntu 22.04 server. The underlying Galaxy version is updated regularly, with the aim of being at most two versions behind the latest public Galaxy release. A ‘Development, Test, Acceptance, and Production’ (DTAP) cycle is followed, with separate Galaxy instances where tools and Galaxy framework updates are developed and tested before deployment to the main Galaxy instance to identify and resolve potential issues as early as possible. Ansible (https://github.com/ansible/ansible) is used to deploy custom tools and Galaxy itself, ensuring consistency and repeatability across these environments. Tool versioning is used consistently to ensure reproducibility and traceability. Custom tools and pipelines are versioned using Lmod (https://github.com/TACC/Lmod), whereas the ToolShed installations are maintained using Galaxy Ephemeris (https://github.com/galaxyproject/ephemeris) and are versioned independently of Lmod. Slurm [49] is used for job management and resource allocation to ensure a fair and efficient distribution of resources across users. The internal codebase (all internally developed tools, custom Galaxy tool wrappers, and the Ansible configuration) is tracked using Git repositories for traceability. A code review system is used to evaluate changes to the internal codebase prior to release. Automated testing via Jenkins (available at https://www.jenkins.io/) is used to test the internal codebase, and the Galaxy instance itself, via API calls [50].

Results

Galaxy usage

Since its launch in October 2019, the number of external users (i.e., people who do not work at our institute) of Galaxy @Sciensano has grown steadily, currently reaching 755 (accessed on September 20th, 2024). An average of 4,819 jobs were started per month over the last twelve months, as shown in Fig. 2. Of the 60,541 jobs completed since the launch, the most commonly used tools were the in-house developed tools (63.13% of the total number of jobs), followed by ToolShed installations (24.39%) and in-house developed ‘push-button’ pipelines (12.49%). Failed jobs and back-end jobs such as file conversions are not included in the job statistics. Note that these user and job numbers generally do not include the internal use of Galaxy by Sciensano employees, who have their own private instance.

Fig. 2.

Fig. 2

Overview of the jobs executed on the Galaxy instance. The number of jobs started on the Sciensano public Galaxy instance. The y-axis shows the number of jobs. The x-axis shows the date grouped by month. The colors represent the tool categories. The categories refer to (a) in-house developed ‘push-button’ pipelines that perform complete pathogen characterization (‘inhouse_pipeline’); (b) in-house developed tools or wrappers for specific analyses, such as sequence typing or gene detection (‘inhouse_tool’); and (c) ToolShed installations. Usage data were extracted from the Galaxy database on September 20.th 2024

Example use cases

The following examples demonstrate how the tools in the Galaxy @Sciensano instance can be used to perform bioinformatics analyses in a public health or clinical context. These examples, extracted from previously published research, demonstrate how Galaxy @Sciensano can be used for WGS data analysis following best practices. However, they cover only a portion of the available functionality, and many additional tools are available, as indicated in Table 1 and Supplementary Table S1. The workflows for both examples are available as shared Galaxy workflows that can be used and customized by users.

STEC outbreak analysis

Dataset

In this example, seven Shiga toxin-producing Escherichia coli (STEC) isolates were analyzed, four of which were confirmed to be part of a 2012 Belgian foodborne outbreak traced back to beef [51]. The goal of this analysis was to characterize the isolates based on their AMR and virulence gene profiles, perform typing, and explore their relatedness through phylogenomic analysis, methods that are often used for ensuring food safety. This approach aimed to identify potential links between isolates obtained from the contaminated food products and isolates collected from clinical cases. WGS data for the seven isolates were generated using the Illumina MiSeq and are available under BioProject accession PRJNA574887, with accession numbers for each dataset listed in Supplementary Table S2. These files can be retrieved from the public data library in our Galaxy instance (labelled ‘Galaxy @Sciensano—Example data STEC’), or downloaded directly from the Sequence Read Archive [52] (SRA) or the European Nucleotide Archive (ENA) [53]. Alternatively, the ‘Faster Download and Extract Reads in FASTQ ‘ tool (Galaxy version 3.1.1) can be used to download the files from SRA directly into the Galaxy history. A schematic overview of the complete analysis is provided in Fig. 3.

Fig. 3.

Fig. 3

Schematic overview of the example STEC analysis. Overview of the steps involved in the STEC outbreak analysis (“STEC outbreak analysis” section). The green boxes correspond to tools in Galaxy. The figure numbers in italics below the step name correspond to additional figures that show concrete visualizations of the corresponding steps. Yellow boxes represent datasets with the file format indicated in brackets. Note that, for visual clarity only output files that are used for further processing are shown. The arrows indicate the data flow. The analyses are grouped by category as indicated by the text in the grey boxes. Notes: (a) the ‘Tree Visualization’ tool is not available in the tool panel, but is built-in as a visualization option within Galaxy

Quality control & read trimming

The analysis started with a quality check of the raw input data using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (Galaxy version 0.73) (Supplementary Figure S1a). The reads were then trimmed using Trimmomatic [54] (Galaxy version 0.39) (Supplementary Figure S1b) and checked with FastQC again. MultiQC [55] (Galaxy version 1.11) was subsequently used to generate a single report combining the results for all the input datasets (Supplementary Figure S1c). Afterwards, datasets were screened for inter-species contamination (i.e., the presence of reads originating from organisms other than E. coli) using a custom Kraken 2 [33] wrapper (Galaxy version 2.1.1) and the in-house created ‘full’ database. Table 2 lists the available custom Kraken 2 databases. The Kraken 2 results were visualized using the ‘Krona pie chart’ tool (Galaxy version 2.8.1) (Supplementary Figure S1d). Intra-species contamination (i.e., the presence of multiple E. coli strains within the same dataset) was checked using the custom wrapper for the ConFindr [23] tool (Galaxy version 0.1) (Supplementary Figures S1e and S1f). All of these tools can be batched across multiple input files, so that parameters need to be set only once for each tool. These species-agnostic pre-processing steps are available as a public workflow in Galaxy @Sciensano, labelled ‘QC and pre-processing for microbial Illumina data’ (Supplementary Figure S1g). Note that an extended workflow with the additional steps of this example, except for the phylogenomic analysis, is also available as ‘Galaxy @Sciensano—STEC outbreak analysis example’. The quality checks indicated that all input FASTQ datasets were of high quality, with no evidence of inter- or intra-species contamination. Therefore all isolates could be retained for further analysis.

De novo assembly

The processed reads were then de novo assembled using SPAdes [56] (Galaxy version 3.15.5) to generate contigs. Afterwards, QUAST [57] (Galaxy version 5.2.0) was used to generate a composite report of assembly statistics for the seven input datasets (Supplementary Figure S1h). The assembly statistics for the seven STEC isolates are listed in Supplementary Table S3. The cumulative length plot in the QUAST report showed that all isolates had a total assembly length close to the expected ~ 5 Mb, although some assemblies were more fragmented than others. The completeness of the assemblies was checked using the custom wrapper for the ‘CheckM—Lineage workflow’ [21] (Galaxy version 0.1) (Supplementary Figure S1i). This tool can also provide information about intra-species contamination based on the number of variants for single-copy core genes. The CheckM analysis showed that, for all STEC datasets, almost all (> 99.9%) core genes were present in the assembly, with no signs of inter-species contamination. The median sequencing depth was estimated by mapping the processed reads to the assembled contigs using Bowtie2 [58] (Galaxy version 2.5.3), followed by SAMtools [17] depth (Galaxy Version 1.15.1). Afterwards, Datamash (available at https://www.gnu.org/software/datamash/, Galaxy Version 1.8) was used to calculate the median depth value from the tabular SAMtools depth output. The ‘Collect Alignment Summary Metrics’ of the Picard tool suite (available at https://github.com/broadinstitute/picard) was then used to calculate the mapping rate. The median depth and mapping rates across all datasets were 39 × and 99.75% respectively, with all datasets above the minimum thresholds that are enforced in the STEC pipeline [20].

Identification of antimicrobial genes and virulence genes

The custom wrapper for AMRFinder + [25] (Galaxy version 0.1) was used to predict AMR based on the detection of genes and point mutations (Supplementary Figure S1j). The tool generates an HTML output report summarizing the results and providing additional information, such as the database version. The output report for the TIAC1153 isolate is shown in Fig. 1. This isolate was predicted to be resistant to beta-lactam antibiotics (blaTEM-1), fosfomycin (glpT E448K), kanamycin (aph(3')-Ia), streptomycin (aph(3'')-Ib and aph(6)-Id), sulfonamide (sul2), tetracycline (tet(A)), and trimethoprim (dfrA8). With the exception of the glpT E448K mutation, all matches were perfect (i.e., 100% sequence identity over the entire length), as indicated by the dark green color in the output table. For the remaining six isolates, only the glpT E448K mutation was detected, as shown in Supplementary Table S4.

Virulence genes were extracted using the custom ‘Gene detection – BLAST’ tool (Galaxy version 0.1) with the ‘VirulenceFinder E. coli’ database [45]. An overview of the available databases for this tool is provided in Table 2. Many virulence genes were detected in the isolates, as shown in Supplementary Table S5. Six of the seven isolates carried the stx1 and stx2 genes encoding the Shiga-toxin, with TIAC1660 carrying only stx2, confirming them as STEC. All hits were perfect matches to the reference sequence, except for stx2 in TIAC1660, which contained three single nucleotide polymorphisms (SNPs), resulting in a sequence identity of 99.76%. The tool generates an HTML output report with a color-coded output table (Supplementary Figure S1k). Rows are colored based on alignment quality: dark green for perfect matches to the database sequence, light green for imperfect full-length matches, and gray for partial matches. For imperfect and partial hits, such as the aslA gene, the user has the option to manually inspect the alignment of the contigs to the gene sequence to investigate the potential impact of the mismatched bases on the open reading frame of the gene. The report also includes information on the function of the detected virulence genes in the output table.

Sequence typing

Multi-locus sequence typing (MLST) was performed using the ‘Sequence typing – BLAST’ tool (Galaxy version 0.1) and the Warwick MLST scheme for E. coli, which is automatically updated weekly via the EnteroBase API [39]. An overview of the available typing schemes is provided in Table 2. The in-house developed ‘Gene detection’ and ‘Sequence typing’ tools both generate color-coded HTML output reports (Supplementary Figure S1l), and are also available for detection methods other than BLAST 2.14.0 [28], namely SRST2 0.2.0 [30] and KMA 1.4.12a [29]. Color codes are used to indicate: perfect matches (dark green), imperfect full-length matches (light green), imperfect partial matches (gray), missing loci (red), and ‘multi-hits’ where multiple alleles have the same alignment score (yellow). All isolates were assigned to ST11 with the exception of TIAC1660 which was assigned to ST223. Serotypes were determined using the in-house ‘Gene detection (KMA)’ tool and the SerotypeFinder database [43]. For some databases, the gene detection tools report additional metadata associated with the genes, in this case the corresponding O- and H-types. KMA was used as the detection method, which was shown to produce more accurate results than BLAST-based detection for these loci [59]. Except for isolate TIAC1660, which was assigned to O113:H21, all isolates were assigned to O157:H7.

Phylogenomic analysis

The typing analysis described in the previous section was re-run using the EnteroBase [39] core-genome MLST (cgMLST) scheme containing 2,513 loci, with the TSV outputs exported to separate datasets in the Galaxy history. These tabular output files were then used to construct a minimum spanning tree using the ‘MLST phylogeny’ tool (Galaxy version 0.1). This tool allows filtering of the allele matrix to exclude datasets with low numbers of detected cgMLST loci, or to exclude loci based on their absence in the tested strains. The tool uses GrapeTree 2.2 [60] for tree construction, which offers several algorithms that can be selected from the Galaxy interface, including the recommended MSTreeV2 minimum-spanning algorithm [60, 61]. The output consists of an HTML output report containing information on the allele matrix filtering, pairwise allele distances, a visualization of the phylogeny, and the corresponding command line calls (Fig. 4 and Supplementary Figure S1m). The tool has an option to export the resulting phylogeny in Newick format to a separate file in the Galaxy history, so that it can be visualized using the built-in Galaxy tool 'Phylogenetic Tree Visualization' in the 'Visualize' menu (Supplementary Figure S1n). The resulting phylogeny contains a clade carrying the four outbreak isolates, clearly separated from the three unrelated isolates. This is also evident from the pairwise allele distance matrix, which shows 0 to 1 allele differences between the four outbreak isolates, and much larger distances to the other three isolates. For the six ST11 isolates, excluding the isolate with an unrelated ST since this sample was too distant to include in a reference-based analysis, an additional SNP-based analysis was performed to investigate the relationships between isolates at a higher resolution. The SNP analysis considers nearly the entire genome, while cgMLST analysis is restricted to the core loci. The custom wrapper around the in-house developed open-source tool PACU [15] (Galaxy version 0.1) was used for this analysis. The tool requires BAM input files with the processed reads mapped to the reference genome, which can be generated using the associated ‘PACU mapping helper’ (Supplementary Figure S1o) or other read mappers in Galaxy, such as Bowtie2 [58] or BWA [62]. A BED file with regions to be omitted from the SNP analysis can optionally be provided as input. In this example, a file with the coordinates of prophage regions in the reference genome was used. The PACU results are provided in Fig. 5 and Supplementary Figure S1p, which again shows the outbreak isolates on a clade clearly separated from the unrelated isolates, with high bootstrap support on the branch. These phylogenomic tests indicate that there is a high probability that these four isolates are related.

Fig. 4.

Fig. 4

Output of the ‘MLST phylogeny’ tool. Part of the output report for the ‘MLST phylogeny’ tool. The top two sections contain the basic analysis information and the allele matrix filtering section, which are not shown for this example. The allele distance section shows the pairwise allele distances between the samples expressed as the number of cgMLST alleles. The ‘commands’ and ‘citations’ sections contain the commands that were used and the citations for the corresponding tools, respectively. The full figure including the Galaxy interface is provided in Supplementary Figure S1m. The phylogeny shows a clade of four closely related isolates with 0–1 allele difference(s): TIAC1151, TIAC1152, TIAC1165, and TIAC1169. These four isolates were part of the outbreak, with isolates TIAC1151 and TIAC1152 collected from beef and TIAC1165 and TIAC1169 collected from feces of infected humans. This analysis suggested that TIAC1638, TIAC1153, and TIAC1660 are not associated with the outbreak. An additional whole genome SNP analysis using PACU was performed to confirm this hypothesis (see Fig. 5)

Fig. 5.

Fig. 5

Output of the ‘PACU’ SNP phylogeny tool. Part of the output report for the PACU SNP phylogeny workflow [15]. The full report contains the following sections: analysis info, parameters, read mapping statistics, variant calling & filtering statistics, region filtering, phylogeny, pairwise SNP distances, and citations. The SNP matrix and the phylogeny in Newick format can be downloaded from the report for further processing. The automated model selection selected the Tamura 3-parameter mutation model (T92) as the best model based on the Bayesian information criterion. This model was then used to construct the tree. This analysis shows that the outbreak isolates (i.e., TIAC1151, TIAC1152, TIAC1165, and TIAC1169) clearly clustered together based on pairwise SNP distances and their positions in the phylogeny, consistent with the results of the cgMLST approach (see Fig. 4). Note that the TIAC1660 dataset was excluded from the analysis because it was too distant from the reference genome and the other datasets. The full figure including the Galaxy interface is provided in Supplementary Figure S1p

The Shiga toxin-producing Escherichia coli pipeline

The example described above shows how the tools of the Galaxy @Sciensano instance can be used for the characterization and phylogenomic analysis of WGS data from bacterial isolates. For many organisms, ‘push-button’ pipelines are available in Galaxy @Sciensano that perform complete isolate characterization starting from raw FASTQ input (Table 1). The pipelines are offered as stand-alone tools and generate interactive HTML output reports and tabular summary output files. These pipelines are designed for ease of use and traceability, with almost all tool parameters and quality thresholds pre-optimized to require minimal user customization (Supplementary Figure S1g) For example, the in-house developed STEC pipeline 1.1 [20] (Galaxy version 1.1) performs quality checks, sequence typing, AMR and virulence gene detection, serotype detection, and additional plasmid characterization, most of which have been discussed in the previous sections. One of the major advantages of these pipelines is that all results are combined into a single output report and summary file, both of which contain the most relevant information as well as the tool version numbers and commands for traceability (Supplementary Figure S1r).

For this example, the seven strains were also analyzed using the STEC pipeline. The in-house developed ‘Pipeline combine’ helper tool (Galaxy version 0.1) can then be used to combine the tabular output of the seven pipeline jobs into a single TSV file (Supplementary Figure S1s). This file can be used, for example, to annotate a phylogeny with the detected AMR genes, sequence types, or serotypes, or to store relevant genomic indicators in a database. The pipeline output can also be processed directly by the ‘MLST phylogeny’ tool to generate (cg)MLST-based phylogenies without the need to re-run the sequence typing tool.

Analysis of methicillin-resistant Staphylococcus aureus strains collected from hospitals in Benin

Datasets

For this analysis, WGS data of five Staphylococcus aureus isolates were analyzed. These isolates were collected in hospitals in Benin to investigate possible transmission between health care workers (HCWs) and patients [63]. The datasets for this study are public available under BioProject accession number PRJNA936823, with accession numbers for each dataset listed in Supplementary Table S2. Two of the selected datasets were collected from patients and three from HCWs (Supplementary Table S2). Note that this example uses only a subset of the isolates collected in the corresponding study. Sequencing quality issues were identified during the initial processing of these datasets, so thorough quality control was needed to ensure that valid conclusions could be drawn from the data analysis. This example demonstrates how to handle datasets with potential contamination issues and highlights specific tools for typing S. aureus, SpaTyping and SCCmecTyping. Similar tools are available for other organisms, including Bacillus cereus (Btyper3 [27]) and Salmonella (SISTR [64]). A schematic overview of the processing is provided in Supplementary Figure S2a. An end-to-end ‘push-button’ pipeline for Staphylococcus aureus (Galaxy version 1.1) is also available (Supplementary Figure S2b) and was used to analyze all datasets in addition to the processing described below.

Quality control

The quality control steps described in the STEC example are species-agnostic, so the same Galaxy workflow can be used to perform basic quality control, read trimming, taxonomic classification of the reads and de novo assembly for S. aureus. For QUAST, the reference genome (RefSeq accession NC_007795.1) and its annotation in GFF3 format were provided as input files, allowing the tool to estimate the completeness of the genome based on gene presence.

The MultiQC report generated from the FastQC output did not reveal any major problems with the raw data quality. However, the combined QUAST output report revealed a substantially larger assembly length for the sa_0680 dataset (Supplementary Figure S2c), with a total assembly length more than twice that of the reference genome length (2.8 Mb). The assembly statistics for all datasets are listed in Supplementary Table S6. The total assembly lengths of the four other datasets were only slightly greater than the reference genome length. An unexpectedly high total assembly length can indicate the presence of reads originating from contaminants that are also assembled into contigs. The Kraken 2 analysis of dataset sa_0680 confirmed this, assigning ~ 13% of the reads to Enterococcus faecalis (Supplementary Figure S2d). As contamination can affect, for example, the AMR or virulence genes that are detected, it is essential that this type of screening is performed to ensure accurate results and subsequent interpretation. The Staphylococcus pipeline and other in-house developed pipelines include 13 quality criteria to identify potential problems with the quality of the input data. An overview of the quality metrics and their thresholds is provided in Supplementary Table S7. Most of these quality metrics can also be calculated using stand-alone tools available in Galaxy @Sciensano. This can be useful, for example, when analyzing data from organisms for which no pipeline is available.

Identification of AMR genes

The 'Pipeline combine' helper tool (Galaxy version 0.1) was used to generate a tabular overview of the detected AMR genes (Supplementary Figure S2e and Supplementary Table S8). The mecA gene, associated with resistance to methicillin, was detected in all strains, along with several other AMR genes associated with resistance to beta-lactam antibiotics, trimethoprim, fosfomycin, and tetracycline. Notably, the erm(C) and lsa(A) genes were detected only in the contaminated sa_0680 dataset. While this isolate would normally be flagged for resequencing due to the failed quality checks, we retained it in this example to demonstrate how Galaxy @Sciensano can be used to perform an in-depth investigation, and more precisely how this contamination affects the results of the WGS analysis. First, the contigs carrying these genes were extracted using the ‘seqtk subseq’ tool (Galaxy version 1.4, available at https://github.com/lh3/seqtk/), with a file containing the identifiers of those contigs. These contigs were then taxonomically classified by (1) aligning them to the complete nucleotide (‘nt’) database using the blastn [28] tool (Galaxy version 2.14.1), and (2) classifying the contigs using Kraken 2 (Galaxy version 2.1.1) with the ‘full ‘ database. The best BLAST hits for the NODE_21 contig carrying the lsa(A) gene were all E. faecalis chromosome sequences (Supplementary Figure S2f), which was consistent with the Kraken 2 classification of the contig. The contig carrying erm(C) matched to a S. aureus plasmid and is, therefore, less likely to be a consequence of E. faecalis contamination. Without quality control, the presence of the lsa(A) gene would have been reported in the sa_0680 isolate, whereas this analysis shows that it is most likely a contamination artifact.

Since the detection method of the pipeline was set to ‘blast’, it does not report explicit information on the read depth of the detected AMR genes, which can provide an indication of their genomic context, as genes on high-copy plasmids typically result in higher read depths. While read depth can be estimated indirectly from the k-mer coverage reported in the contig headers, it can also be determined using a read mapping-based approach. For this example, we performed gene detection using the ‘Gene detection (KMA)’ tool (Galaxy version 0.1) with the National Database of Antibiotic Resistant Organisms (NDARO) AMR gene database [25]. In the sa_0680 dataset, the erm(C) gene was detected at a depth of 576x, substantially higher than the median depth of ~ 70 × for the other AMR genes detected (Supplementary Figure S2g). Therefore, this gene is likely located on a high copy number plasmid. Finally, the corresponding contig was aligned to the PlasmidFinder Gram-positive database [42] using blastn (Galaxy version 2.14.1), which revealed the presence of the rep10_3 pNE131p1 replicon on the contig carrying the erm(C) gene (Supplementary Figure S2h), further supporting that the gene is located on a plasmid.

Sequence typing

The Staphylococcus pipeline performs sequence typing using the MLST and cgMLST schemes from PubMLST [37]. Similar to the previous STEC example, these can be used to construct phylogenies. For S. aureus, two additional typing methods are available, both in the pipeline and as stand-alone tools: spa typing and SCCmec typing [31]. The SCCmec typing tool detected the C1 mec complex in all isolates, confirming that they were methicillin-resistant S. aureus (MRSA). The spa types of these strains were t314 for sa_2777 and t1476 for all others. Similar results can be obtained using the in-house gene detection tools, which allow screening against arbitrary nucleotide reference sequences. The ‘Gene detection—create database’ tool (Galaxy version 0.1) can be used to create databases for the in-house gene detection tools, supporting BLAST + , KMA and SRST2-based detection (Supplementary Figure S2i). The tool clusters the input sequences using CD-HIT [65], with an adjustable sequence identity threshold available in the Galaxy wrapper (Supplementary Figure S2j). For instance, setting the threshold to 100% ensures that each entry forms its own cluster. The gene detection tools report only the best hit for each cluster to limit clutter in the output (e.g., when several different alleles of the same gene are detected). As an example, the mecA gene sequences were downloaded from the SCCmecFinder database (available at https://bitbucket.org/genomicepidemiology/sccmecfinder_db, accessed on August 8th, 2024). This FASTA file contained nine sequences. A gene detection database containing these sequences was created, with a clustering threshold of 95%, resulting in three clusters (Supplementary Figure S2j). The database was then used as input to the ‘Gene detection KMA’ tool (Galaxy version 0.1). All isolates were found to match the C1 mec complex, consistent with the results of the stand-alone SCCmecFinder tool (Galaxy version 0.1).

Phylogenomic investigation

The same approach as in the previous example was used for the phylogenomic investigation of these strains. The MLST phylogeny tool was used on the outputs of the ‘Staphylococcus pipeline’ to construct a cgMLST-based minimum spanning tree (Supplementary Figure S2k). The distances between the three ST8 isolates (i.e., sa_0680, sa_1053, and sa_1405) were relatively small with ≤ 7 pairwise allele differences, whereas the distance to the sa_2777 strain (ST121) was greater than 1,700 alleles on a total of 2,208 loci. An additional SNP analysis with PACU was performed to zoom in on the ST8 cluster. As the bootstrapping step requires at least four isolates, the contaminated sa_0680 dataset was included for the sake of this example. While this approach would normally be discouraged, our analysis traced the problem to the presence of reads originating from E. faecalis. Given that the variant calling relies upon targeted read mapping, the contamination should not substantially affect the results. The same reference genome as in the corresponding study was used (GCF_000013425.1). The distances were relatively large with ≥ 90 SNPs difference between the four ST8 isolates, indicating that there was no evidence of direct transmission between HCWs and patients, consistent with previously reported results [63] (Supplementary Figure S2l).

Other use-cases

The previous examples show how Galaxy @Sciensano can be used to perform a comprehensive analysis of bacterial WGS data for typing, characterization and outbreak detection. However, these examples cover only a small part of the functionality available in Galaxy @Sciensano. Below, several other tools and pipelines developed in-house and are briefly discussed to concisely provide an overview of additional functionalities. The ‘Viral consensus pipeline (Illumina)’ & ‘Viral consensus pipeline (ONT)’ extract the consensus sequence of viral genomes using an iterative mapping approach. In addition to consensus sequence extraction, they also perform human read scrubbing, pre-processing, quality control, primer removal, and multi-allelic site calling. For influenza A and SARS-CoV-2, automated reference selection from a reference genome database and screening with Nextclade [66] are also available. Extracting consensus sequences for other viral species is also supported if the user provides a suitable reference genome. These pipelines produce an easy-to-understand HTML output report, similar to the bacterial ‘push-button’ pipelines (Table 1). The resulting consensus sequences can be used as inputs for the in-house ‘MEGA—Model selection and tree construction’ tool for automated model selection and phylogenetic tree reconstruction. There is also the possibility to detect low frequency variants using with the in-house ‘SARS-CoV-2 LoFreq Pipeline’ [67], which was developed for SARS-CoV-2 but also supports other species. The ‘Hybrid assembly pipeline’ performs automated de novo hybrid assembly of ONT and Illumina data according to recently published recommendations [13]. The pipeline is fully automated and includes read pre-processing for both input types, and generates a comprehensive report with statistics on the quality of the generated assembly, including assembly metrics (e.g., N50, total length, nb. contigs), SNPs and indels, and structural rearrangements. The ‘samtools variant calling’ tool can be used to call SNPs and indels from a BAM file with reads mapped to a reference sequence. Variant calling is performed using ‘bcftools mpileup’ and ‘bcftools call’, for which the options can be customized via the interface. The ‘variant filtering’ tool can be used on the output to remove low quality variants or variants located in problematic genomic regions. While the custom Kraken 2 databases were previously mentioned in the context of detecting inter-species contamination in isolate data, they are also applicable for taxonomic profiling of complex metagenomic datasets, including those involving plants and animals. Finally, in addition to the two previous ‘push-button’ pipelines for the characterization of STEC and S. aureus, several other push-button pipelines are available for the complete characterization of other relevant bacterial organisms with active genomic surveillance programs and/or clinical relevance (Table 1).

Discussion

In this manuscript, we presented Galaxy @Sciensano, a publicly available resource for applied genomics data analysis, tailored to public health and clinical settings for microbial typing, characterization, and outbreak detection. This tool portal can facilitate pathogen surveillance and diagnostics by providing user-friendly, comprehensive, accurate and reproducible bioinformatic methods for analyzing microbial genomics data. In particular, laboratories with limited bioinformatics expertise or computing resources could benefit from a web-based solution that enables them to generate actionable results from raw sequencing data. Furthermore, the optimized pathogen-specific pipelines may be of particular relevance, as they provide complete isolate characterization with a minimum of required user customization. These pipelines have been developed with consideration of international standards and recommendations ([68, 69]). A comprehensive suite of relevant databases, most of which are updated automatically (Table 2), allows users to query databases from multiple sources and with up-to-date information, eliminating the need for manual database collection and maintenance. The available training resources can improve the knowledge of bioinformatics and facilitate the integration of genomics into public health policy and clinical diagnostics [12].

A lack of bioinformatics and command line expertise are common bottlenecks to the successful integration of genomics into the activities of laboratories [70, 71]. Commercial software solutions, such as the CLC Genomic Workbench (QIAGEN), generally offer a more intuitive user interface, lowering the barrier to entry. However, licensing costs can be prohibitive for laboratories, especially those in low- and middle-income countries. In contrast, open-source, command-line tools are free and generally very flexible. These tools can be combined to build very powerful and comprehensive pipelines. However, most of these software packages are more difficult to use and install, are not always well maintained, and rarely offer extensive support or training materials. Galaxy can overcome many of these limitations by providing a graphical user interface to run these tools remotely, eliminating the need for investments in specialized computing hardware, tool installation and command line experience for end users. The tools in Galaxy, including the custom tools in Galaxy @Sciensano, almost exclusively use standard input and output file formats such as FASTA, FASTQ, BAM, TSV or Newick, allowing further processing within and outside of Galaxy. For example, the tabular output of the pipelines and stand-alone sequence typing tools (Table 1) can be used as input for the ‘MLST phylogeny tool’ to generate phylogenies, and can also be imported into a spreadsheet to create custom graphs. Consequently, Galaxy users can mix and match the stand-alone tools, semi-automated pipelines, Galaxy workflows, and any other software to answer their specific questions of interest.

Galaxy is being used successfully as a portal for remote analysis, as evidenced by the 167 publicly listed Galaxy servers on the official directory (https://galaxyproject.org/use), 42 of which are tagged with the keyword ‘genomics’ (accessed on August 8th, 2024). Comparable Galaxy instances have been described previously, such as GalaxyTrakr [72], NanoGalaxy [11] and the Advanced Research Infrastructure for Experimentation in genomicS (ARIES) platform [73]. What sets Galaxy @Sciensano apart from other public instances is its range of 'push-button' pipelines for complete isolate characterization, custom tools with extensive reporting features, and a large catalog of automatically updated databases. GalaxyTrakr provides a complete genomics toolbox for the investigation of foodborne outbreaks and is used mainly by members of the GenomeTrakr network, which is a large distributed international network of laboratories that utilize WGS for food regulatory research [74]. In the corresponding publication, the authors describe how GalaxyTrakr has improved collaboration and rapid data exchange between laboratories worldwide [72]. GalaxyTrakr is not completely open, as access is limited to members of the network, while anyone can register an account at Galaxy @Sciensano. In terms of functionality, many commonly used tools are shared between GalaxyTrakr and our instance, such as FastQC, Trimmomatic or SPAdes. The main differences are the custom tools and wrappers developed internally (Table 1). Similarly, the tools developed exclusively for GalaxyTrakr are not available on our instance. NanoGalaxy [11] is another popular public Galaxy instance that is tailored to the analysis of long-read Oxford Nanopore Technologies (ONT) data. While Galaxy @Sciensano provides several tools that are compatible with ONT data, the main focus is on Illumina sequencing data, which is currently still the standard for most public health laboratories [75]. Consequently, several tools developed specifically for long read data, such as Miniasm [76] and Racon [77], are available on NanoGalaxy and are not currently available on Galaxy @Sciensano. Nevertheless, due to the rapid integration of ONT sequencing into public health and clinical activities, the Galaxy @Sciensano tool catalog will be gradually extended to fully support this type of data, including several relevant tools and support for ONT input data in the ‘push-button’ pipelines. A final example is the ARIES platform maintained by the European reference laboratory for E. coli, which is a genomics platform that offers tools to analyze microbial WGS data [73]. Similarly, there is overlap in some of the commonly used ‘core’ tools, but the focus is mainly on E. coli and L. monocytogenes, including push-button pipelines for e.g. E. coli, whereas Galaxy @Sciensano focusses on microbial genomics in general, which is reflected in the selection of tools and databases.

Analyses performed in the context of public health or clinical diagnostics should meet certain quality standards to ensure traceability and reproducibility. Therefore, the custom tools in the Galaxy @Sciensano instance include the following features: (1) all tools and dependencies are version controlled, with important changes indicated in the Galaxy change logs; (2) the performance of several ‘push-button’ pipelines has been extensively validated [1820], and the in-house developed tools and pipelines will not change significantly without notice; (3) databases are version controlled; (4) tool versions, database versions (including the date of the last change to the database content), and command line calls are listed in the output report; and (5) changes are evaluated on a separate test instance before deployment to the main instances. These features make Galaxy @Sciensano well suited for routine use and several Belgian national reference laboratories and national reference centers have integrated Galaxy @Sciensano into their activities. In addition, certain pipelines such as the Neisseria and Mycobacterium pipelines have been extensively validated and are used under ISO15489 accreditation [18].

In the “Example use cases” section, example analyses of STEC and MRSA datasets were described to highlight some of the functionalities available within Galaxy @Sciensano. In both cases, the analysis was performed using the stand-alone tools and workflows available in Galaxy but similar processing was performed using the’push-button’ pathogen characterization pipelines. The pre-processing, basic quality control and de novo assembly Galaxy workflows are largely species-agnostic and can be used as a starting point for the analysis of most bacterial organisms, including those for which no 'push-button' pipeline is available. Through these examples, we first showed that the stand-alone tools and pipelines can analyze datasets with a minimal need for user customization, even for large datasets through batching. Second, we demonstrated the flexibility of the comprehensive toolbox of bioinformatics software and utilities that can be used to explore datasets in depth and to perform custom analyses to address specific research questions.

One of the disadvantages for users of web-based platforms such as Galaxy is the reliance on an external service that could go down without notice. We have several procedures in place to minimize unplanned downtime (see “Technical setup and maintenance” section). The scheduled updates, during which the instance is unavailable, are announced at least a week in advance. The integration of Galaxy @Sciensano into the routine activities of many Belgian National Reference Centers (NRCs) and National Reference Laboratories (NRLs) ensures continued support of this instance. A second limitation is the need to transfer potentially sensitive data to an external service. Several procedures and policies are in place to mitigate this concern. First, we never access the data uploaded by a user, except to fix or replicate bugs at the user’s request. Second, we encourage users to remove any reads originating from human DNA (e.g., from human host material from a bacterial infection) before analysis, as these are typically not meant to be analyzed or shared. Galaxy @Sciensano provides the 'Human read scrubbing pipeline' (Galaxy version 0.2), based on the NCBI human read removal tool (available at https://github.com/ncbi/sra-human-scrubber), which removes human reads from FASTQ files, after which the raw FASTQ files can be purged from the system by the user. Finally, all datasets older than three months are automatically deleted.

In summary, we presented a user-friendly, comprehensive and flexible web-based portal for genomics-based microbial typing, characterization and outbreak detection. Combined with online training materials, this instance has contributed to a successful shift toward genomics-based pathogen surveillance within our institute. While the use of our external instance has grown steadily, we hope to reach more scientists who could benefit from the functionalities that we have made available in Galaxy @Sciensano.

Supplementary Information

Acknowledgements

We thank the Galaxy community for maintaining and improving the underlying Galaxy codebase. We thank all the contributors to the Galaxy ToolShed. We would like to thank the ICT service at Sciensano for their help in setting up and maintaining the computing resources that host our Galaxy instances. We thank the various pathogen domain experts from the Sciensano NRCs, NRLs and other microbiology laboratories for their continuous feedback on the development, testing and improvement of the tools, pipelines and databases hosted in Galaxy @Sciensano. We thank the users of our Galaxy instances for their feedback on the development and selection of tools. We thank the PubMLST, BIGSdb Institut Pasteur, EnteroBase, cgMLST.org and Technical University of Denmark teams for the curation and maintenance of the corresponding tools and databases.

Clinical trial number

Not applicable.

Authors’ contributions

Conceptualization: BB, JVB, RW, SDK, NR, KV. Methodology: BB, JVB, RW, KV. Funding acquisition: KV, NR. Project administration: KV, RW. Visualization: BB. Software: JVB, BB, AVU, JD, MG, TD, MK, KM, NG, RW. Formal analysis: BB. Supervision: KV, RW. Writing—Original Draft: BB. Writing—Review & Editing: BB, JVB, AVU, JD, MG, TD, MK, KM, NG, SDK, NR, RW, KV.

Funding

This work was supported by the project “NGS and Bioinformatics Platform” funded by Sciensano (Sciensano RP-PJ, Belgium).

Data availability

The datasets used in the example analyses are available in the data library of the Galaxy instance, and SRA accession numbers are provided in Table S2. The workflows are available as public Galaxy workflows in our Galaxy instance. Training videos are available on YouTube (https://www.youtube.com/playlist?list=PL9O-3w2bLZ4X5DJGYlbqL60PQDzn42Wjh).

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Bert Bogaerts and Julien Van Braekel equal first author contribution.

Raf Winand and Kevin Vanneste equal last author contribution.

References

  • 1.European Centre for Disease Control (ECDC), et al. EFSA and ECDC technical report on the collection and analysis of whole genome sequencing data from food‐borne pathogens and other relevant microorganisms isolated from human, animal, food, feed and food/feed environmental samples in the joint ECDC‐EFSA molecular typing database. 2019. EFS3;16(5). 10.2903/sp.efsa.2019.EN-1337.
  • 2.Brown E, Dessai U, McGarry S, Gerner-Smidt P. Use of whole-genome sequencing for food safety and public health in the United States. Foodborne Pathog Dis. 2019;16(7):441–50. 10.1089/fpd.2019.2662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Baker KS, et al. Genomics for public health and international surveillance of antimicrobial resistance. Lancet Microbe. 2023;4(12):e1047–55. 10.1016/S2666-5247(23)00283-5. [DOI] [PubMed] [Google Scholar]
  • 4.Afolayan AO, et al. Overcoming data bottlenecks in genomic pathogen surveillance. Clin Infect Dis. 2021;73(Supplement_4):S267–74. 10.1093/cid/ciab785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Afgan E, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46(W1):W537–44. 10.1093/nar/gky379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.The Galaxy Community, et al. The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update. Nucleic Acids Res. 2024:gkae410. 10.1093/nar/gkae410. [DOI] [PMC free article] [PubMed]
  • 7.Blankenberg D, et al. Dissemination of scientific software with Galaxy ToolShed. Genome Biol. 2014;15(2):403. 10.1186/gb4161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Batut B, et al. ASaiM: a Galaxy-based framework to analyze microbiota data. GigaScience. 2018;7(6). 10.1093/gigascience/giy057. [DOI] [PMC free article] [PubMed]
  • 9.Vandel J, Gheeraert C, Staels B, Eeckhoute J, Lefebvre P, Dubois-Chevalier J. GIANT: galaxy-based tool for interactive analysis of transcriptomic data. Sci Rep. 2020;10(1):19835. 10.1038/s41598-020-76769-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Singh Gaur A, Nagamani S, Priyadarsinee L, Mahanta HJ, Parthasarathi R, Sastry GN. Galaxy for open-source computational drug discovery solutions. Expert Opin Drug Discov. 2023;18(6):579–90. 10.1080/17460441.2023.2205122. [DOI] [PubMed] [Google Scholar]
  • 11.de Koning W, et al. NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy. GigaScience. 2020;9(10):giaa105. 10.1093/gigascience/giaa105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Batut B, et al. Community-driven data analysis training for biology. Cell Syst. 2018;6(6):752–758.e1. 10.1016/j.cels.2018.05.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bouras G, et al. Hybracter: enabling scalable, automated, complete and accurate bacterial genome assemblies. BioRxiv. 2023. 10.1101/2023.12.12.571215. [DOI] [PMC free article] [PubMed]
  • 14.Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Mol Biol Evol. 2018;35(6):1547–9. 10.1093/molbev/msy096. [DOI] [PMC free article] [PubMed]
  • 15.Bogaerts B, et al. Closing the gap: Oxford Nanopore Technologies R10 sequencing allows comparable results to Illumina sequencing for SNP-based outbreak investigation of bacterial pathogens. J Clin Microbiol. 2024:e01576–23. 10.1128/jcm.01576-23. [DOI] [PMC free article] [PubMed]
  • 16.Davis S, et al. CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data. PeerJ Computer Science. 2015;1: e20. 10.7717/peerj-cs.20. [Google Scholar]
  • 17.Danecek P, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2):giab008. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bogaerts B, et al. A bioinformatics whole-genome sequencing workflow for clinical mycobacterium tuberculosis complex isolate analysis, validated using a reference collection extensively characterized with conventional methods and in silico approaches. J Clin Microbiol. 2021;59(6):e00202–e221. 10.1128/JCM.00202-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bogaerts B, et al. Validation of a bioinformatics workflow for routine analysis of whole-genome sequencing data and related challenges for pathogen typing in a European National Reference Center: Neisseria meningitidis as a proof-of-concept. Front Microbiol. 2019;10:362. 10.3389/fmicb.2019.00362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bogaerts B, et al. Validation strategy of a bioinformatics whole genome sequencing workflow for Shiga toxin-producing Escherichia coli using a reference collection extensively characterized with conventional methods. Microb Genom. 2021;7(3). 10.1099/mgen.0.000531. [DOI] [PMC free article] [PubMed]
  • 21.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55. 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nayfach S, Camargo AP, Schulz F, Eloe-Fadrosh E, Roux S, Kyrpides NC. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021;39(5):578–85. 10.1038/s41587-020-00774-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Low AJ, Koziol AG, Manninger PA, Blais B, Carrillo CD. ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data. PeerJ. 2019;7:e6995. 10.7717/peerj.6995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sherry NL, et al. An ISO-certified genomics workflow for identification and surveillance of antimicrobial resistance. Nat Commun. 2023;14(1):60. 10.1038/s41467-022-35713-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Feldgarden M, et al. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021;11(1):12728. 10.1038/s41598-021-91456-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Néron B, Littner E, Haudiquet M, Perrin A, Cury J, Rocha E. IntegronFinder 2.0: identification and analysis of integrons across bacteria, with a focus on antibiotic resistance in Klebsiella. Microorganisms. 2022;10(4):700. 10.3390/microorganisms10040700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Carroll LM, Kovac J, Miller RA, Wiedmann M. Rapid, high-throughput identification of anthrax-causing and emetic bacillus cereus group genome assemblies via BTyper, a computational tool for virulence-based classification of bacillus cereus group isolates by using nucleotide sequencing data. Appl Environ Microbiol. 2017;83(17):19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Camacho C, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10(1):421. 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Clausen PTLC, Aarestrup FM, Lund O. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics. 2018;19(1):307. 10.1186/s12859-018-2336-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Inouye M, et al. SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 2014;6(11):90. 10.1186/s13073-014-0090-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kaya H, et al. SCC mec Finder, a Web-Based Tool for Typing of Staphylococcal Cassette Chromosome mec in Staphylococcus aureus Using Whole-Genome Sequence Data. mSphere. 2018;3(1):e00612–17. 10.1128/mSphere.00612-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bengtsson-Palme J, et al. Improved software detection and extraction of ITS1 and ITS 2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods Ecol Evol. 2013;4(10):914–9. 10.1111/2041-210X.12073. [Google Scholar]
  • 33.Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):257. 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Vanneste K, Garlant L, Broeders S, Van Gucht S, Roosens NH. Application of whole genome data for in silico evaluation of primers and probes routinely employed for the detection of viral species by RT-qPCR using dengue virus as a case study. BMC Bioinformatics. 2018;19(1):312. 10.1186/s12859-018-2313-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zheng Z, Li S, Su J. et al. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat Comput Sci. 2022;2:797–803. 10.1038/s43588-022-00387-x. [DOI] [PubMed]
  • 36.Bortolaia V, et al. ResFinder 4.0 for predictions of phenotypes from genotypes. J Antimicrob Chemother. 2020;75(12):3491–500. 10.1093/jac/dkaa345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Jolley KA, Maiden MC. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 2010;11(1):595. 10.1186/1471-2105-11-595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Jaureguy F, et al. Phylogenetic and genomic diversity of human bacteremic Escherichia coli strains. BMC Genomics. 2008;9(1):560. 10.1186/1471-2164-9-560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Zhou Z, Alikhan NF, Mohamed K, Fan Y, the Agama Study Group, Achtman M. The EnteroBase user’s guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity. Genome Res. 2020;30(1):138–52. 10.1101/gr.251678.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Gupta SK, et al. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob Agents Chemother. 2014;58(1):212–20. 10.1128/AAC.01310-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Jia B, et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 2017;45(D1):D566–73. 10.1093/nar/gkw1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Carattoli A, et al. In Silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother. 2014;58(7):3895–903. 10.1128/AAC.02412-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Joensen KG, Tetzschner AMM, Iguchi A, Aarestrup FM, Scheutz F. Rapid and Easy In Silico Serotyping of Escherichia coli Isolates by Use of Whole-Genome Sequencing Data. J Clin Microbiol. 2015;53(8):2410–26. 10.1128/JCM.00008-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.The UniVec Database. Available: https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/.
  • 45.Joensen KG, et al. Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of Verotoxigenic Escherichia coli. J Clin Microbiol. 2014;52(5):1501–10. 10.1128/JCM.03617-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Chen L, Zheng D, Liu B, Yang J, Jin Q. VFDB 2016: hierarchical and refined dataset for big data analysis—10 years on. Nucleic Acids Res. 2016;44(D1):D694–7. 10.1093/nar/gkv1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Moura A, et al. Whole genome-based population biology and epidemiological surveillance of Listeria monocytogenes. Nat Microbiol. 2016;2(2):16185. 10.1038/nmicrobiol.2016.185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Savin C, et al. Genus-wide Yersinia core-genome multilocus sequence typing for species identification and strain characterization. Microb Genom. 2019;5(10). 10.1099/mgen.0.000301. [DOI] [PMC free article] [PubMed]
  • 49.Jette M, Dunlap C, Garlick J, Grondona M. SLURM: Simple Linux Utility for Resource Management. 2002. Available: https://www.osti.gov/biblio/15002962.
  • 50.Sloggett C, Goonasekera N, Afgan E. BioBlend: automating pipeline analyses within Galaxy and CloudMan. Bioinformatics. 2013;29(13):1685–6. 10.1093/bioinformatics/btt199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Nouws S, et al. The benefits of whole genome sequencing for foodborne outbreak investigation from the perspective of a national reference laboratory in a smaller country. Foods. 2020;9(8):1030. 10.3390/foods9081030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Leinonen R, Sugawara H, Shumway M, on behalf of the International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2011;39(Database):D19–21. 10.1093/nar/gkq1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Leinonen R, et al. The European Nucleotide Archive. Nucleic Acids Res. 2011;39(Database):D28–31. 10.1093/nar/gkq967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20. 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–8. 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics. 2020;70(1): e102. 10.1002/cpbi.102. [DOI] [PubMed] [Google Scholar]
  • 57.Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5. 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Nouws S, et al. Impact of DNA extraction on whole genome sequencing analysis for characterization and relatedness of Shiga toxin-producing Escherichia coli isolates. Sci Rep. 2020;10(1):14649. 10.1038/s41598-020-71207-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Zhou Z, et al. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res. 2018;28(9):1395–404. 10.1101/gr.232397.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Uelze L, et al. Typing methods based on whole genome sequencing data. One Health Outlook. 2020;2(1):3. 10.1186/s42522-020-0010-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Laurence Yehouenou C, et al. Whole-genome sequencing-based screening of MRSA in patients and healthcare workers in public hospitals in Benin. Microorganisms. 2023;11(8):1954. 10.3390/microorganisms11081954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Yoshida CE, et al. The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft salmonella genome assemblies. PLoS ONE. 2016;11(1):e0147101. 10.1371/journal.pone.0147101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Li W, Godzik A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  • 66.Aksamentov I, Roemer C, Hodcroft E, Neher R. Nextclade: clade assignment, mutation calling and quality control for viral genomes. JOSS. 2021;6(67):3773. 10.21105/joss.03773. [Google Scholar]
  • 67.Wilm A, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40(22):11189–201. 10.1093/nar/gks918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.ISO. Microbiology of the food chain - Whole genome sequencing for typing and genomic characterization of bacteria - General requirements and guidance. Available: https://www.iso.org/standard/75509.html.
  • 69.European Food Safety Authority (EFSA). EFSA statement on the requirements for whole genome sequence analysis of microorganisms intentionally used in the food chain. EFS2. 2021;19(7). 10.2903/j.efsa.2021.6506. [DOI] [PMC free article] [PubMed]
  • 70.Almeida OGGD, Pereira De Martinis EC. Relating next-generation sequencing and bioinformatics concepts to routine microbiological testing. Electron J Gen Med. 2019;16(3)136. 10.29333/ejgm/108690.
  • 71.Sánchez-Busó L, et al. A community-driven resource for genomic epidemiology and antimicrobial resistance prediction of Neisseria gonorrhoeae at Pathogenwatch. Genome Med. 2021;13(1):61. 10.1186/s13073-021-00858-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Gangiredla J, et al. GalaxyTrakr: a distributed analysis tool for public health whole genome sequence data accessible to non-bioinformaticians. BMC Genomics. 2021;22(1):114. 10.1186/s12864-021-07405-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Knijn A, Michelacci V, Orsini M, Morabito S. Advanced Research Infrastructure for Experimentation in genomicS (ARIES): a lustrum of Galaxy experience. Bioinformatics. 2020. preprint. 10.1101/2020.05.14.095901.
  • 74.Timme RE, Sanchez Leon M, Allard MW. Utilizing the public GenomeTrakr database for foodborne pathogen traceback. Methods Mol Biol. 2019;1918:201–12. 10.1007/978-1-4939-9000-9_17. [DOI] [PubMed] [Google Scholar]
  • 75.Seth-Smith HMB, Bonfiglio F, Cuénod A, Reist J, Egli A, Wüthrich D. Evaluation of rapid library preparation protocols for whole genome sequencing based outbreak investigation. Front Public Health. 2019;7:241. 10.3389/fpubh.2019.00241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103–10. 10.1093/bioinformatics/btw152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017;27(5):737–46. 10.1101/gr.214270.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets used in the example analyses are available in the data library of the Galaxy instance, and SRA accession numbers are provided in Table S2. The workflows are available as public Galaxy workflows in our Galaxy instance. Training videos are available on YouTube (https://www.youtube.com/playlist?list=PL9O-3w2bLZ4X5DJGYlbqL60PQDzn42Wjh).


Articles from BMC Genomics are provided here courtesy of BMC

RESOURCES